FinGPT:开源金融大模型

论文

[论文地址] Blueprint of FinGPT
[huggingface预训练模型下载地址] https://huggingface.co/FinGPT

为什么 FinGPT?

1). 金融是高度动态的。 BloombergGPT 使用金融数据和通用数据混合训练 LLM，耗时约 53 天，成本约 300 万美元）。每月或每周重新训练一个像 BloombergGPT 这样的 LLM 模型成本很高，因此轻量级适应性非常有利。FinGPT 可以根据新数据迅速进行微调（成本大幅下降，每次微调成本不到 300 美元）。

2). 互联网规模的金融数据民主化至关重要，例如可以利用自动数据整理管道及时更新模型（每月或每周更新）。BloombergGPT 拥有特权数据访问权和应用程序接口，而 FinGPT 则是一种更易访问的替代方案。它优先考虑轻量级适配，利用现有最好的开源 LLM。

3). 其关键技术是 “RLHF（从人类反馈中强化学习）”，这是 BloombergGPT 所缺少的。RLHF 使 LLM 模型能够学习个人偏好（风险厌恶程度、投资习惯、个性化机器人顾问等），而这正是 ChatGPT 和 GPT4 的 “秘诀 ”所在。

模型结构

### 多任务 LLMs 模型:

  demo_tasks = [
      'Financial Sentiment Analysis',
      'Financial Relation Extraction',
      'Financial Headline Classification',
      'Financial Named Entity Recognition',]
  demo_inputs = [
      "Glaxo's ViiV Healthcare Signs China Manufacturing Deal With Desano",
      "Apple Inc. Chief Executive Steve Jobs sought to soothe investor concerns about his health on Monday, saying his weight loss was caused by a hormone imbalance that is relatively simple to treat.",
      'gold trades in red in early trade; eyes near-term range at rs 28,300-28,600',
      'This LOAN AND SECURITY AGREEMENT dated January 27 , 1999 , between SILICON VALLEY BANK (" Bank "), a California - chartered bank with its principal place of business at 3003 Tasman Drive , Santa Clara , California 95054 with a loan production office located at 40 William St ., Ste .',]
  demo_instructions = [
      'What is the sentiment of this news? Please choose an answer from {negative/neutral/positive}.',
      'Given phrases that describe the relationship between two words/phrases as options, extract the word/phrase pair and the corresponding lexical relationship between them from the input text. The output format should be "relation1: word1, word2; relation2: word3, word4". Options: product/material produced, manufacturer, distributed by, industry, position held, original broadcaster, owned by, founded by, distribution format, headquarters location, stock exchange, currency, parent organization, chief executive officer, director/manager, owner of, operator, member of, employer, chairperson, platform, subsidiary, legal form, publisher, developer, brand, business division, location of formation, creator.',
      'Does the news headline talk about price going up? Please choose an answer from {Yes/No}.',
      'Please extract entities and their types from the input sentence, entity types should be chosen from {person/organization/location}.',]

Models	Description	Function
fingpt-mtllama2-7blora	Fine-tuned Llama2-7b model with LoRA	Multi-Task
fingpt-mtfalcon-7blora	Fine-tuned falcon-7b model with LoRA	Multi-Task
fingpt-mtbloom-7b1lora	Fine-tuned bloom-7b1 model with LoRA	Multi-Task
fingpt-mtmpt-7blora	Fine-tuned mpt-7b model with LoRA	Multi-Task
fingpt-mtchatglm2-6blora	Fine-tuned chatglm-6b model with LoRA	Multi-Task
fingpt-mtqwen-7blora	Fine-tuned qwen-7b model with LoRA	Multi-Task
fingpt-sentimentllama2-13blora	Fine-tuned llama2-13b model with LoRA	Single-Task
fingpt-forecasterdow30llama2-7b_lora	Fine-tuned llama2-7b model with LoRA	Single-Task

FinGPT 生态系统

FinGPT 为 FinLLM 提供了一个包含五个层次的全栈框架：

数据源层: 这一层确保了全面的市场覆盖，通过实时信息捕捉解决了财务数据的时间敏感性问题。
数据工程层: 该层适用于实时 NLP 数据处理，可应对金融数据中的高时间敏感性和低信噪比等固有挑战。
LLMs 层: 该层侧重于一系列微调方法（如 LoRA），可减轻财务数据的高度动态性，确保模型的相关性和准确性。
任务层: 这一层负责执行基本任务。这些任务是 FinLLM 性能评估和交叉比较的基准。
应用层: 这一层展示了实际应用和演示，突出了 FinGPT 在金融领域的潜在能力。

FinGPT 框架：开源金融大型语言模型

FinGPT-RAG: 我们提出了一个专门用于金融情感分析的检索增强大型语言模型框架，通过外部知识检索优化信息深度和上下文，从而确保做出细致入微的预测。

FinGPT-FinNLP: FinNLP为所有对金融领域的法学硕士和NLP感兴趣的人提供了一个平台。在这里，我们为金融领域的 LLM 培训和微调提供了完整的管道。完整架构如下图所示。详细的代码和介绍可以在这里找到. 或者您可以参考 wiki。

FinGPT-Benchmark: 我们引入了一种针对金融领域开源大型语言模型（LLM）进行优化的新型指令调整范式，增强了这些模型对各种金融数据集的适应性，同时还有助于从特定任务、多任务和零次指令调整任务中进行经济高效的系统基准测试。

FinGPT的 LLMs 层使用的开源基础模型

欢迎为各种特定语言的金融市场提供更多开源基础模型。

Base Model	Pretraining Tokens	Context Length	Model Advantages	Model Size	Experiment Results	Applications
Llama-2	2 Trillion	4096	Llama-2 excels on English-based market data	llama-2-7b and Llama-2-13b	llama-2 consistently shows superior fine-tuning results	Financial Sentiment Analysis, Robo-Advisor
Falcon	1,500B	2048	Maintains high-quality results while being more resource-efficient	falcon-7b	Good for English market data	Financial Sentiment Analysis
MPT	1T	2048	MPT models can be trained with high throughput efficiency and stable convergence	mpt-7b	Good for English market data	Financial Sentiment Analysis
Bloom	366B	2048	World’s largest open multilingual language model	bloom-7b1	Good for English market data	Financial Sentiment Analysis
ChatGLM2	1.4T	32K	Exceptional capability for Chinese language expression	chatglm2-6b	Shows prowess for Chinese market data	Financial Sentiment Analysis, Financial Report Summary
Qwen	2.2T	8k	Fast response and high accuracy	qwen-7b	Effective for Chinese market data	Financial Sentiment Analysis
InternLM	1.8T	8k	Can flexibly and independently construct workflows	internlm-7b	Effective for Chinese market data	Financial Sentiment Analysis

算法原理

模型基于 General Language Model (GLM) 架构，GLM是一种基于Transformer的语言模型，以自回归空白填充为训练目标，同时具备自回归和自编码能力。

ChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM2-6B 引入了如下新特性：

更强大的性能：基于 ChatGLM 初代模型的开发经验，我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 GLM 的混合目标函数，经过了 1.4T 中英标识符的预训练与人类偏好对齐训练，评测结果显示，相比于初代模型，ChatGLM2-6B 在 MMLU（+23%）、CEval（+33%）、GSM8K（+571%）、BBH（+60%）等数据集上的性能取得了大幅度的提升，在同尺寸开源模型中具有较强的竞争力。
更长的上下文：基于 FlashAttention 技术，我们将基座模型的上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练。对于更长的上下文，我们发布了 ChatGLM2-6B-32K 模型。LongBench 的测评结果表明，在等量级的开源模型中，ChatGLM2-6B-32K 有着较为明显的竞争优势。
更高效的推理：基于 Multi-Query Attention 技术，ChatGLM2-6B 有更高效的推理速度和更低的显存占用：在官方的模型实现下，推理速度相比初代提升了 42%，INT4 量化下，6G 显存支持的对话长度由 1K 提升到了 8K。
更开放的协议：ChatGLM2-6B 权重对学术研究完全开放，在填写问卷进行登记后亦允许免费商业使用。

ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展，恳请开发者和大家遵守开源协议，勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。目前，本项目团队未基于 ChatGLM2-6B 开发任何应用，包括网页端、安卓、苹果 iOS 及 Windows App 等应用。

环境配置

Docker（方法一）

docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk-23.10-py10-latest

docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=64G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name fingpt <your imageID> bash

docker exec -it fingpt bash

cd /path/your_code_data/FinGPT

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

pip install deepspeed-0.12.3+gitfe61783.abi0.dtk2310.torch2.1.0a0-cp310-cp310-manylinux2014_x86_64.whl

Dockerfile（方法二）

cd /path/your_code_data/FinGPT/docker

docker build --no-cache -t fingpt:latest .

docker run --shm-size=64G --name fingpt -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v /path/your_code_data/:/path/your_code_data/ -it fingpt bash

Anaconda（方法三）

关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装。

conda create -n fingpt python=3.10

cd /path/your_code_data/FinGPT

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple

pip install deepspeed-0.12.3+gitfe61783.abi0.dtk2310.torch2.1.0a0-cp310-cp310-manylinux2014_x86_64.whl

DTK驱动：dtk23.10
python：python3.10
torch:2.1

Tips：以上dtk驱动、python、torch、deepspeed等DCU相关工具版本需要严格一一对应

数据集

指令调整数据集和模型

Datasets	Train Rows	Test Rows	Description
fingpt-sentiment-train	76.8K	N/A	Sentiment Analysis Training Instructions
fingpt-finred	27.6k	5.11k	Financial Relation Extraction Instructions
fingpt-headline	82.2k	20.5k	Financial Headline Analysis Instructions
fingpt-ner	511	98	Financial Named-Entity Recognition Instructions
fingpt-fiqa_qa	17.1k	N/A	Financial Q&A Instructions
fingpt-fineval	1.06k	265	Chinese Multiple-Choice Questions Instructions

[huggingface数据集下载地址] https://huggingface.co/FinGPT 网站提供12中不同任务的数据集用于训练

数据处理代码

需要训练数据转换为必要的json格式，数据格式为：

  {
    "instruction": "描述原子的结构。",
    "input": "",
    "output": "原子是物质的基本单位，它由三种基本粒子组成：质子、中子和电子。质子和中子形成原子核，位于原子中心，核外的电子围绕着原子核运动。\n\n原子结构具有层次性。原子核中，质子带正电，中子不带电（中性）。原子核非常小且致密，占据了原子总质量的绝大部分。电子带负电，通常围绕核运动，形成若干层次，称为壳层或电子层。电子数量与质子数量相等，使原子呈电中性。\n\n电子在每个壳层中都呈规律分布，并且不同壳层所能容纳的电子数也不同。在最里面的壳层一般只能容纳2个电子，其次一层最多可容纳8个电子，再往外的壳层可容纳的电子数逐层递增。\n\n原子核主要受到两种相互作用力的影响：强力和电磁力。强力的作用范围非常小，主要限制在原子核内，具有极强的吸引作用，使核子（质子和中子）紧密结合在一起。电磁力的作用范围较大，主要通过核外的电子与原子核相互作用，发挥作用。\n\n这就是原子的基本结构。原子内部结构复杂多样，不同元素的原子核中质子、中子数量不同，核外电子排布分布也不同，形成了丰富多彩的化学世界。"
  },

可以使用data_process.py进行数据处理

python data_process.py

项目中已提供用于试验训练的迷你数据集，训练数据目录结构如下，用于正常训练的完整数据集请按此目录结构进行制备：

 ── dataset
    │&nbsp;&nbsp; ├── dataset_new
    │          &nbsp;&nbsp; ├── data-00000-of-00001.arrow
    │          &nbsp;&nbsp; ├── dataset_info.json
    │             └── state.json
    │&nbsp;&nbsp; └── twitter-financial-news-sentiment
    │          &nbsp;&nbsp; ├── sent_dataset_meta.txt
    │          &nbsp;&nbsp; ├── sent_train.csv
    │             └── sent_valid.csv
    │   └── dataset_new.json
    │   └── dataset_new.jsonl      &nbsp;&nbsp; 
    │          &nbsp;&nbsp;

训练

单机多卡

bash multi_dcu_train.sh

单机单卡

bash single_dcu_train.sh

推理

python inference_FinGPT.py

结果

多任务英语推理

单任务中文推理

精度

测试数据：twitter-financial-news-sentiment，使用的加速卡:V100S/K100。

根据测试结果情况填写表格：

xxx	train_loss	train_runtime	eval_los	eval_runtime
V100s	0.371248	4445.348	0.06542	30.5495
K100	0.371148	4018.4874	0.06536	53.1593

应用场景

金融，教育，政府，科研

算法类别

NLP

预训练权重

[huggingface预训练模型下载地址] https://huggingface.co/FinGPT

可下载带有lora后缀的预训练权重，使用本人编写的merge_model.py文件进行模型合并。

参考资料

本项目gitlab地址FinGPT
参考项目教程网址Notebook教程

FinGPT_ChatGLM2-6B预训练模型

作品详情