灵智大模型 - 垂直领域行业专家

✨ 亮点

从Qwen2-base完美复现了Qwen2-chat，并公开了训练数据；
在垂类领域训练场景下，灵智模型能够在提升垂类领域性能的同时也保持了通用领域的性能；
对多种训练范式（例如直接指令微调，先持续预训练再指令微调等八种范式）做了总结，并针对不同的模型大小采取了最佳的训练范式；
开源了8个灵智模型：Lingzhi-0.5B-chat, Lingzhi-0.8B-chat, Lingzhi-1.5B-chat, Lingzhi-2.7B-chat, Lingzhi-7B-chat, Lingzhi-10B-chat, Lingzhi-57MOE14B-chat, Lingzhi-72B-chat.

? 摘要

在实际应用中，当预训练数据不可用时，进行持续训练是很常见的。然而，持续训练往往会在增强领域特定技能的同时导致大语言模型（LLMs）灾难性地遗忘其通用能力。在本文中，我们首先对常见的持续训练范式进行了实证研究，然后选择了最佳范式来训练灵智系列模型。实验表明，灵智能够在保持通用能力的同时增强领域特定的性能。我们已经开源了所有模型、训练数据和基准测试，用户可以将它们应用到自己的领域特定区域。

? 介绍

大语言模型（LLMs）近年来因其在各种实际下游任务中的出色表现而备受关注。实际上，尽管现有的LLMs在通用领域表现良好，但由于在预训练或指令微调期间缺乏特定领域的专业暴露，它们可能在用户需要的特定领域（如会计、法律、金融）中表现不佳。

为了提升LLMs在特定领域的表现，我们需要收集相应的数据进行持续学习，如持续预训练（CPT）或有监督微调（SFT）。然而，我们注意到，仅在特定领域进行持续学习可能导致通用能力的灾难性遗忘，如规划、指令执行、数学、编程和自然语言理解等。

为了同时保持通用和领域特定能力，通常会部署一个未修改的原生模型用于通用任务，而一个微调模型用于专业任务。这将对计算硬件资源（如GPU和内存）提出巨大的需求，从而阻碍商业部署。众所周知，上述现象是业界面临的一个非常棘手的问题。因此，一个值得研究的问题出现了：如何在持续学习过程中提高领域特定的表现，而不损害通用能力？

为了解决这个问题，我们进行了实证研究，探索了各种持续学习范式并总结了它们的优缺点。最终，在实证研究之后，我们选择了最佳的学习范式和训练数据，基于Qwen2-base进行持续学习，衍生出我们的灵智系列模型。经过大量实验，灵智能够在多个特定领域中表现出色，同时在通用能力方面也表现出与原始Qwen2-chat模型相当的性能。

? 示例

huggingface示例代码

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

lingzhi_model_path = "Lingzhi-AI/Lingzhi-7B-chat"

model = AutoModelForCausalLM.from_pretrained(
    lingzhi_model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

prompt = "帮我介绍一下灵智大模型。"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

modelscope示例代码

from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

lingzhi_model_path = "LingzhiLLM/Lingzhi-7B-chat"

model = AutoModelForCausalLM.from_pretrained(
    lingzhi_model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

prompt = "帮我介绍一下灵智大模型。"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

? 结果

备注：Baselines中Qwen2的所有结果均是在我们统一的环境下进行评测的。

Base Model	General								Domains		Avg.
	English		Chinese		Math		Code
	MMLU	BBH	C-Eval	CMMLU	GSM8K	MathQA	HumanEval	MBPP	Account	Law
*Baselines*
Qwen2-0.5B-chat	43.30	10.35	54.16	53.57	33.97	25.76	20.73	12.40	17.01	25.00	29.62
Qwen2-1.5B-chat	55.73	9.55	69.32	70.13	54.21	32.93	42.68	20.60	32.65	42.07	42.99
Qwen2-7B-chat	69.82	30.56	81.58	81.77	66.26	44.09	72.56	42.20	55.10	59.15	60.31
Qwen2-57MOE14B-chat
Qwen2-72B-chat
*Lingzhi Models*
Lingzhi-0.5B-chat	44.25	25.65	55.05	53.74	29.34	29.18	25.00	22.40	25.85	40.24	35.07
Lingzhi-0.8B-chat	42.93	27.77	53.34	50.98	21.00	28.84	28.66	18.00	24.49	40.85	33.69
Lingzhi-1.5B-chat	55.35	33.67	69.47	69.10	49.58	35.31	39.02	31.00	37.41	42.68	46.26
Lingzhi-2.7B-chat	53.65	36.77	67.09	67.39	46.02	34.51	40.85	30.00	38.10	60.98	47.54
Lingzhi-7B-chat	69.06	58.95	82.69	83.05	74.22	45.59	56.10	49.80	72.79	89.02	68.13
Lingzhi-10B-chat	69.37	64.37	81.50	82.27	76.19	46.00	60.98	50.40	70.07	82.93	68.41
Lingzhi-57MOE14B-chat
Lingzhi-72B-chat

? 引用

⚠️ 警告如果您用到了我们的模型和数据，请使用以下参考文献。

@misc{lingzhi,
      title={Lingzhi: Improving Domain-Specific Performance without Compromising General Capabilities}, 
      author={Daoguang Zan, Lei Yu, Ailun Yu, Zhirong Huang, Zongshuai Ruan, Pengjie Huang},
      year={2024},
      note={All authors contributed equally. The computational power required to train the Lingzhi models (12*8 H800 80G) was provided by Lingzhi AI. Special thanks to them.}
}

Lingzhi-7B-chat

作品详情