PolyLM-Qwen-7B-多语言预训练模型

模型简介

PolyLM是一个通晓多种语言的大规模语言模型，涵盖中文、英文、西班牙语、法语、德语、俄语、葡萄牙语、意大利语、阿拉伯语、日语、韩语、泰语、越南语和印尼语等18个语言。该模型在Qwen-7B预训练模型的基础上，采用PolyLM的多语言预训练数据进行持续训练得到，在保持Qwen-7B中文和英文能力的同时，大幅度提升其他多语言的性能。该模型可以应用于对话问答、文本生成、机器翻译和情感分析等领域，能够自动生成高质量的多语言文本，从而为跨语言、文化的交流提供便利。

PolyLM-Qwen-7B多语言预训练模型具有以下特点：

以通义千问（Qwen-7B）模型为基础：Qwen-7B使用超过2.2万亿tokens的数据进行预训练，包含高质量中、英、多语言、代码、数学等数据，涵盖通用及专业领域的训练语料。通过大量对比实验对预训练语料分布进行了优化，在同规模的开源模型中具有较强的竞争力。
更好的多语言能力：利用980亿tokens的高质量多语言数据持续训练Qwen-7B从而得到PolyLM-Qwen-7B，在西班牙语、法语、德语、俄语、葡萄牙语、意大利语、阿拉伯语、日语、韩语、泰语、越南语和印尼语等18个语言上显著优于社区内的其他开源模型。
较好的多语言翻译能力：980亿tokens的持续训练数据中包含46亿tokens的多语言翻译数据集，其中平行语料通过人工撰写的上百个prompt构造成高质量的文档数据。

Model	Layers	Heads	Hidden	Max_length	LR	Batch	Type
PolyLM-1.7B	24	16	2048	2048	1.0e-4	4M	Pretrain Model
PolyLM-13B	40	40	5120	2048	6.0e-5	4M	Pretrain Model
PolyLM-MultiAlpaca-13B	40	40	5120	2048	6.0e-5	4M	Chat Model
PolyLM-Assistant-13B	40	40	5120	2048	6.0e-5	4M	Chat Model

依赖项 (Dependency)

运行PolyLM-Qwen-7B，请确保机器环境torch版本不低于1.12，再执行以下pip命令安装依赖库

To run PolyLM-Qwen-7B, please make sure that pytorch version is not lower than 1.12, and then execute the following pip commands to install the dependent libraries.

pip install modelscope -U
pip install transformers_stream_generator

另外，推荐安装flash-attention库，以实现更高的效率和更低的显存占用。

In addition, it is recommended to install the flash-attention library for higher efficiency and lower memory usage.

git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
pip install csrc/layer_norm
pip install csrc/rotary

快速使用（Quickstart）

您可以通过以下代码轻松调用：

You can easily call the model with the following code:

from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig

# Note: The default behavior now has injection attack prevention off.

tokenizer = AutoTokenizer.from_pretrained("damo/nlp_polylm_qwen_7b_text_generation", revision = 'v1.0.1', trust_remote_code=True)
# We recommend checking the support of BF16 first. Run the command below:
# import torch
# torch.cuda.is_bf16_supported()
# use bf16
model = AutoModelForCausalLM.from_pretrained("damo/nlp_polylm_qwen_7b_text_generation", device_map="auto", revision = 'v1.0.1',trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("damo/nlp_polylm_qwen_7b_text_generation", device_map="auto", revision = 'v1.0.1', trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("damo/nlp_polylm_qwen_7b_text_generation", device_map="cpu", revision = 'v1.0.1', trust_remote_code=True).eval()
# use fp32
#model = AutoModelForCausalLM.from_pretrained("damo/nlp_polylm_qwen_7b_text_generation", device_map="auto", revision = 'v1.0.1',trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained("damo/nlp_polylm_qwen_7b_text_generation", revision = 'v1.0.1',trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参

inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...

模型训练 (SFT)

代码链接: https://github.com/modelscope/modelscope/blob/master/examples/pytorch/llm

支持的sft方法: lora, qlora, 全参数微调, …
支持的特性: 模型量化, DDP, 模型并行(device_map), gradient checkpoint, 梯度累加, 支持推送modelscope hub, 支持自定义数据集, …

使用modelscope官方代码对PolyLM-Qwen-7B进行SFT需要进行如下的简单开发：

新增多语言指令精调数据集（推荐）

将下面几段代码粘贴到：https://github.com/modelscope/modelscope/blob/master/examples/pytorch/llm/utils/dataset.py，即可使用MultiAlpaca数据集：

定义数据读取逻辑

def _processing_multi_alpaca(datasets: [HfDataset, List]) -> HfDataset:
    output = []
    res = []
&nbsp;  if not isinstance(datasets, List):
        datasets = [datasets]
    for dataset in datasets:
        instruction = dataset['instruction‘]
        input_ = dataset['input‘]
        output_ = dataset['output‘]
        for inst, inp, opt in zip(instruction, input_, output_):
            if inp is not None and inp != ‘’:
                if inp.startswith('输入：‘):
                    inp = inp[3:]
                inst = f'{inst}\n{inp}‘
            if opt is not None and opt != ‘’:
                res.append(inst)
                output.append(opt)
    dataset = HfDataset.from_dict({'instruction': res, 'output': output})
    return dataset

下载数据集

def get_multi_alpaca_dataset() -> HfDataset:
    dataset_multi = []
    for subset_name in ['ar', 'de', 'es', 'fr', 'id', 'ja', 'ko', 'pt', 'ru', 'th', 'vi’]:
        dataset_sub: HfDataset = MsDataset.load(
            'damo/nlp_polylm_multialpaca_sft’,
            subset_name=subset_name,
            split='train').to_hf_dataset()
        dataset_multi.append(dataset_sub)

    return _processing_multi_alpaca(dataset_multi)

定义数据集MAPPER

DATASET_MAPPER = {
    'alpaca-en': get_alpaca_en_dataset,
    'alpaca-zh': get_alpaca_zh_dataset,
    'alpaca-multi': get_multi_alpaca_dataset,
}

新增模型配置（必选）

将下面几段代码粘贴到https://github.com/modelscope/modelscope/blob/master/examples/pytorch/llm/utils/models.py，即可微调PolyLM-Qwen-7B：

定义LoRA模块

class LoRATM(NamedTuple):
    # default lora target modules
    baichuan = ['W_pack‘]
    chatglm2 = ['query_key_value‘]
    llama2 = ['q_proj', 'k_proj', 'v_proj‘]
    qwen = ['c_attn‘]
    polylm = ['c_attn']
    polylm_qwen = ['c_attn‘]

新增模型MAPPER

MODEL_MAPPER = {
    'baichuan-7b': {
        'model_id': 'baichuan-inc/baichuan-7B',  # model id or model dir
        'revision': 'v1.0.7',
        'lora_TM': LoRATM.baichuan
    },
    'baichuan-13b': {
        'model_id': 'baichuan-inc/Baichuan-13B-Base',
        'revision': 'v1.0.3',
        'torch_dtype': torch.bfloat16,
        'lora_TM': LoRATM.baichuan
    },
    'chatglm2-6b': {
        'model_id': 'ZhipuAI/chatglm2-6b',
        'revision': 'v1.0.6',
        'get_function': get_model_tokenizer_chatglm2,
        'lora_TM': LoRATM.chatglm2
    },
    'llama2-7b': {
        'model_id': 'modelscope/Llama-2-7b-ms',
        'revision': 'v1.0.2',
        'ignore_file_pattern': [r'.+\.bin$'],  # use safetensors
        'lora_TM': LoRATM.llama2
    },
    'llama2-13b': {
        'model_id': 'modelscope/Llama-2-13b-ms',
        'revision': 'v1.0.2',
        'ignore_file_pattern': [r'.+\.bin$'],
        'lora_TM': LoRATM.llama2
    },
    'openbuddy-llama2-13b': {
        'model_id': 'OpenBuddy/openbuddy-llama2-13b-v8.1-fp16',
        'revision': 'v1.0.0',
        'lora_TM': LoRATM.llama2
    },
    'qwen-7b': {
        'model_id': 'QWen/qwen-7b',
        'revision': 'v1.0.0',
        'get_function': get_model_tokenizer_qwen,
        'torch_dtype': torch.bfloat16,
        'lora_TM': LoRATM.qwen,
    },
    'polylm-qwen-7b': {
        'model_id': 'damo/nlp_polylm_qwen_7b_text_generation',
        'revision': 'v1.0.1',
        'get_function': get_model_tokenizer_qwen,
        'torch_dtype': torch.bfloat16,
        'lora_TM': LoRATM.polylm_qwen,
    },
    'polylm-13b': {
        'model_id': 'damo/nlp_polylm_13b_text_generation',
        'revision': 'v1.0.3',
        'get_function': get_model_tokenizer_polylm,
        'torch_dtype': torch.bfloat16,
        'lora_TM': LoRATM.polylm
    }
}

执行训练

# git clone https://github.com/modelscope.git 
# cd examples/pytorch/llm 

CUDA_VISIBLE_DEVICES=0,1 \ 
python llm_sft.py \ 
    --model_type polylm-qwen-7b \ 
    --sft_type lora \
    --output_dir runs \ 
    --dataset alpaca-en,alpaca-zh,alpaca-multi \
    --dataset_sample 20000 \ 
    --max_length 1024 \ 
    --lora_rank 8 \ 
    --lora_alpha 32 \
    --batch_size 1 \ 
    --learning_rate 1e-4 \ 
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 10

执行推理

CUDA_VISIBLE_DEVICES=0,1 \ 
python llm_infer.py \ 
    --model_type polylm-qwen-7b \ 
    --ckpt_path “runs/polylm-qwen-7b/v0-20230812-172425/output_best/pytorch_model.bin” \
    --eval_human true

模型细节 (Model)

与Qwen-7B完全一致，以下内容从Qwen-7B的模型卡片复制而来

Qwen-7B模型规模基本情况如下所示：

The details of the model architecture of Qwen-7B are listed as follows:

Hyperparameter	Value
n_layers	32
n_heads	32
d_model	4096
vocab size	151851
sequence length	2048

在位置编码、FFN激活函数和normalization的实现方式上，我们也采用了目前最流行的做法，即RoPE相对位置编码、SwiGLU激活函数、RMSNorm（可选安装flash-attention加速）。

在分词器方面，相比目前主流开源模型以中英词表为主，Qwen-7B使用了超过15万token大小的词表。该词表在GPT-4使用的BPE词表cl100k_base基础上，对中文、多语言进行了优化，在对中、英、代码数据的高效编解码的基础上，对部分多语言更加友好，方便用户在不扩展词表的情况下对部分语种进行能力增强。词表对数字按单个数字位切分。调用较为高效的tiktoken分词库进行分词。

我们从部分语种各随机抽取100万个文档语料，以对比不同模型的编码压缩率（以支持100语种的XLM-R为基准值1，越低越好），具体性能见图。

可以看到Qwen-7B在保持中英代码高效解码的前提下，对部分使用人群较多的语种（泰语th、希伯来语he、阿拉伯语ar、韩语ko、越南语vi、日语ja、土耳其语tr、印尼语id、波兰语pl、俄语ru、荷兰语nl、葡萄牙语pt、意大利语it、德语de、西班牙语es、法语fr等）上也实现了较高的压缩率，使得模型在这些语种上也具备较强的可扩展性和较高的训练和推理效率。

持续训练数据（Continue Training Data）

在预训练数据方面，PolyLM-Qwen-7B模型使用开源通用语料，整理过滤后得到980亿tokens的数据集（如下面表格中红色框里面的内容），语种分布如下面柱状图橙色部分展示。

评测效果（Evaluation）

跨语言自然语言理解任务（Cross-lingual Natural Language Understanding）

涵盖XNLI、XCOPA和PAWS-X三个任务，对比PolyLM-13B、LLaMA2-13B、Qwen-7B以及ChatGPT，效果如下所示：

多语言机器翻译（Multilingual Machine Translation）

涵盖WMT2020机器翻译评测的8个语向，分别是英到中、英到德、英到俄、英到日、中到英、俄到英、德到英和日到英，对比PolyLM-13B、LLaMA2-13B、Qwen-7B以及ChatGPT，效果如下所示：

中文评测（C-Eval）

C-Eval是评测预训练模型中文常识能力的常用测评框架，覆盖人文、社科、理工、其他专业四个大方向共52个学科。我们按照标准做法，以开发集样本作为few-shot来源，在C-Eval验证集上，PolyLM-Qwen-7B模型和其他模型的准确率对比如下：

Model	Avg.
Alpaca-7B	28.9
Vicuna-7B	31.2
ChatGLM-6B	37.1
Baichuan-7B	42.7
ChatGLM2-6B	50.9
InternLM-7B	53.4
ChatGPT	53.5
Claude-v1.3	55.5
PolyLM-Qwen-7B	57.8
Qwen-7B	60.8

相比Qwen-7B，PolyLM-Qwen-7B有3个百分点的下降，主要归因于持续训练数据的中文数据占比较小，远低于Qwen-7B本身预训练数据的中文比例。这说明PolyLM-Qwen-7B能够保留Qwen-7B绝大部分原始能力的同时，大幅度提升其他语种的能力。此外，相比其他开源模型，PolyLM-Qwen-7B具有较强的竞争力。

英文评测（English Evaluation，MMLU）

MMLU是目前评测英文综合能力最权威的基准评测之一，同样覆盖了不同学科领域、不同难度层级的57个子任务。

PolyLM-Qwen-7B在MMLU 5-shot准确率表现如下表：

Model	Avg.	STEM	Social Sciences	Humanities	Others
LLaMA-7B	35.1	30.5	38.3	34.0	38.1
Baichuan-7B	42.3	35.6	48.9	38.4	48.1
LLaMA2-7B	45.3	36.4	51.2	42.9	52.2
LLaMA-13B	46.9	35.8	53.8	45.0	53.3
ChatGLM2-6B	47.9	41.2	54.4	43.7	54.5
InternLM-7B	51.0	-	-	-	-
Baichuan-13B	51.6	41.6	60.9	47.4	58.5
LLaMA2-13B	54.8	44.1	62.6	52.8	61.1
ChatGLM2-12B	56.2	48.2	65.1	52.6	60.9
PolyLM-Qwen-7B	53.8	45.1	61.9	49.1	61.8
Qwen-7B	56.7	47.6	65.9	51.5	64.7

使用协议（License Agreement）

我们的代码和模型权重对学术研究完全开放，并支持商用。

论文引用

如果你觉得这个该模型对有所帮助，请考虑引用下面的相关的论文：

@misc{wei2023polylm,
      title={PolyLM: An Open Source Polyglot Large Language Model}, 
      author={Xiangpeng Wei and Haoran Wei and Huan Lin and Tianhao Li and Pei Zhang and Xingzhang Ren and Mei Li and Yu Wan and Zhiwei Cao and Binbin Xie and Tianxiang Hu and Shangjie Li and Binyuan Hui and Bowen Yu and Dayiheng Liu and Baosong Yang and Fei Huang and Jun Xie},
      year={2023},
      eprint={2307.06018},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

PolyLM-Qwen-7B-多语言预训练模型

作品详情

PolyLM-Qwen-7B-多语言预训练模型

模型简介

相关模型

依赖项 (Dependency)

快速使用（Quickstart）

模型训练 (SFT)

新增多语言指令精调数据集（推荐）

新增模型配置（必选）

执行训练

执行推理

模型细节 (Model)

持续训练数据（Continue Training Data）

评测效果（Evaluation）

跨语言自然语言理解任务（Cross-lingual Natural Language Understanding）

多语言机器翻译（Multilingual Machine Translation）

中文评测（C-Eval）

英文评测（English Evaluation，MMLU）

使用协议（License Agreement）

论文引用

重点城市程序员兼职推荐

重点岗位程序员兼职推荐