- 本模型是对 vivo-ai/BlueLM-7B-Chat (https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat) 进行GGUF转换及int4量化
操作步骤
第一步:下载模型
from modelscope import snapshot_download
model_dir = snapshot_download("vivo-ai/BlueLM-7B-Chat", revision="v1.0.2")
print(model_dir)
# /mnt/workspace/.cache/modelscope/vivo-ai/BlueLM-7B-Chat
第二步:转换为GGUF模型
root@dsw-30793-6fc485bff8-x5qnz:/mnt/workspace/demos/llama.cpp# git log|head
commit a75fa576abba9d37f463580c379e4bbf1e1ad03c
root@dsw-30793-6fc485bff8-x5qnz:/mnt/workspace/demos/llama.cpp# python3 convert.py /mnt/workspace/.cache/modelscope/vivo-ai/BlueLM-7B-Chat
convert.py文件转换的时候报错:
Exception: Unexpected tensor name: model.embed_layer_norm.weight。
# 模型路径下的:added_tokens.json 需增加
{
"[|AI|]:": 100001,
"[|Human|]:": 100000,
"100002": 100002,
"100003": 100003,
...
"100095": 100095
}
第三步:解决转换GGUF模型问题
1、修改 gguf-py/gguf/gguf.py 文件,372行处新增一行
root@dsw-30793-6fc485bff8-x5qnz:/mnt/workspace/demos/llama.cpp# vim gguf-py/gguf/gguf.py
# Output norm
MODEL_TENSOR.OUTPUT_NORM: (
"gpt_neox.final_layer_norm", # gptneox
"transformer.ln_f", # gpt2 gpt-j falcon
"model.norm", # llama-hf baichuan
"norm", # llama-pth
"embeddings.LayerNorm", # bert
"model.embed_layer_norm", # BlueLM, 此处的一行是新增的;
"transformer.norm_f", # mpt
"ln_f", # refact bloom
"language_model.encoder.final_layernorm", # persimmon
),
2、修改 convert.py文件
在985行处,增加如下两行内容:
< if name.startswith("model.embed_layer_norm."):
< name_new = "token_embd_norm." + name.split('.')[-1]
修改后的完整内容为:
for name, lazy_tensor in model.items():
tensor_type, name_new = tmap.get_type_and_name(name, try_suffixes = (".weight", ".bias")) or (None, None)
if name.startswith("model.embed_layer_norm."): # 此行是新增的
name_new = "token_embd_norm." + name.split('.')[-1] # 此行也是新增的
if name_new is None:
raise Exception(f"Unexpected tensor name: {name}")
至此运行下面命令即可成功转换模型为gguf格式
root@dsw-30793-6fc485bff8-x5qnz:/mnt/workspace/demos/llama.cpp# python3 convert.py /mnt/workspace/.cache/modelscope/vivo-ai/BlueLM-7B-Chat
第四步:解决模型加载及预测问题
因是暂时llama.cpp不支持的模型,需要修改模型结构,不然会出现类似下面这样的错误
error loading model: done_getting_tensors: wrong number of tensors; expected 293, got 291
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/workspace/.cache/modelscope/vivo-ai/BlueLM-7B-Chat/ggml-model-f16.gguf'
main: error: unable to load model
这里以修改llama.cpp为例进行说明
llama.cpp修改共有3处;
第一处:344行新增
344d343
< { LLM_TENSOR_TOKEN_EMBD_NORM, "token_embd_norm" },
修改过后:
static std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES = {
{
LLM_ARCH_LLAMA,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_TOKEN_EMBD_NORM, "token_embd_norm" }, // 这一行是新增的
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ROPE_FREQS, "rope_freqs" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
...
第二处:2585行处新增两行
2585,2587d2583
< model.tok_norm = ml.create_tensor(ctx, tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, GGML_BACKEND_CPU);
< model.tok_norm_b = ml.create_tensor(ctx, tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"), {n_embd}, GGML_BACKEND_CPU);
修改过后的:
switch (model.arch) {
case LLM_ARCH_LLAMA:
case LLM_ARCH_REFACT:
{
model.tok_embd = ml.create_tensor(ctx, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, GGML_BACKEND_CPU);
model.tok_norm = ml.create_tensor(ctx, tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, GGML_BACKEND_CPU); // 新增的一行
model.tok_norm_b = ml.create_tensor(ctx, tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"), {n_embd}, GGML_BACKEND_CPU); // 也是新增的一行
...
第三处:3639行处新增
3639,3644d3634
<
< inpL = llm_build_norm(ctx0, inpL, hparams,
< model.tok_norm,
< model.tok_norm_b,
< LLM_NORM, cb, -1);
< cb(inpL, "inp_norm", -1);
修改后的内容:
struct ggml_cgraph * build_llama() {
...
// KQ_mask (mask for 1 head, it will be broadcasted to all heads)
struct ggml_tensor * KQ_mask = ggml_new_tensor_3d(ctx0, GGML_TYPE_F32, n_kv, n_tokens, 1);
cb(KQ_mask, "KQ_mask", -1);
// 下面5行为新增的
inpL = llm_build_norm(ctx0, inpL, hparams,
model.tok_norm,
model.tok_norm_b,
LLM_NORM, cb, -1);
cb(inpL, "inp_norm", -1);
// shift the entire K-cache if needed
if (do_rope_shift) {
llm_build_k_shift(ctx0, hparams, cparams, kv_self, gf, LLM_ROPE, n_ctx, n_embd_head, freq_base, freq_scale, cb);
}
...
清除之前的编译结果,并重新编译
root@dsw-30793-6fc485bff8-x5qnz:/mnt/workspace/demos/llama.cpp# make clean && make
用重新编译的命令加载gguf模型进行预测
root@dsw-30793-6fc485bff8-x5qnz:/mnt/workspace/demos/llama.cpp# ./main -ngl 32 -m /mnt/workspace/.cache/modelscope/vivo-ai/BlueLM-7B-Chat/ggml-model-f16.gguf --color -c 2048 --temp 1.0 --repeat_penalty 1.1 -n -1 -p "问题:桃花潭水深千尺的下一句是什么?答案:"
# 问题:桃花潭水深千尺的下一句是什么?答案:不及汪伦送我情
或者命令行加载模型启动服务
# 先量化
root@dsw-30793-7bd7ddb664-jtc4b:/mnt/workspace/demos/llama.cpp# ./quantize /tmp/ggml-model-f16.gguf /tmp/BlueLM-7B-Chat-gguf-q4_0.bin q4_0
# 加载量化后的模型并启动服务
root@dsw-30793-7bd7ddb664-jtc4b:/mnt/workspace/demos/llama.cpp# ./server --host 0.0.0.0 --port 5000 -m /tmp/BlueLM-7B-Chat-gguf-q4_0.bin -ngl 32 -c 2048
请求服务
curl --request POST \
--url http://localhost:5000/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "床前明月光的下一句是什么?","n_predict": 128}'
返回结果:
{"content":" 疑是地上霜","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/tmp/BlueLM-7B-Chat-gguf-q4_0.bin","n_ctx":2048,"n_keep":0,"n_predict":128,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"/tmp/BlueLM-7B-Chat-gguf-q4_0.bin","prompt":"床前明月光的下一句是什么?","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":1942.425,"predicted_n":5,"predicted_per_second":2.574101960178643,"predicted_per_token_ms":388.485,"prompt_ms":1651.895,"prompt_n":10,"prompt_per_second":6.053653531247446,"prompt_per_token_ms":165.1895},"tokens_cached":15,"tokens_evaluated":10,"tokens_predicted":5,"truncated":false}
五、编译好的命令使用
运行环境:ubuntu:20.04
root@9d42f7e0c8d7:~# tar -zxvf llama.cpp-command.tar.gz
root@9d42f7e0c8d7:~# du -sh /root/BlueLM-7B-Chat-gguf-q4_0.bin
4.0G /root/BlueLM-7B-Chat-gguf-q4_0.bin
root@9d42f7e0c8d7:~# cd llama.cpp-command
root@9d42f7e0c8d7:~/llama.cpp-command# ./main -ngl 32 -m /root/BlueLM-7B-Chat-gguf-q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "问题:静夜思全文是什么?答案:"
问题:静夜思全文是什么?答案:《静夜思》的全文是唐代:李白床前明月光,疑是地上霜。举头望明月,低头思故乡。
《静夜思》的全文的原文如下:
床前明月光,疑是地上霜。
举头望明月,低头思故乡。
# 启动服务:
root@9d42f7e0c8d7:~/llama.cpp-command# ./server --host 0.0.0.0 --port 5000 -m /root/BlueLM-7B-Chat-gguf-q4_0.bin -ngl 32 -c 2048
# 请求服务:
curl --request POST --url http://localhost:5000/completion --header "Content-Type: application/json" --data '{"prompt": "桃花潭水深千尺的下一句是什么?","n_predict": 128}'
license: Apache License 2.0
致谢
1、https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat
2、https://github.com/ggerganov/llama.cpp
评论