Model Card for CodeFuse-CodeLlama-34B-4bits
     
[中文]    [Eglish]
Model Descriptio
CodeFuse-CodeLlama-34B-4bits is the 4-bit quatized versio of CodeFuse-CodeLlama-34B, which is a 34B Code-LLM fie-tued over multiple code tasks(600k istructios/aswers)o the base model CodeLlama-34b-Pytho.
After udergoig 4-bit quatizatio, the CodeFuse-CodeLlama-34B-4bits model ca be loaded o either a sigle A10 (24GB VRAM) or a RTX 4090 (24GB VRAM). Moreover, the quatized model still achives a impressive accuracy of 73.8% o the Humaeval pass@1 metric.
News ad Updates
??? 2023-09-28 CodeFuse-CodeLlama-34B 4-bit techical documetatio has bee released. If you are iterested, please click the provided lik to view it o the CodeFuse WeChat official accout.(https://mp.weixi.qq.com/s/QLycLdgPGQjF7JE_YF466Q)
??? 2023-09-26 We are pleased to aouce the release of the 4-bit quatized versio of CodeFuse-CodeLlama-34B. Despite the quatizatio process, the model still achieves a remarkable 73.8% accuracy (greedy decodig) o the HumaEval pass@1 metric.
??? 2023-09-11 CodeFuse-CodeLlama34B has achived 74.4% of pass@1 (greedy decodig) o HumaEval, which is SOTA results for opespurced LLMs at preset.
Code Commuity
Homepage: ? https://github.com/codefuse-ai (Please give us your support with a Star? + Fork? + Watch?)
Performace
Code
| Model | HumaEval(pass@1) | Date | 
| CodeFuse-CodeLlama-34B | 74.4% | 2023.9 | 
| CodeFuse-CodeLlama-34B-4bits | 73.8% | 2023.9 | 
| WizardCoder-Pytho-34B-V1.0 | 73.2% | 2023.8 | 
| GPT-4(zero-shot) | 67.0% | 2023.3 | 
| PaGu-Coder2 15B | 61.6% | 2023.8 | 
| CodeLlama-34b-Pytho | 53.7% | 2023.8 | 
| CodeLlama-34b | 48.8% | 2023.8 | 
| GPT-3.5(zero-shot) | 48.1% | 2022.11 | 
| OctoCoder | 46.2% | 2023.8 | 
| StarCoder-15B | 33.6% | 2023.5 | 
| LLaMA 2 70B(zero-shot) | 29.9% | 2023.7 | 
GPU Memory Usage
We measured the GPU memory usage after loadig the model, as well as the memory usage whe ecodig 2048/1024 tokes ad geeratig 1024/2048 tokes. The results are preseted i the table below.
| Precisio | Idle Model | Ecodig 2048 tokes ad Geeratig 1024 tokes | Ecodig 1024 tokes ad Geeratig 2048 tokes | 
| bfloat16 | 64.89GB | 69.31GB | 66.41GB | 
| it4 | 19.09GB | 22.19GB | 20.78GB | 
Requiremets
- pytho>=3.8 
- pytorch>=2.0.0
- trasformers==4.32.0
- auto_gptq==0.4.2
- Setecepiece
- CUDA 11.4
Iferece Strig Format
The iferece strig is a cocateated strig formed by combiig coversatio data (huma ad bot cotets) i the traiig data format.  It is used as iput durig the iferece process.
Here is a example format of the cocateated strig:
"""
<|role_start|>huma<|role_ed|>Huma 1st roud iput
<|role_start|>bot<|role_ed|>Bot 1st roud output</s>
<|role_start|>huma<|role_ed|>Huma 2d roud iput
<|role_start|>bot<|role_ed|>Bot 2d roud output</s>
...
...
...
<|role_start|>huma<|role_ed|>Huma th roud iput
<|role_start|>bot<|role_ed|>{Bot output to be gereated}</s>
"""
Whe applyig iferece, you always make your iput strig ed with "<|rolestart|>bot<|roleed|>" to ask the model geeratig aswers.
Quickstart
```bash
 git cloe https://www.modelscope.c/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git    
bash
pip istall -r requiremets.txt
pytho
import os
import torch
import time
from modelscope import AutoTokeizer, sapshotdowload
from autogptq import AutoGPTQForCausalLM
os.eviro["TOKENIZERS_PARALLELISM"] = "false"
def loadmodeltokeizer(modelpath):
    """
    Load model ad tokeizer based o the give model ame or local path of dowloaded model.
    """
    tokeizer = AutoTokeizer.frompretraied(modelpath, 
                                              trustremotecode=True, 
                                              usefast=False,
                                              lagecy=False)
    tokeizer.paddigside = "left"
    tokeizer.padtokeid = tokeizer.coverttokestoids("")
    tokeizer.eostokeid = tokeizer.coverttokesto_ids("")
model = AutoGPTQForCausalLM.from_quatized(model_path, 
                                            iject_fused_attetio=False,
                                            iject_fused_mlp=False,
                                            use_safetesors=False,
                                            use_cuda_fp16=True,
                                            disable_exllama=False,
                                            device_map='auto'   # Support multi-gpus
                                          )
retur model, tokeizer
def iferece(model, tokeizer, prompt):
    """
    Uset the give model ad tokeizer to geerate a aswer for the speicifed prompt.
    """
    st = time.time()
    prompt = prompt if prompt.edswith('\') else f'{prompt}\'
    iputs =  f"<|rolestart|>huma<|roleed|>{prompt}<|rolestart|>bot<|roleed|>"
iput_ids = tokeizer.ecode(iputs, 
                              retur_tesors="pt", 
                              paddig=True, 
                              add_special_tokes=False).to("cuda")
with torch.o_grad():
    geerated_ids = model.geerate(
        iput_ids=iput_ids,
        top_p=0.95,
        temperature=0.1,
        do_sample=True,
        max_ew_tokes=512,
        eos_toke_id=tokeizer.eos_toke_id,
        pad_toke_id=tokeizer.pad_toke_id              
    )
prit(f'geerated tokes um is {le(geerated_ids[0][iput_ids.size(1):])}')
outputs = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True) 
prit(f'geerate text is {outputs[0][le(iputs): ]}')
latecy = time.time() - st
prit('latecy is {} secods'.format(latecy))
if ame == "mai":
    modeldir = sapshotdowload('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revisio='v1.0.0')
prompt = 'Please write a QuickSort program i Pytho'
model, tokeizer = load_model_tokeizer(model_dir)
iferece(model, tokeizer, prompt)
**The curret iferece example code is based o [AutoGPTQ](https://github.com/PaQiWei/AutoGPTQ). If you wat to achieve higher iferece speed, it is recommeded to combie it with [TesorRT-LLM (Early Access)](https://developer.vidia.com/tesorrt-llm-early-access).**
<br>
## Cosistecy Check
Here, SHA256 values are provided for the model-related files for cosistecy check durig the dowload.
| File                           |  SHA256                         |
|-------------------------------:|:--------------------------------:|
|cofig.jso | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b |
|geeratio_cofig.jso | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 |
|gptq_model-4bit-64g.bi | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a |
|pytorch_model.bi.idex.jso | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d |
|quatize_cofig.jso | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 |
|special_tokes_map.jso | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 |
|tokeizer.jso | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 |
|tokeizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 |
|tokeizer_cofig.jso | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 |
<br>
<br>
<a id="chiese"></a>
## 模型简介
CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Pytho进行多代码任务微调而得到的代码大模型,模型输入长度为4K。
经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaeval pass@1指标上仍取得了73.8%的表现。
<br>
## 新闻
??? 2023-09-28 CodeFuse-CodeLlama-34B 4bits技术文档公布,感兴趣请点击微信公众号CodeFuse链接查看: https://mp.weixi.qq.com/s/QLycLdgPGQjF7JE_YF466Q
??? 2023-09-26 CodeFuse-CodeLlama-34B 4bits量化版本发布,量化后模型在HumaEval pass@1指标为73.8% (贪婪解码)。
??? 2023-09-11 CodeFuse-CodeLlama-34B发布,HumaEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。
<br>
## 代码社区
**大本营**: ? https://github.com/codefuse-ai (**请支持我们的项目Star? + Fork? + Watch?**)
+ 如果您想自己微调该模型,可以访问 ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨
+ 如果您想自己部署该模型,可以访问 ✨[FasterTrasformer4CodeFuse](https://github.com/codefuse-ai/FasterTrasformer4CodeFuse)✨✨
+ 如果您想观看该模型示例,可以访问 ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨
<br>
## 评测表现(代码)
| 模型                             | HumaEval(pass@1) |   日期    |
|:--------------------------------|:-----------------:|:-------:|
| **CodeFuse-CodeLlama-34B**      |     **74.4%**      | 2023.9  |
|**CodeFuse-CodeLlama-34B-4bits** |     **73.8%**      |  2023.9 |
| WizardCoder-Pytho-34B-V1.0     |       73.2%       | 2023.8  |
| GPT-4(zero-shot)                |       67.0%       | 2023.3  |
| PaGu-Coder2 15B                |       61.6%       | 2023.8  |
| CodeLlama-34b-Pytho            |       53.7%       | 2023.8  |
| CodeLlama-34b                   |       48.8%       | 2023.8  |
| GPT-3.5(zero-shot)              |       48.1%       | 2022.11 |
| OctoCoder                       |       46.2%       | 2023.8  |
| StarCoder-15B                   |       33.6%       | 2023.5  |
| LLaMA 2 70B(zero-shot)          |       29.9%       | 2023.7  |
<br>
## 显存使用
我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokes并输出1024/2048 tokes时的显存使用情况,如下表所示
| 精度                             | 模型空载            |    输入2048 tokes + 输出1024 tokes | 输入1024 tokes + 输出2048 tokes      |
|:--------------------------------|:-------------------|:------------------------:|:------------:|
|bfloat16                         |     64.89GB        |             69.31GB        |  66.41GB   |
|it4                             |     19.09GB        |            22.19GB         |  20.78GB |            
<br>
## 依赖要求
* pytho>=3.8 
* pytorch>=2.0.0
* trasformers==4.32.0
* auto_gptq==0.4.2
* Setecepiece
* CUDA 11.4
<br>
## 推理数据格式
推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:
pytho
"""
<|rolestart|>huma<|roleed|>Huma 1st roud iput
<|rolestart|>bot<|roleed|>Bot 1st roud output
<|rolestart|>huma<|roleed|>Huma 2d roud iput
<|rolestart|>bot<|roleed|>Bot 2d roud output
…
…
…
<|ed|><|rolestart|>huma<|roleed|>Huma th roud iput
<|ed|><|rolestart|>bot<|roleed|>{Bot output to be gereated}
"""
推理时,请确保拼接的prompt字符串以"<|role_start|>bot<|role_ed|>"结尾,引导模型生成回答。
<br>
## 快速使用
bash
 git cloe https://www.modelscope.c/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git    
bash
pip istall -r requiremets.txt
pytho
import os
import torch
import time
from modelscope import AutoTokeizer, sapshotdowload
from autogptq import AutoGPTQForCausalLM
os.eviro["TOKENIZERS_PARALLELISM"] = "false"
def loadmodeltokeizer(modelpath):
    """
    Load model ad tokeizer based o the give model ame or local path of dowloaded model.
    """
    tokeizer = AutoTokeizer.frompretraied(modelpath, 
                                              trustremotecode=True, 
                                              usefast=False,
                                              lagecy=False)
    tokeizer.paddigside = "left"
    tokeizer.padtokeid = tokeizer.coverttokestoids("")
    tokeizer.eostokeid = tokeizer.coverttokesto_ids("")
model = AutoGPTQForCausalLM.from_quatized(model_path, 
                                            iject_fused_attetio=False,
                                            iject_fused_mlp=False,
                                            use_safetesors=False,
                                            use_cuda_fp16=True,
                                            disable_exllama=False,
                                            device_map='auto'   # 支持多卡
                                          )
retur model, tokeizer
def iferece(model, tokeizer, prompt):
    """
    Uset the give model ad tokeizer to geerate a aswer for the speicifed prompt.
    """
    st = time.time()
    prompt = prompt if prompt.edswith('\') else f'{prompt}\'
    iputs =  f"<|rolestart|>huma<|roleed|>{prompt}<|rolestart|>bot<|roleed|>"
iput_ids = tokeizer.ecode(iputs, 
                              retur_tesors="pt", 
                              paddig=True, 
                              add_special_tokes=False).to("cuda")
with torch.o_grad():
    geerated_ids = model.geerate(
        iput_ids=iput_ids,
        top_p=0.95,
        temperature=0.1,
        do_sample=True,
        max_ew_tokes=512,
        eos_toke_id=tokeizer.eos_toke_id, 
        pad_toke_id=tokeizer.pad_toke_id           
    )
prit(f'geerated tokes um is {le(geerated_ids[0][iput_ids.size(1):])}')
outputs = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True) 
prit(f'geerate text is {outputs[0][le(iputs): ]}')
latecy = time.time() - st
prit('latecy is {} secods'.format(latecy))
if ame == "mai":
    modeldir = sapshotdowload('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revisio='v1.0.0')
prompt = '请用Pytho实现一个快速排序算法'
model, tokeizer = load_model_tokeizer(model_dir)
iferece(model, tokeizer, prompt)
```
目前的推理示例代码是基于AutoGPTQ的,如果你想获取更高的推理速度,建议结合使用TesorRT-LLM (Early Access)。
一致性校验
这里提供了模型相关文件的SHA256值,用于下载一致性校验。
| 文件 | SHA256 | 
| cofig.jso | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b | 
| geeratio_cofig.jso | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 | 
| gptq_model-4bit-64g.bi | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a | 
| pytorch_model.bi.idex.jso | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d | 
| quatize_cofig.jso | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 | 
| specialtokesmap.jso | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 | 
| tokeizer.jso | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 | 
| tokeizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 | 
| tokeizer_cofig.jso | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 | 
加入我们
我们是平台技术事业群AI Native团队,负责蚂蚁蚂蚁集团平台工程的智能化,团队成立3年多以来,支持了蚂蚁集团云计算基础设施智能化运维的升级改造。团队的Missio是,通过世界级的技术创新和影响,构建有广泛用户的算法服务和平台,支撑内外部产品和业务落地。团队秉承创新基因,在支撑业务落地的同时,推动技术影响。3年以来在ICLR、NeurIPS、KDD、ACL等顶会发表论文20余篇,创新业务结果获得两次蚂蚁技术最高奖T-Star,1次蚂蚁集团最高奖SuperMA。开源项目CodeFuse获得4K点赞(2024年2月),Huggigface和modelscope上模型累积下载量超过150万次。
我们正在寻找行业中的佼佼者加入我们的团队!如果您希望在一个充满活力、创新和卓越文化的环境中发展您的职业生涯,欢迎您查看我们的社招&校招机会,加入我们,一起创造下一个行业里程碑。
校招:https://hrrecommed.atgroup.com/guide.html?code=8uoP5mlus5DqQYbEEqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GL7
社招:https://talet.atgroup.com/off-campus-positio?positioId=1933830
评论