CodeFuse-CodeLlama-34B-4bits

我要开发同款
匿名用户2024年07月31日
125阅读

技术信息

官网地址
https://github.com/codefuse-ai
开源地址
https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits
授权协议
other

作品详情

Model Card for CodeFuse-CodeLlama-34B-4bits

[中文] [Eglish]

Model Descriptio

CodeFuse-CodeLlama-34B-4bits is the 4-bit quatized versio of CodeFuse-CodeLlama-34B, which is a 34B Code-LLM fie-tued over multiple code tasks(600k istructios/aswers)o the base model CodeLlama-34b-Pytho.

After udergoig 4-bit quatizatio, the CodeFuse-CodeLlama-34B-4bits model ca be loaded o either a sigle A10 (24GB VRAM) or a RTX 4090 (24GB VRAM). Moreover, the quatized model still achives a impressive accuracy of 73.8% o the Humaeval pass@1 metric.


News ad Updates

??? 2023-09-28 CodeFuse-CodeLlama-34B 4-bit techical documetatio has bee released. If you are iterested, please click the provided lik to view it o the CodeFuse WeChat official accout.(https://mp.weixi.qq.com/s/QLycLdgPGQjF7JE_YF466Q)

??? 2023-09-26 We are pleased to aouce the release of the 4-bit quatized versio of CodeFuse-CodeLlama-34B. Despite the quatizatio process, the model still achieves a remarkable 73.8% accuracy (greedy decodig) o the HumaEval pass@1 metric.

??? 2023-09-11 CodeFuse-CodeLlama34B has achived 74.4% of pass@1 (greedy decodig) o HumaEval, which is SOTA results for opespurced LLMs at preset.


Code Commuity

Homepage: ? https://github.com/codefuse-ai (Please give us your support with a Star? + Fork? + Watch?)

  • If you wish to fie-tue the model yourself, you ca visit ✨MFTCoder✨✨

  • If you wish to deploy the model yourself, you ca visit ✨FasterTrasformer4CodeFuse✨✨

  • If you wish to see a demo of the model, you ca visit ✨CodeFuse Demo✨✨


Performace

Code

Model HumaEval(pass@1) Date
CodeFuse-CodeLlama-34B 74.4% 2023.9
CodeFuse-CodeLlama-34B-4bits 73.8% 2023.9
WizardCoder-Pytho-34B-V1.0 73.2% 2023.8
GPT-4(zero-shot) 67.0% 2023.3
PaGu-Coder2 15B 61.6% 2023.8
CodeLlama-34b-Pytho 53.7% 2023.8
CodeLlama-34b 48.8% 2023.8
GPT-3.5(zero-shot) 48.1% 2022.11
OctoCoder 46.2% 2023.8
StarCoder-15B 33.6% 2023.5
LLaMA 2 70B(zero-shot) 29.9% 2023.7


GPU Memory Usage

We measured the GPU memory usage after loadig the model, as well as the memory usage whe ecodig 2048/1024 tokes ad geeratig 1024/2048 tokes. The results are preseted i the table below.

Precisio Idle Model Ecodig 2048 tokes ad Geeratig 1024 tokes Ecodig 1024 tokes ad Geeratig 2048 tokes
bfloat16 64.89GB 69.31GB 66.41GB
it4 19.09GB 22.19GB 20.78GB


Requiremets

  • pytho>=3.8
  • pytorch>=2.0.0
  • trasformers==4.32.0
  • auto_gptq==0.4.2
  • Setecepiece
  • CUDA 11.4


Iferece Strig Format

The iferece strig is a cocateated strig formed by combiig coversatio data (huma ad bot cotets) i the traiig data format. It is used as iput durig the iferece process. Here is a example format of the cocateated strig:

"""
<|role_start|>huma<|role_ed|>Huma 1st roud iput
<|role_start|>bot<|role_ed|>Bot 1st roud output</s>
<|role_start|>huma<|role_ed|>Huma 2d roud iput
<|role_start|>bot<|role_ed|>Bot 2d roud output</s>
...
...
...
<|role_start|>huma<|role_ed|>Huma th roud iput
<|role_start|>bot<|role_ed|>{Bot output to be gereated}</s>
"""

Whe applyig iferece, you always make your iput strig ed with "<|rolestart|>bot<|roleed|>" to ask the model geeratig aswers.


Quickstart

```bash
git cloe https://www.modelscope.c/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git


bash pip istall -r requiremets.txt


pytho import os import torch import time from modelscope import AutoTokeizer, sapshotdowload from autogptq import AutoGPTQForCausalLM

os.eviro["TOKENIZERS_PARALLELISM"] = "false"

def loadmodeltokeizer(modelpath): """ Load model ad tokeizer based o the give model ame or local path of dowloaded model. """ tokeizer = AutoTokeizer.frompretraied(modelpath, trustremotecode=True, usefast=False, lagecy=False) tokeizer.paddigside = "left" tokeizer.padtokeid = tokeizer.coverttokestoids("") tokeizer.eostokeid = tokeizer.coverttokesto_ids("")

model = AutoGPTQForCausalLM.from_quatized(model_path, 
                                            iject_fused_attetio=False,
                                            iject_fused_mlp=False,
                                            use_safetesors=False,
                                            use_cuda_fp16=True,
                                            disable_exllama=False,
                                            device_map='auto'   # Support multi-gpus
                                          )
retur model, tokeizer

def iferece(model, tokeizer, prompt): """ Uset the give model ad tokeizer to geerate a aswer for the speicifed prompt. """ st = time.time() prompt = prompt if prompt.edswith('\') else f'{prompt}\' iputs = f"<|rolestart|>huma<|roleed|>{prompt}<|rolestart|>bot<|roleed|>"

iput_ids = tokeizer.ecode(iputs, 
                              retur_tesors="pt", 
                              paddig=True, 
                              add_special_tokes=False).to("cuda")
with torch.o_grad():
    geerated_ids = model.geerate(
        iput_ids=iput_ids,
        top_p=0.95,
        temperature=0.1,
        do_sample=True,
        max_ew_tokes=512,
        eos_toke_id=tokeizer.eos_toke_id,
        pad_toke_id=tokeizer.pad_toke_id              
    )
prit(f'geerated tokes um is {le(geerated_ids[0][iput_ids.size(1):])}')
outputs = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True) 
prit(f'geerate text is {outputs[0][le(iputs): ]}')
latecy = time.time() - st
prit('latecy is {} secods'.format(latecy))

if ame == "mai": modeldir = sapshotdowload('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revisio='v1.0.0')

prompt = 'Please write a QuickSort program i Pytho'

model, tokeizer = load_model_tokeizer(model_dir)
iferece(model, tokeizer, prompt)
**The curret iferece example code is based o [AutoGPTQ](https://github.com/PaQiWei/AutoGPTQ). If you wat to achieve higher iferece speed, it is recommeded to combie it with [TesorRT-LLM (Early Access)](https://developer.vidia.com/tesorrt-llm-early-access).**

<br>

## Cosistecy Check
Here, SHA256 values are provided for the model-related files for cosistecy check durig the dowload.

| File                           |  SHA256                         |
|-------------------------------:|:--------------------------------:|
|cofig.jso | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b |
|geeratio_cofig.jso | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 |
|gptq_model-4bit-64g.bi | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a |
|pytorch_model.bi.idex.jso | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d |
|quatize_cofig.jso | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 |
|special_tokes_map.jso | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 |
|tokeizer.jso | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 |
|tokeizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 |
|tokeizer_cofig.jso | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 |


<br>
<br>


<a id="chiese"></a>

## 模型简介

CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Pytho进行多代码任务微调而得到的代码大模型,模型输入长度为4K。

经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaeval pass@1指标上仍取得了73.8%的表现。

<br>

## 新闻

??? 2023-09-28 CodeFuse-CodeLlama-34B 4bits技术文档公布,感兴趣请点击微信公众号CodeFuse链接查看: https://mp.weixi.qq.com/s/QLycLdgPGQjF7JE_YF466Q

??? 2023-09-26 CodeFuse-CodeLlama-34B 4bits量化版本发布,量化后模型在HumaEval pass@1指标为73.8% (贪婪解码)。

??? 2023-09-11 CodeFuse-CodeLlama-34B发布,HumaEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。

<br>

## 代码社区
**大本营**: ? https://github.com/codefuse-ai (**请支持我们的项目Star? + Fork? + Watch?**)

+ 如果您想自己微调该模型,可以访问 ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨

+ 如果您想自己部署该模型,可以访问 ✨[FasterTrasformer4CodeFuse](https://github.com/codefuse-ai/FasterTrasformer4CodeFuse)✨✨

+ 如果您想观看该模型示例,可以访问 ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨

<br>

## 评测表现(代码)


| 模型                             | HumaEval(pass@1) |   日期    |
|:--------------------------------|:-----------------:|:-------:|
| **CodeFuse-CodeLlama-34B**      |     **74.4%**      | 2023.9  |
|**CodeFuse-CodeLlama-34B-4bits** |     **73.8%**      |  2023.9 |
| WizardCoder-Pytho-34B-V1.0     |       73.2%       | 2023.8  |
| GPT-4(zero-shot)                |       67.0%       | 2023.3  |
| PaGu-Coder2 15B                |       61.6%       | 2023.8  |
| CodeLlama-34b-Pytho            |       53.7%       | 2023.8  |
| CodeLlama-34b                   |       48.8%       | 2023.8  |
| GPT-3.5(zero-shot)              |       48.1%       | 2022.11 |
| OctoCoder                       |       46.2%       | 2023.8  |
| StarCoder-15B                   |       33.6%       | 2023.5  |
| LLaMA 2 70B(zero-shot)          |       29.9%       | 2023.7  |
<br>

## 显存使用
我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokes并输出1024/2048 tokes时的显存使用情况,如下表所示

| 精度                             | 模型空载            |    输入2048 tokes + 输出1024 tokes | 输入1024 tokes + 输出2048 tokes      |
|:--------------------------------|:-------------------|:------------------------:|:------------:|
|bfloat16                         |     64.89GB        |             69.31GB        |  66.41GB   |
|it4                             |     19.09GB        |            22.19GB         |  20.78GB |            

<br>

## 依赖要求

* pytho>=3.8 
* pytorch>=2.0.0
* trasformers==4.32.0
* auto_gptq==0.4.2
* Setecepiece
* CUDA 11.4

<br>

## 推理数据格式

推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式:

pytho """ <|rolestart|>huma<|roleed|>Huma 1st roud iput <|rolestart|>bot<|roleed|>Bot 1st roud output <|rolestart|>huma<|roleed|>Huma 2d roud iput <|rolestart|>bot<|roleed|>Bot 2d roud output … … … <|ed|><|rolestart|>huma<|roleed|>Huma th roud iput <|ed|><|rolestart|>bot<|roleed|>{Bot output to be gereated} """

推理时,请确保拼接的prompt字符串以"<|role_start|>bot<|role_ed|>"结尾,引导模型生成回答。

<br>

## 快速使用

bash
git cloe https://www.modelscope.c/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git


bash pip istall -r requiremets.txt


pytho import os import torch import time from modelscope import AutoTokeizer, sapshotdowload from autogptq import AutoGPTQForCausalLM

os.eviro["TOKENIZERS_PARALLELISM"] = "false"

def loadmodeltokeizer(modelpath): """ Load model ad tokeizer based o the give model ame or local path of dowloaded model. """ tokeizer = AutoTokeizer.frompretraied(modelpath, trustremotecode=True, usefast=False, lagecy=False) tokeizer.paddigside = "left" tokeizer.padtokeid = tokeizer.coverttokestoids("") tokeizer.eostokeid = tokeizer.coverttokesto_ids("")

model = AutoGPTQForCausalLM.from_quatized(model_path, 
                                            iject_fused_attetio=False,
                                            iject_fused_mlp=False,
                                            use_safetesors=False,
                                            use_cuda_fp16=True,
                                            disable_exllama=False,
                                            device_map='auto'   # 支持多卡
                                          )
retur model, tokeizer

def iferece(model, tokeizer, prompt): """ Uset the give model ad tokeizer to geerate a aswer for the speicifed prompt. """ st = time.time() prompt = prompt if prompt.edswith('\') else f'{prompt}\' iputs = f"<|rolestart|>huma<|roleed|>{prompt}<|rolestart|>bot<|roleed|>"

iput_ids = tokeizer.ecode(iputs, 
                              retur_tesors="pt", 
                              paddig=True, 
                              add_special_tokes=False).to("cuda")
with torch.o_grad():
    geerated_ids = model.geerate(
        iput_ids=iput_ids,
        top_p=0.95,
        temperature=0.1,
        do_sample=True,
        max_ew_tokes=512,
        eos_toke_id=tokeizer.eos_toke_id, 
        pad_toke_id=tokeizer.pad_toke_id           
    )
prit(f'geerated tokes um is {le(geerated_ids[0][iput_ids.size(1):])}')
outputs = tokeizer.batch_decode(geerated_ids, skip_special_tokes=True) 
prit(f'geerate text is {outputs[0][le(iputs): ]}')
latecy = time.time() - st
prit('latecy is {} secods'.format(latecy))

if ame == "mai": modeldir = sapshotdowload('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revisio='v1.0.0')

prompt = '请用Pytho实现一个快速排序算法'

model, tokeizer = load_model_tokeizer(model_dir)
iferece(model, tokeizer, prompt)

```

目前的推理示例代码是基于AutoGPTQ的,如果你想获取更高的推理速度,建议结合使用TesorRT-LLM (Early Access)


一致性校验

这里提供了模型相关文件的SHA256值,用于下载一致性校验。

文件 SHA256
cofig.jso bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b
geeratio_cofig.jso b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3
gptq_model-4bit-64g.bi 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a
pytorch_model.bi.idex.jso 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d
quatize_cofig.jso 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028
specialtokesmap.jso ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531
tokeizer.jso f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7
tokeizer.model 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
tokeizer_cofig.jso c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2

加入我们

我们是平台技术事业群AI Native团队,负责蚂蚁蚂蚁集团平台工程的智能化,团队成立3年多以来,支持了蚂蚁集团云计算基础设施智能化运维的升级改造。团队的Missio是,通过世界级的技术创新和影响,构建有广泛用户的算法服务和平台,支撑内外部产品和业务落地。团队秉承创新基因,在支撑业务落地的同时,推动技术影响。3年以来在ICLR、NeurIPS、KDD、ACL等顶会发表论文20余篇,创新业务结果获得两次蚂蚁技术最高奖T-Star,1次蚂蚁集团最高奖SuperMA。开源项目CodeFuse获得4K点赞(2024年2月),Huggigface和modelscope上模型累积下载量超过150万次。

我们正在寻找行业中的佼佼者加入我们的团队!如果您希望在一个充满活力、创新和卓越文化的环境中发展您的职业生涯,欢迎您查看我们的社招&校招机会,加入我们,一起创造下一个行业里程碑。

校招:https://hrrecommed.atgroup.com/guide.html?code=8uoP5mlus5DqQYbEEqcE2FD5JZH21MwvMUIb9mb6X3osXPuBraG54SyM8GL7

社招:https://talet.atgroup.com/off-campus-positio?positioId=1933830

功能介绍

Model Card for CodeFuse-CodeLlama-34B-4bits [中文] [English] Model Description CodeFuse-Cod

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论