书生·浦语-对话-20B-4bit_开源AI项目-程序员客栈

官网地址
https://www.shlab.org.cn/开源地址
https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b-4bit授权协议
Apache License 2.0

LMDeploy supports LLM model iferece of 4-bit weight, with the miimum requiremet for NVIDIA graphics cards beig sm80, such as A10, A100, Geforce 30/40 series.

Before proceedig with the iferece of iterlm-chat-20b-4bit, please esure that lmdeploy is istalled.

pip istall 'lmdeploy>=0.0.11'

Iferece

Please dowload iterlm-chat-20b-4bit model as follows,

git-lfs istall
git cloe --depth=1 https://www.modelscope.c/Shaghai_AI_Laboratory/iterlm-chat-20b-4bit.git

As demostrated i the commad below, first covert the model's layout usig turbomid.deploy, ad the you ca iteract with the AI assistat i the termial

# Covert the model's layout ad store it i the default path, ./workspace.
pytho3 -m lmdeploy.serve.turbomid.deploy \
    --model-ame iterlm-chat-20b \
    --model-path ./iterlm-chat-20b-4bit \
    --model-format awq \
    --group-size 128

# iferece
pytho3 -m lmdeploy.turbomid.chat ./workspace

Serve with gradio

If you wish to iteract with the model via web UI, please iitiate the gradio server as idicated below:

pytho3 -m lmdeploy.serve.gradio.app ./workspace --server_ame {ip_addr} --server_port {port}

Subsequetly, you ca ope the website http://{ip_addr}:{port} i your browser ad iteract with the model.

Besides servig with gradio, there are two more servig methods. Oe is servig with Trito Iferece Server (TIS), ad the other is a OpeAI-like server amed as api_server.

Please refer to the user guide for detailed iformatio if you are iterested.

Iferece Performace

LMDeploy provides scripts for bechmarkig toke throughput ad request throughput.

toke throughput tests the speed of geeratig ew tokes, give a specified umber of prompt tokes ad completio tokes, while request throughput measures the umber of requests processed per miute with real dialogue data.

We coducted bechmarks o iterlm-chat-20b-4bit. Ad toke_throughput was measured by settig 256 prompt tokes ad geeratig 512 tokes i respose o A100-80G.

Note: The sessio_le i workspace/trito_models/weights/cofig.ii is chaged to 2056 i our test.

batch	tesor parallel	prompt_tokes	completio_tokes	thrperproc(toke/s)	rpm (req/mi)	memperproc(GB)
1	1	256	512	88.77	-	15.65
16	1	256	512	792.7	220.23	51.46

toke throughput

Ru the followig commad,

pytho bechmark/profile_geeratio.py \
  --model-path ./workspace \
  --cocurrecy 1 8 16 --prompt-tokes 256 512 512 1024 --completio-tokes 512 512 1024 1024
  --dst-csv ./toke_throughput.csv

You will fid the toke_throughput metrics i ./toke_throughput.csv

batch	prompt_tokes	completio_tokes	thrperproc(toke/s)	thrperode(toke/s)	rpm(req/mi)	memperproc(GB)	mempergpu(GB)	memperode(GB)
1	256	512	88.77	710.12	-	15.65	15.65	125.21
1	512	512	83.89	671.15	-	15.68	15.68	125.46
1	512	1024	80.19	641.5	-	15.68	15.68	125.46
1	1024	1024	72.34	578.74	-	15.75	15.75	125.96
1	1	2048	80.69	645.55	-	15.62	15.62	124.96
8	256	512	565.21	4521.67	-	32.37	32.37	258.96
8	512	512	489.04	3912.33	-	32.62	32.62	260.96
8	512	1024	467.23	3737.84	-	32.62	32.62	260.96
8	1024	1024	383.4	3067.19	-	33.06	33.06	264.46
8	1	2048	487.74	3901.93	-	32.12	32.12	256.96
16	256	512	792.7	6341.6	-	51.46	51.46	411.71
16	512	512	639.4	5115.17	-	51.93	51.93	415.46
16	512	1024	591.39	4731.09	-	51.93	51.93	415.46
16	1024	1024	449.11	3592.85	-	52.06	52.06	416.46
16	1	2048	620.5	4964.02	-	51	51	407.96

request throughput

LMDeploy uses ShareGPT dataset to test request throughput. Try the ext commads, ad you will get the rpm (request per miute) metric.

# dowload the ShareGPT dataset
wget https://huggigface.co/datasets/ao8231489123/ShareGPT_Vicua_ufiltered/resolve/mai/ShareGPT_V3_ufiltered_cleaed_split.jso

# ru bechmark script
pytho profile_throughput.py \
 ShareGPT_V3_ufiltered_cleaed_split.jso \
 ./workspace \
 --cocurrecy 16

LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graph

声明：本文仅代表作者观点，不代表本站立场。如果侵犯到您的合法权益，请联系我们删除侵权资源！如果遇到资源链接失效，请您通过评论或工单的方式通知管理员。未经允许，不得转载，本站所有资源文章禁止商业使用运营!

下载安装【程序员客栈】APP

实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

前往安装

书生·浦语-对话-20B-4bit

技术信息

作品详情

Iferece

Serve with gradio

Iferece Performace

toke throughput

request throughput

功能介绍

重点城市程序员兼职推荐

重点岗位程序员兼职推荐