书生·浦语-对话-20B-4bit

我要开发同款
匿名用户2024年07月31日
22阅读
所属分类ai、internlm、pytorch
开源地址https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b-4bit
授权协议Apache License 2.0

作品详情

LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.

Before proceeding with the inference of internlm-chat-20b-4bit, please ensure that lmdeploy is installed.

pip install 'lmdeploy>=0.0.11'

Inference

Please download internlm-chat-20b-4bit model as follows,

git-lfs install
git clone --depth=1 https://www.modelscope.cn/Shanghai_AI_Laboratory/internlm-chat-20b-4bit.git

As demonstrated in the command below, first convert the model's layout using turbomind.deploy, and then you can interact with the AI assistant in the terminal

# Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name internlm-chat-20b \
    --model-path ./internlm-chat-20b-4bit \
    --model-format awq \
    --group-size 128

# inference
python3 -m lmdeploy.turbomind.chat ./workspace

Serve with gradio

If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:

python3 -m lmdeploy.serve.gradio.app ./workspace --server_name {ip_addr} --server_port {port}

Subsequently, you can open the website http://{ip_addr}:{port} in your browser and interact with the model.

Besides serving with gradio, there are two more serving methods. One is serving with Triton Inference Server (TIS), and the other is an OpenAI-like server named as api_server.

Please refer to the user guide for detailed information if you are interested.

Inference Performance

LMDeploy provides scripts for benchmarking token throughput and request throughput.

token throughput tests the speed of generating new tokens, given a specified number of prompt tokens and completion tokens, while request throughput measures the number of requests processed per minute with real dialogue data.

We conducted benchmarks on internlm-chat-20b-4bit. And token_throughput was measured by setting 256 prompt tokens and generating 512 tokens in response on A100-80G.

Note: The session_len in workspace/triton_models/weights/config.ini is changed to 2056 in our test.

batch tensor parallel prompt_tokens completion_tokens thrperproc(token/s) rpm (req/min) memperproc(GB)
1 1 256 512 88.77 - 15.65
16 1 256 512 792.7 220.23 51.46

token throughput

Run the following command,

python benchmark/profile_generation.py \
  --model-path ./workspace \
  --concurrency 1 8 16 --prompt-tokens 256 512 512 1024 --completion-tokens 512 512 1024 1024
  --dst-csv ./token_throughput.csv

You will find the token_throughput metrics in ./token_throughput.csv

batch prompt_tokens completion_tokens thrperproc(token/s) thrpernode(token/s) rpm(req/min) memperproc(GB) mempergpu(GB) mempernode(GB)
1 256 512 88.77 710.12 - 15.65 15.65 125.21
1 512 512 83.89 671.15 - 15.68 15.68 125.46
1 512 1024 80.19 641.5 - 15.68 15.68 125.46
1 1024 1024 72.34 578.74 - 15.75 15.75 125.96
1 1 2048 80.69 645.55 - 15.62 15.62 124.96
8 256 512 565.21 4521.67 - 32.37 32.37 258.96
8 512 512 489.04 3912.33 - 32.62 32.62 260.96
8 512 1024 467.23 3737.84 - 32.62 32.62 260.96
8 1024 1024 383.4 3067.19 - 33.06 33.06 264.46
8 1 2048 487.74 3901.93 - 32.12 32.12 256.96
16 256 512 792.7 6341.6 - 51.46 51.46 411.71
16 512 512 639.4 5115.17 - 51.93 51.93 415.46
16 512 1024 591.39 4731.09 - 51.93 51.93 415.46
16 1024 1024 449.11 3592.85 - 52.06 52.06 416.46
16 1 2048 620.5 4964.02 - 51 51 407.96

request throughput

LMDeploy uses ShareGPT dataset to test request throughput. Try the next commands, and you will get the rpm (request per minute) metric.

# download the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# run benchmark script
python profile_throughput.py \
 ShareGPT_V3_unfiltered_cleaned_split.json \
 ./workspace \
 --concurrency 16
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论