chatglm2-6B-GGML

我要开发同款
匿名用户2024年07月31日
38阅读
所属分类aipytorch
开源地址https://modelscope.cn/models/Xorbits/chatglm2-6B-GGML
授权协议apache-2.0

作品详情

THUDM's chatglm2 6B GGML

These files are GGML format model files for THUDM's chatglm2 6B.

GGML files are for CPU + GPU inference using chatglm.cpp and Xorbits Inference.

Prompt template

NOTE: prompt template is not available yet since the system prompt is hard coded in chatglm.cpp for now.

Provided files

Name Quant method Bits Size
chatglm2-ggml-q4_0.bin q4_0 4 3.5 GB
chatglm2-ggml-q4_1.bin q4_1 4 3.9 GB
chatglm2-ggml-q5_0.bin q5_0 5 4.3 GB
chatglm2-ggml-q5_1.bin q5_1 5 4.7 GB
chatglm2-ggml-q8_0.bin q8_0 8 6.6 GB

How to run in Xorbits Inference

Install

Xinference can be installed via pip from PyPI. It is highly recommended to create a new virtual environment to avoid conflicts.

$ pip install "xinference[all]"
$ pip install chatglm-cpp

Start Xorbits Inference

To start a local instance of Xinference, run the following command:

$ xinference

Once Xinference is running, an endpoint will be accessible for model management via CLI or Xinference client. The default endpoint is http://localhost:9997. You can also view a web UI using the Xinference endpoint to chat with all the builtin models. You can even chat with two cutting-edge AI models side-by-side to compare their performance!

Xinference web UI

Slack

For further support, and discussions on these models and AI in general, join our slack channel!

Original model card: THUDM's chatglm2 6B

ChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B. It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:

  1. Stronger Performance: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of GLM, and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The evaluation results show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size.
  2. Longer Context: Based on FlashAttention technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations.
  3. More Efficient Inference: Based on Multi-Query Attention technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K.

For more instructions, including how to run CLI and web demos, and model quantization, please refer to our Github Repo.

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论