匿名用户2024年07月31日
71阅读

技术信息

开源地址
https://modelscope.cn/models/AI-ModelScope/gpt-j-6b
授权协议
apache-2.0

作品详情

GPT-J 6B

Model Descriptio

GPT-J 6B is a trasformer model traied usig Be Wag's Mesh Trasformer JAX. "GPT-J" refers to the class of model, while "6B" represets the umber of traiable parameters.

| Hyperparameter | Value | |----------------------|------------| | \\(_{parameters}\\) | 6053381344 | | \\(_{layers}\\) | 28* | | \\(d_{model}\\) | 4096 | | \\(d_{ff}\\) | 16384 | | \\(_{heads}\\) | 16 | | \\(d_{head}\\) | 256 | | \\(_{ctx}\\) | 2048 | | \\(_{vocab}\\) | 50257/50400† (same tokeizer as GPT-2/3) | | Positioal Ecodig | [Rotary Positio Embeddig (RoPE)](https://arxiv.org/abs/2104.09864) | | RoPE Dimesios | [64](https://github.com/kigoflolz/mesh-trasformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_trasformer/layers.py#L223) |

* Each layer cosists of oe feedforward block ad oe self attetio block.

Although the embeddig matrix has a size of 50400, oly 50257 etries are used by the GPT-2 tokeizer.

The model cosists of 28 layers with a model dimesio of 4096, ad a feedforward dimesio of 16384. The model dimesio is split ito 16 heads, each with a dimesio of 256. Rotary Positio Embeddig (RoPE) is applied to 64 dimesios of each head. The model is traied with a tokeizatio vocabulary of 50257, usig the same set of BPEs as GPT-2/GPT-3.

Iteded Use ad Limitatios

GPT-J lears a ier represetatio of the Eglish laguage that ca be used to extract features useful for dowstream tasks. The model is best at what it was pretraied for however, which is geeratig text from a prompt.

Out-of-scope use

GPT-J-6B is ot iteded for deploymet without fie-tuig, supervisio, ad/or moderatio. It is ot a i itself a product ad caot be used for huma-facig iteractios. For example, the model may geerate harmful or offesive text. Please evaluate the risks associated with your particular use case.

GPT-J-6B was traied o a Eglish-laguage oly dataset, ad is thus ot suitable for traslatio or geeratig text i other laguages.

GPT-J-6B has ot bee fie-tued for dowstream cotexts i which laguage models are commoly deployed, such as writig gere prose, or commercial chatbots. This meas GPT-J-6B will ot respod to a give prompt the way a product like ChatGPT does. This is because, ulike this model, ChatGPT was fie-tued usig methods such as Reiforcemet Learig from Huma Feedback (RLHF) to better “follow” huma istructios.

Limitatios ad Biases

The core fuctioality of GPT-J is takig a strig of text ad predictig the ext toke. While laguage models are widely used for tasks other tha this, there are a lot of ukows with this work. Whe promptig GPT-J it is importat to remember that the statistically most likely ext toke is ofte ot the toke that produces the most "accurate" text. Never deped upo GPT-J to produce factually accurate output.

GPT-J was traied o the Pile, a dataset kow to cotai profaity, lewd, ad otherwise abrasive laguage. Depedig upo use case GPT-J may produce socially uacceptable text. See Sectios 5 ad 6 of the Pile paper for a more detailed aalysis of the biases i the Pile.

As with all laguage models, it is hard to predict i advace how GPT-J will respod to particular prompts ad offesive cotet may occur without warig. We recommed havig a huma curate or filter the outputs before releasig them, both to cesor udesirable cotet ad to improve the quality of the results.

示例代码

from modelscope.utils.costat import Tasks
from modelscope.pipelies import pipelie
pipe = pipelie(task=Tasks.text_geeratio, model='AI-ModelScope/gpt-j-6b', model_revisio='v1.0.1', device='cuda')
iputs = 'Oce upo a time,'
result = pipe(iputs)
prit(result)
# {'text': ["Oce upo a time, there was a girl who loved to sig. Oe day, she walked ito a studio, ad the studio's ower had a request for her.\\“I'm lookig for the ext Adele,” they said.\\At the time, there just were't that may female sigers out there, ad the oly oe that the studio ower kew was pretty well accomplished as far as beig a vocalist. At the time, she was kid of ito acoustic"]}

Traiig data

GPT-J 6B was traied o the Pile, a large-scale curated dataset created by EleutherAI.

Traiig procedure

This model was traied for 402 billio tokes over 383,500 steps o TPU v3-256 pod. It was traied as a autoregressive laguage model, usig cross-etropy loss to maximize the likelihood of predictig the ext toke correctly.

Evaluatio results

| Model | Public | Traiig FLOPs | LAMBADA PPL ↓ | LAMBADA Acc ↑ | Wiograde ↑ | Hellaswag ↑ | PIQA ↑ | Dataset Size (GB) | |--------------------------|-------------|----------------|--- |--- |--- |--- |--- |-------------------| | Radom Chace | ✓ | 0 | ~a lot | ~0% | 50% | 25% | 25% | 0 | | GPT-3 Ada‡ | ✗ | ----- | 9.95 | 51.6% | 52.9% | 43.4% | 70.5% | ----- | | GPT-2 1.5B | ✓ | ----- | 10.63 | 51.21% | 59.4% | 50.9% | 70.8% | 40 | | GPT-Neo 1.3B‡ | ✓ | 3.0e21 | 7.50 | 57.2% | 55.0% | 48.9% | 71.1% | 825 | | Megatro-2.5B* | ✗ | 2.4e21 | ----- | 61.7% | ----- | ----- | ----- | 174 | | GPT-Neo 2.7B‡ | ✓ | 6.8e21 | 5.63 | 62.2% | 56.5% | 55.8% | 73.0% | 825 | | GPT-3 1.3B*‡ | ✗ | 2.4e21 | 5.44 | 63.6% | 58.7% | 54.7% | 75.1% | ~800 | | GPT-3 Babbage‡ | ✗ | ----- | 5.58 | 62.4% | 59.0% | 54.5% | 75.5% | ----- | | Megatro-8.3B* | ✗ | 7.8e21 | ----- | 66.5% | ----- | ----- | ----- | 174 | | GPT-3 2.7B*‡ | ✗ | 4.8e21 | 4.60 | 67.1% | 62.3% | 62.8% | 75.6% | ~800 | | Megatro-11B† | ✓ | 1.0e22 | ----- | ----- | ----- | ----- | ----- | 161 | | **GPT-J 6B‡** | **✓** | **1.5e22** | **3.99** | **69.7%** | **65.3%** | **66.1%** | **76.5%** | **825** | | GPT-3 6.7B*‡ | ✗ | 1.2e22 | 4.00 | 70.3% | 64.5% | 67.4% | 78.0% | ~800 | | GPT-3 Curie‡ | ✗ | ----- | 4.00 | 69.3% | 65.6% | 68.5% | 77.9% | ----- | | GPT-3 13B*‡ | ✗ | 2.3e22 | 3.56 | 72.5% | 67.9% | 70.9% | 78.5% | ~800 | | GPT-3 175B*‡ | ✗ | 3.1e23 | 3.00 | 76.2% | 70.2% | 78.9% | 81.0% | ~800 | | GPT-3 Davici‡ | ✗ | ----- | 3.0 | 75% | 72% | 78% | 80% | ----- |

Models roughly sorted by performace, or by FLOPs if ot available.

* Evaluatio umbers reported by their respective authors. All other umbers are provided by ruig lm-evaluatio-haress either with released weights or with API access. Due to subtle implemetatio differeces as well as differet zero shot task framig, these might ot be directly comparable. See this blog post for more details.

Megatro-11B provides o comparable metrics, ad several implemetatios usig the released weights do ot reproduce the geeratio quality ad evaluatios. (see 1 2 3) Thus, evaluatio was ot attempted.

These models have bee traied with data which cotais possible test set cotamiatio. The OpeAI GPT-3 models failed to deduplicate traiig data for certai test sets, while the GPT-Neo models as well as this oe is traied o the Pile, which has ot bee deduplicated agaist ay test sets.

Citatio ad Related Iformatio

BibTeX etry

To cite this model:

@misc{gpt-j,
  author = {Wag, Be ad Komatsuzaki, Ara},
  title = {{GPT-J-6B: A 6 Billio Parameter Autoregressive Laguage Model}},
  howpublished = {\url{https://github.com/kigoflolz/mesh-trasformer-jax}},
  year = 2021,
  moth = May
}

To cite the codebase that traied this model:

@misc{mesh-trasformer-jax,
  author = {Wag, Be},
  title = {{Mesh-Trasformer-JAX: Model-Parallel Implemetatio of Trasformer Laguage Model with JAX}},
  howpublished = {\url{https://github.com/kigoflolz/mesh-trasformer-jax}},
  year = 2021,
  moth = May
}

If you use this model, we would love to hear about it! Reach out o GitHub, Discord, or shoot Be a email.

Ackowledgemets

This project would ot have bee possible without compute geerously provided by Google through the TPU Research Cloud, as well as the Cloud TPU team for providig early access to the Cloud TPU VM Alpha.

Thaks to everyoe who have helped out oe way or aother (listed alphabetically):

功能介绍

GPT-J 6B Model Description GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer

声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论