EasyNLP 中文 NLP 算法框架开源项目

我要开发同款
匿名用户2022年09月05日
41阅读
开发技术Python
所属分类人工智能、机器学习/深度学习
授权协议Apache-2.0

作品详情

随着BERT、Megatron、GPT-3等预训练模型在NLP领域取得瞩目的成果,越来越多团队投身到超大规模训练中,这使得训练模型的规模从亿级别发展到了千亿甚至万亿的规模。然而,这类超大规模的模型运用于实际场景中仍然有一些挑战。首先,模型参数量过大使得训练和推理速度过慢且部署成本极高;其次在很多实际场景中数据量不足的问题仍然制约着大模型在小样本场景中的应用,提高预训练模型在小样本场景的泛化性依然存在挑战。为了应对以上问题,阿里云机器学习PAI团队推出了EasyNLP中文NLP算法框架,助力大模型快速且高效的落地。

主要特性易用且兼容开源:EasyNLP支持常用的中文NLP数据和模型,方便用户评测中文NLP技术。除了提供易用简洁的PAI命令形式对前沿NLP算法进行调用以外,EasyNLP还抽象了一定的自定义模块如AppZoo和ModelZoo,降低NLP应用的门槛,同时ModelZoo里面常见的预训练模型和PAI自研的模型,包括知识预训练模型等。EasyNLP可以无缝接入huggingface/transformers的模型,也兼容EasyTransfer模型,并且可以借助框架自带的分布式训练框架(基于Torch-Accelerator)提升训练效率。大模型小样本落地技术:EasyNLP框架集成了多种经典的小样本学习算法,例如PET、P-Tuning等,实现基于大模型的小样本数据调优,从而解决大模型与小训练集不相匹配的问题。此外,PAI团队结合经典小样本学习算法和对比学习的思路,提出了一种不增添任何新的参数与任何人工设置模版与标签词的方案ContrastivePromptTuning,在FewCLUE小样本学习榜单取得第一名,相比Finetune有超过10%的提升。大模型知识蒸馏技术:鉴于大模型参数大难以落地的问题,EasyNLP提供知识蒸馏功能帮助蒸馏大模型从而得到高效的小模型来满足线上部署服务的需求。同时EasyNLP提供MetaKD算法,支持元知识蒸馏,提升学生模型的效果,在很多领域上甚至可以跟教师模型的效果持平。同时,EasyNLP支持数据增强,通过预训练模型来增强目标领域的数据,可以有效的提升知识蒸馏的效果。多模态模型技术:由于很多NLP任务需要借助其他模态的表征来完成,EasyNLP框架不仅仅支持纯NLP任务,它还支持各种流行的多模态预训练模型,以支持需要视觉知识或视觉特征的NLP任务。例如,EasyNLP集成了用于文本图像匹配的CLIP模型和用于文本到图像生成的DALLE风格的中文模型。安装$gitclonehttps://github.com/alibaba/EasyNLP.git$pipinstall-rrequirements.txt$cdEasyNLP$pythonsetup.pyinstall环境要求:Python3.6,PyTorch>=1.8.

快速上手下面提供一个BERT文本分类的例子,只需要几行代码就可以训练BERT模型:

首先,通过load_dataset接口加载数据,其次构建一个分类模型,然后调用Trainer即可训练.

fromeasynlp.coreimportTrainerfromeasynlp.appzooimportGeneralDataset,SequenceClassification,load_datasetfromeasynlp.utilsimportinitialize_easynlpargs=initialize_easynlp()row_data=load_dataset('glue','qnli')["train"]train_dataset=GeneralDataset(row_data,args.pretrained_model_name_or_path,args.sequence_length)model=SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)Trainer(model=model,train_dataset=train_dataset).train()Formoredatasets,pleasecheckitoutin DataHub.

也可以使用自定义数据接口:

fromeasynlp.coreimportTrainerfromeasynlp.appzooimportClassificationDataset,SequenceClassificationfromeasynlp.utilsimportinitialize_easynlpargs=initialize_easynlp()train_dataset=ClassificationDataset(pretrained_model_name_or_path=args.pretrained_model_name_or_path,data_file=args.tables,max_seq_length=args.sequence_length,input_schema=args.input_schema,first_sequence=args.first_sequence,label_name=args.label_name,label_enumerate_values=args.label_enumerate_values,is_training=True)model=SequenceClassification(pretrained_model_name_or_path=args.pretrained_model_name_or_path)Trainer(model=model,train_dataset=train_dataset).train()测试代码:

pythonmain.py\--modetrain\--tables=train_toy.tsv\--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1\--first_sequence=sent1\--label_name=label\--label_enumerate_values=0,1\--checkpoint_dir=./tmp/\--epoch_num=1\--app_name=text_classify\--user_defined_parameters='pretrain_model_name_or_path=bert-tiny-uncased'我们也提供了AppZoo的命令行来训练模型,只需要通过简单的参数配置就可以开启训练:

$easynlp\--mode=train\--worker_gpu=1\--tables=train.tsv,dev.tsv\--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1\--first_sequence=sent1\--label_name=label\--label_enumerate_values=0,1\--checkpoint_dir=./classification_model\--epoch_num=1\--sequence_length=128\--app_name=text_classify\--user_defined_parameters='pretrain_model_name_or_path=bert-small-uncased'$easynlp\--mode=predict\--tables=dev.tsv\--outputs=dev.pred.tsv\--input_schema=label:str:1,sid1:str:1,sid2:str:1,sent1:str:1,sent2:str:1\--output_schema=predictions,probabilities,logits,output\--append_cols=label\--first_sequence=sent1\--checkpoint_path=./classification_model\--app_name=text_classifyAppZoo更多示例,详见:AppZoo文档.

ModelZooEasyNLP的ModelZoo目前支持如下预训练模型。

PAI-BERT-zh(fromAlibabaPAI):pre-trainedBERTmodelswithalargeChinesecorpus.DKPLM(fromAlibabaPAI):releasedwiththepaper DKPLM:DecomposableKnowledge-enhancedPre-trainedLanguageModelforNaturalLanguageUnderstanding byTaolinZhang,ChengyuWang,NanHu,MinghuiQiu,ChengguangTang,XiaofengHeandJunHuang.KGBERT(fromAlibabaDamoAcademy&PAI):pre-trainBERTmodelswithknowledgegraphembeddingsinjected.BERT(fromGoogle):releasedwiththepaper BERT:Pre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding byJacobDevlin,Ming-WeiChang,KentonLeeandKristinaToutanova.RoBERTa(fromFacebook):releasedwiththepaper RoBERTa:ARobustlyOptimizedBERTPretrainingApproach byYinhanLiu,MyleOtt,NamanGoyal,JingfeiDu,MandarJoshi,DanqiChen,OmerLevy,MikeLewis,LukeZettlemoyerandVeselinStoyanov.ChineseRoBERTa(fromHFL):theChineseversionofRoBERTa.MacBERT(fromHFL):releasedwiththepaper RevisitingPre-trainedModelsforChineseNaturalLanguageProcessing byYimingCui,WanxiangChe,TingLiu,BingQin,ShijinWangandGuopingHu.WOBERT(fromZhuiyiTechnology):theword-basedBERTfortheChineselanguage.Mengzi(fromLangboat):releasedwiththepaper Mengzi:TowardsLightweightyetIngeniousPre-trainedModelsforChinese byZhuoshengZhang,HanqingZhang,KemingChen,YuhangGuo,JingyunHua,YulongWangandMingZhou.详细列表参见 readme。

预训练大模型的落地EasyNLP提供小样本学习和知识蒸馏,方便用户落地超大预训练模型。

PET (fromLMUMunichandSulzerGmbH):releasedwiththepaper ExploitingClozeQuestionsforFewShotTextClassificationandNaturalLanguageInference byTimoSchickandHinrichSchutze.WehavemadesomeslightmodificationstomakethealgorithmsuitablefortheChineselanguage.P-Tuning (fromTsinghuaUniversity,BeijingAcademyofAI,MITandRecurrentAI,Ltd.):releasedwiththepaper GPTUnderstands,Too byXiaoLiu,YananZheng,ZhengxiaoDu,MingDing,YujieQian,ZhilinYangandJieTang.WehavemadesomeslightmodificationstomakethealgorithmsuitablefortheChineselanguage.CP-Tuning (fromAlibabaPAI):releasedwiththepaper MakingPre-trainedLanguageModelsEnd-to-endFew-shotLearnerswithContrastivePromptTuning byZiyunXu,ChengyuWang,MinghuiQiu,FuliLuo,RunxinXu,SongfangHuangandJunHuang.VanillaKD (fromAlibabaPAI):distillingthelogitsoflargeBERT-stylemodelstosmallerones.MetaKD (fromAlibabaPAI):releasedwiththepaper Meta-KD:AMetaKnowledgeDistillationFrameworkforLanguageModelCompressionacrossDomains byHaojiePan,ChengyuWang,MinghuiQiu,YichangZhang,YaliangLiandJunHuang.DataAugmentation (fromAlibabaPAI):augmentatingthedatabasedontheMLMheadofpre-trainedlanguagemodels.CLUEBenchmarkEasyNLP提供 CLUE评测代码,方便用户快速评测CLUE数据上的模型效果。

#Format:bashrun_clue.shdevice_idtrain/predictdataset#e.g.:bashrun_clue.sh0traincsl根据我们的脚本,可以获得BERT,RoBERTa等模型的评测效果(dev数据):

(1)bert-base-chinese

TaskAFQMCCMNLICSLIFLYTEKOCNLITNEWSWSCP72.17%75.74%80.93%60.22%78.31%57.52%75.33%F152.96%75.74%81.71%60.22%78.30%57.52%80.82%(2)chinese-roberta-wwm-ext:

TaskAFQMCCMNLICSLIFLYTEKOCNLITNEWSWSCP73.10%80.75%80.07%60.98%80.75%57.93%86.84%F156.04%80.75%81.50%60.98%80.75%57.93%89.58%详细的例子,请参考CLUE评测示例.

Tutorials自定义文本分类示例QuickStart-文本分类QuickStart-PAIDSWQuickStart-MaxCompute/ODPS数据AppZoo-文本向量化AppZoo-文本分类/匹配AppZoo-序列标注AppZoo-GEEP文本分类基础预训练实践知识预训练实践知识蒸馏实践跨任务知识蒸馏实践小样本学习实践Rapidformer模型训练加速实践APIdocs: https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/easynlp/easynlp_docs/html/index.htmlLicenseThisprojectislicensedunderthe ApacheLicense(Version2.0).Thistoolkitalsocontainssomecodemodifiedfromotherreposunderotheropen-sourcelicenses.Seethe NOTICE fileformoreinformation.

联系我们扫描下面二维码加入dingidng群,有任何问题欢迎在群里反馈。

 

参考文献DKPLM: https://paperswithcode.com/paper/dkplm-decomposable-knowledge-enhanced-preMetaKD: https://paperswithcode.com/paper/meta-kd-a-meta-knowledge-distillationCP-Tuning: https://paperswithcode.com/paper/making-pre-trained-language-models-end-to-end-1FashionBERT: https://paperswithcode.com/paper/fashionbert-text-and-image-matching-with更加详细的解读可以参考我们的 arxiv文章。

@article{easynlp,doi={10.48550/ARXIV.2205.00258},url={https://arxiv.org/abs/2205.00258},author={Wang,ChengyuandQiu,MinghuiandZhang,TaolinandLiu,TingtingandLi,LeiandWang,JianingandWang,MingandHuang,JunandLin,Wei},title={EasyNLP:AComprehensiveandEasy-to-useToolkitforNaturalLanguageProcessing},publisher={arXiv},year={2022}}
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论