attention_is_all_you_need_系统开发案例-程序员客栈

Transformer-AttentionIsAllYouNeed

Chainer-basedPythonimplementationofTransformer,anattention-basedseq2seqmodelwithoutconvolutionandrecurrence.Ifyouwanttoseethearchitecture,pleaseseenet.py.

See"AttentionIsAllYouNeed",AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanN.Gomez,LukaszKaiser,IlliaPolosukhin,arxiv,2017.

Thisrepositoryispartlyderivedfrommyconvolutionalseq2seqrepo,whichisalsoderivedfromChainer'sofficialseq2seqexample.

RequirementPython3.6.0+Chainer2.0.0+numpy1.12.1+cupy1.0.0+(ifusinggpu)nltkprogressbar(Youcaninstallallthroughpip)andtheirdependenciesPrepareDataset

Youcanuseanyparallelcorpus.Forexample,run

shdownload_wmt.sh

whichdownloadsanddecompressestrainingdatasetanddevelopmentdatasetfromWMT/europalintoyourcurrentdirectory.Thesefilesandtheirpathsaresetintrainingscripttrain.pyasdefault.

HowtoRunPYTHONIOENCODING=utf-8python-utrain.py-g=0-iDATA_DIR-oSAVE_DIR

Duringtraining,logsforloss,perplexity,wordaccuracyandtimeareprintedatacertaininternval,inadditiontovalidationtests(perplexityandBLEUforgeneration)everyhalfepoch.Andalso,generationtestisperformedandprintedforcheckingtrainingprogress.

Arguments

Someofthemisasfollows:

-g:yourgpuid.Ifcpu,set-1.-iDATA_DIR,-sSOURCE,-tTARGET,-svalidSVALID,-tvalidTVALID:DATA_DIRdirectoryneedstoincludeapairoftrainingdatasetSOURCEandTARGETwithapairofvalidationdatasetSVALIDandTVALID.Eachpairshouldbeparallellcorpuswithline-by-linesentencealignment.-oSAVE_DIR:JSONlogreportfileandamodelsnapshotwillbesavedinSAVE_DIRdirectory(ifitdoesnotexist,itwillbeautomaticallymade).-e:maxepochsoftrainingcorpus.-b:minibatchsize.-u:sizeofunitsandwordembeddings.-l:numberoflayersinboththeencoderandthedecoder.--source-vocab:maxsizeofvocabularysetofsourcelanguage--target-vocab:maxsizeofvocabularysetoftargetlanguage

Pleaseseetheothersbypythontrain.py-h.

Note

Thisrepositorydoesnotaimforcompletevalidationofresultsinthepaper,soIhavenoteagerlyconfirmedvalidityofperformance.But,Iexpectmyimplementationisalmostcompatiblewithamodeldescribedinthepaper.SomedifferenceswhereIamawareareasfollows:

Optimization/trainingstrategy.Detailedinformationaboutbatchsize,parameterinitialization,etc.isunclearinthepaper.Additionally,thelearningrateproposedinthepapermayworkonlywithalargebatchsize(e.g.4000)fordeeplayernets.Ichangedwarmup_stepto32000from4000,thoughthereisroomforimprovement.Ialsochangedreluintoleakyreluinfeedforwardnetlayersforeasygradientpropagation.Vocabularyset,dataset,preprocessingandevaluation.Thisrepousesacommonword-basedtokenization,althoughthepaperusesbyte-pairencoding.Sizeoftokensetalsodiffers.Evaluation(validation)islittleunfairandincompatiblewithoneinthepaper,e.g.,evenvalidationsetreplacesunknownwordstoasingle"unk"token.BeamsearchisunusedinBLEUcalculation.Modelsize.Thesettingofamodelinthisrepoisoneof"basemodel"inthepaper,althoughyoucanmodifysomelinesforusing"bigmodel".Thiscodefollowssomesettingsusedintensor2tensorrepository,whichincludesaTransformermodel.Forexample,positionalencodingusedintherepositoryseemstodifferfromoneinthepaper.Thiscodefollowstheformerone.

attention_is_all_you_need

作品详情

重点城市程序员兼职推荐

重点岗位程序员兼职推荐