Chainer-basedPythonimplementationofTransformer,anattention-basedseq2seqmodelwithoutconvolutionandrecurrence.Ifyouwanttoseethearchitecture,pleaseseenet.py.
See"AttentionIsAllYouNeed",AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanN.Gomez,LukaszKaiser,IlliaPolosukhin,arxiv,2017.
Thisrepositoryispartlyderivedfrommyconvolutionalseq2seqrepo,whichisalsoderivedfromChainer'sofficialseq2seqexample.
RequirementPython3.6.0+Chainer2.0.0+numpy1.12.1+cupy1.0.0+(ifusinggpu)nltkprogressbar(Youcaninstallallthroughpip)andtheirdependenciesPrepareDatasetYoucanuseanyparallelcorpus.Forexample,run
shdownload_wmt.shwhichdownloadsanddecompressestrainingdatasetanddevelopmentdatasetfromWMT/europalintoyourcurrentdirectory.Thesefilesandtheirpathsaresetintrainingscripttrain.pyasdefault.
HowtoRunPYTHONIOENCODING=utf-8python-utrain.py-g=0-iDATA_DIR-oSAVE_DIRDuringtraining,logsforloss,perplexity,wordaccuracyandtimeareprintedatacertaininternval,inadditiontovalidationtests(perplexityandBLEUforgeneration)everyhalfepoch.Andalso,generationtestisperformedandprintedforcheckingtrainingprogress.
ArgumentsSomeofthemisasfollows:
-g:yourgpuid.Ifcpu,set-1.-iDATA_DIR,-sSOURCE,-tTARGET,-svalidSVALID,-tvalidTVALID:DATA_DIRdirectoryneedstoincludeapairoftrainingdatasetSOURCEandTARGETwithapairofvalidationdatasetSVALIDandTVALID.Eachpairshouldbeparallellcorpuswithline-by-linesentencealignment.-oSAVE_DIR:JSONlogreportfileandamodelsnapshotwillbesavedinSAVE_DIRdirectory(ifitdoesnotexist,itwillbeautomaticallymade).-e:maxepochsoftrainingcorpus.-b:minibatchsize.-u:sizeofunitsandwordembeddings.-l:numberoflayersinboththeencoderandthedecoder.--source-vocab:maxsizeofvocabularysetofsourcelanguage--target-vocab:maxsizeofvocabularysetoftargetlanguagePleaseseetheothersbypythontrain.py-h.
NoteThisrepositorydoesnotaimforcompletevalidationofresultsinthepaper,soIhavenoteagerlyconfirmedvalidityofperformance.But,Iexpectmyimplementationisalmostcompatiblewithamodeldescribedinthepaper.SomedifferenceswhereIamawareareasfollows:
Optimization/trainingstrategy.Detailedinformationaboutbatchsize,parameterinitialization,etc.isunclearinthepaper.Additionally,thelearningrateproposedinthepapermayworkonlywithalargebatchsize(e.g.4000)fordeeplayernets.Ichangedwarmup_stepto32000from4000,thoughthereisroomforimprovement.Ialsochangedreluintoleakyreluinfeedforwardnetlayersforeasygradientpropagation.Vocabularyset,dataset,preprocessingandevaluation.Thisrepousesacommonword-basedtokenization,althoughthepaperusesbyte-pairencoding.Sizeoftokensetalsodiffers.Evaluation(validation)islittleunfairandincompatiblewithoneinthepaper,e.g.,evenvalidationsetreplacesunknownwordstoasingle"unk"token.BeamsearchisunusedinBLEUcalculation.Modelsize.Thesettingofamodelinthisrepoisoneof"basemodel"inthepaper,althoughyoucanmodifysomelinesforusing"bigmodel".Thiscodefollowssomesettingsusedintensor2tensorrepository,whichincludesaTransformermodel.Forexample,positionalencodingusedintherepositoryseemstodifferfromoneinthepaper.Thiscodefollowstheformerone.
评论