GithubmirrorofM.Zinkevich'sgreat"RulesofMachineLearning"styleguide,withextragoodness.
Youcanfindtheterminologyforthisguideinterminology.md.
Youcanfindtheoverviewforthisguideinoverview.md.
StructureBeforeMachineLearningMLPhase1:YourFirstPipelineMLPhase2:FeatureEngineeringMLPhase3:SlowGrowth,OptimationRefinement,andComplexModelsRelatedWorkAcknowledgements&AppendixNote:Asterisk(*)footnotesaremyown.NumberedfootnotesareMartin's.
BeforeMachineLearningRule1-Don'tbeafraidtolaunchaproductwithoutmachinelearning.*Machinelearningiscool,butitrequiresdata.Theoretically,youcantakedatafromadifferentproblemandthentweakthemodelforanewproduct,butthiswilllikelyunderperformbasicheuristics.Ifyouthinkthatmachinelearningwillgiveyoua100%boost,thenaheuristicwillgetyou50%ofthewaythere.Forinstance,ifyouarerankingappsinanappmarketplace,youcouldusetheinstallrateornumberofinstalls.Ifyouaredetectingspam,filteroutpublishersthathavesentspambefore.Don’tbeafraidtousehumaneditingeither.Ifyouneedtorankcontacts,rankthemostrecentlyusedhighest(orevenrankalphabetically).Ifmachinelearningisnotabsolutelyrequiredforyourproduct,don'tuseituntilyouhavedata.
GoogleResearchBlog-The280-Year-OldAlgorithmInsideGoogleTrips
Rule2-First,designandimplementmetrics.Beforeformalizingwhatyourmachinelearningsystemwilldo,trackasmuchaspossibleinyourcurrentsystem.Dothisforthefollowingreasons:
Itiseasiertogainpermissionfromthesystem’susersearlieron.Ifyouthinkthatsomethingmightbeaconcerninthefuture,itisbettertogethistoricaldatanow.Ifyoudesignyoursystemwithmetricinstrumentationinmind,thingswillgobetterforyouinthefuture.Specifically,youdon’twanttofindyourselfgreppingforstringsinlogstoinstrumentyourmetrics!Youwillnoticewhatthingschangeandwhatstaysthesame.Forinstance,supposeyouwanttodirectlyoptimizeone-dayactiveusers.However,duringyourearlymanipulationsofthesystem,youmaynoticethatdramaticalterationsoftheuserexperiencedon’tnoticeablychangethismetric.GooglePlusteammeasuresexpandsperread,resharesperread,plus-onesperread,comments/read,commentsperuser,resharesperuser,etc.whichtheyuseincomputingthegoodnessofapostatservingtime.Also,notethatanexperimentframework,whereyoucangroupusersintobucketsandaggregatestatisticsbyexperiment,isimportant.SeeRule#12.
Bybeingmoreliberalaboutgatheringmetrics,youcangainabroaderpictureofyoursystem.Noticeaproblem?Addametrictotrackit!Excitedaboutsomequantitativechangeonthelastrelease?Addametrictotrackit!
Rule3-Choosemachinelearningovercomplexheuristic.Asimpleheuristiccangetyourproductoutthedoor.Acomplexheuristicisunmaintainable.Onceyouhavedataandabasicideaofwhatyouaretryingtoaccomplish,moveontomachinelearning.Asinmostsoftwareengineeringtasks,youwillwanttobeconstantlyupdatingyourapproach,whetheritisaheuristicoramachine-learnedmodel,andyouwillfindthatthemachine-learnedmodeliseasiertoupdateandmaintain(seeRule#16).
YourFirstPipelineFocusonyoursysteminfrastructureforyourfirstpipeline.Whileitisfuntothinkaboutalltheimaginativemachinelearningyouaregoingtodo,itwillbehardtofigureoutwhatishappeningifyoudon’tfirsttrustyourpipeline.
Rule4-Keepthefirstmodelsimpleandgettheinfrastructureright.Thefirstmodelprovidesthebiggestboosttoyourproduct,soitdoesn'tneedtobefancy.Butyouwillrunintomanymoreinfrastructureissuesthanyouexpect.Beforeanyonecanuseyourfancynewmachinelearningsystem,youhavetodetermine:
Howtogetexamplestoyourlearningalgorithm.Afirstcutastowhat“good”and“bad”meantoyoursystem.Howtointegrateyourmodelintoyourapplication.Youcaneitherapplythemodellive,orprecomputethemodelonexamplesofflineandstoretheresultsinatable.Forexample,youmightwanttopreclassifywebpagesandstoretheresultsinatable,butyoumightwanttoclassifychatmessageslive.Choosingsimplefeaturesmakesiteasiertoensurethat:
Thefeaturesreachyourlearningalgorithmcorrectly.Themodellearnsreasonableweights.Thefeaturesreachyourmodelintheservercorrectly.Onceyouhaveasystemthatdoesthesethreethingsreliably,youhavedonemostofthework.Yoursimplemodelprovidesyouwithbaselinemetricsandabaselinebehaviorthatyoucanusetotestmorecomplexmodels.Someteamsaimfora“neutral”firstlaunch:afirstlaunchthatexplicitlyde-prioritizesmachinelearninggains,toavoidgettingdistracted.
Rule5-Testtheinfrastructureindependentlyfromthemachinelearning.Makesurethattheinfrastructureistestable,andthatthelearningpartsofthesystemareencapsulatedsothatyoucantesteverythingaroundit.Specifically:
Testgettingdataintothealgorithm.Checkthatfeaturecolumnsthatshouldbepopulatedarepopulated.Whereprivacypermits,manuallyinspecttheinputtoyourtrainingalgorithm.Ifpossible,checkstatisticsinyourpipelineincomparisontoelsewhere,suchasRASTA.
Testgettingmodelsoutofthetrainingalgorithm.Makesurethatthemodelinyourtrainingenvironmentgivesthesamescoreasthemodelinyourservingenvironment(seeRule#37).Machinelearninghasanelementofunpredictability,somakesurethatyouhavetestsforthecodeforcreatingexamplesintrainingandserving,andthatyoucanloadanduseafixedmodelduringserving.Also,itisimportanttounderstandyourdata:seePracticalAdviceforAnalysisofLarge,ComplexDataSets.
Rule6-Becarefulaboutdroppeddatawhencopyingpipelines.Oftenwecreateapipelinebycopyinganexistingpipeline(i.e.cargocultprogramming),andtheoldpipelinedropsdatathatweneedforthenewpipeline.Forexample,thepipelineforGooglePlusWhat’sHotdropsolderposts(becauseitistryingtorankfreshposts).ThispipelinewascopiedtouseforGooglePlusStream,whereolderpostsarestillmeaningful,butthepipelinewasstilldroppingoldposts.Anothercommonpatternistoonlylogdatathatwasseenbytheuser.Thus,thisdataisuselessifwewanttomodelwhyaparticularpostwasnotseenbytheuser,becauseallthenegativeexampleshavebeendropped.AsimilarissueoccurredinPlay.WhileworkingonPlayAppsHome,anewpipelinewascreatedthatalsocontainedexamplesfromtwootherlandingpages(PlayGamesHomeandPlayHomeHome)withoutanyfeaturetodisambiguatewhereeachexamplecamefrom.
Rule7-Turnheuristicsintofeatures,orhandlethemexternally.Usuallytheproblemsthatmachinelearningistryingtosolvearenotcompletelynew.Thereisanexistingsystemforranking,orclassifying,orwhateverproblemyouaretryingtosolve.Thismeansthatthereareabunchofrulesandheuristics.Thesesameheuristicscangiveyoualiftwhentweakedwithmachinelearning.Yourheuristicsshouldbeminedforwhateverinformationtheyhave,fortworeasons.First,thetransitiontoamachinelearnedsystemwillbesmoother.Second,usuallythoserulescontainalotoftheintuitionaboutthesystemyoudon’twanttothrowaway.Therearefourwaysyoucanuseanexistingheuristic:
Preprocessusingtheheuristic.Ifthefeatureisincrediblyawesome,thenthisisanoption.Forexample,if,inaspamfilter,thesenderhasalreadybeenblacklisted,don’ttrytorelearnwhat“blacklisted”means.Blockthemessage.Thisapproachmakesthemostsenseinbinaryclassificationtasks.Createafeature.Directlycreatingafeaturefromtheheuristicisgreat.Forexample,ifyouuseaheuristictocomputearelevancescoreforaqueryresult,youcanincludethescoreasthevalueofafeature.Lateronyoumaywanttousemachinelearningtechniquestomassagethevalue(forexample,convertingthevalueintooneofafinitesetofdiscretevalues,orcombiningitwithotherfeatures)butstartbyusingtherawvalueproducedbytheheuristic.Minetherawinputsoftheheuristic.Ifthereisaheuristicforappsthatcombinesthenumberofinstalls,thenumberofcharactersinthetext,andthedayoftheweek,thenconsiderpullingthesepiecesapart,andfeedingtheseinputsintothelearningseparately.Sometechniquesthatapplytoensemblesapplyhere(seeRule#40).Modifythelabel.Thisisanoptionwhenyoufeelthattheheuristiccapturesinformationnotcurrentlycontainedinthelabel.Forexample,ifyouaretryingtomaximizethenumberofdownloads,butyoualsowantqualitycontent,thenmaybethesolutionistomultiplythelabelbytheaveragenumberofstarstheappreceived.Thereisalotofspacehereforleeway.Seethesectionon“YourFirstObjective”.DobemindfuloftheaddedcomplexitywhenusingheuristicsinanMLsystem.Usingoldheuristicsinyournewmachinelearningalgorithmcanhelptocreateasmoothtransition,butthinkaboutwhetherthereisasimplerwaytoaccomplishthesameeffect.MonitoringIngeneral,practicegoodalertinghygiene,suchasmakingalertsactionableandhavingadashboardpage.
Rule8-KnowthefreshnessrequirementsofyoursystemHowmuchdoesperformancedegradeifyouhaveamodelthatisadayold?Aweekold?Aquarterold?Thisinformationcanhelpyoutounderstandtheprioritiesofyourmonitoring.Ifyoulose10%ofyourrevenueifthemodelisnotupdatedforaday,itmakessensetohaveanengineerwatchingitcontinuously.Mostadservingsystemshavenewadvertisementstohandleeveryday,andmustupdatedaily.Forinstance,iftheMLmodelforGooglePlaySearchisnotupdated,itcanhaveanimpactonrevenueinunderamonth.SomemodelsforWhat’sHotinGooglePlushavenopostidentifierintheirmodelsotheycanexportthesemodelsinfrequently.Othermodelsthathavepostidentifiersareupdatedmuchmorefrequently.Alsonoticethatfreshnesscanchangeovertime,especiallywhenfeaturecolumnsareaddedorremovedfromyourmodel.
Rule9-Detectproblemsbeforeexportingmodels.Manymachinelearningsystemshaveastagewhereyouexportthemodeltoserving.Ifthereisanissuewithanexportedmodel,itisauserfacingissue.Ifthereisanissuebefore,thenitisatrainingissue,anduserswillnotnotice.Dosanitychecksrightbeforeyouexportthemodel.Specifically,makesurethatthemodel’sperformanceisreasonableonheldoutdata.Or,ifyouhavelingeringconcernswiththedata,don’texportamodel.ManyteamscontinuouslydeployingmodelschecktheareaundertheROCcurve(orAUC)beforeexporting.Issuesaboutmodelsthathaven’tbeenexportedrequireanemailalert,butissuesonauserfacingmodelmayrequireapage.Sobettertowaitandbesurebeforeimpactingusers.
Rule10-Watchforsilentfailures.Thisisaproblemthatoccursmoreformachinelearningsystemsthanforotherkindsofsystems.Supposethataparticulartablethatisbeingjoinedisnolongerbeingupdated.Themachinelearningsystemwilladjust,andbehaviorwillcontinuetobereasonablygood,decayinggradually.Sometimestablesarefoundthatweremonthsoutofdate,andasimplerefreshimprovedperformancemorethananyotherlaunchthatquarter!Forexample,thecoverageofafeaturemaychangeduetoimplementationchanges:forexampleafeaturecolumncouldbepopulatedin90%oftheexamples,andsuddenlydropto60%oftheexamples.Playoncehadatablethatwasstalefor6months,andrefreshingthetablealonegaveaboostof2%ininstallrate.Ifyoutrackstatisticsofthedata,aswellasmanuallyinspectthedataonoccasion,youcanreducethesekindsoffailures.*
AFrameworkforAnalysisofDataFreshness-Bouzeghoub&PeraltaRule11-Givefeaturecolumnsownersanddocumentation.Ifthesystemislarge,andtherearemanyfeaturecolumns,knowwhocreatedorismaintainingeachfeaturecolumn.Ifyoufindthatthepersonwhounderstandsafeaturecolumnisleaving,makesurethatsomeonehastheinformation.Althoughmanyfeaturecolumnshavedescriptivenames,it'sgoodtohaveamoredetaileddescriptionofwhatthefeatureis,whereitcamefrom,andhowitisexpectedtohelp.
YourFirstObjectiveYouhavemanymetrics,ormeasurementsaboutthesystemthatyoucareabout,butyourmachinelearningalgorithmwilloftenrequireasingleobjective,anumberthatyouralgorithmis“trying”tooptimize.Idistinguishherebetweenobjectivesandmetrics:ametricisanynumberthatyoursystemreports,whichmayormaynotbeimportant.SeealsoRule#2.
Rule12-Don'toverthinkwhichobjectiveyouchoosetodirectlyoptimize.Youwanttomakemoney,makeyourusershappy,andmaketheworldabetterplace.Therearetonsofmetricsthatyoucareabout,andyoushouldmeasurethemall(seeRule#2).However,earlyinthemachinelearningprocess,youwillnoticethemallgoingup,eventhosethatyoudonotdirectlyoptimize.Forinstance,supposeyoucareaboutnumberofclicks,timespentonthesite,anddailyactiveusers.Ifyouoptimizefornumberofclicks,youarelikelytoseethetimespentincrease.So,keepitsimpleanddon’tthinktoohardaboutbalancingdifferentmetricswhenyoucanstilleasilyincreaseallthemetrics.Don’ttakethisruletoofarthough:donotconfuseyourobjectivewiththeultimatehealthofthesystem(seeRule#39).And,ifyoufindyourselfincreasingthedirectlyoptimizedmetric,butdecidingnottolaunch,someobjectiverevisionmayberequired.
Rule13-Chooseasimple,observableandattributablemetricforyourfirstobjective.Oftenyoudon'tknowwhatthetrueobjectiveis.Youthinkyoudobutthenyouasyoustareatthedataandside-by-sideanalysisofyouroldsystemandnewMLsystem,yourealizeyouwanttotweakit.Further,differentteammembersoftencan'tagreeonthetrueobjective.TheMLobjectiveshouldbesomethingthatiseasytomeasureandisaproxyforthe“true”objective.SotrainonthesimpleMLobjective,andconsiderhavinga"policylayer"ontopthatallowsyoutoaddadditionallogic(hopefullyverysimplelogic)todothefinalranking.
Theeasiestthingtomodelisauserbehaviorthatisdirectlyobservedandattributabletoanactionofthesystem:
Wasthisrankedlinkclicked?Wasthisrankedobjectdownloaded?Wasthisrankedobjectforwarded/repliedto/emailed?Wasthisrankedobjectrated?Wasthisshownobjectmarkedasspam/pornography/offensive?Avoidmodelingindirecteffectsatfirst:
Didtheuservisitthenextday?Howlongdidtheuservisitthesite?Whatwerethedailyactiveusers?Indirecteffectsmakegreatmetrics,andcanbeusedduringA/Btestingandduringlaunchdecisions.Finally,don’ttrytogetthemachinelearningtofigureout:
Istheuserhappyusingtheproduct?Istheusersatisfiedwiththeexperience?Istheproductimprovingtheuser’soverallwellbeing?Howwillthisaffectthecompany’soverallhealth?Theseareallimportant,butalsoincrediblyhard.Instead,useproxies:iftheuserishappy,theywillstayonthesitelonger.Iftheuserissatisfied,theywillvisitagaintomorrow.Insofaraswellbeingandcompanyhealthisconcerned,humanjudgementisrequiredtoconnectanymachinelearnedobjectivetothenatureoftheproductyouaresellingandyourbusinessplan,sowedon’tenduphere.
Rule14-Startingwithaninterpretablemodelmakesdebuggingeasier.Linearregression,logisticregression,andPoissonregressionaredirectlymotivatedbyaprobabilisticmodel.Eachpredictionisinterpretableasaprobabilityoranexpectedvalue.Thismakesthemeasiertodebugthanmodelsthatuseobjectives(zerooneloss,varioushingelosses,etcetera)thattrytodirectlyoptimizeclassificationaccuracyorrankingperformance.Forexample,ifprobabilitiesintrainingdeviatefromprobabilitiespredictedinside-by-sidesorbyinspectingtheproductionsystem,thisdeviationcouldrevealaproblem.
Forexample,inlinear,logistic,orPoissonregression,therearesubsetsofthedatawheretheaveragepredictedexpectationequalstheaveragelabel(1momentcalibrated,orjustcalibrated)3.Ifyouhaveafeaturewhichiseither1or0foreachexample,thenthesetofexampleswherethatfeatureis1iscalibrated.Also,ifyouhaveafeaturethatis1foreveryexample,thenthesetofallexamplesiscalibrated.
Withsimplemodels,itiseasiertodealwithfeedbackloops(seeRule#36&).Often,weusetheseprobabilisticpredictionstomakeadecision:e.g.rankpostsindecreasingexpectedvalue(i.e.probabilityofclick/download/etc.).However,rememberwhenitcomestimetochoosewhichmodeltouse,thedecisionmattersmorethanthelikelihoodofthedatagiventhemodel(seeRule#27).
Rule15-SeparateSpamFilteringandQualityRankinginaPolicyLayer.Qualityrankingisafineart,butspamfilteringisawar.*Thesignalsthatyouusetodeterminehighqualitypostswillbecomeobvioustothosewhouseyoursystem,andtheywilltweaktheirpoststohavetheseproperties.Thus,yourqualityrankingshouldfocusonrankingcontentthatispostedingoodfaith.Youshouldnotdiscountthequalityrankinglearnerforrankingspamhighly.Similarly,“racy”contentshouldbehandledseparatelyfromQualityRanking.Spamfilteringisadifferentstory.Youhavetoexpectthatthefeaturesthatyouneedtogeneratewillbeconstantlychanging.Often,therewillbeobviousrulesthatyouputintothesystem(ifaposthasmorethanthreespamvotes,don’tretrieveit,etcetera).Anylearnedmodelwillhavetobeupdateddaily,ifnotfaster.Thereputationofthecreatorofthecontentwillplayagreatrole.
Atsomelevel,theoutputofthesetwosystemswillhavetobeintegrated.Keepinmind,filteringspaminsearchresultsshouldprobablybemoreaggressivethanfilteringspaminemailmessages.Also,itisastandardpracticetoremovespamfromthetrainingdataforthequalityclassifier.
GoogleResearchBlog-LessonslearnedwhileprotectingGmail
FeatureengineeringInthefirstphaseofthelifecycleofamachinelearningsystem,theimportantissueistogetthetrainingdataintothelearningsystem,getanymetricsofinterestinstrumented,andcreateaservinginfrastructure.Afteryouhaveaworkingendtoendsystemwithunitandsystemtestsinstrumented,PhaseIIbegins.
Rule16-Plantolaunchanditerate.Don’texpectthatthemodelyouareworkingonnowwillbethelastonethatyouwilllaunch,oreventhatyouwilleverstoplaunchingmodels.Thusconsiderwhetherthecomplexityyouareaddingwiththislaunchwillslowdownfuturelaunches.Manyteamshavelaunchedamodelperquarterormoreforyears.Therearethreebasicreasonstolaunchnewmodels:
youarecomingupwithnewfeatures,youaretuningregularizationandcombiningoldfeaturesinnewways,and/oryouaretuningtheobjective.Regardless,givingamodelabitoflovecanbegood:lookingoverthedatafeedingintotheexamplecanhelpfindnewsignalsaswellasold,brokenones.So,asyoubuildyourmodel,thinkabouthoweasyitistoaddorremoveorrecombinefeatures.Thinkabouthoweasyitistocreateafreshcopyofthepipelineandverifyitscorrectness.Thinkaboutwhetheritispossibletohavetwoorthreecopiesrunninginparallel.Finally,don’tworryaboutwhetherfeature16of35makesitintothisversionofthepipeline.You’llgetitnextquarter.
Rule17-Startwithdirectlyobservedandreportedfeaturesasopposedtolearnedfeatures.Thismightbeacontroversialpoint,butitavoidsalotofpitfalls.Firstofall,let’sdescribewhatalearnedfeatureis.Alearnedfeatureisafeaturegeneratedeitherbyanexternalsystem(suchasanunsupervisedclusteringsystem)orbythelearneritself(e.g.viaafactoredmodelordeeplearning).Bothofthesecanbeuseful,buttheycanhavealotofissues,sotheyshouldnotbeinthefirstmodel.Ifyouuseanexternalsystemtocreateafeature,rememberthatthesystemhasitsownobjective.Theexternalsystem'sobjectivemaybeonlyweaklycorrelatedwithyourcurrentobjective.Ifyougrabasnapshotoftheexternalsystem,thenitcanbecomeoutofdate.Ifyouupdatethefeaturesfromtheexternalsystem,thenthemeaningsmaychange.Ifyouuseanexternalsystemtoprovideafeature,beawarethattheyrequireagreatdealofcare.Theprimaryissuewithfactoredmodelsanddeepmodelsisthattheyarenon-convex.Thus,thereisnoguaranteethatanoptimalsolutioncanbeapproximatedorfound,andthelocalminimafoundoneachiterationcanbedifferent.Thisvariationmakesithardtojudgewhethertheimpactofachangetoyoursystemismeaningfulorrandom.Bycreatingamodelwithoutdeepfeatures,youcangetanexcellentbaselineperformance.Afterthisbaselineisachieved,youcantrymoreesotericapproaches.
Rule18-Explorewithfeaturesofcontentthatgeneralizeacrosscontexts.Oftenamachinelearningsystemisasmallpartofamuchbiggerpicture.Forexample,ifyouimagineapostthatmightbeusedinWhat’sHot,manypeoplewillplus-one,re-share,orcommentonapostbeforeitisevershowninWhat’sHot.Ifyouprovidethosestatisticstothelearner,itcanpromotenewpoststhatithasnodataforinthecontextitisoptimizing.YouTubeWatchNextcouldusenumberofwatches,orco-watches(countsofhowmanytimesonevideowaswatchedafteranotherwaswatched)fromYouTubesearch.Youcanalsouseexplicituserratings.Finally,ifyouhaveauseractionthatyouareusingasalabel,seeingthatactiononthedocumentinadifferentcontextcanbeagreatfeature.Allofthesefeaturesallowyoutobringnewcontentintothecontext.Notethatthisisnotaboutpersonalization:figureoutifsomeonelikesthecontentinthiscontextfirst,thenfigureoutwholikesitmoreorless.
Rule19-Useveryspecificfeatureswhenyoucan.Withtonsofdata,itissimplertolearnmillionsofsimplefeaturesthanafewcomplexfeatures.Identifiersofdocumentsbeingretrievedandcanonicalizedqueriesdonotprovidemuchgeneralization,butalignyourrankingwithyourlabelsonheadqueries..Thus,don’tbeafraidofgroupsoffeatureswhereeachfeatureappliestoaverysmallfractionofyourdata,butoverallcoverageisabove90%.Youcanuseregularizationtoeliminatethefeaturesthatapplytotoofewexamples.
Rule20-Combineandmodifyexistingfeaturestocreatenewfeaturesinhuman-understandableways.Thereareavarietyofwaystocombineandmodifyfeatures.MachinelearningsystemssuchasTensorFlowallowyoutopreprocessyourdatathroughtransformations.Thetwomoststandardapproachesare“discretizations”and“crosses”.
Discretizationconsistsoftakingacontinuousfeatureandcreatingmanydiscretefeaturesfromit.Consideracontinuousfeaturesuchasage.Youcancreateafeaturewhichis1whenageislessthan18,anotherfeaturewhichis1whenageisbetween18and35,etcetera.Don’toverthinktheboundariesofthesehistograms:basicquantileswillgiveyoumostoftheimpact.Crossescombinetwoormorefeaturecolumns.Afeaturecolumn,inTensorFlow'sterminology,isasetofhomogenousfeatures,(e.g.{male,female},{US,Canada,Mexico},etcetera).Acrossisanewfeaturecolumnwithfeaturesin,forexample,{male,female}×{US,Canada,Mexico}.Thisnewfeaturecolumnwillcontainthefeature(male,Canada).IfyouareusingTensorFlowandyoutellTensorFlowtocreatethiscrossforyou,this(male,Canada)featurewillbepresentinexamplesrepresentingmaleCanadians.Notethatittakesmassiveamountsofdatatolearnmodelswithcrossesofthree,four,ormorebasefeaturecolumns.
Crossesthatproduceverylargefeaturecolumnsmayoverfit.Forinstance,imaginethatyouaredoingsomesortofsearch,andyouhaveafeaturecolumnwithwordsinthequery,andyouhaveafeaturecolumnwithwordsinthedocument.Youcancombinethesewithacross,butyouwillendupwithalotoffeatures(seeRule#21).Whenworkingwithtexttherearetwoalternatives.Themostdraconianisadotproduct.Adotproductinitssimplestformsimplycountsthenumberofcommonwordsbetweenthequeryandthedocument.Thisfeaturecanthenbediscretized.Anotherapproachisanintersection:thus,wewillhaveafeaturewhichispresentifandonlyiftheword“pony”isinthedocumentandthequery,andanotherfeaturewhichispresentifandonlyiftheword“the”isinthedocumentandthequery.
Rule21-Thenumberoffeatureweightsyoucanlearninalinearmodelisroughlyproportionaltotheamountofdatayouhave.Therearefascinatingstatisticallearningtheoryresultsconcerningtheappropriatelevelofcomplexityforamodel,butthisruleisbasicallyallyouneedtoknow.Ihavehadconversationsinwhichpeopleweredoubtfulthatanythingcanbelearnedfromonethousandexamples,orthatyouwouldeverneedmorethan1millionexamples,becausetheygetstuckinacertainmethodoflearning.Thekeyistoscaleyourlearningtothesizeofyourdata:
Ifyouareworkingonasearchrankingsystem,andtherearemillionsofdifferentwordsinthedocumentsandthequeryandyouhave1000labeledexamples,thenyoushoulduseadotproductbetweendocumentandqueryfeatures,TF-IDF,andahalf-dozenotherhighlyhuman-engineeredfeatures.1000examples,adozenfeatures.Ifyouhaveamillionexamples,thenintersectthedocumentandqueryfeaturecolumns,usingregularizationandpossiblyfeatureselection.Thiswillgiveyoumillionsoffeatures,butwithregularizationyouwillhavefewer.Tenmillionexamples,maybeahundredthousandfeatures.Ifyouhavebillionsorhundredsofbillionsofexamples,youcancrossthefeaturecolumnswithdocumentandquerytokens,usingfeatureselectionandregularization.Youwillhaveabillionexamples,and10millionfeatures.Statisticallearningtheoryrarelygivestightbounds,butgivesgreatguidanceforastartingpoint.Intheend,useRule#28todecidewhatfeaturestouse.
Rule22-Cleanupfeaturesyouarenolongerusing.Unusedfeaturescreatetechnicaldebt.Ifyoufindthatyouarenotusingafeature,andthatcombiningitwithotherfeaturesisnotworking,thendropitoutofyourinfrastructure.Youwanttokeepyourinfrastructurecleansothatthemostpromisingfeaturescanbetriedasfastaspossible.Ifnecessary,someonecanalwaysaddbackyourfeature.Keepcoverageinmindwhenconsideringwhatfeaturestoaddorkeep.Howmanyexamplesarecoveredbythefeature?Forexample,ifyouhavesomepersonalizationfeatures,butonly8%ofyourusershaveanypersonalizationfeatures,itisnotgoingtobeveryeffective.Atthesametime,somefeaturesmaypunchabovetheirweight.Forexample,ifyouhaveafeaturewhichcoversonly1%ofthedata,but90%oftheexamplesthathavethefeaturearepositive,thenitwillbeagreatfeaturetoadd.
HumanAnalysisoftheSystemBeforegoingontothethirdphaseofmachinelearning,itisimportanttofocusonsomethingthatisnottaughtinanymachinelearningclass:howtolookatanexistingmodel,andimproveit.Thisismoreofanartthanascience,andyetthereareseveralanti-patternsthatithelpstoavoid.
Rule23-Youarenotatypicalenduser.*Thisisperhapstheeasiestwayforateamtogetboggeddown.Whiletherearealotofbenefitstofish-fooding(usingaprototypewithinyourteam)anddog-fooding(usingaprototypewithinyourcompany),employeesshouldlookatwhethertheperformanceiscorrect.Whileachangewhichisobviouslybadshouldnotbeused,anythingthatlooksreasonablynearproductionshouldbetestedfurther,eitherbypayinglaypeopletoanswerquestionsonacrowdsourcingplatform,orthroughaliveexperimentonrealusers.Therearetworeasonsforthis.Thefirstisthatyouaretooclosetothecode.Youmaybelookingforaparticularaspectoftheposts,oryouaresimplytooemotionallyinvolved(e.g.confirmationbias).Thesecondisthatyourtimeistoovaluable.Considerthecostof9engineerssittinginaonehourmeeting,andthinkofhowmanycontractedhumanlabelsthatbuysonacrowdsourcingplatform.
Ifyoureallywanttohaveuserfeedback,useuserexperiencemethodologies.Createuserpersonas(onedescriptionisinBillBuxton’sDesigningSketchingUserExperiences)earlyinaprocessanddousabilitytesting(onedescriptionisinSteveKrug’sDon’tMakeMeThink)later.Userpersonasinvolvecreatingahypotheticaluser.Forinstance,ifyourteamisallmale,itmighthelptodesigna35-yearoldfemaleuserpersona(completewithuserfeatures),andlookattheresultsitgeneratesratherthan10resultsfor25-40yearoldmales.Bringinginactualpeopletowatchtheirreactiontoyoursite(locallyorremotely)inusabilitytestingcanalsogetyouafreshperspective.
GoogleResearchBlog-Howtomeasuretranslationqualityinyouruserinterfaces
Rule24-MeasurethedeltabetweenmodelsOneoftheeasiest,andsometimesmostusefulmeasurementsyoucanmakebeforeanyusershavelookedatyournewmodelistocalculatejusthowdifferentthenewresultsarefromproduction.Forinstance,ifyouhavearankingproblem,runbothmodelsonasampleofqueriesthroughtheentiresystem,andlookatthesizeofthesymmetricdifferenceoftheresults(weightedbyrankingposition).Ifthedifferenceisverysmall,thenyoucantellwithoutrunninganexperimentthattherewillbelittlechange.Ifthedifferenceisverylarge,thenyouwanttomakesurethatthechangeisgood.Lookingoverquerieswherethesymmetricdifferenceishighcanhelpyoutounderstandqualitativelywhatthechangewaslike.Makesure,however,thatthesystemisstable.Makesurethatamodelwhencomparedwithitselfhasalow(ideallyzero)symmetricdifference.
Rule25-Whenchoosingmodels,utilitarianperformancetrumpspredictivepower.Yourmodelmaytrytopredictclick-through-rate.However,intheend,thekeyquestioniswhatyoudowiththatprediction.Ifyouareusingittorankdocuments,thenthequalityofthefinalrankingmattersmorethanthepredictionitself.Ifyoupredicttheprobabilitythatadocumentisspamandthenhaveacutoffonwhatisblocked,thentheprecisionofwhatisallowedthroughmattersmore.Mostofthetime,thesetwothingsshouldbeinagreement:whentheydonotagree,itwilllikelybeonasmallgain.Thus,ifthereissomechangethatimprovesloglossbutdegradestheperformanceofthesystem,lookforanotherfeature.Whenthisstartshappeningmoreoften,itistimetorevisittheobjectiveofyourmodel.
Rule26-Lookforpatternsinthemeasurederrors,andcreatenewfeatures.Supposethatyouseeatrainingexamplethatthemodelgot“wrong”.Inaclassificationtask,thiscouldbeafalsepositiveorafalsenegative.Inarankingtask,itcouldbeapairwhereapositivewasrankedlowerthananegative.Themostimportantpointisthatthisisanexamplethatthemachinelearningsystemknowsitgotwrongandwouldliketofixifgiventheopportunity.Ifyougivethemodelafeaturethatallowsittofixtheerror,themodelwilltrytouseit.Ontheotherhand,ifyoutrytocreateafeaturebaseduponexamplesthesystemdoesn’tseeasmistakes,thefeaturewillbeignored.Forinstance,supposethatinPlayAppsSearch,someonesearchesfor“freegames”.Supposeoneofthetopresultsisalessrelevantgagapp.Soyoucreateafeaturefor“gagapps”.However,ifyouaremaximizingnumberofinstalls,andpeopleinstallagagappwhentheysearchforfreegames,the“gagapps”featurewon’thavetheeffectyouwant.
Onceyouhaveexamplesthatthemodelgotwrong,lookfortrendsthatareoutsideyourcurrentfeatureset.Forinstance,ifthesystemseemstobedemotinglongerposts,thenaddpostlength.Don’tbetoospecificaboutthefeaturesyouadd.Ifyouaregoingtoaddpostlength,don’ttrytoguesswhatlongmeans,justaddadozenfeaturesandtheletmodelfigureoutwhattodowiththem(seeRule#21).Thatistheeasiestwaytogetwhatyouwant.
Rule27-Trytoquantifyobservedundesirablebehavior.Somemembersofyourteamwillstarttobefrustratedwithpropertiesofthesystemtheydon’tlikewhicharen’tcapturedbytheexistinglossfunction.Atthispoint,theyshoulddowhateverittakestoturntheirgripesintosolidnumbers.Forexample,iftheythinkthattoomany“gagapps”arebeingshowninPlaySearch,theycouldhavehumanratersidentifygagapps.(Youcanfeasiblyusehuman-labelleddatainthiscasebecausearelativelysmallfractionofthequeriesaccountforalargefractionofthetraffic.)Ifyourissuesaremeasurable,thenyoucanstartusingthemasfeatures,objectives,ormetrics.Thegeneralruleis“measurefirst,optimizesecond”.
Rule28-Beawarethatidenticalshort-termbehaviordoesnotimplyidenticallong-termbehavior.Imaginethatyouhaveanewsystemthatlooksateverydoc_idandexact_query,andthencalculatestheprobabilityofclickforeverydocforeveryquery.YoufindthatitsbehaviorisnearlyidenticaltoyourcurrentsysteminbothsidebysidesandA/Btesting,sogivenitssimplicity,youlaunchit.However,younoticethatnonewappsarebeingshown.Why?Well,sinceyoursystemonlyshowsadocbasedonitsownhistorywiththatquery,thereisnowaytolearnthatanewdocshouldbeshown.
Theonlywaytounderstandhowsuchasystemwouldworklongtermistohaveittrainonlyondataacquiredwhenthemodelwaslive.Thisisverydifficult.
Training-ServingSkewTraining-servingskewisadifferencebetweenperformanceduringtrainingandperformanceduringserving.Thisskewcanbecausedby:
adiscrepancybetweenhowyouhandledatainthetrainingandservingpipelines,orachangeinthedatabetweenwhenyoutrainandwhenyouserve,orafeedbackloopbetweenyourmodelandyouralgorithm.WehaveobservedproductionmachinelearningsystemsatGooglewithtraining-servingskewthatnegativelyimpactsperformance.Thebestsolutionistoexplicitlymonitoritsothatsystemanddatachangesdon’tintroduceskewunnoticed.
评论