[go: up one dir, main page]

100% found this document useful (1 vote)
457 views24 pages

ML Best Practices for Engineers

This document provides rules and best practices for machine learning engineering based on experiences at Google. It is intended for those with basic ML knowledge. The document covers terminology, an overview of effective ML approaches, and then provides 37 rules organized into sections on the ML pipeline, objectives, feature engineering, handling training/serving skew, and later optimization stages. The overall approach is to start simply, ensure a solid end-to-end pipeline, focus on good features before complex models, and maintain a stable pipeline through iterations.

Uploaded by

Kevin P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
457 views24 pages

ML Best Practices for Engineers

This document provides rules and best practices for machine learning engineering based on experiences at Google. It is intended for those with basic ML knowledge. The document covers terminology, an overview of effective ML approaches, and then provides 37 rules organized into sections on the ML pipeline, objectives, feature engineering, handling training/serving skew, and later optimization stages. The overall approach is to start simply, ensure a solid end-to-end pipeline, focus on good features before complex models, and maintain a stable pipeline through iterations.

Uploaded by

Kevin P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Rules of Machine Learning:

Best Practices for ML Engineering

MartinZinkevich

Thisdocumentisintendedtohelpthosewithabasicknowledgeofmachinelearninggetthe
benefitofbestpracticesinmachinelearningfromaroundGoogle.Itpresentsastyleformachine
learning,similartotheGoogleC++StyleGuideandotherpopularguidestopractical
programming.Ifyouhavetakenaclassinmachinelearning,orbuiltorworkedona
machinelearnedmodel,thenyouhavethenecessarybackgroundtoreadthisdocument.

Terminology
Overview
BeforeMachineLearning
Rule#1:Dontbeafraidtolaunchaproductwithoutmachinelearning.
Rule#2:Makemetricsdesignandimplementationapriority.
Rule#3:Choosemachinelearningoveracomplexheuristic.
MLPhaseI:YourFirstPipeline
Rule#4:Keepthefirstmodelsimpleandgettheinfrastructureright.
Rule#5:Testtheinfrastructureindependentlyfromthemachinelearning.
Rule#6:Becarefulaboutdroppeddatawhencopyingpipelines.
Rule#7:Turnheuristicsintofeatures,orhandlethemexternally.
Monitoring
Rule#8:Knowthefreshnessrequirementsofyoursystem.
Rule#9:Detectproblemsbeforeexportingmodels.
Rule#10:Watchforsilentfailures.
Rule#11:Givefeaturesetsownersanddocumentation.
YourFirstObjective
Rule#12:Dontoverthinkwhichobjectiveyouchoosetodirectlyoptimize.
Rule#13:Chooseasimple,observableandattributablemetricforyourfirst
objective.
Rule#14:Startingwithaninterpretablemodelmakesdebuggingeasier.
Rule#15:SeparateSpamFilteringandQualityRankinginaPolicyLayer.
MLPhaseII:FeatureEngineering
Rule#16:Plantolaunchanditerate.
Rule#17:Startwithdirectlyobservedandreportedfeaturesasopposedtolearned
features.

Rule#18:Explorewithfeaturesofcontentthatgeneralizeacrosscontexts.
Rule#19:Useveryspecificfeatureswhenyoucan.
Rule#20:Combineandmodifyexistingfeaturestocreatenewfeaturesin
humanunderstandableways.
Rule#21:Thenumberoffeatureweightsyoucanlearninalinearmodelisroughly
proportionaltotheamountofdatayouhave.
Rule#22:Cleanupfeaturesyouarenolongerusing.
HumanAnalysisoftheSystem
Rule#23:Youarenotatypicalenduser.
Rule#24:Measurethedeltabetweenmodels.
Rule#25:Whenchoosingmodels,utilitarianperformancetrumpspredictivepower.
Rule#26:Lookforpatternsinthemeasurederrors,andcreatenewfeatures.
Rule#27:Trytoquantifyobservedundesirablebehavior.
Rule#28:Beawarethatidenticalshorttermbehaviordoesnotimplyidentical
longtermbehavior.
TrainingServingSkew
Rule#29:Thebestwaytomakesurethatyoutrainlikeyouserveistosavetheset
offeaturesusedatservingtime,andthenpipethosefeaturestoalogtousethemat
trainingtime.
Rule#30:Importanceweightsampleddata,dontarbitrarilydropit!
Rule#31:Bewarethatifyoujoindatafromatableattrainingandservingtime,the
datainthetablemaychange.
Rule#32:Reusecodebetweenyourtrainingpipelineandyourservingpipeline
wheneverpossible.
Rule#33:IfyouproduceamodelbasedonthedatauntilJanuary5th,testthemodel
onthedatafromJanuary6thandafter.
Rule#34:Inbinaryclassificationforfiltering(suchasspamdetectionordetermining
interestingemails),makesmallshorttermsacrificesinperformanceforveryclean
data.
Rule#35:Bewareoftheinherentskewinrankingproblems.
Rule#36:Avoidfeedbackloopswithpositionalfeatures.
Rule#37:MeasureTraining/ServingSkew.
MLPhaseIII:SlowedGrowth,OptimizationRefinement,andComplexModels
Rule#38:Dontwastetimeonnewfeaturesifunalignedobjectiveshavebecomethe
issue.
Rule#39:Launchdecisionswilldependuponmorethanonemetric.
Rule#40:Keepensemblessimple.
Rule#41:Whenperformanceplateaus,lookforqualitativelynewsourcesof
informationtoaddratherthanrefiningexistingsignals.
Rule#42:Dontexpectdiversity,personalization,orrelevancetobeascorrelated
withpopularityasyouthinktheyare.
Rule#43:Yourfriendstendtobethesameacrossdifferentproducts.Yourinterests
tendnottobe.

RelatedWork
Acknowledgements
Appendix
YouTubeOverview
GooglePlayOverview
GooglePlusOverview

Terminology

Thefollowingtermswillcomeuprepeatedlyinourdiscussionofeffectivemachinelearning:

Instance:Thethingaboutwhichyouwanttomakeaprediction.Forexample,theinstance
mightbeawebpagethatyouwanttoclassifyaseither"aboutcats"or"notaboutcats".
Label:Ananswerforapredictiontaskeithertheanswerproducedbyamachinelearning
system,ortherightanswersuppliedintrainingdata.Forexample,thelabelforawebpage
mightbe"aboutcats".
Feature:Apropertyofaninstanceusedinapredictiontask.Forexample,awebpagemight
haveafeature"containstheword'cat'".
FeatureColumn1:Asetofrelatedfeatures,suchasthesetofallpossiblecountriesinwhich
usersmightlive.Anexamplemayhaveoneormorefeaturespresentinafeaturecolumn.A
featurecolumnisreferredtoasanamespaceintheVWsystem(atYahoo/Microsoft),ora
field.
Example:Aninstance(withitsfeatures)andalabel.
Model:Astatisticalrepresentationofapredictiontask.Youtrainamodelonexamplesthenuse
themodeltomakepredictions.
Metric:Anumberthatyoucareabout.Mayormaynotbedirectlyoptimized.
Objective:A
metricthatyouralgorithmistryingtooptimize.
Pipeline:Theinfrastructuresurroundingamachinelearningalgorithm.Includesgatheringthe
datafromthefrontend,puttingitintotrainingdatafiles,trainingoneormoremodels,and
exportingthemodelstoproduction.

Overview
Tomakegreatproducts:
domachinelearninglikethegreatengineeryouare,notlikethegreatmachinelearning
expertyouarent.
1

Googlespecificterminology.

Mostoftheproblemsyouwillfaceare,infact,engineeringproblems.Evenwithallthe
resourcesofagreatmachinelearningexpert,mostofthegainscomefromgreatfeatures,not
greatmachinelearningalgorithms.So,thebasicapproachis:
1. makesureyourpipelineissolidendtoend
2. startwithareasonableobjective
3. addcommonsensefeaturesinasimpleway
4. makesurethatyourpipelinestayssolid.
Thisapproachwillmakelotsofmoneyand/ormakelotsofpeoplehappyforalongperiodof
time.Divergefromthisapproachonlywhentherearenomoresimpletrickstogetyouany
farther.Addingcomplexityslowsfuturereleases.

Onceyou'veexhaustedthesimpletricks,cuttingedgemachinelearningmightindeedbeinyour
future.SeethesectiononP
haseIIImachinelearningprojects.

Thisdocumentisarrangedinfourparts:
1. Thefirstpartshouldhelpyouunderstandwhetherthetimeisrightforbuildingamachine
learningsystem.
2. Thesecondpartisaboutdeployingyourfirstpipeline.
3. Thethirdpartisaboutlaunchinganditeratingwhileaddingnewfeaturestoyourpipeline,
howtoevaluatemodelsandtrainingservingskew.
4. Thefinalpartisaboutwhattodowhenyoureachaplateau.
5. Afterwards,thereisalistofr elatedworkandana
ppendixwithsomebackgroundonthe
systemscommonlyusedasexamplesinthisdocument.

BeforeMachineLearning
Rule#1:Dontbeafraidtolaunchaproductwithoutmachinelearning.
Machinelearningiscool,butitrequiresdata.Theoretically,youcantakedatafromadifferent
problemandthentweakthemodelforanewproduct,butthiswilllikelyunderperformbasic
heuristics.Ifyouthinkthatmachinelearningwillgiveyoua100%boost,thenaheuristicwillget
you50%ofthewaythere.

Forinstance,ifyouarerankingappsinanappmarketplace,youcouldusetheinstallrateor
numberofinstalls.Ifyouaredetectingspam,filteroutpublishersthathavesentspambefore.
Dontbeafraidtousehumaneditingeither.Ifyouneedtorankcontacts,rankthemostrecently
usedhighest(orevenrankalphabetically).Ifmachinelearningisnotabsolutelyrequiredforyour
product,don'tuseituntilyouhavedata.

Rule#2:First,designandimplementmetrics.
Beforeformalizingwhatyourmachinelearningsystemwilldo,trackasmuchaspossibleinyour
currentsystem.Dothisforthefollowingreasons:

1. Itiseasiertogainpermissionfromthesystemsusersearlieron.
2. Ifyouthinkthatsomethingmightbeaconcerninthefuture,itisbettertogethistorical
datanow.
3. Ifyoudesignyoursystemwithmetricinstrumentationinmind,thingswillgobetterfor
youinthefuture.Specifically,youdontwanttofindyourselfgreppingforstringsinlogs
toinstrumentyourmetrics!
4. Youwillnoticewhatthingschangeandwhatstaysthesame.Forinstance,supposeyou
wanttodirectlyoptimizeonedayactiveusers.However,duringyourearlymanipulations
ofthesystem,youmaynoticethatdramaticalterationsoftheuserexperiencedont
noticeablychangethismetric.

GooglePlusteammeasuresexpandsperread,resharesperread,plusonesperread,
comments/read,commentsperuser,resharesperuser,etc.whichtheyuseincomputingthe
goodnessofapostatservingtime.A
lso,notethatanexperimentframework,whereyou
cangroupusersintobucketsandaggregatestatisticsbyexperiment,isimportant.See
Rule#12.

Bybeingmoreliberalaboutgatheringmetrics,youcangainabroaderpictureofyoursystem.
Noticeaproblem?Addametrictotrackit!Excitedaboutsomequantitativechangeonthelast
release?Addametrictotrackit!

Rule#3:Choosemachinelearningoveracomplexheuristic.
Asimpleheuristiccangetyourproductoutthedoor.Acomplexheuristicisunmaintainable.
Onceyouhavedataandabasicideaofwhatyouaretryingtoaccomplish,moveontomachine
learning.Asinmostsoftwareengineeringtasks,youwillwanttobeconstantlyupdatingyour
approach,whetheritisaheuristicoramachinelearnedmodel,andyouwillfindthatthe
machinelearnedmodeliseasiertoupdateandmaintain(seeR
ule#16).

MLPhaseI:YourFirstPipeline
Focusonyoursysteminfrastructureforyourfirstpipeline.Whileitisfuntothinkaboutallthe
imaginativemachinelearningyouaregoingtodo,itwillbehardtofigureoutwhatishappening
ifyoudontfirsttrustyourpipeline.

Rule#4:Keepthefirstmodelsimpleandgettheinfrastructureright.
Thefirstmodelprovidesthebiggestboosttoyourproduct,soitdoesn'tneedtobefancy.But
youwillrunintomanymoreinfrastructureissuesthanyouexpect.Beforeanyonecanuseyour
fancynewmachinelearningsystem,youhavetodetermine:

1. Howtogetexamplestoyourlearningalgorithm.
2. Afirstcutastowhatgoodandbadmeantoyoursystem.
3. Howtointegrateyourmodelintoyourapplication.Youcaneitherapplythemodellive,or
precomputethemodelonexamplesofflineandstoretheresultsinatable.Forexample,
youmightwanttopreclassifywebpagesandstoretheresultsinatable,butyoumight
wanttoclassifychatmessageslive.

Choosingsimplefeaturesmakesiteasiertoensurethat:
1. Thefeaturesreachyourlearningalgorithmcorrectly.
2. Themodellearnsreasonableweights.
3. Thefeaturesreachyourmodelintheservercorrectly.
Onceyouhaveasystemthatdoesthesethreethingsreliably,youhavedonemostofthework.
Yoursimplemodelprovidesyouwithbaselinemetricsandabaselinebehaviorthatyoucanuse
totestmorecomplexmodels.Someteamsaimforaneutralfirstlaunch:afirstlaunchthat
explicitlydeprioritizesmachinelearninggains,toavoidgettingdistracted.

Rule#5:Testtheinfrastructureindependentlyfromthemachinelearning.
Makesurethattheinfrastructureistestable,andthatthelearningpartsofthesystemare
encapsulatedsothatyoucantesteverythingaroundit.Specifically:
1. Testgettingdataintothealgorithm.Checkthatfeaturecolumnsthatshouldbepopulated
arepopulated.Whereprivacypermits,manuallyinspecttheinputtoyourtraining
algorithm.Ifpossible,checkstatisticsinyourpipelineincomparisontoelsewhere,such
asRASTA.
2. Testgettingmodelsoutofthetrainingalgorithm.Makesurethatthemodelinyour
trainingenvironmentgivesthesamescoreasthemodelinyourservingenvironment
(seeR
ule#37).

Machinelearninghasanelementofunpredictability,somakesurethatyouhavetestsforthe
codeforcreatingexamplesintrainingandserving,andthatyoucanloadanduseafixedmodel
duringserving.Also,itisimportanttounderstandyourdata:seeP
racticalAdviceforAnalysisof
Large,ComplexDataSets.
Rule#6:Becarefulaboutdroppeddatawhencopyingpipelines.
Oftenwecreateapipelinebycopyinganexistingpipeline(i.e.cargocultprogramming),andthe
oldpipelinedropsdatathatweneedforthenewpipeline.Forexample,thepipelineforG
oogle
PlusWhatsHotdropsolderposts(becauseitistryingtorankfreshposts).Thispipelinewas
copiedtouseforG
ooglePlusStream,whereolderpostsarestillmeaningful,butthepipeline
wasstilldroppingoldposts.Anothercommonpatternistoonlylogdatathatwasseenbythe
user.Thus,thisdataisuselessifwewanttomodelwhyaparticularpostwasnotseenbythe
user,becauseallthenegativeexampleshavebeendropped.AsimilarissueoccurredinPlay.
WhileworkingonPlayAppsHome,anewpipelinewascreatedthatalsocontainedexamples
fromtwootherlandingpages(PlayGamesHomeandPlayHomeHome)withoutanyfeatureto
disambiguatewhereeachexamplecamefrom.


Rule#7:Turnheuristicsintofeatures,orhandlethemexternally.
Usuallytheproblemsthatmachinelearningistryingtosolvearenotcompletelynew.Thereis
anexistingsystemforranking,orclassifying,orwhateverproblemyouaretryingtosolve.This
meansthatthereareabunchofrulesandheuristics.T
hesesameheuristicscangiveyoua
liftwhentweakedwithmachinelearning.Yourheuristicsshouldbeminedforwhatever
informationtheyhave,fortworeasons.First,thetransitiontoamachinelearnedsystemwillbe
smoother.Second,usuallythoserulescontainalotoftheintuitionaboutthesystemyoudont
wanttothrowaway.Therearefourwaysyoucanuseanexistingheuristic:
1. Preprocessusingtheheuristic.Ifthefeatureisincrediblyawesome,thenthisisan
option.Forexample,if,inaspamfilter,thesenderhasalreadybeenblacklisted,donttry
torelearnwhatblacklistedmeans.Blockthemessage.Thisapproachmakesthemost
senseinbinaryclassificationtasks.
2. Createafeature.Directlycreatingafeaturefromtheheuristicisgreat.Forexample,if
youuseaheuristictocomputearelevancescoreforaqueryresult,youcanincludethe
scoreasthevalueofafeature.Lateronyoumaywanttousemachinelearning
techniquestomassagethevalue(forexample,convertingthevalueintooneofafinite
setofdiscretevalues,orcombiningitwithotherfeatures)butstartbyusingtheraw
valueproducedbytheheuristic.
3. Minetherawinputsoftheheuristic.Ifthereisaheuristicforappsthatcombinesthe
numberofinstalls,thenumberofcharactersinthetext,andthedayoftheweek,then
considerpullingthesepiecesapart,andfeedingtheseinputsintothelearning
separately.Sometechniquesthatapplytoensemblesapplyhere(s eeRule#40).
4. Modifythelabel.Thisisanoptionwhenyoufeelthattheheuristiccapturesinformation
notcurrentlycontainedinthelabel.Forexample,ifyouaretryingtomaximizethe
numberofdownloads,butyoualsowantqualitycontent,thenmaybethesolutionisto
multiplythelabelbytheaveragenumberofstarstheappreceived.Thereisalotof
spacehereforleeway.SeethesectiononYourFirstObjective.
DobemindfuloftheaddedcomplexitywhenusingheuristicsinanMLsystem.Usingold
heuristicsinyournewmachinelearningalgorithmcanhelptocreateasmoothtransition,but
thinkaboutwhetherthereisasimplerwaytoaccomplishthesameeffect.

Monitoring
Ingeneral,practicegoodalertinghygiene,suchasmakingalertsactionableandhavinga
dashboardpage.

Rule#8:Knowthefreshnessrequirementsofyoursystem.
Howmuchdoesperformancedegradeifyouhaveamodelthatisadayold?Aweekold?A
quarterold?Thisinformationcanhelpyoutounderstandtheprioritiesofyourmonitoring.Ifyou
lose10%ofyourrevenueifthemodelisnotupdatedforaday,itmakessensetohavean
engineerwatchingitcontinuously.Mostadservingsystemshavenewadvertisementstohandle

everyday,andmustupdatedaily.Forinstance,iftheMLmodelforG
ooglePlaySearchisnot
updated,itcanhaveanimpactonrevenueinunderamonth.SomemodelsforWhatsHotin
GooglePlushavenopostidentifierintheirmodelsotheycanexportthesemodelsinfrequently.
Othermodelsthathavepostidentifiersareupdatedmuchmorefrequently.Alsonoticethat
freshnesscanchangeovertime,especiallywhenfeaturecolumnsareaddedorremovedfrom
yourmodel.

Rule#9:Detectproblemsbeforeexportingmodels.
Manymachinelearningsystemshaveastagewhereyouexportthemodeltoserving.Ifthereis
anissuewithanexportedmodel,itisauserfacingissue.Ifthereisanissuebefore,thenitisa
trainingissue,anduserswillnotnotice.

Dosanitychecksrightbeforeyouexportthemodel.Specifically,makesurethatthemodels
performanceisreasonableonheldoutdata.Or,ifyouhavelingeringconcernswiththedata,
dontexportamodel.Manyteamscontinuouslydeployingmodelscheckthea
reaunderthe
ROCcurve(orAUC)beforeexporting.Issuesaboutmodelsthathaventbeenexported
requireanemailalert,butissuesonauserfacingmodelmayrequireapage.Sobetterto
waitandbesurebeforeimpactingusers.

Rule#10:Watchforsilentfailures.
Thisisaproblemthatoccursmoreformachinelearningsystemsthanforotherkindsof
systems.Supposethataparticulartablethatisbeingjoinedisnolongerbeingupdated.The
machinelearningsystemwilladjust,andbehaviorwillcontinuetobereasonablygood,decaying
gradually.Sometimestablesarefoundthatweremonthsoutofdate,andasimplerefresh
improvedperformancemorethananyotherlaunchthatquarter!Forexample,thecoverageofa
featuremaychangeduetoimplementationchanges:forexampleafeaturecolumncouldbe
populatedin90%oftheexamples,andsuddenlydropto60%oftheexamples.Playoncehada
tablethatwasstalefor6months,andrefreshingthetablealonegaveaboostof2%ininstall
rate.Ifyoutrackstatisticsofthedata,aswellasmanuallyinspectthedataonoccassion,you
canreducethesekindsoffailures.

Rule#11:Givefeaturecolumnownersanddocumentation.
Ifthesystemislarge,andtherearemanyfeaturecolumns,knowwhocreatedorismaintaining
eachfeaturecolumn.Ifyoufindthatthepersonwhounderstandsafeaturecolumnisleaving,
makesurethatsomeonehastheinformation.Althoughmanyfeaturecolumnshavedescriptive
names,it'sgoodtohaveamoredetaileddescriptionofwhatthefeatureis,whereitcamefrom,
andhowitisexpectedtohelp.

YourFirstObjective
Youhavemanymetrics,ormeasurementsaboutthesystemthatyoucareabout,butyour
machinelearningalgorithmwilloftenrequireasingleo
bjective,anumberthatyouralgorithm

istryingtooptimize.Idistinguishherebetweenobjectivesandmetrics:a
metricisany
numberthatyoursystemreports,whichmayormaynotbeimportant.SeealsoR
ule#2.

Rule#12:Dontoverthinkwhichobjectiveyouchoosetodirectlyoptimize.
Youwanttomakemoney,makeyourusershappy,andmaketheworldabetterplace.Thereare
tonsofmetricsthatyoucareabout,andyoushouldmeasurethemall(seeR
ule#2).However,
earlyinthemachinelearningprocess,youwillnoticethemallgoingup,eventhosethatyoudo
notdirectlyoptimize.Forinstance,supposeyoucareaboutnumberofclicks,timespentonthe
site,anddailyactiveusers.Ifyouoptimizefornumberofclicks,youarelikelytoseethetime
spentincrease.
So,keepitsimpleanddontthinktoohardaboutbalancingdifferentmetricswhenyoucanstill
easilyincreaseallthemetrics.Donttakethisruletoofarthough:donotconfuseyourobjective
withtheultimatehealthofthesystem(seeR
ule#39).And,ifyoufindyourselfincreasingthe
directlyoptimizedmetric,butdecidingnottolaunch,someobjectiverevisionmaybe
required.

Rule#13:Chooseasimple,observableandattributablemetricforyourfirstobjective.
Oftenyoudon'tknowwhatthetrueobjectiveis.Youthinkyoudobutthenyouasyoustareat
thedataandsidebysideanalysisofyouroldsystemandnewMLsystem,yourealizeyouwant
totweakit.Further,differentteammembersoftencan'tagreeonthetrueobjective.T
heML
objectiveshouldbesomethingthatiseasytomeasureandisaproxyforthetrue
objective2.SotrainonthesimpleMLobjective,andconsiderhavinga"policylayer"ontopthat
allowsyoutoaddadditionallogic(hopefullyverysimplelogic)todothefinalranking.

Theeasiestthingtomodelisauserbehaviorthatisdirectlyobservedandattributabletoan
actionofthesystem:
1. Wasthisrankedlinkclicked?
2. Wasthisrankedobjectdownloaded?
3. Wasthisrankedobjectforwarded/repliedto/emailed?
4. Wasthisrankedobjectrated?
5. Wasthisshownobjectmarkedasspam/pornography/offensive?
Avoidmodelingindirecteffectsatfirst:
1. Didtheuservisitthenextday?
2. Howlongdidtheuservisitthesite?
3. Whatwerethedailyactiveusers?
Indirecteffectsmakegreatmetrics,andcanbeusedduringA/Btestingandduringlaunch
decisions.
Finally,donttrytogetthemachinelearningtofigureout:
1. Istheuserhappyusingtheproduct?
2. Istheusersatisfiedwiththeexperience?
3. Istheproductimprovingtheusersoverallwellbeing?
2

Thereisoftennotrueobjective.SeeR
ule#39.

4. Howwillthisaffectthecompanysoverallhealth?
Theseareallimportant,butalsoincrediblyhard.Instead,useproxies:iftheuserishappy,they
willstayonthesitelonger.Iftheuserissatisfied,theywillvisitagaintomorrow.Insofaras
wellbeingandcompanyhealthisconcerned,humanjudgementisrequiredtoconnectany
machinelearnedobjectivetothenatureoftheproductyouaresellingandyourbusinessplan,
sowedontenduph
ere.

Rule#14:Startingwithaninterpretablemodelmakesdebuggingeasier.
Linearregression,logisticregression,andPoissonregressionaredirectlymotivatedbya
probabilisticmodel.Eachpredictionisinterpretableasaprobabilityoranexpectedvalue.This
makesthemeasiertodebugthanmodelsthatuseobjectives(zerooneloss,varioushinge
losses,etcetera)thattrytodirectlyoptimizeclassificationaccuracyorrankingperformance.For
example,ifprobabilitiesintrainingdeviatefromprobabilitiespredictedinsidebysidesorby
inspectingtheproductionsystem,thisdeviationcouldrevealaproblem.

Forexample,inlinear,logistic,orPoissonregression,t herearesubsetsofthedatawherethe
averagepredictedexpectationequalstheaveragelabel(1momentcalibrated,orjust
calibrated)3.Ifyouhaveafeaturewhichiseither1or0foreachexample,thenthesetof
exampleswherethatfeatureis1iscalibrated.Also,ifyouhaveafeaturethatis1forevery
example,thenthesetofallexamplesiscalibrated.

Withsimplemodels,itiseasiertodealwithfeedbackloops(seeR
ule#36).
Often,weusetheseprobabilisticpredictionstomakeadecision:e.g.rankpostsindecreasing
expectedvalue(i.e.probabilityofclick/download/etc.).H
owever,rememberwhenitcomes
timetochoosewhichmodeltouse,thedecisionmattersmorethanthelikelihoodofthe
datagiventhemodel(seeR
ule#27).

Rule#15:SeparateSpamFilteringandQualityRankinginaPolicyLayer.
Qualityrankingisafineart,butspamfilteringisawar.Thesignalsthatyouusetodetermine
highqualitypostswillbecomeobvioustothosewhouseyoursystem,andtheywilltweaktheir
poststohavetheseproperties.Thus,yourqualityrankingshouldfocusonrankingcontentthat
ispostedingoodfaith.Youshouldnotdiscountthequalityrankinglearnerforrankingspam
highly.S
imilarly,racycontentshouldbehandledseparatelyfromQualityRanking.
Spamfilteringisadifferentstory.Youhavetoexpectthatthefeaturesthatyouneedtogenerate
willbeconstantlychanging.Often,therewillbeobviousrulesthatyouputintothesystem(ifa
posthasmorethanthreespamvotes,dontretrieveit,etcetera).Anylearnedmodelwillhaveto
beupdateddaily,ifnotfaster.Thereputationofthecreatorofthecontentwillplayagreatrole.

Atsomelevel,theoutputofthesetwosystemswillhavetobeintegrated.Keepinmind,filtering
spaminsearchresultsshouldprobablybemoreaggressivethanfilteringspaminemail
3

Thisistrueassumingthatyouhavenoregularizationandthatyouralgorithmhasconverged.Itis
approximatelytrueingeneral.

messages.Also,itisastandardpracticetoremovespamfromthetrainingdataforthequality
classifier.

MLPhaseII:FeatureEngineering
Inthefirstphaseofthelifecycleofamachinelearningsystem,theimportantissueistogetthe
trainingdataintothelearningsystem,getanymetricsofinterestinstrumented,andcreatea
servinginfrastructure.A
fteryouhaveaworkingendtoendsystemwithunitandsystem
testsinstrumented,PhaseIIbegins.

Inthesecondphase,thereisalotoflowhangingfruit.Thereareavarietyofobviousfeatures
thatcouldbepulledintothesystem.Thus,thesecondphaseofmachinelearninginvolves
pullinginasmanyfeaturesaspossibleandcombiningtheminintuitiveways.Duringthisphase,
allofthemetricsshouldstillberising.Therewillbelotsoflaunches,anditisagreattimetopull
inlotsofengineersthatcanjoinupallthedatathatyouneedtocreateatrulyawesomelearning
system.

Rule#16:Plantolaunchanditerate.
Dontexpectthatthemodelyouareworkingonnowwillbethelastonethatyouwilllaunch,or
eventhatyouwilleverstoplaunchingmodels.Thusconsiderwhetherthecomplexityyouare
addingwiththislaunchwillslowdownfuturelaunches.Manyteamshavelaunchedamodelper
quarterormoreforyears.Therearethreebasicreasonstolaunchnewmodels:
1. youarecomingupwithnewfeatures,
2. youaretuningregularizationandcombiningoldfeaturesinnewways,and/or
3. youaretuningtheobjective.

Regardless,givingamodelabitoflovecanbegood:lookingoverthedatafeedingintothe
examplecanhelpfindnewsignalsaswellasold,brokenones.So,asyoubuildyourmodel,
thinkabouthoweasyitistoaddorremoveorrecombinefeatures.Thinkabouthoweasyitisto
createafreshcopyofthepipelineandverifyitscorrectness.Thinkaboutwhetheritispossible
tohavetwoorthreecopiesrunninginparallel.Finally,dontworryaboutwhetherfeature16of
35makesitintothisversionofthepipeline.Youllgetitnextquarter.

Rule#17:Startwithdirectlyobservedandreportedfeaturesasopposedtolearned
features.
Thismightbeacontroversialpoint,butitavoidsalotofpitfalls.Firstofall,letsdescribewhata
learnedfeatureis.Alearnedfeatureisafeaturegeneratedeitherbyanexternalsystem(such
asanunsupervisedclusteringsystem)orbythelearneritself(e.g.viaafactoredmodelordeep

learning).Bothofthesecanbeuseful,buttheycanhavealotofissues,sotheyshouldnotbein
thefirstmodel.

Ifyouuseanexternalsystemtocreateafeature,rememberthatthesystemhasitsown
objective.Theexternalsystem'sobjectivemaybeonlyweaklycorrelatedwithyourcurrent
objective.Ifyougrabasnapshotoftheexternalsystem,thenitcanbecomeoutofdate.Ifyou
updatethefeaturesfromtheexternalsystem,thenthemeaningsmaychange.Ifyouusean
externalsystemtoprovideafeature,beawarethattheyrequireagreatdealofcare.

Theprimaryissuewithfactoredmodelsanddeepmodelsisthattheyarenonconvex.Thus,
thereisnoguaranteethatanoptimalsolutioncanbeapproximatedorfound,andthelocal
minimafoundoneachiterationcanbedifferent.Thisvariationmakesithardtojudgewhether
theimpactofachangetoyoursystemismeaningfulorrandom.Bycreatingamodelwithout
deepfeatures,youcangetanexcellentbaselineperformance.Afterthisbaselineisachieved,
youcantrymoreesotericapproaches.

Rule#18:Explorewithfeaturesofcontentthatgeneralizeacrosscontexts.
Oftenamachinelearningsystemisasmallpartofamuchbiggerpicture.Forexample,ifyou
imagineapostthatmightbeusedinWhatsHot,manypeoplewillplusone,reshare,or
commentonapostbeforeitisevershowninWhatsHot.Ifyouprovidethosestatisticstothe
learner,itcanpromotenewpoststhatithasnodataforinthecontextitisoptimizing.Y
ouTube
WatchNextcouldusenumberofwatches,orcowatches(countsofhowmanytimesonevideo
waswatchedafteranotherwaswatched)fromY
ouTubesearch.Youcanalsouseexplicituser
ratings.Finally,ifyouhaveauseractionthatyouareusingasalabel,seeingthatactiononthe
documentinadifferentcontextcanbeagreatfeature.Allofthesefeaturesallowyoutobring
newcontentintothecontext.Notethatthisisnotaboutpersonalization:figureoutifsomeone
likesthecontentinthiscontextfirst,thenfigureoutwholikesitmoreorless.

Rule#19:Useveryspecificfeatureswhenyoucan.
Withtonsofdata,itissimplertolearnmillionsofsimplefeaturesthanafewcomplexfeatures.
Identifiersofdocumentsbeingretrievedandcanonicalizedqueriesdonotprovidemuch
generalization,butalignyourrankingwithyourlabelsonheadqueries..Thus,dontbeafraidof
groupsoffeatureswhereeachfeatureappliestoaverysmallfractionofyourdata,butoverall
coverageisabove90%.Youcanuseregularizationtoeliminatethefeaturesthatapplytotoo
fewexamples.

Rule#20:Combineandmodifyexistingfeaturestocreatenewfeaturesin
humanunderstandableways.
Thereareavarietyofwaystocombineandmodifyfeatures.Machinelearningsystemssuchas
TensorFlowallowyoutopreprocessyourdatathroughtransformations.Thetwomoststandard
approachesarediscretizationsandcrosses.

Discretizationconsistsoftakingacontinuousfeatureandcreatingmanydiscretefeaturesfrom
it.Consideracontinuousfeaturesuchasage.Youcancreateafeaturewhichis1whenageis
lessthan18,anotherfeaturewhichis1whenageisbetween18and35,etcetera.Dont
overthinktheboundariesofthesehistograms:basicquantileswillgiveyoumostoftheimpact.

Crossescombinetwoormorefeaturecolumns.Afeaturecolumn,inTensorFlow'sterminology,
isasetofhomogenousfeatures,(e.g.{male,female},{US,Canada,Mexico},etcetera).Across
isanewfeaturecolumnwithfeaturesin,forexample,{male, f emale} {U S, C anada, M exico} .
Thisnewfeaturecolumnwillcontainthefeature(male,Canada).IfyouareusingTensorFlow
andyoutellTensorFlowtocreatethiscrossforyou,this(male,Canada)featurewillbepresent
inexamplesrepresentingmaleCanadians.Notethatittakesmassiveamountsofdatatolearn
modelswithcrossesofthree,four,ormorebasefeaturecolumns.

Crossesthatproduceverylargefeaturecolumnsmayoverfit.Forinstance,imaginethatyouare
doingsomesortofsearch,andyouhaveafeaturecolumnwithwordsinthequery,andyou
haveafeaturecolumnwithwordsinthedocument.Youcancombinethesewithacross,but
youwillendupwithalotoffeatures(seeR
ule#21).Whenworkingwithtexttherearetwo
alternatives.Themostdraconianisadotproduct.Adotproductinitssimplestformsimply
countsthenumberofcommonwordsbetweenthequeryandthedocument.Thisfeaturecan
thenbediscretized.Anotherapproachisanintersection:thus,wewillhaveafeaturewhichis
presentifandonlyifthewordponyisinthedocumentandthequery,andanotherfeature
whichispresentifandonlyifthewordtheisinthedocumentandthequery.

Rule#21:Thenumberoffeatureweightsyoucanlearninalinearmodelisroughly
proportionaltotheamountofdatayouhave.
Therearefascinatingstatisticallearningtheoryresultsconcerningtheappropriatelevelof
complexityforamodel,butthisruleisbasicallyallyouneedtoknow.Ihavehadconversations
inwhichpeopleweredoubtfulthatanythingcanbelearnedfromonethousandexamples,or
thatyouwouldeverneedmorethan1millionexamples,becausetheygetstuckinacertain
methodoflearning.Thekeyistoscaleyourlearningtothesizeofyourdata:
1. Ifyouareworkingonasearchrankingsystem,andtherearemillionsofdifferentwords
inthedocumentsandthequeryandyouhave1000labeledexamples,thenyoushould
useadotproductbetweendocumentandqueryfeatures,T
FIDF,andahalfdozen
otherhighlyhumanengineeredfeatures.1000examples,adozenfeatures.
2. Ifyouhaveamillionexamples,thenintersectthedocumentandqueryfeaturecolumns,
usingregularizationandpossiblyfeatureselection.Thiswillgiveyoumillionsoffeatures,
butwithregularizationyouwillhavefewer.Tenmillionexamples,maybeahundred
thousandfeatures.
3. Ifyouhavebillionsorhundredsofbillionsofexamples,youcancrossthefeature
columnswithdocumentandquerytokens,usingfeatureselectionandregularization.
Youwillhaveabillionexamples,and10millionfeatures.
Statisticallearningtheoryrarelygivestightbounds,butgivesgreatguidanceforastartingpoint.
Intheend,useR
ule#28todecidewhatfeaturestouse.


Rule#22:Cleanupfeaturesyouarenolongerusing.
Unusedfeaturescreatetechnicaldebt.Ifyoufindthatyouarenotusingafeature,andthat
combiningitwithotherfeaturesisnotworking,thendropitoutofyourinfrastructure.Youwant
tokeepyourinfrastructurecleansothatthemostpromisingfeaturescanbetriedasfastas
possible.Ifnecessary,someonecanalwaysaddbackyourfeature.

Keepcoverageinmindwhenconsideringwhatfeaturestoaddorkeep.Howmanyexamples
arecoveredbythefeature?Forexample,ifyouhavesomepersonalizationfeatures,butonly
8%ofyourusershaveanypersonalizationfeatures,itisnotgoingtobeveryeffective.

Atthesametime,somefeaturesmaypunchabovetheirweight.Forexample,ifyouhavea
featurewhichcoversonly1%ofthedata,but90%oftheexamplesthathavethefeatureare
positive,thenitwillbeagreatfeaturetoadd.

HumanAnalysisoftheSystem
Beforegoingontothethirdphaseofmachinelearning,itisimportanttofocusonsomethingthat
isnottaughtinanymachinelearningclass:howtolookatanexistingmodel,andimproveit.
Thisismoreofanartthanascience,andyetthereareseveralantipatternsthatithelpsto
avoid.

Rule#23:Youarenotatypicalenduser.
Thisisperhapstheeasiestwayforateamtogetboggeddown.Whiletherearealotofbenefits
tofishfooding(usingaprototypewithinyourteam)anddogfooding(usingaprototypewithin
yourcompany),employeesshouldlookatwhethertheperformanceiscorrect.Whileachange
whichisobviouslybadshouldnotbeused,anythingthatlooksreasonablynearproduction
shouldbetestedfurther,eitherbypayinglaypeopletoanswerquestionsonacrowdsourcing
platform,orthroughaliveexperimentonrealusers.

Therearetworeasonsforthis.Thefirstisthatyouaretooclosetothecode.Youmaybe
lookingforaparticularaspectoftheposts,oryouaresimplytooemotionallyinvolved(e.g.
confirmationbias).Thesecondisthatyourtimeistoovaluable.Considerthecostof9
engineerssittinginaonehourmeeting,andthinkofhowmanycontractedhumanlabelsthat
buysonacrowdsourcingplatform.

Ifyoureallywanttohaveuserfeedback,u
seuserexperiencemethodologies.Createuser
personas(onedescriptionisinBillBuxtonsD
esigningUserExperiences)earlyinaprocessand
dousabilitytesting(onedescriptionisinSteveKrugsD
ontMakeMeThink)later.User
personasinvolvecreatingahypotheticaluser.Forinstance,ifyourteamisallmale,itmighthelp
todesigna35yearoldfemaleuserpersona(completewithuserfeatures),andlookatthe
resultsitgeneratesratherthan10resultsfor2540yearoldmales.Bringinginactualpeopleto

watchtheirreactiontoyoursite(locallyorremotely)inusabilitytestingcanalsogetyouafresh
perspective.

Rule#24:Measurethedeltabetweenmodels.
Oneoftheeasiest,andsometimesmostusefulmeasurementsyoucanmakebeforeanyusers
havelookedatyournewmodelistocalculatejusthowdifferentthenewresultsarefrom
production.Forinstance,ifyouhavearankingproblem,runbothmodelsonasampleofqueries
throughtheentiresystem,andlookatthesizeofthesymmetricdifferenceoftheresults
(weightedbyrankingposition).Ifthedifferenceisverysmall,thenyoucantellwithoutrunning
anexperimentthattherewillbelittlechange.Ifthedifferenceisverylarge,thenyouwantto
makesurethatthechangeisgood.Lookingoverquerieswherethesymmetricdifferenceishigh
canhelpyoutounderstandqualitativelywhatthechangewaslike.Makesure,however,thatthe
systemisstable.Makesurethatamodelwhencomparedwithitselfhasalow(ideallyzero)
symmetricdifference.

Rule#25:Whenchoosingmodels,utilitarianperformancetrumpspredictivepower.
Yourmodelmaytrytopredictclickthroughrate.However,intheend,thekeyquestioniswhat
youdowiththatprediction.Ifyouareusingittorankdocuments,thenthequalityofthefinal
rankingmattersmorethanthepredictionitself.Ifyoupredicttheprobabilitythatadocumentis
spamandthenhaveacutoffonwhatisblocked,thentheprecisionofwhatisallowedthrough
mattersmore.Mostofthetime,thesetwothingsshouldbeinagreement:whentheydonot
agree,itwilllikelybeonasmallgain.Thus,ifthereissomechangethatimprovesloglossbut
degradestheperformanceofthesystem,lookforanotherfeature.Whenthisstartshappening
moreoften,itistimetorevisittheobjectiveofyourmodel.

Rule#26:Lookforpatternsinthemeasurederrors,andcreatenewfeatures.
Supposethatyouseeatrainingexamplethatthemodelgotwrong.Inaclassificationtask,this
couldbeafalsepositiveorafalsenegative.Inarankingtask,itcouldbeapairwhereapositive
wasrankedlowerthananegative.Themostimportantpointisthatthisisanexamplethatthe
machinelearningsystemk nowsitgotwrongandwouldliketofixifgiventheopportunity.Ifyou
givethemodelafeaturethatallowsittofixtheerror,themodelwilltrytouseit.

Ontheotherhand,ifyoutrytocreateafeaturebaseduponexamplesthesystemdoesntsee
asmistakes,thefeaturewillbeignored.Forinstance,supposethatinPlayAppsSearch,
someonesearchesforfreegames.Supposeoneofthetopresultsisalessrelevantgagapp.
Soyoucreateafeatureforgagapps.However,ifyouaremaximizingnumberofinstalls,and
peopleinstallagagappwhentheysearchforfreegames,thegagappsfeaturewonthavethe
effectyouwant.

Onceyouhaveexamplesthatthemodelgotwrong,lookfortrendsthatareoutsideyourcurrent
featureset.Forinstance,ifthesystemseemstobedemotinglongerposts,thenaddpost
length.Dontbetoospecificaboutthefeaturesyouadd.Ifyouaregoingtoaddpostlength,

donttrytoguesswhatlongmeans,justaddadozenfeaturesandtheletmodelfigureoutwhat
todowiththem(seeR
ule#21).Thatistheeasiestwaytogetwhatyouwant.

Rule#27:Trytoquantifyobservedundesirablebehavior.
Somemembersofyourteamwillstarttobefrustratedwithpropertiesofthesystemtheydont
likewhicharentcapturedbytheexistinglossfunction.Atthispoint,theyshoulddowhateverit
takestoturntheirgripesintosolidnumbers.Forexample,iftheythinkthattoomanygagapps
arebeingshowninPlaySearch,theycouldhavehumanratersidentifygagapps.(Youcan
feasiblyusehumanlabelleddatainthiscasebecausearelativelysmallfractionofthequeries
accountforalargefractionofthetraffic.)Ifyourissuesaremeasurable,thenyoucanstartusing
themasfeatures,objectives,ormetrics.Thegeneralruleis measurefirst,optimizesecond.

Rule#28:Beawarethatidenticalshorttermbehaviordoesnotimplyidenticallongterm
behavior.
Imaginethatyouhaveanewsystemthatlooksateverydoc_idandexact_query,andthen
calculatestheprobabilityofclickforeverydocforeveryquery.Youfindthatitsbehavioris
nearlyidenticaltoyourcurrentsysteminbothsidebysidesandA/Btesting,sogivenits
simplicity,youlaunchit.However,younoticethatnonewappsarebeingshown.Why?Well,
sinceyoursystemonlyshowsadocbasedonitsownhistorywiththatquery,thereisnowayto
learnthatanewdocshouldbeshown.

Theonlywaytounderstandhowsuchasystemwouldworklongtermistohaveittrainonlyon
dataacquiredwhenthemodelwaslive.Thisisverydifficult.

TrainingServingSkew
Trainingservingskewisadifferencebetweenperformanceduringtrainingandperformance
duringserving.Thisskewcanbecausedby:
adiscrepancybetweenhowyouhandledatainthetrainingandservingpipelines,or
achangeinthedatabetweenwhenyoutrainandwhenyouserve,or
afeedbackloopbetweenyourmodelandyouralgorithm.
WehaveobservedproductionmachinelearningsystemsatGooglewithtrainingservingskew
thatnegativelyimpactsperformance.Thebestsolutionistoexplicitlymonitoritsothatsystem
anddatachangesdontintroduceskewunnoticed.

Rule#29:Thebestwaytomakesurethatyoutrainlikeyouserveistosavethesetof
featuresusedatservingtime,andthenpipethosefeaturestoalogtousethemat
trainingtime.

Evenifyoucantdothisforeveryexample,doitforasmallfraction,suchthatyoucanverifythe
consistencybetweenservingandtraining(seeR
ule#37).Teamsthathavemadethis
measurementatGoogleweresometimessurprisedbytheresults.Y
ouTubehomepage

switchedtologgingfeaturesatservingtimewithsignificantqualityimprovementsanda
reductionincodecomplexity,andmanyteamsareswitchingtheirinfrastructureaswespeak.

Rule#30:Importanceweightsampleddata,dontarbitrarilydropit!
Whenyouhavetoomuchdata,thereisatemptationtotakefiles112,andignorefiles1399.
Thisisamistake:droppingdataintraininghascausedissuesinthepastforseveralteams(see
Rule#6).Althoughdatathatwasnevershowntotheusercanbedropped,importance
weightingisbestfortherest.Importanceweightingmeansthatifyoudecidethatyouaregoing
tosampleexampleXwitha30%probability,thengiveitaweightof10/3.W
ithimportance
weighting,allofthecalibrationpropertiesdiscussedinR
ule#14stillhold.

Rule#31:Bewarethatifyoujoindatafromatableattrainingandservingtime,thedatain
thetablemaychange.
Sayyoujoindocidswithatablecontainingfeaturesforthosedocs(suchasnumberof
commentsorclicks).Betweentrainingandservingtime,featuresinthetablemaybechanged.
Yourmodel'spredictionforthesamedocumentmaythendifferbetweentrainingandserving.
Theeasiestwaytoavoidthissortofproblemistologfeaturesatservingtime(seeR
ule#32).If
thetableischangingonlyslowly,youcanalsosnapshotthetablehourlyordailytoget
reasonablyclosedata.Notethatthisstilldoesntcompletelyresolvetheissue.

Rule#32:Reusecodebetweenyourtrainingpipelineandyourservingpipeline
wheneverpossible.
Batchprocessingisdifferentthanonlineprocessing.Inonlineprocessing,youmusthandle
eachrequestasitarrives(e.g.youmustdoaseparatelookupforeachquery),whereasinbatch
processing,youcancombinetasks(e.g.makingajoin).Atservingtime,youaredoingonline
processing,whereastrainingisabatchprocessingtask.However,therearesomethingsthat
youcandotoreusecode.Forexample,youcancreateanobjectthatisparticulartoyour
systemwheretheresultofanyqueriesorjoinscanbestoredinaveryhumanreadableway,
anderrorscanbetestedeasily.Then,onceyouhavegatheredalltheinformation,during
servingortraining,yourunacommonmethodtobridgebetweenthehumanreadableobject
thatisspecifictoyoursystem,andwhateverformatthemachinelearningsystemexpects.T
his
eliminatesasourceoftrainingservingskew.Asacorollary,trynottousetwodifferent
programminglanguagesbetweentrainingandservingthatdecisionwillmakeitnearly
impossibleforyoutosharecode.

Rule#33:IfyouproduceamodelbasedonthedatauntilJanuary5th,testthemodelon
thedatafromJanuary6thandafter.
Ingeneral,measureperformanceofamodelonthedatagatheredafterthedatayoutrainedthe
modelon,asthisbetterreflectswhatyoursystemwilldoinproduction.Ifyouproduceamodel
basedonthedatauntilJanuary5th,testthemodelonthedatafromJanuary6th.Youwill
expectthattheperformancewillnotbeasgoodonthenewdata,butitshouldntberadically
worse.Sincetheremightbedailyeffects,youmightnotpredicttheaverageclickrateor

conversionrate,buttheareaunderthecurve,whichrepresentsthelikelihoodofgivingthe
positiveexampleascorehigherthananegativeexample,shouldbereasonablyclose.

Rule#34:Inbinaryclassificationforfiltering(suchasspamdetectionordetermining
interestingemails),makesmallshorttermsacrificesinperformanceforverycleandata.
Inafilteringtask,exampleswhicharemarkedasnegativearenotshowntotheuser.Suppose
youhaveafilterthatblocks75%ofthenegativeexamplesatserving.Youmightbetemptedto
drawadditionaltrainingdatafromtheinstancesshowntousers.Forexample,ifausermarksan
emailasspamthatyourfilterletthrough,youmightwanttolearnfromthat.

Butthisapproachintroducessamplingbias.Youcangathercleanerdataifinsteadduring
servingyoulabel1%ofalltrafficasheldout,andsendallheldoutexamplestotheuser.Now
yourfilterisblockingatleast74%ofthenegativeexamples.Theseheldoutexamplescan
becomeyourtrainingdata.

Notethatifyourfilterisblocking95%ofthenegativeexamplesormore,thisbecomesless
viable.Evenso,ifyouwishtomeasureservingperformance,youcanmakeaneventinier
sample(say0.1%or0.001%).Tenthousandexamplesisenoughtoestimateperformancequite
accurately.

Rule#35:Bewareoftheinherentskewinrankingproblems.
Whenyouswitchyourrankingalgorithmradicallyenoughthatdifferentresultsshowup,you
haveeffectivelychangedthedatathatyouralgorithmisgoingtoseeinthefuture.Thiskindof
skewwillshowup,andyoushoulddesignyourmodelaroundit.Therearemultipledifferent
approaches.Theseapproachesareallwaystofavordatathatyourmodelhasalreadyseen.
1. Havehigherregularizationonfeaturesthatcovermorequeriesasopposedtothose
featuresthatareonforonlyonequery.Thisway,themodelwillfavorfeaturesthatare
specifictooneorafewqueriesoverfeaturesthatgeneralizetoallqueries.This
approachcanhelppreventverypopularresultsfromleakingintoirrelevantqueries.Note
thatthisisoppositethemoreconventionaladviceofhavingmoreregularizationon
featurecolumnswithmoreuniquevalues.
2. Onlyallowfeaturestohavepositiveweights.Thus,anygoodfeaturewillbebetterthana
featurethatisunknown.
3. Donthavedocumentonlyfeatures.Thisisanextremeversionof#1.Forexample,even
ifagivenappisapopulardownloadregardlessofwhatthequerywas,youdontwantto
showiteverywhere4.Nothavingdocumentonlyfeatureskeepsthatsimple.

Thereasonyoudontwanttoshowaspecificpopularappeverywherehastodowiththeimportanceof
makingallthedesiredappsr eachable.Forinstance,ifsomeonesearchesforbirdwatchingapp,they
mightdownloadangrybirds,butthatcertainlywasnttheirintent.Showingsuchanappmightimprove
downloadrate,butleavetheusersneedsultimatelyunsatisfied.

Rule#36:Avoidfeedbackloopswithpositionalfeatures.
Thepositionofcontentdramaticallyaffectshowlikelytheuseristointeractwithit.Ifyouputan
appinthefirstpositionitwillbeclickedmoreoften,andyouwillbeconvinceditismorelikelyto
beclicked.Onewaytodealwiththisistoaddpositionalfeatures,i.e.featuresabouttheposition
ofthecontentinthepage.Youtrainyourmodelwithpositionalfeatures,anditlearnstoweight,
forexample,thefeature"1stposition"heavily.Yourmodelthusgiveslessweighttootherfactors
forexampleswith"1stposition=true".Thenatservingyoudon'tgiveanyinstancesthe
positionalfeature,oryougivethemallthesamedefaultfeature,becauseyouarescoring
candidatesb
eforeyouhavedecidedtheorderinwhichtodisplaythem.

Notethatitisimportanttokeepanypositionalfeaturessomewhatseparatefromtherestofthe
modelbecauseofthisasymmetrybetweentrainingandtesting.Havingthemodelbethesumof
afunctionofthepositionalfeaturesandafunctionoftherestofthefeaturesisideal.For
example,dontcrossthepositionalfeatureswithanydocumentfeature.

Rule#37:MeasureTraining/ServingSkew.
Thereareseveralthingsthatcancauseskewinthemostgeneralsense.Moreover,youcan
divideitintoseveralparts:
1. Thedifferencebetweentheperformanceonthetrainingdataandtheholdoutdata.In
general,thiswillalwaysexist,anditisnotalwaysbad.
2. Thedifferencebetweentheperformanceontheholdoutdataandthenextdaydata.
Again,thiswillalwaysexist.Y
oushouldtuneyourregularizationtomaximizethe
nextdayperformance.However,largedropsinperformancebetweenholdoutand
nextdaydatamayindicatethatsomefeaturesaretimesensitiveandpossiblydegrading
modelperformance.
3. Thedifferencebetweentheperformanceonthenextdaydataandthelivedata.Ifyou
applyamodeltoanexampleinthetrainingdataandthesameexampleatserving,it
shouldgiveyouexactlythesameresult(seeR
ule#5).Thus,adiscrepancyhere
probablyindicatesanengineeringerror.

MLPhaseIII:SlowedGrowth,Optimization
Refinement,andComplexModels
Therewillbecertainindicationsthatthesecondphaseisreachingaclose.Firstofall,your
monthlygainswillstarttodiminish.Youwillstarttohavetradeoffsbetweenmetrics:youwillsee
someriseandothersfallinsomeexperiments.Thisiswhereitgetsinteresting.Sincethegains
arehardertoachieve,themachinelearninghastogetmoresophisticated.

Acaveat:thissectionhasmoreblueskyrulesthanearliersections.Wehaveseenmanyteams
gothroughthehappytimesofPhaseIandPhaseIImachinelearning.OncePhaseIIIhasbeen
reached,teamshavetofindtheirownpath.
Rule#38:Dontwastetimeonnewfeaturesifunalignedobjectiveshavebecomethe
issue.
Asyourmeasurementsplateau,yourteamwillstarttolookatissuesthatareoutsidethescope
oftheobjectivesofyourcurrentmachinelearningsystem.Asstatedbefore,iftheproductgoals
arenotcoveredbytheexistingalgorithmicobjective,youneedtochangeeitheryourobjective
oryourproductgoals.Forinstance,youmayoptimizeclicks,plusones,ordownloads,butmake
launchdecisionsbasedinpartonhumanraters.

Rule#39:Launchdecisionsareaproxyforlongtermproductgoals.
Alicehasanideaaboutreducingthelogisticlossofpredictinginstalls.Sheaddsafeature.The
logisticlossdrops.Whenshedoesaliveexperiment,sheseestheinstallrateincrease.
However,whenshegoestoalaunchreviewmeeting,someonepointsoutthatthenumberof
dailyactiveusersdropsby5%.Theteamdecidesnottolaunchthemodel.Aliceis
disappointed,butnowrealizesthatlaunchdecisionsdependonmultiplecriteria,onlysomeof
whichcanbedirectlyoptimizedusingML.

Thetruthisthattherealworldisnotdungeonsanddragons:therearenohitpointsidentifying
thehealthofyourproduct.Theteamhastousethestatisticsitgatherstotrytoeffectively
predicthowgoodthesystemwillbeinthefuture.Theyneedtocareaboutengagement,1day
activeusers(DAU),30DAU,revenue,andadvertisersreturnoninvestment.Thesemetricsthat
aremeasureableinA/Btestsinthemselvesareonlyaproxyformorelongtermgoals:satisfying
users,increasingusers,satisfyingpartners,andprofit,whicheventhenyoucouldconsider
proxiesforhavingauseful,highqualityproductandathrivingcompanyfiveyearsfromnow.

Theonlyeasylaunchdecisionsarewhenallmetricsgetbetter(oratleastdonotget
worse).Iftheteamhasachoicebetweenasophisticatedmachinelearningalgorithm,anda
simpleheuristic,ifthesimpleheuristicdoesabetterjobonallthesemetrics,itshouldchoose
theheuristic.Moreover,thereisnoexplicitrankingofallpossiblemetricvalues.Specifically,
considerthefollowingtwoscenarios:

Experiment

DailyActiveUsers

Revenue/Day

1million

$4million

2million

$2million

IfthecurrentsystemisA,thentheteamwouldbeunlikelytoswitchtoB.Ifthecurrentsystemis
B,thentheteamwouldbeunlikelytoswitchtoA.Thisseemsinconflictwithrationalbehavior:
however,predictionsofchangingmetricsmayormaynotpanout,andthusthereisalargerisk
involvedwitheitherchange.Eachmetriccoverssomeriskwithwhichtheteamisconcerned.

Moreover,nometriccoverstheteamsultimateconcern,whereismyproductgoingtobefive
yearsfromnow?

Individuals,ontheotherhand,tendtofavoroneobjectivethattheycandirectlyoptimize.
Mostmachinelearningtoolsfavorsuchanenvironment.Anengineerbangingoutnewfeatures
cangetasteadystreamoflaunchesinsuchanenvironment.Thereisatypeofmachine
learning,multiobjectivelearning,whichstartstoaddressthisproblem.Forinstance,onecan
formulateaconstraintsatisfactionproblemthathaslowerboundsoneachmetric,andoptimizes
somelinearcombinationofmetrics.However,eventhen,notallmetricsareeasilyframedas
machinelearningobjectives:ifadocumentisclickedonoranappisinstalled,itisbecausethat
thecontentwasshown.Butitisfarhardertofigureoutwhyauservisitsyoursite.Howto
predictthefuturesuccessofasiteasawholeisA
Icomplete,ashardascomputervisionor
naturallanguageprocessing.

Rule#40:Keepensemblessimple.
Unifiedmodelsthattakeinrawfeaturesanddirectlyrankcontentaretheeasiestmodelsto
debugandunderstand.However,anensembleofmodels(amodelwhichcombinesthescores
ofothermodels)canworkbetter.T
okeepthingssimple,eachmodelshouldeitherbean
ensembleonlytakingtheinputofothermodels,orabasemodeltakingmanyfeatures,
butnotboth.Ifyouhavemodelsontopofothermodelsthataretrainedseparately,then
combiningthemcanresultinbadbehavior.

Useasimplemodelforensemblingthattakesonlytheoutputofyourbasemodelsasinputs.
Youalsowanttoenforcepropertiesontheseensemblemodels.Forexample,anincreaseinthe
scoreproducedbyabasemodelshouldnotdecreasethescoreoftheensemble.Also,itisbest
iftheincomingmodelsaresemanticallyinterpretable(forexample,calibrated)sothatchanges
oftheunderlyingmodelsdonotconfusetheensemblemodel.Also,e
nforcethatanincrease
inthepredictedprobabilityofanunderlyingclassifierdoesnotdecreasethepredicted
probabilityoftheensemble.

Rule#41:W
henperformanceplateaus,lookforqualitativelynewsourcesofinformation
toaddratherthanrefiningexistingsignals.
Youveaddedsomedemographicinformationabouttheuser.You'veaddedsomeinformation
aboutthewordsinthedocument.Youhavegonethroughtemplateexploration,andtunedthe
regularization.Youhaventseenalaunchwithmorethana1%improvementinyourkeymetrics
inafewquarters.Nowwhat?

Itistimetostartbuildingtheinfrastructureforradicallydifferentfeatures,suchasthehistoryof
documentsthatthisuserhasaccessedinthelastday,week,oryear,ordatafromadifferent
property.Usew
ikidataentitiesorsomethinginternaltoyourcompany(suchasGoogles
knowledgegraph).Usedeeplearning.Starttoadjustyourexpectationsonhowmuchreturnyou

expectoninvestment,andexpandyoureffortsaccordingly.Asinanyengineeringproject,you
havetoweighthebenefitofaddingnewfeaturesagainstthecostofincreasedcomplexity.

Rule#42:Dontexpectdiversity,personalization,orrelevancetobeascorrelatedwith
popularityasyouthinktheyare.
Diversityinasetofcontentcanmeanmanythings,withthediversityofthesourceofthe
contentbeingoneofthemostcommon.Personalizationimplieseachusergetstheirown
results.Relevanceimpliesthattheresultsforaparticularqueryaremoreappropriateforthat
querythananyother.Thusallthreeofthesepropertiesaredefinedasbeingdifferentfromthe
ordinary.

Theproblemisthattheordinarytendstobehardtobeat.

Notethatifyoursystemismeasuringclicks,timespent,watches,+1s,reshares,etcetera,you
aremeasuringthep
opularityofthecontent.Teamssometimestrytolearnapersonalmodel
withdiversity.Topersonalize,theyaddfeaturesthatwouldallowthesystemtopersonalize
(somefeaturesrepresentingtheusersinterest)ordiversify(featuresindicatingifthisdocument
hasanyfeaturesincommonwithotherdocumentsreturned,suchasauthororcontent),and
findthatthosefeaturesgetlessweight(orsometimesadifferentsign)thantheyexpect.

Thisdoesntmeanthatdiversity,personalization,orrelevancearentvaluable.Aspointedoutin
thepreviousrule,youcandopostprocessingtoincreasediversityorrelevance.Ifyousee
longertermobjectivesincrease,thenyoucandeclarethatdiversity/relevanceisvaluable,aside
frompopularity.Youcantheneithercontinuetouseyourpostprocessing,ordirectlymodifythe
objectivebasedupondiversityorrelevance.

Rule#43:Yourfriendstendtobethesameacrossdifferentproducts.Yourintereststend
nottobe.
TeamsatGooglehavegottenalotoftractionfromtakingamodelpredictingtheclosenessofa
connectioninoneproduct,andhavingitworkwellonanother.Yourfriendsarewhotheyare.On
theotherhand,Ihavewatchedseveralteamsstrugglewithpersonalizationfeaturesacross
productdivides.Yes,itseemslikeitshouldwork.Fornow,itdoesntseemlikeitdoes.Whathas
sometimesworkedisusingrawdatafromonepropertytopredictbehavioronanother.Also,
keepinmindthatevenknowingthatauserhasahistoryonanotherpropertycanhelp.For
instance,thepresenceofuseractivityontwoproductsmaybeindicativeinandofitself.

RelatedWork
TherearemanydocumentsonmachinelearningatGoogleaswellasexternally.
MachineLearningCrashCourse:anintroductiontoappliedmachinelearning

MachineLearning:AProbabilisticApproachbyKevinMurphyforanunderstandingof
thefieldofmachinelearning
PracticalAdvicefortheAnalysisofLarge,ComplexDataSets:adatascienceapproach
tothinkingaboutdatasets.

DeepLearningbyIanGoodfellowetalforlearningnonlinearmodels
Googlepaperontechnicaldebt,whichhasalotofgeneraladvice.
TensorflowDocumentation

Acknowledgements
ThankstoDavidWestbrook,PeterBrandt,SamuelIeong,ChenyuZhao,LiWei,Michalis
Potamias,EvanRosen,BarryRosenberg,ChristineRobson,JamesPine,TalShaked,Tushar
Chandra,MustafaIspir,JeremiahHarmsen,KonstantinosKatsiapis,GlenAnderson,Dan
Duckworth,ShishirBirmiwal,GalElidan,SuLinWu,JaihuiLiu,FernandoPereira,and
HrishikeshAradhyeformanycorrections,suggestions,andhelpfulexamplesforthisdocument.
Also,thankstoKristenLefevre,SuddhaBasu,andChrisBergwhohelpedwithanearlier
version.Anyerrors,omissions,or(gasp!)unpopularopinionsaremyown.

Appendix
ThereareavarietyofreferencestoGoogleproductsinthisdocument.Toprovidemorecontext,
Igiveashortdescriptionofthemostcommonexamplesbelow.

YouTubeOverview
YouTubeisastreamingvideoservice.BothYouTubeWatchNextandYouTubeHomePage
teamsuseMLmodelstorankvideorecommendations.WatchNextrecommendsvideosto
watchafterthecurrentlyplayingone,whileHomePagerecommendsvideostousersbrowsing
thehomepage.

GooglePlayOverview
GooglePlayhasmanymodelssolvingavarietyofproblems.PlaySearch,PlayHomePage
PersonalizedRecommendations,andUsersAlsoInstalledappsallusemachinelearning.

GooglePlusOverview
GooglePlususesmachinelearninginavarietyofsituations:rankingpostsinthestreamof
postsbeingseenbytheuser,rankingWhatsHotposts(poststhatareverypopularnow),
rankingpeopleyouknow,etcetera.

You might also like