Decision tree algorithm - mascotasohana

Introduction

Decisiontreeisabasicclassificationandregressionmethod.Thedecisiontreemodelhasatreestructure.Inaclassificationproblem,itrepresentstheprocessofclassifyinginstancesbasedonfeatures.Itcanbeconsideredasacollectionofif-thenrules,orasaconditionalprobabilitydistributiondefinedinfeaturespaceandclassspace.

Themainadvantageisthatthemodelisreadableandtheclassificationspeedisfast.Whenlearning,usethetrainingdatatoestablishadecisiontreemodelaccordingtotheprincipleofminimizingthelossfunction.Whenforecasting,usethedecisiontreemodeltoclassifythenewdata.

Decisiontreelearningusuallyincludes3steps:featureselection,decisiontreegenerationanddecisiontreepruning.

Decisiontreelearning

Goal:Buildadecisiontreemodelbasedonagiventrainingdataset,sothatitcanclassifyinstancescorrectly.Decisiontreelearningisessentiallytosummarizeasetofclassificationrulesfromthetrainingdataset.Theremaybemultipledecisiontreesthatcancorrectlyclassifythetrainingdata,ornone.Whenchoosingadecisiontree,youshouldchooseadecisiontreethathaslesscontradictionwiththetrainingdataandhasgoodgeneralizationability;andtheselectedconditionalprobabilitymodelshouldnotonlyhaveagoodfittothetrainingdata,butalsototheunknowndataTherearegoodpredictions.

Lossfunction:usuallyaregularizedmaximumlikelihoodfunction

Strategy:minimizetheobjectivefunctionwiththelossfunction

BecauseselectingtheoptimaldecisiontreefromallpossibledecisiontreesisanNP-completeproblem,inrealitydecisiontreelearningusuallyusesheuristicmethodstoapproximatethisoptimizationproblem,andtheresultisThedecisiontreeissub-optimal.

Thealgorithmofdecisiontreelearningisusuallyarecursivelyselecttheoptimalfeature,andsegmentthetrainingdataaccordingtothefeature,sothateachsubThedatasethasabestclassificationprocess.Includingfeatureselection,decisiontreegenerationanddecisiontreepruningprocess.

Pruning:

Purpose:Tomakethetreesimpler,sothatithasbettergeneralizationability.

Steps:Removetheleafnodesthataretoosubdivided,makethemfallbacktotheparentnode,orevenhighernodes,andthenchangetheparentnodeorhighernodetoanewleafNode.

Thegenerationofthedecisiontreecorrespondstothelocalselectionofthemodel,andthepruningofthedecisiontreecorrespondstotheglobalselectionofthemodel.Thegenerationofthedecisiontreeonlyconsidersthelocaloptimum,andthepruningofthedecisiontreeconsiderstheglobaloptimum.

Featureselection:

Ifthenumberoffeaturesislarge,selectthefeaturesatthebeginningofthedecisiontreelearning,leavingonlyFeaturesthathavesufficientclassificationabilityfortrainingdata.(Forexample,thenameisnotselectedasafeature)

Decision tree algorithm

Typicalalgorithm

DecisiontreeTypicalalgorithmsareID3,C4.5,CART,etc.

Theinternationalauthoritativeacademicorganization,ICDM(theIEEEInternationalConferenceonDataMining)inDecember2006selectedthetoptenclassicalgorithmsinthefieldofdatamining,C4.5algorithmrankingFirst.C4.5algorithmisaclassificationdecisiontreealgorithminmachinelearningalgorithms,anditscorealgorithmisID3algorithm.TheclassificationrulesgeneratedbytheC4.5algorithmareeasytounderstandandhaveahighaccuracyrate.However,intheprocessofconstructingthetree,thedatasetneedstobescannedandsortedmultipletimes,whichwillleadtotheinefficiencyofthealgorithminpracticalapplications.

Theadvantagesofthedecisiontreealgorithmareasfollows:

(1)Highclassificationaccuracy;

(2)Thegeneratedmodelissimple;

(3)Ithasgoodrobustnesstonoisedata.

Therefore,itisoneofthemostwidelyusedinductivereasoningalgorithmsatpresent,andithasreceivedextensiveattentionfromresearchersindatamining.

Basicidea

1)Thetreestartswithasinglenoderepresentingthetrainingsample.

2)Ifthesamplesareallinthesamecategory.Thenthenodebecomesaleafandismarkedwiththistype.

3)Otherwise,thealgorithmselectstheattributewiththemostclassificationabilityasthecurrentnodeofthedecisiontree.

4)Accordingtothedifferentvaluesoftheattributesofthecurrentdecision-makingnodes,thetrainingsampledatasettlIisdividedintoseveralsubsets.Eachvalueformsabranch,andseveralvaluesformseveralsub-sets.branch.Evenforasubsetobtainedinthepreviousstep,repeatthepreviousstepsandpass4'Itoformadecisiontreeoneachdividedsample.Onceanattributeappearsonanode,thereisnoneedtoconsideritinanydescendantsofthatnode.

5)Therecursivedivisionsteponlystopswhenoneofthefollowingconditionsistrue:

①Allsamplesatagivennodebelongtothesamecategory.

②Therearenoremainingattributesthatcanbeusedtofurtherdividethesample.underthesecircumstances.Usemajorityvotingtoconvertagivennodeintoleaves,andusethecategorywiththelargestnumberoftuplesinthesampleasthecategorylabel.Atthesametime,itcanalsostorethecategorydistributionofthenodesample.

③IfIfabranchtcdoesnotsatisfytheexistingclassificationsamplesinthebranch,aleafiscreatedwiththemajorityclassofthesample.

Constructionmethod

Theinputofthedecisiontreeconstructionisasetofexampleswithcategorymarks,andtheresultoftheconstructionisabinarytreeorapolytree.Theinternalnode(non-leafnode)ofabinarytreeisgenerallyexpressedasalogicaljudgment,suchasalogicaljudgmentintheformofa=aj,whereaisanattributeandajisallthevaluesoftheattribute:theedgeofthetreeisthebranchresultofthelogicaljudgment.Theinternalnodeofthepolytree(ID3)isanattribute,andtheedgesareallthevaluesoftheattribute.Thereareseveraledgesifthereareseveralattributevalues.Theleafnodesofthetreeareallcategorylabels.

Duetoimproperdatarepresentation,noise,orduplicatesubtreeswhenthedecisiontreeisgenerated,theresultingdecisiontreewillbetoolarge.Therefore,simplifyingthedecisiontreeisanindispensablelink.Tofindanoptimaldecisiontree,thefollowingthreeoptimizationproblemsshouldbesolved:①Generatetheleastnumberofleafnodes;②Thedepthofeachleafnodegeneratedisthesmallest;③ThedecisiontreegeneratedhastheleastleafnodesandeachleafnodeThedepthisthesmallest.

Classificationandregressiontreemodel

Itisalsocomposedoffeatureselection,treegenerationandpruning,whichcanbeusedforbothclassificationandregression.

TheCARTalgorithmconsistsofthefollowingtwosteps

(1)Decisiontreegeneration:Generateadecisiontreebasedonthetrainingdataset,andthegenerateddecisiontreeshouldbeaslargeaspossible;

(2)Decisiontreepruning:Prunethegeneratedtreewiththeverificationdatasetandselecttheoptimalsubtree.Atthistime,theminimumlossfunctionisusedasthecriterionforpruning.