Clustering

Typicalapplication

"Whatisthetypicalapplicationofclustering?"Inbusiness,clusteringcanhelpmarketanalystsfinddifferentcustomergroupsfromthecustomerbase,andusepurchaseModeltocharacterizethecharacteristicsofdifferentcustomergroups.Inbiology,clusteringcanbeusedtoderivetheclassificationofplantsandanimals,classifygenes,andgainanunderstandingoftheinherentstructureofthepopulation.ClusteringcanalsoplayaroleinthedeterminationofsimilarareasintheEarthobservationdatabase,thegroupingofcarinsurancepolicyholders,andthegroupingofhousesinacityaccordingtothetype,valueandgeographiclocationofthehouse.ClusteringcanalsobeusedtoclassifydocumentsontheWebtodiscoverinformation.

Typicalrequirements

Scalability:Manyclusteringalgorithmsworkwellonsmalldatasetslessthan200dataobjects;however,oneLarge-scaledatabasesmaycontainmillionsofobjects,andclusteringonsamplesofsuchlargedatasetsmayleadtobiasedresults.Weneedclusteringalgorithmsthatarehighlyscalable.

Theabilitytohandledifferenttypesofdata:Manyalgorithmsaredesignedtoclusternumericaldata.However,applicationsmayrequireclusteringofothertypesofdata,suchasbinary,categorical/nominal,ordinaldata,oramixtureofthesedatatypes.

Discoverclustersofarbitraryshapes:ManyclusteringalgorithmsarebasedonEuclideanorManhattandistancemetricstodetermineclusters.Algorithmsbasedonsuchdistancemetricstendtofindsphericalclusterswithsimilarscaleanddensity.However,aclustermaybeofarbitraryshape.Itisimportanttoproposeanalgorithmthatcanfindclustersofarbitraryshapes.

Minimizethedomainknowledgeusedtodeterminetheinputparameters:Manyclusteringalgorithmsrequiretheusertoinputcertainparametersintheclusteranalysis,suchasthenumberofclustersdesiredtobegenerated.Theclusteringresultisverysensitivetotheinputparameters.Parametersareoftendifficulttodetermine,especiallyfordatasetscontaininghigh-dimensionalobjects.Thisnotonlyincreasestheburdenonusers,butalsomakesthequalityofclusteringdifficulttocontrol.

Theabilitytodealwith"noisy"data:Mostactualdatabasescontainoutliers,missing,orerroneousdata.Someclusteringalgorithmsaresensitivetosuchdataandmayleadtolow-qualityclusteringresults.

Insensitivetotheorderofinputrecords:Someclusteringalgorithmsaresensitivetotheorderofinputdata.Forexample,whenthesamedatasetisdeliveredtothesamealgorithminadifferentorder,itmaygenerateverydifferentclusteringresults.Itisofgreatsignificancetodevelopalgorithmsthatarenotsensitivetotheorderofdatainput.

HighdimensionalityHighdimensionality:Adatabaseordatawarehousemaycontainseveraldimensionsorattributes.Manyclusteringalgorithmsaregoodatprocessinglow-dimensionaldata,whichmayonlyinvolvetwotothreedimensions.Thehumaneyecanjudgethequalityofclusteringwellinthemostthree-dimensionalsituation.Clusteringdataobjectsinahigh-dimensionalspaceisverychallenging,especiallyconsideringthatsuchdatamaybeverysparselydistributedandhighlyskewed.

Constraint-basedclustering:Real-worldapplicationsmayrequireclusteringundervariousconstraints.SupposeyourjobistochoosealocationforagivennumberofATMsinacity.Inordertomakeadecision,youcanclusterresidentialareas,takingintoaccount,forexample,thecity'sriverandroadnetwork,andthecustomerrequirementsineacharea.Waitforthesituation.Itisachallengingtasktofindadatagroupthatnotonlysatisfiesspecificconstraints,butalsohasgoodclusteringcharacteristics.

Interpretabilityandusability:Userswanttheclusteringresultstobeinterpretable,understandable,andusable.Inotherwords,clusteringmayneedtobeassociatedwithspecificsemanticinterpretationsandapplications.Howapplicationgoalsaffectthechoiceofclusteringmethodisalsoanimportantresearchtopic.

Clustering

Calculationmethods

Traditionalclusteranalysiscalculationmethodsaremainlyasfollows:

1,partitioningmethods(partitioningmethods)

GivenadatasetwithNtuplesorrecords,thesplitmethodwillconstructKgroups,andeachgrouprepresentsacluster,whichcanberelaxedintheKclusteringalgorithm);foragivenK,thealgorithmfirstgivesDevelopaninitialgroupingmethod,andthenchangethegroupingthroughrepeatediterations,sothatthegroupingschemeaftereachimprovementisbetterthanthepreviousone,andtheso-calledgoodstandardis:therecordsinthesamegroupareascloseaspossible,butdifferentgroupsThefarthertherecordin,thebetter.Thealgorithmsthatusethisbasicideaare:K-MEANSalgorithm,K-MEDOIDSalgorithm,CLARANSalgorithm;

Mostofthedivisionmethodsarebasedondistance.Giventhenumberofpartitionsktobeconstructed,thepartitionmethodfirstcreatesaninitialpartition.Then,itusesaniterativerelocationtechniquetodivideobjectsbymovingobjectsfromonegrouptoanother.Thegeneralpreparationforagooddivisionis:objectsinthesameclusterareascloseaspossibleorrelatedtoeachother,whileobjectsindifferentclustersareasfarawayordifferentaspossible.Therearemanyothercriteriaforjudgingthequalityofthedivision.Thetraditionaldivisionmethodcanbeextendedtosubspaceclusteringinsteadofsearchingtheentiredataspace.Thisisusefulwhentherearemanyattributesandthedataissparse.Inordertoachievetheglobaloptimum,thepartition-basedclusteringmayneedtoexhaustallpossiblepartitions,andtheamountofcalculationisextremelylarge.Infact,mostapplicationsusepopularheuristicmethods,suchask-meansandk-centeralgorithms,toasymptoticallyimprovetheclusteringqualityandapproachthelocaloptimalsolution.Theseheuristicclusteringmethodsareverysuitablefordiscoveringsphericalclustersinsmallandmedium-sizeddatabases.Inordertodiscoverclusterswithcomplexshapesandclusterverylargedatasets,itisnecessarytofurtherexpandthepartition-basedmethod.

2.Hierarchicalmethods

Thismethoddecomposesagivendatasethierarchicallyuntilacertainconditionismet.Specifically,itcanbedividedinto"bottom-up"and"top-down"schemes.Forexample,inthe"bottom-up"scheme,eachdatarecordinitiallyformsaseparategroup.Inthenextiteration,itcombinestheadjacentcombinationsintoagroupuntilalltherecordsformagrouporUntilacertainconditionismet.Representativealgorithmsare:BIRCHalgorithm,CUREalgorithm,CHAMELEONalgorithm,etc.;

Thehierarchicalclusteringmethodcanbebasedondistanceordensityorconnectivity.Someextensionsofhierarchicalclusteringmethodsalsoconsidersubspaceclustering.Thedisadvantageofthehierarchicalapproachisthatonceastep(mergerorsplit)iscompleted,itcannotbeundone.Thisstrictruleisusefulbecausethereisnoneedtoworryaboutthenumberofcombinationsofdifferentchoices,itwillproducelesscomputationaloverhead.However,thistechniquecannotcorrectwrongdecisions.Somemethodshavebeenproposedtoimprovethequalityofhierarchicalclustering.

3.Density-basedmethods

Afundamentaldifferencebetweendensity-basedmethodsandothermethodsisthatitisnotbasedonvariousdistances.Itisbasedondensity.Thiscanovercometheshortcomingsofdistance-basedalgorithmsthatcanonlyfind"quasi-circular"clusters.Theguidingideaof​​thismethodisthataslongasthedensityofpointsinaregionisgreaterthanacertainthreshold,itisaddedtotheclustersthatareclosetoit.Representativealgorithmsare:DBSCANalgorithm,OPTICSalgorithm,DENCLUEalgorithm,etc.;

4,grid-basedmethods

ThismethodfirstdividesthedataspaceintoWithagridstructureofafinitenumberofcells,allprocessingisbasedonasinglecell.Anoutstandingadvantageofthisprocessingisthattheprocessingspeedisveryfast,usuallythisisnotrelatedtothenumberofrecordsinthetargetdatabase,itisonlyrelatedtohowmanyunitsthedataspaceisdividedinto.Representativealgorithmsare:STINGalgorithm,CLIQUEalgorithm,WAVE-CLUSTERalgorithm;

Formanyspatialdataminingproblems,theuseofgridsisusuallyaneffectivemethod.Therefore,thegrid-basedmethodcanbeintegratedwithotherclusteringmethods.

5.Model-basedmethods

Model-basedmethodsassumeamodelforeachcluster,andthenlookforamodelthatsatisfiesthemodelwell.dataset.Suchamodelmaybethedensitydistributionfunctionofdatapointsinspaceorother.Oneofitsunderlyingassumptionsisthatthetargetdatasetisdeterminedbyaseriesofprobabilitydistributions.Thereareusuallytwodirectionstotry:statisticalschemesandneuralnetworkschemes.

Ofcourse,thereareclusteringmethods:transitiveclosuremethod,Booleanmatrixmethod,directclusteringmethod,correlationanalysisclustering,clusteringmethodbasedonstatistics,etc.

Researchsituation

Traditionalclusteringhassuccessfullysolvedtheclusteringproblemoflow-dimensionaldata.However,duetothecomplexityofdatainpracticalapplications,existingalgorithmsoftenfailwhendealingwithmanyproblems,especiallyforhigh-dimensionaldataandlarge-scaledata.Becausetraditionalclusteringmethodsmainlyencountertwoproblemswhenclusteringhigh-dimensionaldatasets.①Therearealargenumberofirrelevantattributesinthehigh-dimensionaldataset,whichmakesthepossibilityofclustersinalldimensionsalmostzero;②Thedatainthehigh-dimensionalspaceshouldbesparselydistributedinthelower-dimensionalspace,anditisacommonphenomenonthatthedistancebetweenthedataisalmostequal.Thetraditionalclusteringmethodisbasedondistancetocluster,soitisimpossibletoconstructclustersbasedondistanceinhigh-dimensionalspace.

High-dimensionalclusteranalysishasbecomeanimportantresearchdirectionofclusteranalysis.Atthesametime,high-dimensionaldataclusteringisalsoadifficultpointinclusteringtechnology.Withtheadvancementoftechnology,datacollectionhasbecomeeasierandeasier,leadingtolargerandlargerdatabasesandhighercomplexity,suchasvarioustypesoftradetransactiondata,Webdocuments,geneexpressiondata,etc.,theirdimensions(Attributes)canusuallyreachhundredsorthousandsofdimensions,orevenhigher.However,affectedbythe"dimensionaleffect",manyclusteringmethodsthatperformwellinlow-dimensionaldataspacesareoftenunabletoobtaingoodclusteringresultswhenusedinhigh-dimensionalspaces.Clusteranalysisofhigh-dimensionaldataisaveryactivefieldinclusteranalysis,anditisalsoachallengingtask.High-dimensionaldataclusteranalysishasawiderangeofapplicationsinmarketanalysis,informationsecurity,finance,entertainment,andanti-terrorism.

Related Articles
TOP