Typicalapplication
"Whatisthetypicalapplicationofclustering?"Inbusiness,clusteringcanhelpmarketanalystsfinddifferentcustomergroupsfromthecustomerbase,andusepurchaseModeltocharacterizethecharacteristicsofdifferentcustomergroups.Inbiology,clusteringcanbeusedtoderivetheclassificationofplantsandanimals,classifygenes,andgainanunderstandingoftheinherentstructureofthepopulation.ClusteringcanalsoplayaroleinthedeterminationofsimilarareasintheEarthobservationdatabase,thegroupingofcarinsurancepolicyholders,andthegroupingofhousesinacityaccordingtothetype,valueandgeographiclocationofthehouse.ClusteringcanalsobeusedtoclassifydocumentsontheWebtodiscoverinformation.
Typicalrequirements
Scalability:Manyclusteringalgorithmsworkwellonsmalldatasetslessthan200dataobjects;however,oneLarge-scaledatabasesmaycontainmillionsofobjects,andclusteringonsamplesofsuchlargedatasetsmayleadtobiasedresults.Weneedclusteringalgorithmsthatarehighlyscalable.
Theabilitytohandledifferenttypesofdata:Manyalgorithmsaredesignedtoclusternumericaldata.However,applicationsmayrequireclusteringofothertypesofdata,suchasbinary,categorical/nominal,ordinaldata,oramixtureofthesedatatypes.
Discoverclustersofarbitraryshapes:ManyclusteringalgorithmsarebasedonEuclideanorManhattandistancemetricstodetermineclusters.Algorithmsbasedonsuchdistancemetricstendtofindsphericalclusterswithsimilarscaleanddensity.However,aclustermaybeofarbitraryshape.Itisimportanttoproposeanalgorithmthatcanfindclustersofarbitraryshapes.
Minimizethedomainknowledgeusedtodeterminetheinputparameters:Manyclusteringalgorithmsrequiretheusertoinputcertainparametersintheclusteranalysis,suchasthenumberofclustersdesiredtobegenerated.Theclusteringresultisverysensitivetotheinputparameters.Parametersareoftendifficulttodetermine,especiallyfordatasetscontaininghigh-dimensionalobjects.Thisnotonlyincreasestheburdenonusers,butalsomakesthequalityofclusteringdifficulttocontrol.
Theabilitytodealwith"noisy"data:Mostactualdatabasescontainoutliers,missing,orerroneousdata.Someclusteringalgorithmsaresensitivetosuchdataandmayleadtolow-qualityclusteringresults.
Insensitivetotheorderofinputrecords:Someclusteringalgorithmsaresensitivetotheorderofinputdata.Forexample,whenthesamedatasetisdeliveredtothesamealgorithminadifferentorder,itmaygenerateverydifferentclusteringresults.Itisofgreatsignificancetodevelopalgorithmsthatarenotsensitivetotheorderofdatainput.
HighdimensionalityHighdimensionality:Adatabaseordatawarehousemaycontainseveraldimensionsorattributes.Manyclusteringalgorithmsaregoodatprocessinglow-dimensionaldata,whichmayonlyinvolvetwotothreedimensions.Thehumaneyecanjudgethequalityofclusteringwellinthemostthree-dimensionalsituation.Clusteringdataobjectsinahigh-dimensionalspaceisverychallenging,especiallyconsideringthatsuchdatamaybeverysparselydistributedandhighlyskewed.
Constraint-basedclustering:Real-worldapplicationsmayrequireclusteringundervariousconstraints.SupposeyourjobistochoosealocationforagivennumberofATMsinacity.Inordertomakeadecision,youcanclusterresidentialareas,takingintoaccount,forexample,thecity'sriverandroadnetwork,andthecustomerrequirementsineacharea.Waitforthesituation.Itisachallengingtasktofindadatagroupthatnotonlysatisfiesspecificconstraints,butalsohasgoodclusteringcharacteristics.
Interpretabilityandusability:Userswanttheclusteringresultstobeinterpretable,understandable,andusable.Inotherwords,clusteringmayneedtobeassociatedwithspecificsemanticinterpretationsandapplications.Howapplicationgoalsaffectthechoiceofclusteringmethodisalsoanimportantresearchtopic.
Calculationmethods
Traditionalclusteranalysiscalculationmethodsaremainlyasfollows:
1,partitioningmethods(partitioningmethods)
GivenadatasetwithNtuplesorrecords,thesplitmethodwillconstructKgroups,andeachgrouprepresentsacluster,whichcanberelaxedintheKclusteringalgorithm);foragivenK,thealgorithmfirstgivesDevelopaninitialgroupingmethod,andthenchangethegroupingthroughrepeatediterations,sothatthegroupingschemeaftereachimprovementisbetterthanthepreviousone,andtheso-calledgoodstandardis:therecordsinthesamegroupareascloseaspossible,butdifferentgroupsThefarthertherecordin,thebetter.Thealgorithmsthatusethisbasicideaare:K-MEANSalgorithm,K-MEDOIDSalgorithm,CLARANSalgorithm;
Mostofthedivisionmethodsarebasedondistance.Giventhenumberofpartitionsktobeconstructed,thepartitionmethodfirstcreatesaninitialpartition.Then,itusesaniterativerelocationtechniquetodivideobjectsbymovingobjectsfromonegrouptoanother.Thegeneralpreparationforagooddivisionis:objectsinthesameclusterareascloseaspossibleorrelatedtoeachother,whileobjectsindifferentclustersareasfarawayordifferentaspossible.Therearemanyothercriteriaforjudgingthequalityofthedivision.Thetraditionaldivisionmethodcanbeextendedtosubspaceclusteringinsteadofsearchingtheentiredataspace.Thisisusefulwhentherearemanyattributesandthedataissparse.Inordertoachievetheglobaloptimum,thepartition-basedclusteringmayneedtoexhaustallpossiblepartitions,andtheamountofcalculationisextremelylarge.Infact,mostapplicationsusepopularheuristicmethods,suchask-meansandk-centeralgorithms,toasymptoticallyimprovetheclusteringqualityandapproachthelocaloptimalsolution.Theseheuristicclusteringmethodsareverysuitablefordiscoveringsphericalclustersinsmallandmedium-sizeddatabases.Inordertodiscoverclusterswithcomplexshapesandclusterverylargedatasets,itisnecessarytofurtherexpandthepartition-basedmethod.
2.Hierarchicalmethods
Thismethoddecomposesagivendatasethierarchicallyuntilacertainconditionismet.Specifically,itcanbedividedinto"bottom-up"and"top-down"schemes.Forexample,inthe"bottom-up"scheme,eachdatarecordinitiallyformsaseparategroup.Inthenextiteration,itcombinestheadjacentcombinationsintoagroupuntilalltherecordsformagrouporUntilacertainconditionismet.Representativealgorithmsare:BIRCHalgorithm,CUREalgorithm,CHAMELEONalgorithm,etc.;
Thehierarchicalclusteringmethodcanbebasedondistanceordensityorconnectivity.Someextensionsofhierarchicalclusteringmethodsalsoconsidersubspaceclustering.Thedisadvantageofthehierarchicalapproachisthatonceastep(mergerorsplit)iscompleted,itcannotbeundone.Thisstrictruleisusefulbecausethereisnoneedtoworryaboutthenumberofcombinationsofdifferentchoices,itwillproducelesscomputationaloverhead.However,thistechniquecannotcorrectwrongdecisions.Somemethodshavebeenproposedtoimprovethequalityofhierarchicalclustering.
3.Density-basedmethods
Afundamentaldifferencebetweendensity-basedmethodsandothermethodsisthatitisnotbasedonvariousdistances.Itisbasedondensity.Thiscanovercometheshortcomingsofdistance-basedalgorithmsthatcanonlyfind"quasi-circular"clusters.Theguidingideaofthismethodisthataslongasthedensityofpointsinaregionisgreaterthanacertainthreshold,itisaddedtotheclustersthatareclosetoit.Representativealgorithmsare:DBSCANalgorithm,OPTICSalgorithm,DENCLUEalgorithm,etc.;
4,grid-basedmethods
ThismethodfirstdividesthedataspaceintoWithagridstructureofafinitenumberofcells,allprocessingisbasedonasinglecell.Anoutstandingadvantageofthisprocessingisthattheprocessingspeedisveryfast,usuallythisisnotrelatedtothenumberofrecordsinthetargetdatabase,itisonlyrelatedtohowmanyunitsthedataspaceisdividedinto.Representativealgorithmsare:STINGalgorithm,CLIQUEalgorithm,WAVE-CLUSTERalgorithm;
Formanyspatialdataminingproblems,theuseofgridsisusuallyaneffectivemethod.Therefore,thegrid-basedmethodcanbeintegratedwithotherclusteringmethods.
5.Model-basedmethods
Model-basedmethodsassumeamodelforeachcluster,andthenlookforamodelthatsatisfiesthemodelwell.dataset.Suchamodelmaybethedensitydistributionfunctionofdatapointsinspaceorother.Oneofitsunderlyingassumptionsisthatthetargetdatasetisdeterminedbyaseriesofprobabilitydistributions.Thereareusuallytwodirectionstotry:statisticalschemesandneuralnetworkschemes.
Ofcourse,thereareclusteringmethods:transitiveclosuremethod,Booleanmatrixmethod,directclusteringmethod,correlationanalysisclustering,clusteringmethodbasedonstatistics,etc.
Researchsituation
Traditionalclusteringhassuccessfullysolvedtheclusteringproblemoflow-dimensionaldata.However,duetothecomplexityofdatainpracticalapplications,existingalgorithmsoftenfailwhendealingwithmanyproblems,especiallyforhigh-dimensionaldataandlarge-scaledata.Becausetraditionalclusteringmethodsmainlyencountertwoproblemswhenclusteringhigh-dimensionaldatasets.①Therearealargenumberofirrelevantattributesinthehigh-dimensionaldataset,whichmakesthepossibilityofclustersinalldimensionsalmostzero;②Thedatainthehigh-dimensionalspaceshouldbesparselydistributedinthelower-dimensionalspace,anditisacommonphenomenonthatthedistancebetweenthedataisalmostequal.Thetraditionalclusteringmethodisbasedondistancetocluster,soitisimpossibletoconstructclustersbasedondistanceinhigh-dimensionalspace.
High-dimensionalclusteranalysishasbecomeanimportantresearchdirectionofclusteranalysis.Atthesametime,high-dimensionaldataclusteringisalsoadifficultpointinclusteringtechnology.Withtheadvancementoftechnology,datacollectionhasbecomeeasierandeasier,leadingtolargerandlargerdatabasesandhighercomplexity,suchasvarioustypesoftradetransactiondata,Webdocuments,geneexpressiondata,etc.,theirdimensions(Attributes)canusuallyreachhundredsorthousandsofdimensions,orevenhigher.However,affectedbythe"dimensionaleffect",manyclusteringmethodsthatperformwellinlow-dimensionaldataspacesareoftenunabletoobtaingoodclusteringresultswhenusedinhigh-dimensionalspaces.Clusteranalysisofhigh-dimensionaldataisaveryactivefieldinclusteranalysis,anditisalsoachallengingtask.High-dimensionaldataclusteranalysishasawiderangeofapplicationsinmarketanalysis,informationsecurity,finance,entertainment,andanti-terrorism.