Clintonpr查询

pr查询时间:2021-04-18 阅读:()

TheAnatomyofaLarge-ScaleHypertextualWebSearchEngineSergeyBrinandLawrencePageComputerScienceDepartment,StanfordUniversity,Stanford,CA94305,USAsergey@cs.
stanford.
eduandpage@cs.
stanford.
eduAbstractInthispaper,wepresentGoogle,aprototypeofalarge-scalesearchenginewhichmakesheavyuseofthestructurepresentinhypertext.
GoogleisdesignedtocrawlandindextheWebefficientlyandproducemuchmoresatisfyingsearchresultsthanexistingsystems.
Theprototypewithafulltextandhyperlinkdatabaseofatleast24millionpagesisavailableathttp://google.
stanford.
edu/Toengineerasearchengineisachallengingtask.
Searchenginesindextenstohundredsofmillionsofwebpagesinvolvingacomparablenumberofdistinctterms.
Theyanswertensofmillionsofquerieseveryday.
Despitetheimportanceoflarge-scalesearchenginesontheweb,verylittleacademicresearchhasbeendoneonthem.
Furthermore,duetorapidadvanceintechnologyandwebproliferation,creatingawebsearchenginetodayisverydifferentfromthreeyearsago.
Thispaperprovidesanin-depthdescriptionofourlarge-scalewebsearchengine--thefirstsuchdetailedpublicdescriptionweknowoftodate.
Apartfromtheproblemsofscalingtraditionalsearchtechniquestodataofthismagnitude,therearenewtechnicalchallengesinvolvedwithusingtheadditionalinformationpresentinhypertexttoproducebettersearchresults.
Thispaperaddressesthisquestionofhowtobuildapracticallarge-scalesystemwhichcanexploittheadditionalinformationpresentinhypertext.
Alsowelookattheproblemofhowtoeffectivelydealwithuncontrolledhypertextcollectionswhereanyonecanpublishanythingtheywant.
KeywordsWorldWideWeb,SearchEngines,InformationRetrieval,PageRank,Google1.
Introduction(Note:Therearetwoversionsofthispaper--alongerfullversionandashorterprintedversion.
ThefullversionisavailableonthewebandtheconferenceCD-ROM.
)Thewebcreatesnewchallengesforinformationretrieval.
Theamountofinformationonthewebisgrowingrapidly,aswellasthenumberofnewusersinexperiencedintheartofwebresearch.
Peoplearelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighqualityhumanmaintainedindicessuchasYahoo!
orwithsearchengines.
Humanmaintainedlistscoverpopulartopicseffectivelybutaresubjective,expensivetobuildandmaintain,slowtoimprove,andcannotcoverallesoterictopics.
Automatedsearchenginesthatrelyonkeywordmatchingusuallyreturntoomanylowqualitymatches.
Tomakemattersworse,someadvertisersattempttogainpeople'sattentionbytakingmeasuresmeanttomisleadautomatedsearchengines.
Wehavebuiltalarge-scalesearchenginewhichaddressesmanyoftheproblemsofexistingsystems.
Itmakesespeciallyheavyuseoftheadditionalstructurepresentinhypertexttoprovidemuchhigherqualitysearchresults.
Wechoseoursystemname,Google,becauseitisacommonspellingofgoogol,or10100andfitswellwithourgoalofbuildingverylarge-scalesearchengines.
1.
1WebSearchEngines--ScalingUp:1994-2000Searchenginetechnologyhashadtoscaledramaticallytokeepupwiththegrowthoftheweb.
In1994,oneofthefirstwebsearchengines,theWorldWideWebWorm(WWWW)[McBryan94]hadanindexof110,000webpagesandwebaccessibledocuments.
AsofNovember,1997,thetopsearchenginesclaimtoindexfrom2million(WebCrawler)to100millionwebdocuments(fromSearchEngineWatch).
Itisforeseeablethatbytheyear2000,acomprehensiveindexoftheWebwillcontainoverabilliondocuments.
Atthesametime,thenumberofqueriessearchengineshandlehasgrownincrediblytoo.
InMarchandApril1994,theWorldWideWebWormreceivedanaverageofabout1500queriesperday.
InNovember1997,Altavistaclaimedithandledroughly20millionqueriesperday.
Withtheincreasingnumberofusersontheweb,andautomatedsystemswhichquerysearchengines,itislikelythattopsearchengineswillhandlehundredsofmillionsofqueriesperdaybytheyear2000.
Thegoalofoursystemistoaddressmanyoftheproblems,bothinqualityandscalability,introducedbyscalingsearchenginetechnologytosuchextraordinarynumbers.
1.
2.
Google:ScalingwiththeWebCreatingasearchenginewhichscaleseventotoday'swebpresentsmanychallenges.
Fastcrawlingtechnologyisneededtogatherthewebdocumentsandkeepthemuptodate.
Storagespacemustbeusedefficientlytostoreindicesand,optionally,thedocumentsthemselves.
Theindexingsystemmustprocesshundredsofgigabytesofdataefficiently.
Queriesmustbehandledquickly,atarateofhundredstothousandspersecond.
ThesetasksarebecomingincreasinglydifficultastheWebgrows.
However,hardwareperformanceandcosthaveimproveddramaticallytopartiallyoffsetthedifficulty.
Thereare,however,severalnotableexceptionstothisprogresssuchasdiskseektimeandoperatingsystemrobustness.
IndesigningGoogle,wehaveconsideredboththerateofgrowthoftheWebandtechnologicalchanges.
Googleisdesignedtoscalewelltoextremelylargedatasets.
Itmakesefficientuseofstoragespacetostoretheindex.
Itsdatastructuresareoptimizedforfastandefficientaccess(seesection4.
2).
Further,weexpectthatthecosttoindexandstoretextorHTMLwilleventuallydeclinerelativetotheamountthatwillbeavailable(seeAppendixB).
ThiswillresultinfavorablescalingpropertiesforcentralizedsystemslikeGoogle.
1.
3DesignGoals1.
3.
1ImprovedSearchQualityOurmaingoalistoimprovethequalityofwebsearchengines.
In1994,somepeoplebelievedthatacompletesearchindexwouldmakeitpossibletofindanythingeasily.
AccordingtoBestoftheWeb1994--Navigators,"ThebestnavigationserviceshouldmakeiteasytofindalmostanythingontheWeb(onceallthedataisentered).
"However,theWebof1997isquitedifferent.
Anyonewhohasusedasearchenginerecently,canreadilytestifythatthecompletenessoftheindexisnottheonlyfactorinthequalityofsearchresults.
"Junkresults"oftenwashoutanyresultsthatauserisinterestedin.
Infact,asofNovember1997,onlyoneofthetopfourcommercialsearchenginesfindsitself(returnsitsownsearchpageinresponsetoitsnameinthetoptenresults).
Oneofthemaincausesofthisproblemisthatthenumberofdocumentsintheindiceshasbeenincreasingbymanyordersofmagnitude,buttheuser'sabilitytolookatdocumentshasnot.
Peoplearestillonlywillingtolookatthefirstfewtensofresults.
Becauseofthis,asthecollectionsizegrows,weneedtoolsthathaveveryhighprecision(numberofrelevantdocumentsreturned,sayinthetoptensofresults).
Indeed,wewantournotionof"relevant"toonlyincludetheverybestdocumentssincetheremaybetensofthousandsofslightlyrelevantdocuments.
Thisveryhighprecisionisimportantevenattheexpenseofrecall(thetotalnumberofrelevantdocumentsthesystemisabletoreturn).
Thereisquiteabitofrecentoptimismthattheuseofmorehypertextualinformationcanhelpimprovesearchandotherapplications[Marchiori97][Spertus97][Weiss96][Kleinberg98].
Inparticular,linkstructure[Page98]andlinktextprovidealotofinformationformakingrelevancejudgmentsandqualityfiltering.
Googlemakesuseofbothlinkstructureandanchortext(seeSections2.
1and2.
2).
1.
3.
2AcademicSearchEngineResearchAsidefromtremendousgrowth,theWebhasalsobecomeincreasinglycommercialovertime.
In1993,1.
5%ofwebserverswereon.
comdomains.
Thisnumbergrewtoover60%in1997.
Atthesametime,searchengineshavemigratedfromtheacademicdomaintothecommercial.
Upuntilnowmostsearchenginedevelopmenthasgoneonatcompanieswithlittlepublicationoftechnicaldetails.
Thiscausessearchenginetechnologytoremainlargelyablackartandtobeadvertisingoriented(seeAppendixA).
WithGoogle,wehaveastronggoaltopushmoredevelopmentandunderstandingintotheacademicrealm.
Anotherimportantdesigngoalwastobuildsystemsthatreasonablenumbersofpeoplecanactuallyuse.
Usagewasimportanttousbecausewethinksomeofthemostinterestingresearchwillinvolveleveragingthevastamountofusagedatathatisavailablefrommodernwebsystems.
Forexample,therearemanytensofmillionsofsearchesperformedeveryday.
However,itisverydifficulttogetthisdata,mainlybecauseitisconsideredcommerciallyvaluable.
Ourfinaldesigngoalwastobuildanarchitecturethatcansupportnovelresearchactivitiesonlarge-scalewebdata.
Tosupportnovelresearchuses,Googlestoresalloftheactualdocumentsitcrawlsincompressedform.
OneofourmaingoalsindesigningGooglewastosetupanenvironmentwhereotherresearcherscancomeinquickly,processlargechunksoftheweb,andproduceinterestingresultsthatwouldhavebeenverydifficulttoproduceotherwise.
Intheshorttimethesystemhasbeenup,therehavealreadybeenseveralpapersusingdatabasesgeneratedbyGoogle,andmanyothersareunderway.
AnothergoalwehaveistosetupaSpacelab-likeenvironmentwhereresearchersorevenstudentscanproposeanddointerestingexperimentsonourlarge-scalewebdata.
2.
SystemFeaturesTheGooglesearchenginehastwoimportantfeaturesthathelpitproducehighprecisionresults.
First,itmakesuseofthelinkstructureoftheWebtocalculateaqualityrankingforeachwebpage.
ThisrankingiscalledPageRankandisdescribedindetailin[Page98].
Second,Googleutilizeslinktoimprovesearchresults.
2.
1PageRank:BringingOrdertotheWebThecitation(link)graphofthewebisanimportantresourcethathaslargelygoneunusedinexistingwebsearchengines.
Wehavecreatedmapscontainingasmanyas518millionofthesehyperlinks,asignificantsampleofthetotal.
Thesemapsallowrapidcalculationofawebpage's"PageRank",anobjectivemeasureofitscitationimportancethatcorrespondswellwithpeople'ssubjectiveideaofimportance.
Becauseofthiscorrespondence,PageRankisanexcellentwaytoprioritizetheresultsofwebkeywordsearches.
Formostpopularsubjects,asimpletextmatchingsearchthatisrestrictedtowebpagetitlesperformsadmirablywhenPageRankprioritizestheresults(demoavailableatgoogle.
stanford.
edu).
ForthetypeoffulltextsearchesinthemainGooglesystem,PageRankalsohelpsagreatdeal.
2.
1.
1DescriptionofPageRankCalculationAcademiccitationliteraturehasbeenappliedtotheweb,largelybycountingcitationsorbacklinkstoagivenpage.
Thisgivessomeapproximationofapage'simportanceorquality.
PageRankextendsthisideabynotcountinglinksfromallpagesequally,andbynormalizingbythenumberoflinksonapage.
PageRankisdefinedasfollows:WeassumepageAhaspagesT1.
.
.
Tnwhichpointtoit(i.
e.
,arecitations).
Theparameterdisadampingfactorwhichcanbesetbetween0and1.
Weusuallysetdto0.
85.
Therearemoredetailsaboutdinthenextsection.
AlsoC(A)isdefinedasthenumberoflinksgoingoutofpageA.
ThePageRankofapageAisgivenasfollows:PR(A)=(1-d)+d(PR(T1)/C(T1)PR(Tn)/C(Tn))NotethatthePageRanksformaprobabilitydistributionoverwebpages,sothesumofallwebpages'PageRankswillbeone.
PageRankorPR(A)canbecalculatedusingasimpleiterativealgorithm,andcorrespondstotheprincipaleigenvectorofthenormalizedlinkmatrixoftheweb.
Also,aPageRankfor26millionwebpagescanbecomputedinafewhoursonamediumsizeworkstation.
Therearemanyotherdetailswhicharebeyondthescopeofthispaper.
2.
1.
2IntuitiveJustificationPageRankcanbethoughtofasamodelofuserbehavior.
Weassumethereisa"randomsurfer"whoisgivenawebpageatrandomandkeepsclickingonlinks,neverhitting"back"buteventuallygetsboredandstartsonanotherrandompage.
TheprobabilitythattherandomsurfervisitsapageisitsPageRank.
And,theddampingfactoristheprobabilityateachpagethe"randomsurfer"willgetboredandrequestanotherrandompage.
Oneimportantvariationistoonlyaddthedampingfactordtoasinglepage,oragroupofpages.
Thisallowsforpersonalizationandcanmakeitnearlyimpossibletodeliberatelymisleadthesysteminordertogetahigherranking.
WehaveseveralotherextensionstoPageRank,againsee[Page98].
AnotherintuitivejustificationisthatapagecanhaveahighPageRankiftherearemanypagesthatpointtoit,oriftherearesomepagesthatpointtoitandhaveahighPageRank.
Intuitively,pagesthatarewellcitedfrommanyplacesaroundthewebareworthlookingat.
Also,pagesthathaveperhapsonlyonecitationfromsomethingliketheYahoo!
homepagearealsogenerallyworthlookingat.
Ifapagewasnothighquality,orwasabrokenlink,itisquitelikelythatYahoo'shomepagewouldnotlinktoit.
PageRankhandlesboththesecasesandeverythinginbetweenbyrecursivelypropagatingweightsthroughthelinkstructureoftheweb.
2.
2AnchorTextThetextoflinksistreatedinaspecialwayinoursearchengine.
Mostsearchenginesassociatethetextofalinkwiththepagethatthelinkison.
Inaddition,weassociateitwiththepagethelinkpointsto.
Thishasseveraladvantages.
First,anchorsoftenprovidemoreaccuratedescriptionsofwebpagesthanthepagesthemselves.
Second,anchorsmayexistfordocumentswhichcannotbeindexedbyatext-basedsearchengine,suchasimages,programs,anddatabases.
Thismakesitpossibletoreturnwebpageswhichhavenotactuallybeencrawled.
Notethatpagesthathavenotbeencrawledcancauseproblems,sincetheyarenevercheckedforvaliditybeforebeingreturnedtotheuser.
Inthiscase,thesearchenginecanevenreturnapagethatneveractuallyexisted,buthadhyperlinkspointingtoit.
However,itispossibletosorttheresults,sothatthisparticularproblemrarelyhappens.
ThisideaofpropagatinganchortexttothepageitreferstowasimplementedintheWorldWideWebWorm[McBryan94]especiallybecauseithelpssearchnon-textinformation,andexpandsthesearchcoveragewithfewerdownloadeddocuments.
Weuseanchorpropagationmostlybecauseanchortextcanhelpprovidebetterqualityresults.
Usinganchortextefficientlyistechnicallydifficultbecauseofthelargeamountsofdatawhichmustbeprocessed.
Inourcurrentcrawlof24millionpages,wehadover259millionanchorswhichweindexed.
2.
3OtherFeaturesAsidefromPageRankandtheuseofanchortext,Googlehasseveralotherfeatures.
First,ithaslocationinformationforallhitsandsoitmakesextensiveuseofproximityinsearch.
Second,Googlekeepstrackofsomevisualpresentationdetailssuchasfontsizeofwords.
Wordsinalargerorbolderfontareweightedhigherthanotherwords.
Third,fullrawHTMLofpagesisavailableinarepository.
3RelatedWorkSearchresearchonthewebhasashortandconcisehistory.
TheWorldWideWebWorm(WWWW)[McBryan94]wasoneofthefirstwebsearchengines.
Itwassubsequentlyfollowedbyseveralotheracademicsearchengines,manyofwhicharenowpubliccompanies.
ComparedtothegrowthoftheWebandtheimportanceofsearchenginestherearepreciousfewdocumentsaboutrecentsearchengines[Pinkerton94].
AccordingtoMichaelMauldin(chiefscientist,LycosInc)[Mauldin],"thevariousservices(includingLycos)closelyguardthedetailsofthesedatabases".
However,therehasbeenafairamountofworkonspecificfeaturesofsearchengines.
Especiallywellrepresentedisworkwhichcangetresultsbypost-processingtheresultsofexistingcommercialsearchengines,orproducesmallscale"individualized"searchengines.
Finally,therehasbeenalotofresearchoninformationretrievalsystems,especiallyonwellcontrolledcollections.
Inthenexttwosections,wediscusssomeareaswherethisresearchneedstobeextendedtoworkbetterontheweb.
3.
1InformationRetrievalWorkininformationretrievalsystemsgoesbackmanyyearsandiswelldeveloped[Witten94].
However,mostoftheresearchoninformationretrievalsystemsisonsmallwellcontrolledhomogeneouscollectionssuchascollectionsofscientificpapersornewsstoriesonarelatedtopic.
Indeed,theprimarybenchmarkforinformationretrieval,theTextRetrievalConference[TREC96],usesafairlysmall,wellcontrolledcollectionfortheirbenchmarks.
The"VeryLargeCorpus"benchmarkisonly20GBcomparedtothe147GBfromourcrawlof24millionwebpages.
ThingsthatworkwellonTRECoftendonotproducegoodresultsontheweb.
Forexample,thestandardvectorspacemodeltriestoreturnthedocumentthatmostcloselyapproximatesthequery,giventhatbothqueryanddocumentarevectorsdefinedbytheirwordoccurrence.
Ontheweb,thisstrategyoftenreturnsveryshortdocumentsthatarethequeryplusafewwords.
Forexample,wehaveseenamajorsearchenginereturnapagecontainingonly"BillClintonSucks"andpicturefroma"BillClinton"query.
Somearguethatontheweb,usersshouldspecifymoreaccuratelywhattheywantandaddmorewordstotheirquery.
Wedisagreevehementlywiththisposition.
Ifauserissuesaquerylike"BillClinton"theyshouldgetreasonableresultssincethereisaenormousamountofhighqualityinformationavailableonthistopic.
Givenexampleslikethese,webelievethatthestandardinformationretrievalworkneedstobeextendedtodealeffectivelywiththeweb.
3.
2DifferencesBetweentheWebandWellControlledCollectionsThewebisavastcollectionofcompletelyuncontrolledheterogeneousdocuments.
Documentsonthewebhaveextremevariationinternaltothedocuments,andalsointheexternalmetainformationthatmightbeavailable.
Forexample,documentsdifferinternallyintheirlanguage(bothhumanandprogramming),vocabulary(emailaddresses,links,zipcodes,phonenumbers,productnumbers),typeorformat(text,HTML,PDF,images,sounds),andmayevenbemachinegenerated(logfilesoroutputfromadatabase).
Ontheotherhand,wedefineexternalmetainformationasinformationthatcanbeinferredaboutadocument,butisnotcontainedwithinit.
Examplesofexternalmetainformationincludethingslikereputationofthesource,updatefrequency,quality,popularityorusage,andcitations.
Notonlyarethepossiblesourcesofexternalmetainformationvaried,butthethingsthatarebeingmeasuredvarymanyordersofmagnitudeaswell.
Forexample,comparetheusageinformationfromamajorhomepage,likeYahoo'swhichcurrentlyreceivesmillionsofpageviewseverydaywithanobscurehistoricalarticlewhichmightreceiveonevieweverytenyears.
Clearly,thesetwoitemsmustbetreatedverydifferentlybyasearchengine.
Anotherbigdifferencebetweenthewebandtraditionalwellcontrolledcollectionsisthatthereisvirtuallynocontroloverwhatpeoplecanputontheweb.
Couplethisflexibilitytopublishanythingwiththeenormousinfluenceofsearchenginestoroutetrafficandcompanieswhichdeliberatelymanipulatingsearchenginesforprofitbecomeaseriousproblem.
Thisproblemthathasnotbeenaddressedintraditionalclosedinformationretrievalsystems.
Also,itisinterestingtonotethatmetadataeffortshavelargelyfailedwithwebsearchengines,becauseanytextonthepagewhichisnotdirectlyrepresentedtotheuserisabusedtomanipulatesearchengines.
Thereareevennumerouscompanieswhichspecializeinmanipulatingsearchenginesforprofit.
4SystemAnatomyFirst,wewillprovideahighleveldiscussionofthearchitecture.
Then,thereissomein-depthdescriptionsofimportantdatastructures.
Finally,themajorapplications:crawling,indexing,andsearchingwillbeexaminedindepth.
4.
1GoogleArchitectureOverviewInthissection,wewillgiveahighleveloverviewofhowthewholesystemworksaspicturedinFigure1.
Furthersectionswilldiscusstheapplicationsanddatastructuresnotmentionedinthissection.
MostofGoogleisimplementedinCorC++forefficiencyandcanrunineitherSolarisorLinux.
InGoogle,thewebcrawling(downloadingofwebpages)isdonebyseveraldistributedcrawlers.
ThereisaURLserverthatsendslistsofURLstobefetchedtothecrawlers.
Thewebpagesthatarefetchedarethensenttothestoreserver.
Thestoreserverthencompressesandstoresthewebpagesintoarepository.
EverywebpagehasanassociatedIDnumbercalledadocIDwhichisassignedwheneveranewURLisparsedoutofawebpage.
Theindexingfunctionisperformedbytheindexerandthesorter.
Theindexerperformsanumberoffunctions.
Itreadstherepository,uncompressesthedocuments,andparsesthem.
Eachdocumentisconvertedintoasetofwordoccurrencescalledhits.
Thehitsrecordtheword,positionindocument,anapproximationoffontsize,andcapitalization.
Theindexerdistributesthesehitsintoasetof"barrels",creatingapartiallysortedforwardindex.
Theindexerperformsanotherimportantfunction.
Itparsesoutallthelinksineverywebpageandstoresimportantinformationabouttheminananchorsfile.
Thisfilecontainsenoughinformationtodeterminewhereeachlinkpointsfromandto,andthetextofthelink.
TheURLresolverreadstheanchorsfileandconvertsrelativeURLsintoabsoluteURLsandinturnintodocIDs.
Itputstheanchortextintotheforwardindex,associatedwiththedocIDthattheanchorpointsto.
ItalsogeneratesadatabaseoflinkswhicharepairsofdocIDs.
ThelinksdatabaseisusedtocomputePageRanksforallthedocuments.
Thesortertakesthebarrels,whicharesortedbydocID(thisisasimplification,seeSection4.
2.
5),andresortsthembywordIDtogeneratetheinvertedindex.
Thisisdoneinplacesothatlittletemporaryspaceisneededforthisoperation.
ThesorteralsoproducesalistofwordIDsandoffsetsintotheinvertedindex.
AprogramcalledDumpLexicontakesthislisttogetherwiththelexiconproducedbytheindexerandgeneratesanewlexicontobeusedbythesearcher.
ThesearcherisrunbyawebserverandusesthelexiconbuiltbyDumpLexicontogetherwiththeinvertedindexandthePageRankstoanswerqueries.
4.
2MajorDataStructuresGoogle'sdatastructuresareoptimizedsothatalargedocumentcollectioncanbecrawled,indexed,andsearchedwithlittlecost.
Although,CPUsandbulkinputoutputrateshaveimproveddramaticallyovertheyears,adiskseekstillrequiresabout10mstocomplete.
Googleisdesignedtoavoiddiskseekswheneverpossible,andthishashadaconsiderableinfluenceonthedesignofthedatastructures.
Figure1.
HighLevelGoogleArchitecture4.
2.
1BigFilesBigFilesarevirtualfilesspanningmultiplefilesystemsandareaddressableby64bitintegers.
Theallocationamongmultiplefilesystemsishandledautomatically.
TheBigFilespackagealsohandlesallocationanddeallocationoffiledescriptors,sincetheoperatingsystemsdonotprovideenoughforourneeds.
BigFilesalsosupportrudimentarycompressionoptions.
4.
2.
2RepositoryTherepositorycontainsthefullHTMLofeverywebpage.
Eachpageiscompressedusingzlib(seeRFC1950).
Thechoiceofcompressiontechniqueisatradeoffbetweenspeedandcompressionratio.
Wechosezlib'sspeedoverasignificantimprovementincompressionofferedbybzip.
Thecompressionrateofbzipwasapproximately4to1ontherepositoryascomparedtozlib's3to1compression.
Intherepository,thedocumentsarestoredoneaftertheotherandareprefixedbydocID,length,andURLascanbeseeninFigure2.
Therepositoryrequiresnootherdatastructurestobeusedinordertoaccessit.
Thishelpswithdataconsistencyandmakesdevelopmentmucheasier;wecanrebuildalltheotherdatastructuresfromonlytherepositoryandafilewhichlistscrawlererrors.
4.
2.
3DocumentIndexThedocumentindexkeepsinformationabouteachdocument.
ItisafixedwidthISAM(Indexsequentialaccessmode)index,orderedbydocID.
Theinformationstoredineachentryincludesthecurrentdocumentstatus,apointerintotherepository,adocumentchecksum,andvariousstatistics.
Ifthedocumenthasbeencrawled,italsocontainsapointerintoavariablewidthfilecalleddocinfowhichcontainsitsURLandtitle.
OtherwisethepointerpointsintotheURLlistwhichcontainsjusttheURL.
Thisdesigndecisionwasdrivenbythedesiretohaveareasonablycompactdatastructure,andtheabilitytofetcharecordinonediskseekduringasearchAdditionally,thereisafilewhichisusedtoconvertURLsintodocIDs.
ItisalistofURLchecksumswiththeircorrespondingdocIDsandissortedbychecksum.
InordertofindthedocIDofaparticularURL,theURL'schecksumiscomputedandabinarysearchisperformedonthechecksumsfiletofinditsdocID.
URLsmaybeconvertedintodocIDsinbatchbydoingamergewiththisfile.
ThisisthetechniquetheURLresolverusestoturnURLsintodocIDs.
Thisbatchmodeofupdateiscrucialbecauseotherwisewemustperformoneseekforeverylinkwhichassumingonediskwouldtakemorethanamonthforour322millionlinkdataset.
4.
2.
4LexiconThelexiconhasseveraldifferentforms.
Oneimportantchangefromearliersystemsisthatthelexiconcanfitinmemoryforareasonableprice.
Inthecurrentimplementationwecankeepthelexiconinmemoryonamachinewith256MBofmainmemory.
Thecurrentlexiconcontains14millionwords(thoughsomerarewordswerenotaddedtothelexicon).
Itisimplementedintwoparts--alistofthewords(concatenatedtogetherbutseparatedbynulls)andahashtableofpointers.
Forvariousfunctions,Figure2.
RepositoryDataStructurethelistofwordshassomeauxiliaryinformationwhichisbeyondthescopeofthispapertoexplainfully.
4.
2.
5HitListsAhitlistcorrespondstoalistofoccurrencesofaparticularwordinaparticulardocumentincludingposition,font,andcapitalizationinformation.
Hitlistsaccountformostofthespaceusedinboththeforwardandtheinvertedindices.
Becauseofthis,itisimportanttorepresentthemasefficientlyaspossible.
Weconsideredseveralalternativesforencodingposition,font,andcapitalization--simpleencoding(atripleofintegers),acompactencoding(ahandoptimizedallocationofbits),andHuffmancoding.
IntheendwechoseahandoptimizedcompactencodingsinceitrequiredfarlessspacethanthesimpleencodingandfarlessbitmanipulationthanHuffmancoding.
ThedetailsofthehitsareshowninFigure3.
Ourcompactencodingusestwobytesforeveryhit.
Therearetwotypesofhits:fancyhitsandplainhits.
FancyhitsincludehitsoccurringinaURL,title,anchortext,ormetatag.
Plainhitsincludeeverythingelse.
Aplainhitconsistsofacapitalizationbit,fontsize,and12bitsofwordpositioninadocument(allpositionshigherthan4095arelabeled4096).
Fontsizeisrepresentedrelativetotherestofthedocumentusingthreebits(only7valuesareactuallyusedbecause111istheflagthatsignalsafancyhit).
Afancyhitconsistsofacapitalizationbit,thefontsizesetto7toindicateitisafancyhit,4bitstoencodethetypeoffancyhit,and8bitsofposition.
Foranchorhits,the8bitsofpositionaresplitinto4bitsforpositioninanchorand4bitsforahashofthedocIDtheanchoroccursin.
Thisgivesussomelimitedphrasesearchingaslongastherearenotthatmanyanchorsforaparticularword.
WeexpecttoupdatethewaythatanchorhitsarestoredtoallowforgreaterresolutioninthepositionanddocIDhashfields.
Weusefontsizerelativetotherestofthedocumentbecausewhensearching,youdonotwanttorankotherwiseidenticaldocumentsdifferentlyjustbecauseoneofthedocumentsisinalargerfont.
Thelengthofahitlistisstoredbeforethehitsthemselves.
Tosavespace,thelengthofthehitlistiscombinedwiththewordIDintheforwardindexandthedocIDintheinvertedindex.
Thislimitsitto8and5bitsrespectively(therearesometrickswhichallow8bitstobeborrowedfromthewordID).
Ifthelengthislongerthanwouldfitinthatmanybits,anescapecodeisusedinthosebits,andthenexttwobytescontaintheactuallength.
4.
2.
6ForwardIndexTheforwardindexisactuallyalreadypartiallysorted.
Itisstoredinanumberofbarrels(weused64).
EachbarrelholdsarangeofwordID's.
Ifadocumentcontainswordsthatfallintoaparticularbarrel,thedocIDisrecordedintothebarrel,followedbyalistofwordID'swithhitlistswhichcorrespondtothosewords.
ThisschemerequiresslightlymorestoragebecauseofduplicateddocIDsbutthedifferenceisverysmallforareasonablenumberofbucketsandsavesconsiderabletimeandcodingcomplexityinthefinalindexingphasedonebythesorter.
Furthermore,insteadofstoringactualwordID's,westoreeachwordIDasarelativedifferencefromtheminimumwordIDthatfallsintotheFigure3.
ForwardandReverseIndexesandtheLexiconbarrelthewordIDisin.
Thisway,wecanusejust24bitsforthewordID'sintheunsortedbarrels,leaving8bitsforthehitlistlength.
4.
2.
7InvertedIndexTheinvertedindexconsistsofthesamebarrelsastheforwardindex,exceptthattheyhavebeenprocessedbythesorter.
ForeveryvalidwordID,thelexiconcontainsapointerintothebarrelthatwordIDfallsinto.
ItpointstoadoclistofdocID'stogetherwiththeircorrespondinghitlists.
Thisdoclistrepresentsalltheoccurrencesofthatwordinalldocuments.
AnimportantissueisinwhatorderthedocID'sshouldappearinthedoclist.
OnesimplesolutionistostorethemsortedbydocID.
Thisallowsforquickmergingofdifferentdoclistsformultiplewordqueries.
Anotheroptionistostorethemsortedbyarankingoftheoccurrenceofthewordineachdocument.
Thismakesansweringonewordqueriestrivialandmakesitlikelythattheanswerstomultiplewordqueriesarenearthestart.
However,mergingismuchmoredifficult.
Also,thismakesdevelopmentmuchmoredifficultinthatachangetotherankingfunctionrequiresarebuildoftheindex.
Wechoseacompromisebetweentheseoptions,keepingtwosetsofinvertedbarrels--onesetforhitlistswhichincludetitleoranchorhitsandanothersetforallhitlists.
Thisway,wecheckthefirstsetofbarrelsfirstandiftherearenotenoughmatcheswithinthosebarrelswecheckthelargerones.
4.
3CrawlingtheWebRunningawebcrawlerisachallengingtask.
Therearetrickyperformanceandreliabilityissuesandevenmoreimportantly,therearesocialissues.
Crawlingisthemostfragileapplicationsinceitinvolvesinteractingwithhundredsofthousandsofwebserversandvariousnameserverswhichareallbeyondthecontrolofthesystem.
Inordertoscaletohundredsofmillionsofwebpages,Googlehasafastdistributedcrawlingsystem.
AsingleURLserverserveslistsofURLstoanumberofcrawlers(wetypicallyranabout3).
BoththeURLserverandthecrawlersareimplementedinPython.
Eachcrawlerkeepsroughly300connectionsopenatonce.
Thisisnecessarytoretrievewebpagesatafastenoughpace.
Atpeakspeeds,thesystemcancrawlover100webpagespersecondusingfourcrawlers.
Thisamountstoroughly600Kpersecondofdata.
AmajorperformancestressisDNSlookup.
EachcrawlermaintainsaitsownDNScachesoitdoesnotneedtodoaDNSlookupbeforecrawlingeachdocument.
Eachofthehundredsofconnectionscanbeinanumberofdifferentstates:lookingupDNS,connectingtohost,sendingrequest,andreceivingresponse.
Thesefactorsmakethecrawleracomplexcomponentofthesystem.
ItusesasynchronousIOtomanageevents,andanumberofqueuestomovepagefetchesfromstatetostate.
Itturnsoutthatrunningacrawlerwhichconnectstomorethanhalfamillionservers,andgeneratestensofmillionsoflogentriesgeneratesafairamountofemailandphonecalls.
Becauseofthevastnumberofpeoplecomingonline,therearealwaysthosewhodonotknowwhatacrawleris,becausethisisthefirstonetheyhaveseen.
Almostdaily,wereceiveanemailsomethinglike,"Wow,youlookedatalotofpagesfrommywebsite.
Howdidyoulikeit"Therearealsosomepeoplewhodonotknowabouttherobotsexclusionprotocol,andthinktheirpageshouldbeprotectedfromindexingbyastatementlike,"Thispageiscopyrightedandshouldnotbeindexed",whichneedlesstosayisdifficultforwebcrawlerstounderstand.
Also,becauseofthehugeamountofdatainvolved,unexpectedthingswillhappen.
Forexample,oursystemtriedtocrawlanonlinegame.
Thisresultedinlotsofgarbagemessagesinthemiddleoftheirgame!
Itturnsoutthiswasaneasyproblemtofix.
Butthisproblemhadnotcomeupuntilwehaddownloadedtensofmillionsofpages.
Becauseoftheimmensevariationinwebpagesandservers,itisvirtuallyimpossibletotestacrawlerwithoutrunningitonlargepartoftheInternet.
Invariably,therearehundredsofobscureproblemswhichmayonlyoccurononepageoutofthewholewebandcausethecrawlertocrash,orworse,causeunpredictableorincorrectbehavior.
SystemswhichaccesslargepartsoftheInternetneedtobedesignedtobeveryrobustandcarefullytested.
Sincelargecomplexsystemssuchascrawlerswillinvariablycauseproblems,thereneedstobesignificantresourcesdevotedtoreadingtheemailandsolvingtheseproblemsastheycomeup.
4.
4IndexingtheWebParsing--AnyparserwhichisdesignedtorunontheentireWebmusthandleahugearrayofpossibleerrors.
TheserangefromtyposinHTMLtagstokilobytesofzerosinthemiddleofatag,non-ASCIIcharacters,HTMLtagsnestedhundredsdeep,andagreatvarietyofothererrorsthatchallengeanyone'simaginationtocomeupwithequallycreativeones.
Formaximumspeed,insteadofusingYACCtogenerateaCFGparser,weuseflextogeneratealexicalanalyzerwhichweoutfitwithitsownstack.
Developingthisparserwhichrunsatareasonablespeedandisveryrobustinvolvedafairamountofwork.
IndexingDocumentsintoBarrels--Aftereachdocumentisparsed,itisencodedintoanumberofbarrels.
EverywordisconvertedintoawordIDbyusinganin-memoryhashtable--thelexicon.
Newadditionstothelexiconhashtableareloggedtoafile.
OncethewordsareconvertedintowordID's,theiroccurrencesinthecurrentdocumentaretranslatedintohitlistsandarewrittenintotheforwardbarrels.
Themaindifficultywithparallelizationoftheindexingphaseisthatthelexiconneedstobeshared.
Insteadofsharingthelexicon,wetooktheapproachofwritingalogofalltheextrawordsthatwerenotinabaselexicon,whichwefixedat14millionwords.
Thatwaymultipleindexerscanruninparallelandthenthesmalllogfileofextrawordscanbeprocessedbyonefinalindexer.
Sorting--Inordertogeneratetheinvertedindex,thesortertakeseachoftheforwardbarrelsandsortsitbywordIDtoproduceaninvertedbarrelfortitleandanchorhitsandafulltextinvertedbarrel.
Thisprocesshappensonebarrelatatime,thusrequiringlittletemporarystorage.
Also,weparallelizethesortingphasetouseasmanymachinesaswehavesimplybyrunningmultiplesorters,whichcanprocessdifferentbucketsatthesametime.
Sincethebarrelsdon'tfitintomainmemory,thesorterfurthersubdividesthemintobasketswhichdofitintomemorybasedonwordIDanddocID.
Thenthesorter,loadseachbasketintomemory,sortsitandwritesitscontentsintotheshortinvertedbarrelandthefullinvertedbarrel.
4.
5SearchingThegoalofsearchingistoprovidequalitysearchresultsefficiently.
Manyofthelargecommercialsearchenginesseemedtohavemadegreatprogressintermsofefficiency.
Therefore,wehavefocusedmoreonqualityofsearchinourresearch,althoughwebelieveoursolutionsarescalabletocommercialvolumeswithabitmoreeffort.
ThegooglequeryevaluationprocessisshowinFigure4.
Toputalimitonresponsetime,onceacertainnumber(currently40,000)ofmatchingdocumentsarefound,thesearcherautomaticallygoestostep8inFigure4.
Thismeansthatitispossiblethatsub-optimalresultswouldbereturned.
Wearecurrentlyinvestigatingotherwaystosolvethisproblem.
Inthepast,wesortedthehitsaccordingtoPageRank,whichseemedtoimprovethesituation.
4.
5.
1TheRankingSystemGooglemaintainsmuchmoreinformationaboutwebdocumentsthantypicalsearchengines.
Everyhitlistincludesposition,font,andcapitalizationinformation.
Additionally,wefactorinhitsfromanchortextandthePageRankofthedocument.
Combiningallofthisinformationintoarankisdifficult.
Wedesignedourrankingfunctionsothatnoparticularfactorcanhavetoomuchinfluence.
First,considerthesimplestcase--asinglewordquery.
Inordertorankadocumentwithasinglewordquery,Googlelooksatthatdocument'shitlistforthatword.
Googleconsiderseachhittobeoneofseveraldifferenttypes(title,anchor,URL,plaintextlargefont,plaintextsmallfont,.
.
.
),eachofwhichhasitsowntype-weight.
Thetype-weightsmakeupavectorindexedbytype.
Googlecountsthenumberofhitsofeachtypeinthehitlist.
Theneverycountisconvertedintoacount-weight.
Count-weightsincreaselinearlywithcountsatfirstbutquicklytaperoffsothatmorethanacertaincountwillnothelp.
Wetakethedotproductofthevectorofcount-weightswiththevectoroftype-weightstocomputeanIRscoreforthedocument.
Finally,theIRscoreiscombinedwithPageRanktogiveafinalranktothedocument.
Foramulti-wordsearch,thesituationismorecomplicated.
Nowmultiplehitlistsmustbescannedthroughatoncesothathitsoccurringclosetogetherinadocumentareweightedhigherthanhitsoccurringfarapart.
Thehitsfromthemultiplehitlistsarematchedupsothatnearbyhitsarematchedtogether.
Foreverymatchedsetofhits,aproximityiscomputed.
Theproximityisbasedonhowfarapartthehitsareinthedocument(oranchor)butisclassifiedinto10differentvalue"bins"rangingfromaphrasematchto"notevenclose".
Countsarecomputednotonlyforeverytypeofhitbutforeverytypeandproximity.
Everytypeandproximitypairhasatype-prox-weight.
Thecountsareconvertedintocount-weightsandwetakethedotproductofthecount-weightsandthetype-prox-weightstocomputeanIRscore.
Allofthesenumbersandmatricescanallbedisplayedwiththesearchresultsusingaspecialdebugmode.
Thesedisplayshavebeenveryhelpfulindevelopingtherankingsystem.
4.
5.
2FeedbackTherankingfunctionhasmanyparameterslikethetype-weightsandthetype-prox-weights.
Figuringouttherightvaluesfortheseparametersissomethingofablackart.
Inordertodothis,wehaveauserfeedbackmechanisminthesearchengine.
Atrustedusermayoptionallyevaluatealloftheresultsthatarereturned.
Thisfeedbackissaved.
Thenwhenwemodifytherankingfunction,wecanseetheimpactofthischangeonallprevioussearcheswhichwereranked.
Althoughfarfromperfect,thisgivesussome1.
Parsethequery.
2.
ConvertwordsintowordIDs.
3.
Seektothestartofthedoclistintheshortbarrelforeveryword.
4.
Scanthroughthedoclistsuntilthereisadocumentthatmatchesallthesearchterms.
5.
Computetherankofthatdocumentforthequery.
6.
Ifweareintheshortbarrelsandattheendofanydoclist,seektothestartofthedoclistinthefullbarrelforeverywordandgotostep4.
7.
Ifwearenotattheendofanydoclistgotostep4.
Sortthedocumentsthathavematchedbyrankandreturnthetopk.
Figure4.
GoogleQueryEvaluationideaofhowachangeintherankingfunctionaffectsthesearchresults.
5ResultsandPerformanceThemostimportantmeasureofasearchengineisthequalityofitssearchresults.
Whileacompleteuserevaluationisbeyondthescopeofthispaper,ourownexperiencewithGooglehasshownittoproducebetterresultsthanthemajorcommercialsearchenginesformostsearches.
AsanexamplewhichillustratestheuseofPageRank,anchortext,andproximity,Figure4showsGoogle'sresultsforasearchon"billclinton".
TheseresultsdemonstratessomeofGoogle'sfeatures.
Theresultsareclusteredbyserver.
Thishelpsconsiderablywhensiftingthroughresultsets.
Anumberofresultsarefromthewhitehouse.
govdomainwhichiswhatonemayreasonablyexpectfromsuchasearch.
Currently,mostmajorcommercialsearchenginesdonotreturnanyresultsfromwhitehouse.
gov,muchlesstherightones.
Noticethatthereisnotitleforthefirstresult.
Thisisbecauseitwasnotcrawled.
Instead,Googlereliedonanchortexttodeterminethiswasagoodanswertothequery.
Similarly,thefifthresultisanemailaddresswhich,ofcourse,isnotcrawlable.
Itisalsoaresultofanchortext.
Alloftheresultsarereasonablyhighqualitypagesand,atlastcheck,nonewerebrokenlinks.
ThisislargelybecausetheyallhavehighPageRank.
ThePageRanksarethepercentagesinredalongwithbargraphs.
Finally,therearenoresultsaboutaBillotherthanClintonoraboutaClintonotherthanBill.
Thisisbecauseweplaceheavyimportanceontheproximityofwordoccurrences.
Ofcourseatruetestofthequalityofasearchenginewouldinvolveanextensiveuserstudyorresultsanalysiswhichwedonothaveroomforhere.
Instead,weinvitethereadertotryGoogleforthemselvesathttp://google.
stanford.
edu.
5.
1StorageRequirementsQuery:billclintonhttp://www.
whitehouse.
gov/100.
00%(nodate)(0K)http://www.
whitehouse.
gov/OfficeofthePresident99.
67%(Dec231996)(2K)http://www.
whitehouse.
gov/WH/EOP/OP/html/OP_Home.
htmlWelcomeToTheWhiteHouse99.
98%(Nov091997)(5K)http://www.
whitehouse.
gov/WH/Welcome.
htmlSendElectronicMailtothePresident99.
86%(Jul141997)(5K)http://www.
whitehouse.
gov/WH/Mail/html/Mail_President.
htmlmailto:president@whitehouse.
gov99.
98%mailto:President@whitehouse.
gov99.
27%The"Unofficial"BillClinton94.
06%(Nov111997)(14K)http://zpub.
com/un/un-bc.
htmlBillClintonMeetsTheShrinks86.
27%(Jun291997)(63K)http://zpub.
com/un/un-bc9.
htmlPresidentBillClinton-TheDarkSide97.
27%(Nov101997)(15K)http://www.
realchange.
org/clinton.
htm$3BillClinton94.
73%(nodate)(4K)http://www.
gatewy.
net/~tjohnson/clinton1.
htmlFigure4.
SampleResultsfromGoogleAsidefromsearchquality,GoogleisdesignedtoscalecosteffectivelytothesizeoftheWebasitgrows.
Oneaspectofthisistousestorageefficiently.
Table1hasabreakdownofsomestatisticsandstoragerequirementsofGoogle.
Duetocompressionthetotalsizeoftherepositoryisabout53GB,justoveronethirdofthetotaldataitstores.
Atcurrentdiskpricesthismakestherepositoryarelativelycheapsourceofusefuldata.
Moreimportantly,thetotalofallthedatausedbythesearchenginerequiresacomparableamountofstorage,about55GB.
Furthermore,mostqueriescanbeansweredusingjusttheshortinvertedindex.
WithbetterencodingandcompressionoftheDocumentIndex,ahighqualitywebsearchenginemayfitontoa7GBdriveofanewPC.
5.
2SystemPerformanceItisimportantforasearchenginetocrawlandindexefficiently.
Thiswayinformationcanbekeptuptodateandmajorchangestothesystemcanbetestedrelativelyquickly.
ForGoogle,themajoroperationsareCrawling,Indexing,andSorting.
Itisdifficulttomeasurehowlongcrawlingtookoverallbecausedisksfilledup,nameserverscrashed,oranynumberofotherproblemswhichstoppedthesystem.
Intotalittookroughly9daystodownloadthe26millionpages(includingerrors).
However,oncethesystemwasrunningsmoothly,itranmuchfaster,downloadingthelast11millionpagesinjust63hours,averagingjustover4millionpagesperdayor48.
5pagespersecond.
Werantheindexerandthecrawlersimultaneously.
Theindexerranjustfasterthanthecrawlers.
Thisislargelybecausewespentjustenoughtimeoptimizingtheindexersothatitwouldnotbeabottleneck.
Theseoptimizationsincludedbulkupdatestothedocumentindexandplacementofcriticaldatastructuresonthelocaldisk.
Theindexerrunsatroughly54pagespersecond.
Thesorterscanberuncompletelyinparallel;usingfourmachines,thewholeprocessofsortingtakesabout24hours.
5.
3SearchPerformanceImprovingtheperformanceofsearchwasnotthemajorfocusofourresearchuptothispoint.
ThecurrentversionofGoogleanswersmostqueriesinbetween1and10seconds.
ThistimeismostlydominatedbydiskIOoverNFS(sincedisksarespreadoveranumberofmachines).
Furthermore,Googledoesnothaveanyoptimizationssuchasquerycaching,subindicesoncommonterms,andothercommonoptimizations.
WeintendtospeedupGoogleconsiderablythroughdistributionandhardware,software,andalgorithmicimprovements.
Ourtargetistobeabletohandleseveralhundredqueriespersecond.
Table2hassomesamplequerytimesfromthecurrentversionofGoogle.
TheyarerepeatedtoshowthespeedupsresultingfromcachedIO.
StorageStatisticsTotalSizeofFetchedPages147.
8GBCompressedRepository53.
5GBShortInvertedIndex4.
1GBFullInvertedIndex37.
2GBLexicon293MBTemporaryAnchorData(notintotal)6.
6GBDocumentIndexIncl.
VariableWidthData9.
7GBLinksDatabase3.
9GBTotalWithoutRepository55.
2GBTotalWithRepository108.
7GBWebPageStatisticsNumberofWebPagesFetched24millionNumberofUrlsSeen76.
5millionNumberofEmailAddresses1.
7millionNumberof404's1.
6millionTable1.
Statistics6ConclusionsGoogleisdesignedtobeascalablesearchengine.
TheprimarygoalistoprovidehighqualitysearchresultsoverarapidlygrowingWorldWideWeb.
Googleemploysanumberoftechniquestoimprovesearchqualityincludingpagerank,anchortext,andproximityinformation.
Furthermore,Googleisacompletearchitectureforgatheringwebpages,indexingthem,andperformingsearchqueriesoverthem.
6.
1FutureWorkAlarge-scalewebsearchengineisacomplexsystemandmuchremainstobedone.
Ourimmediategoalsaretoimprovesearchefficiencyandtoscaletoapproximately100millionwebpages.
Somesimpleimprovementstoefficiencyincludequerycaching,smartdiskallocation,andsubindices.
Anotherareawhichrequiresmuchresearchisupdates.
Wemusthavesmartalgorithmstodecidewhatoldwebpagesshouldberecrawledandwhatnewonesshouldbecrawled.
Worktowardthisgoalhasbeendonein[Cho98].
Onepromisingareaofresearchisusingproxycachestobuildsearchdatabases,sincetheyaredemanddriven.
Weareplanningtoaddsimplefeaturessupportedbycommercialsearchengineslikebooleanoperators,negation,andstemming.
However,otherfeaturesarejuststartingtobeexploredsuchasrelevancefeedbackandclustering(Googlecurrentlysupportsasimplehostnamebasedclustering).
Wealsoplantosupportusercontext(liketheuser'slocation),andresultsummarization.
Wearealsoworkingtoextendtheuseoflinkstructureandlinktext.
SimpleexperimentsindicatePageRankcanbepersonalizedbyincreasingtheweightofauser'shomepageorbookmarks.
Asforlinktext,weareexperimentingwithusingtextsurroundinglinksinadditiontothelinktextitself.
AWebsearchengineisaveryrichenvironmentforresearchideas.
WehavefartoomanytolistheresowedonotexpectthisFutureWorksectiontobecomemuchshorterinthenearfuture.
6.
2HighQualitySearchThebiggestproblemfacingusersofwebsearchenginestodayisthequalityoftheresultstheygetback.
Whiletheresultsareoftenamusingandexpandusers'horizons,theyareoftenfrustratingandconsumeprecioustime.
Forexample,thetopresultforasearchfor"BillClinton"ononeofthemostpopularcommercialsearchengineswastheBillClintonJokeoftheDay:April14,1997.
GoogleisdesignedtoprovidehigherqualitysearchsoastheWebcontinuestogrowrapidly,informationcanbefoundeasily.
InordertoaccomplishthisGooglemakesheavyuseofhypertextualinformationconsistingoflinkstructureandlink(anchor)text.
Googlealsousesproximityandfontinformation.
Whileevaluationofasearchengineisdifficult,wehavesubjectivelyfoundthatGooglereturnshigherqualitysearchresultsthancurrentcommercialsearchengines.
TheanalysisoflinkstructureviaPageRankallowsGoogletoevaluatethequalityofwebpages.
Theuseoflinktextasadescriptionofwhatthelinkpointstohelpsthesearchenginereturnrelevant(andtosomedegreehighquality)results.
Finally,theuseofproximityinformationhelpsincreaserelevanceagreatdealformanyqueries.
InitialQuerySameQueryRepeated(IOmostlycached)QueryCPUTime(s)TotalTime(s)CPUTime(s)TotalTime(s)algore0.
092.
130.
060.
06vicepresident1.
773.
841.
661.
80harddisks0.
254.
860.
200.
24searchengines1.
319.
631.
161.
16Table2.
SearchTimes6.
3ScalableArchitectureAsidefromthequalityofsearch,Googleisdesignedtoscale.
Itmustbeefficientinbothspaceandtime,andconstantfactorsareveryimportantwhendealingwiththeentireWeb.
InimplementingGoogle,wehaveseenbottlenecksinCPU,memoryaccess,memorycapacity,diskseeks,diskthroughput,diskcapacity,andnetworkIO.
Googlehasevolvedtoovercomeanumberofthesebottlenecksduringvariousoperations.
Google'smajordatastructuresmakeefficientuseofavailablestoragespace.
Furthermore,thecrawling,indexing,andsortingoperationsareefficientenoughtobeabletobuildanindexofasubstantialportionoftheweb--24millionpages,inlessthanoneweek.
Weexpecttobeabletobuildanindexof100millionpagesinlessthanamonth.
6.
4AResearchToolInadditiontobeingahighqualitysearchengine,Googleisaresearchtool.
ThedataGooglehascollectedhasalreadyresultedinmanyotherpaperssubmittedtoconferencesandmanymoreontheway.
Recentresearchsuchas[Abiteboul97]hasshownanumberoflimitationstoqueriesabouttheWebthatmaybeansweredwithouthavingtheWebavailablelocally.
ThismeansthatGoogle(orasimilarsystem)isnotonlyavaluableresearchtoolbutanecessaryoneforawiderangeofapplications.
WehopeGooglewillbearesourceforsearchersandresearchersallaroundtheworldandwillsparkthenextgenerationofsearchenginetechnology.
7AcknowledgmentsScottHassanandAlanSteremberghavebeencriticaltothedevelopmentofGoogle.
Theirtalentedcontributionsareirreplaceable,andtheauthorsowethemmuchgratitude.
WewouldalsoliketothankHectorGarcia-Molina,RajeevMotwani,JeffUllman,andTerryWinogradandthewholeWebBasegroupfortheirsupportandinsightfuldiscussions.
FinallywewouldliketorecognizethegeneroussupportofourequipmentdonorsIBM,Intel,andSunandourfunders.
TheresearchdescribedherewasconductedaspartoftheStanfordIntegratedDigitalLibraryProject,supportedbytheNationalScienceFoundationunderCooperativeAgreementIRI-9411306.
FundingforthiscooperativeagreementisalsoprovidedbyDARPAandNASA,andbyIntervalResearch,andtheindustrialpartnersoftheStanfordDigitalLibrariesProject.
ReferencesBestoftheWeb1994--Navigatorshttp://botw.
org/1994/awards/navigators.
htmlBillClintonJokeoftheDay:April14,1997.
http://www.
io.
com/~cjburke/clinton/970414.
html.
Bzip2Homepagehttp://www.
muraroa.
demon.
co.
uk/GoogleSearchEnginehttp://google.
stanford.
edu/Harvesthttp://harvest.
transarc.
com/Mauldin,MichaelL.
LycosDesignChoicesinanInternetSearchService,IEEEExpertInterviewhttp://www.
computer.
org/pubs/expert/1997/trends/x1008/mauldin.
htmTheEffectofCellularPhoneUseUponDriverAttentionhttp://www.
webfirst.
com/aaa/text/cell/cell0toc.
htmSearchEngineWatchhttp://www.
searchenginewatch.
com/RFC1950(zlib)ftp://ftp.
uu.
net/graphics/png/documents/zlib/zdoc-index.
htmlRobotsExclusionProtocol:http://info.
webcrawler.
com/mak/projects/robots/exclusion.
htmWebGrowthSummary:http://www.
mit.
edu/people/mkgray/net/web-growth-summary.
htmlYahoo!
http://www.
yahoo.
com/[Abiteboul97]SergeAbiteboulandVictorVianu,QueriesandComputationontheWeb.
ProceedingsoftheInternationalConferenceonDatabaseTheory.
Delphi,Greece1997.
[Bagdikian97]BenH.
Bagdikian.
TheMediaMonopoly.
5thEdition.
Publisher:Beacon,ISBN:0807061557[Cho98]JunghooCho,HectorGarcia-Molina,LawrencePage.
EfficientCrawlingThroughURLOrdering.
SeventhInternationalWebConference(WWW98).
Brisbane,Australia,April14-18,1998.
[Gravano94]LuisGravano,HectorGarcia-Molina,andA.
Tomasic.
TheEffectivenessofGlOSSfortheText-DatabaseDiscoveryProblem.
Proc.
ofthe1994ACMSIGMODInternationalConferenceOnManagementOfData,1994.
[Kleinberg98]JonKleinberg,AuthoritativeSourcesinaHyperlinkedEnvironment,Proc.
ACM-SIAMSymposiumonDiscreteAlgorithms,1998.
[Marchiori97]MassimoMarchiori.
TheQuestforCorrectInformationontheWeb:HyperSearchEngines.
TheSixthInternationalWWWConference(WWW97).
SantaClara,USA,April7-11,1997.
[McBryan94]OliverA.
McBryan.
GENVLandWWWW:ToolsforTamingtheWeb.
FirstInternationalConferenceontheWorldWideWeb.
CERN,Geneva(Switzerland),May25-26-271994.
http://www.
cs.
colorado.
edu/home/mcbryan/mypapers/www94.
ps[Page98]LawrencePage,SergeyBrin,RajeevMotwani,TerryWinograd.
ThePageRankCitationRanking:BringingOrdertotheWeb.
Manuscriptinprogress.
http://google.
stanford.
edu/~backrub/pageranksub.
ps[Pinkerton94]BrianPinkerton,FindingWhatPeopleWant:ExperienceswiththeWebCrawler.
TheSecondInternationalWWWConferenceChicago,USA,October17-20,1994.
http://info.
webcrawler.
com/bp/WWW94.
html[Spertus97]EllenSpertus.
ParaSite:MiningStructuralInformationontheWeb.
TheSixthInternationalWWWConference(WWW97).
SantaClara,USA,April7-11,1997.
[TREC96]ProceedingsofthefifthTextREtrievalConference(TREC-5).
Gaithersburg,Maryland,November20-22,1996.
Publisher:DepartmentofCommerce,NationalInstituteofStandardsandTechnology.
Editors:D.
K.
HarmanandE.
M.
Voorhees.
Fulltextat:http://trec.
nist.
gov/[Witten94]IanHWitten,AlistairMoffat,andTimothyC.
Bell.
ManagingGigabytes:CompressingandIndexingDocumentsandImages.
NewYork:VanNostrandReinhold,1994.
[Weiss96]RonWeiss,BienvenidoVelez,MarkA.
Sheldon,ChanathipManprempre,PeterSzilagyi,AndrzejDuda,andDavidK.
Gifford.
HyPursuit:AHierarchicalNetworkSearchEnginethatExploitsContent-LinkHypertextClustering.
Proceedingsofthe7thACMConferenceonHypertext.
NewYork,1996.
VitaeSergeyBrinreceivedhisB.
S.
degreeinmathematicsandcomputersciencefromtheUniversityofMarylandatCollegeParkin1993.
Currently,heisaPh.
D.
candidateincomputerscienceatStanfordUniversitywherehereceivedhisM.
S.
in1995.
HeisarecipientofaNationalScienceFoundationGraduateFellowship.
Hisresearchinterestsincludesearchengines,informationextractionfromunstructuredsources,anddataminingoflargetextcollectionsandscientificdata.
LawrencePagewasborninEastLansing,Michigan,andreceivedaB.
S.
E.
inComputerEngineeringattheUniversityofMichiganAnnArborin1995.
HeiscurrentlyaPh.
D.
candidateinComputerScienceatStanfordUniversity.
Someofhisresearchinterestsincludethelinkstructureoftheweb,humancomputerinteraction,searchengines,scalabilityofinformationaccessinterfaces,andpersonaldatamining.
8AppendixA:AdvertisingandMixedMotivesCurrently,thepredominantbusinessmodelforcommercialsearchenginesisadvertising.
Thegoalsoftheadvertisingbusinessmodeldonotalwayscorrespondtoprovidingqualitysearchtousers.
Forexample,inourprototypesearchengineoneofthetopresultsforcellularphoneis"TheEffectofCellularPhoneUseUponDriverAttention",astudywhichexplainsingreatdetailthedistractionsandriskassociatedwithconversingonacellphonewhiledriving.
ThissearchresultcameupfirstbecauseofitshighimportanceasjudgedbythePageRankalgorithm,anapproximationofcitationimportanceontheweb[Page,98].
Itisclearthatasearchenginewhichwastakingmoneyforshowingcellularphoneadswouldhavedifficultyjustifyingthepagethatoursystemreturnedtoitspayingadvertisers.
Forthistypeofreasonandhistoricalexperiencewithothermedia[Bagdikian83],weexpectthatadvertisingfundedsearchengineswillbeinherentlybiasedtowardstheadvertisersandawayfromtheneedsoftheconsumers.
Sinceitisverydifficultevenforexpertstoevaluatesearchengines,searchenginebiasisparticularlyinsidious.
AgoodexamplewasOpenText,whichwasreportedtobesellingcompaniestherighttobelistedatthetopofthesearchresultsforparticularqueries[Marchiori97].
Thistypeofbiasismuchmoreinsidiousthanadvertising,becauseitisnotclearwho"deserves"tobethere,andwhoiswillingtopaymoneytobelisted.
Thisbusinessmodelresultedinanuproar,andOpenTexthasceasedtobeaviablesearchengine.
Butlessblatantbiasarelikelytobetoleratedbythemarket.
Forexample,asearchenginecouldaddasmallfactortosearchresultsfrom"friendly"companies,andsubtractafactorfromresultsfromcompetitors.
Thistypeofbiasisverydifficulttodetectbutcouldstillhaveasignificanteffectonthemarket.
Furthermore,advertisingincomeoftenprovidesanincentivetoprovidepoorqualitysearchresults.
Forexample,wenoticedamajorsearchenginewouldnotreturnalargeairline'shomepagewhentheairline'snamewasgivenasaquery.
Itsohappenedthattheairlinehadplacedanexpensivead,linkedtothequerythatwasitsname.
Abettersearchenginewouldnothaverequiredthisad,andpossiblyresultedinthelossoftherevenuefromtheairlinetothesearchengine.
Ingeneral,itcouldbearguedfromtheconsumerpointofviewthatthebetterthesearchengineis,thefeweradvertisementswillbeneededfortheconsumertofindwhattheywant.
Thisofcourseerodestheadvertisingsupportedbusinessmodeloftheexistingsearchengines.
However,therewillalwaysbemoneyfromadvertiserswhowantacustomertoswitchproducts,orhavesomethingthatisgenuinelynew.
Butwebelievetheissueofadvertisingcausesenoughmixedincentivesthatitiscrucialtohaveacompetitivesearchenginethatistransparentandintheacademicrealm.
9AppendixB:Scalability9.
1ScalabilityofGoogleWehavedesignedGoogletobescalableintheneartermtoagoalof100millionwebpages.
Wehavejustreceiveddiskandmachinestohandleroughlythatamount.
Allofthetimeconsumingpartsofthesystemareparallelizeandroughlylineartime.
Theseincludethingslikethecrawlers,indexers,andsorters.
Wealsothinkthatmostofthedatastructureswilldealgracefullywiththeexpansion.
However,at100millionwebpageswewillbeverycloseupagainstallsortsofoperatingsystemlimitsinthecommonoperatingsystems(currentlywerunonbothSolarisandLinux).
Theseincludethingslikeaddressablememory,numberofopenfiledescriptors,networksocketsandbandwidth,andmanyothers.
Webelieveexpandingtoalotmorethan100millionpageswouldgreatlyincreasethecomplexityofoursystem.
9.
2ScalabilityofCentralizedIndexingArchitecturesAsthecapabilitiesofcomputersincrease,itbecomespossibletoindexaverylargeamountoftextforareasonablecost.
Ofcourse,othermorebandwidthintensivemediasuchasvideoislikelytobecomemorepervasive.
But,becausethecostofproductionoftextislowcomparedtomedialikevideo,textislikelytoremainverypervasive.
Also,itislikelythatsoonwewillhavespeechrecognitionthatdoesareasonablejobconvertingspeechintotext,expandingtheamountoftextavailable.
Allofthisprovidesamazingpossibilitiesforcentralizedindexing.
Hereisanillustrativeexample.
WeassumewewanttoindexeverythingeveryoneintheUShaswrittenforayear.
Weassumethatthereare250millionpeopleintheUSandtheywriteanaverageof10kperday.
Thatworksouttobeabout850terabytes.
Alsoassumethatindexingaterabytecanbedonenowforareasonablecost.
Wealsoassumethattheindexingmethodsusedoverthetextarelinear,ornearlylinearintheircomplexity.
Givenalltheseassumptionswecancomputehowlongitwouldtakebeforewecouldindexour850terabytesforareasonablecostassumingcertaingrowthfactors.
Moore'sLawwasdefinedin1965asadoublingevery18monthsinprocessorpower.
Ithasheldremarkablytrue,notjustforprocessors,butforotherimportantsystemparameterssuchasdiskaswell.
IfweassumethatMoore'slawholdsforthefuture,weneedonly10moredoublings,or15yearstoreachourgoalofindexingeverythingeveryoneintheUShaswrittenforayearforapricethatasmallcompanycouldafford.
Ofcourse,hardwareexpertsaresomewhatconcernedMoore'sLawmaynotcontinuetoholdforthenext15years,buttherearecertainlyalotofinterestingcentralizedapplicationsevenifweonlygetpartofthewaytoourhypotheticalexample.
OfcourseadistributedsystemslikeGloss[Gravano94]orHarvestwilloftenbethemostefficientandeleganttechnicalsolutionforindexing,butitseemsdifficulttoconvincetheworldtousethesesystemsbecauseofthehighadministrationcostsofsettinguplargenumbersofinstallations.
Ofcourse,itisquitelikelythatreducingtheadministrationcostdrasticallyispossible.
Ifthathappens,andeveryonestartsrunningadistributedindexingsystem,searchingwouldcertainlyimprovedrastically.
Becausehumanscanonlytypeorspeakafiniteamount,andascomputerscontinueimproving,textindexingwillscaleevenbetterthanitdoesnow.
Ofcoursetherecouldbeaninfiniteamountofmachinegeneratedcontent,butjustindexinghugeamountsofhumangeneratedcontentseemstremendouslyuseful.
Soweareoptimisticthatourcentralizedwebsearchenginearchitecturewillimproveinitsabilitytocoverthepertinenttextinformationovertimeandthatthereisabrightfutureforsearch.

展开全文