Clintonpr查询

pr查询  时间:2021-04-18  阅读:()
TheAnatomyofaLarge-ScaleHypertextualWebSearchEngineSergeyBrinandLawrencePageComputerScienceDepartment,StanfordUniversity,Stanford,CA94305,USAsergey@cs.
stanford.
eduandpage@cs.
stanford.
eduAbstractInthispaper,wepresentGoogle,aprototypeofalarge-scalesearchenginewhichmakesheavyuseofthestructurepresentinhypertext.
GoogleisdesignedtocrawlandindextheWebefficientlyandproducemuchmoresatisfyingsearchresultsthanexistingsystems.
Theprototypewithafulltextandhyperlinkdatabaseofatleast24millionpagesisavailableathttp://google.
stanford.
edu/Toengineerasearchengineisachallengingtask.
Searchenginesindextenstohundredsofmillionsofwebpagesinvolvingacomparablenumberofdistinctterms.
Theyanswertensofmillionsofquerieseveryday.
Despitetheimportanceoflarge-scalesearchenginesontheweb,verylittleacademicresearchhasbeendoneonthem.
Furthermore,duetorapidadvanceintechnologyandwebproliferation,creatingawebsearchenginetodayisverydifferentfromthreeyearsago.
Thispaperprovidesanin-depthdescriptionofourlarge-scalewebsearchengine--thefirstsuchdetailedpublicdescriptionweknowoftodate.
Apartfromtheproblemsofscalingtraditionalsearchtechniquestodataofthismagnitude,therearenewtechnicalchallengesinvolvedwithusingtheadditionalinformationpresentinhypertexttoproducebettersearchresults.
Thispaperaddressesthisquestionofhowtobuildapracticallarge-scalesystemwhichcanexploittheadditionalinformationpresentinhypertext.
Alsowelookattheproblemofhowtoeffectivelydealwithuncontrolledhypertextcollectionswhereanyonecanpublishanythingtheywant.
KeywordsWorldWideWeb,SearchEngines,InformationRetrieval,PageRank,Google1.
Introduction(Note:Therearetwoversionsofthispaper--alongerfullversionandashorterprintedversion.
ThefullversionisavailableonthewebandtheconferenceCD-ROM.
)Thewebcreatesnewchallengesforinformationretrieval.
Theamountofinformationonthewebisgrowingrapidly,aswellasthenumberofnewusersinexperiencedintheartofwebresearch.
Peoplearelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighqualityhumanmaintainedindicessuchasYahoo!
orwithsearchengines.
Humanmaintainedlistscoverpopulartopicseffectivelybutaresubjective,expensivetobuildandmaintain,slowtoimprove,andcannotcoverallesoterictopics.
Automatedsearchenginesthatrelyonkeywordmatchingusuallyreturntoomanylowqualitymatches.
Tomakemattersworse,someadvertisersattempttogainpeople'sattentionbytakingmeasuresmeanttomisleadautomatedsearchengines.
Wehavebuiltalarge-scalesearchenginewhichaddressesmanyoftheproblemsofexistingsystems.
Itmakesespeciallyheavyuseoftheadditionalstructurepresentinhypertexttoprovidemuchhigherqualitysearchresults.
Wechoseoursystemname,Google,becauseitisacommonspellingofgoogol,or10100andfitswellwithourgoalofbuildingverylarge-scalesearchengines.
1.
1WebSearchEngines--ScalingUp:1994-2000Searchenginetechnologyhashadtoscaledramaticallytokeepupwiththegrowthoftheweb.
In1994,oneofthefirstwebsearchengines,theWorldWideWebWorm(WWWW)[McBryan94]hadanindexof110,000webpagesandwebaccessibledocuments.
AsofNovember,1997,thetopsearchenginesclaimtoindexfrom2million(WebCrawler)to100millionwebdocuments(fromSearchEngineWatch).
Itisforeseeablethatbytheyear2000,acomprehensiveindexoftheWebwillcontainoverabilliondocuments.
Atthesametime,thenumberofqueriessearchengineshandlehasgrownincrediblytoo.
InMarchandApril1994,theWorldWideWebWormreceivedanaverageofabout1500queriesperday.
InNovember1997,Altavistaclaimedithandledroughly20millionqueriesperday.
Withtheincreasingnumberofusersontheweb,andautomatedsystemswhichquerysearchengines,itislikelythattopsearchengineswillhandlehundredsofmillionsofqueriesperdaybytheyear2000.
Thegoalofoursystemistoaddressmanyoftheproblems,bothinqualityandscalability,introducedbyscalingsearchenginetechnologytosuchextraordinarynumbers.
1.
2.
Google:ScalingwiththeWebCreatingasearchenginewhichscaleseventotoday'swebpresentsmanychallenges.
Fastcrawlingtechnologyisneededtogatherthewebdocumentsandkeepthemuptodate.
Storagespacemustbeusedefficientlytostoreindicesand,optionally,thedocumentsthemselves.
Theindexingsystemmustprocesshundredsofgigabytesofdataefficiently.
Queriesmustbehandledquickly,atarateofhundredstothousandspersecond.
ThesetasksarebecomingincreasinglydifficultastheWebgrows.
However,hardwareperformanceandcosthaveimproveddramaticallytopartiallyoffsetthedifficulty.
Thereare,however,severalnotableexceptionstothisprogresssuchasdiskseektimeandoperatingsystemrobustness.
IndesigningGoogle,wehaveconsideredboththerateofgrowthoftheWebandtechnologicalchanges.
Googleisdesignedtoscalewelltoextremelylargedatasets.
Itmakesefficientuseofstoragespacetostoretheindex.
Itsdatastructuresareoptimizedforfastandefficientaccess(seesection4.
2).
Further,weexpectthatthecosttoindexandstoretextorHTMLwilleventuallydeclinerelativetotheamountthatwillbeavailable(seeAppendixB).
ThiswillresultinfavorablescalingpropertiesforcentralizedsystemslikeGoogle.
1.
3DesignGoals1.
3.
1ImprovedSearchQualityOurmaingoalistoimprovethequalityofwebsearchengines.
In1994,somepeoplebelievedthatacompletesearchindexwouldmakeitpossibletofindanythingeasily.
AccordingtoBestoftheWeb1994--Navigators,"ThebestnavigationserviceshouldmakeiteasytofindalmostanythingontheWeb(onceallthedataisentered).
"However,theWebof1997isquitedifferent.
Anyonewhohasusedasearchenginerecently,canreadilytestifythatthecompletenessoftheindexisnottheonlyfactorinthequalityofsearchresults.
"Junkresults"oftenwashoutanyresultsthatauserisinterestedin.
Infact,asofNovember1997,onlyoneofthetopfourcommercialsearchenginesfindsitself(returnsitsownsearchpageinresponsetoitsnameinthetoptenresults).
Oneofthemaincausesofthisproblemisthatthenumberofdocumentsintheindiceshasbeenincreasingbymanyordersofmagnitude,buttheuser'sabilitytolookatdocumentshasnot.
Peoplearestillonlywillingtolookatthefirstfewtensofresults.
Becauseofthis,asthecollectionsizegrows,weneedtoolsthathaveveryhighprecision(numberofrelevantdocumentsreturned,sayinthetoptensofresults).
Indeed,wewantournotionof"relevant"toonlyincludetheverybestdocumentssincetheremaybetensofthousandsofslightlyrelevantdocuments.
Thisveryhighprecisionisimportantevenattheexpenseofrecall(thetotalnumberofrelevantdocumentsthesystemisabletoreturn).
Thereisquiteabitofrecentoptimismthattheuseofmorehypertextualinformationcanhelpimprovesearchandotherapplications[Marchiori97][Spertus97][Weiss96][Kleinberg98].
Inparticular,linkstructure[Page98]andlinktextprovidealotofinformationformakingrelevancejudgmentsandqualityfiltering.
Googlemakesuseofbothlinkstructureandanchortext(seeSections2.
1and2.
2).
1.
3.
2AcademicSearchEngineResearchAsidefromtremendousgrowth,theWebhasalsobecomeincreasinglycommercialovertime.
In1993,1.
5%ofwebserverswereon.
comdomains.
Thisnumbergrewtoover60%in1997.
Atthesametime,searchengineshavemigratedfromtheacademicdomaintothecommercial.
Upuntilnowmostsearchenginedevelopmenthasgoneonatcompanieswithlittlepublicationoftechnicaldetails.
Thiscausessearchenginetechnologytoremainlargelyablackartandtobeadvertisingoriented(seeAppendixA).
WithGoogle,wehaveastronggoaltopushmoredevelopmentandunderstandingintotheacademicrealm.
Anotherimportantdesigngoalwastobuildsystemsthatreasonablenumbersofpeoplecanactuallyuse.
Usagewasimportanttousbecausewethinksomeofthemostinterestingresearchwillinvolveleveragingthevastamountofusagedatathatisavailablefrommodernwebsystems.
Forexample,therearemanytensofmillionsofsearchesperformedeveryday.
However,itisverydifficulttogetthisdata,mainlybecauseitisconsideredcommerciallyvaluable.
Ourfinaldesigngoalwastobuildanarchitecturethatcansupportnovelresearchactivitiesonlarge-scalewebdata.
Tosupportnovelresearchuses,Googlestoresalloftheactualdocumentsitcrawlsincompressedform.
OneofourmaingoalsindesigningGooglewastosetupanenvironmentwhereotherresearcherscancomeinquickly,processlargechunksoftheweb,andproduceinterestingresultsthatwouldhavebeenverydifficulttoproduceotherwise.
Intheshorttimethesystemhasbeenup,therehavealreadybeenseveralpapersusingdatabasesgeneratedbyGoogle,andmanyothersareunderway.
AnothergoalwehaveistosetupaSpacelab-likeenvironmentwhereresearchersorevenstudentscanproposeanddointerestingexperimentsonourlarge-scalewebdata.
2.
SystemFeaturesTheGooglesearchenginehastwoimportantfeaturesthathelpitproducehighprecisionresults.
First,itmakesuseofthelinkstructureoftheWebtocalculateaqualityrankingforeachwebpage.
ThisrankingiscalledPageRankandisdescribedindetailin[Page98].
Second,Googleutilizeslinktoimprovesearchresults.
2.
1PageRank:BringingOrdertotheWebThecitation(link)graphofthewebisanimportantresourcethathaslargelygoneunusedinexistingwebsearchengines.
Wehavecreatedmapscontainingasmanyas518millionofthesehyperlinks,asignificantsampleofthetotal.
Thesemapsallowrapidcalculationofawebpage's"PageRank",anobjectivemeasureofitscitationimportancethatcorrespondswellwithpeople'ssubjectiveideaofimportance.
Becauseofthiscorrespondence,PageRankisanexcellentwaytoprioritizetheresultsofwebkeywordsearches.
Formostpopularsubjects,asimpletextmatchingsearchthatisrestrictedtowebpagetitlesperformsadmirablywhenPageRankprioritizestheresults(demoavailableatgoogle.
stanford.
edu).
ForthetypeoffulltextsearchesinthemainGooglesystem,PageRankalsohelpsagreatdeal.
2.
1.
1DescriptionofPageRankCalculationAcademiccitationliteraturehasbeenappliedtotheweb,largelybycountingcitationsorbacklinkstoagivenpage.
Thisgivessomeapproximationofapage'simportanceorquality.
PageRankextendsthisideabynotcountinglinksfromallpagesequally,andbynormalizingbythenumberoflinksonapage.
PageRankisdefinedasfollows:WeassumepageAhaspagesT1.
.
.
Tnwhichpointtoit(i.
e.
,arecitations).
Theparameterdisadampingfactorwhichcanbesetbetween0and1.
Weusuallysetdto0.
85.
Therearemoredetailsaboutdinthenextsection.
AlsoC(A)isdefinedasthenumberoflinksgoingoutofpageA.
ThePageRankofapageAisgivenasfollows:PR(A)=(1-d)+d(PR(T1)/C(T1)PR(Tn)/C(Tn))NotethatthePageRanksformaprobabilitydistributionoverwebpages,sothesumofallwebpages'PageRankswillbeone.
PageRankorPR(A)canbecalculatedusingasimpleiterativealgorithm,andcorrespondstotheprincipaleigenvectorofthenormalizedlinkmatrixoftheweb.
Also,aPageRankfor26millionwebpagescanbecomputedinafewhoursonamediumsizeworkstation.
Therearemanyotherdetailswhicharebeyondthescopeofthispaper.
2.
1.
2IntuitiveJustificationPageRankcanbethoughtofasamodelofuserbehavior.
Weassumethereisa"randomsurfer"whoisgivenawebpageatrandomandkeepsclickingonlinks,neverhitting"back"buteventuallygetsboredandstartsonanotherrandompage.
TheprobabilitythattherandomsurfervisitsapageisitsPageRank.
And,theddampingfactoristheprobabilityateachpagethe"randomsurfer"willgetboredandrequestanotherrandompage.
Oneimportantvariationistoonlyaddthedampingfactordtoasinglepage,oragroupofpages.
Thisallowsforpersonalizationandcanmakeitnearlyimpossibletodeliberatelymisleadthesysteminordertogetahigherranking.
WehaveseveralotherextensionstoPageRank,againsee[Page98].
AnotherintuitivejustificationisthatapagecanhaveahighPageRankiftherearemanypagesthatpointtoit,oriftherearesomepagesthatpointtoitandhaveahighPageRank.
Intuitively,pagesthatarewellcitedfrommanyplacesaroundthewebareworthlookingat.
Also,pagesthathaveperhapsonlyonecitationfromsomethingliketheYahoo!
homepagearealsogenerallyworthlookingat.
Ifapagewasnothighquality,orwasabrokenlink,itisquitelikelythatYahoo'shomepagewouldnotlinktoit.
PageRankhandlesboththesecasesandeverythinginbetweenbyrecursivelypropagatingweightsthroughthelinkstructureoftheweb.
2.
2AnchorTextThetextoflinksistreatedinaspecialwayinoursearchengine.
Mostsearchenginesassociatethetextofalinkwiththepagethatthelinkison.
Inaddition,weassociateitwiththepagethelinkpointsto.
Thishasseveraladvantages.
First,anchorsoftenprovidemoreaccuratedescriptionsofwebpagesthanthepagesthemselves.
Second,anchorsmayexistfordocumentswhichcannotbeindexedbyatext-basedsearchengine,suchasimages,programs,anddatabases.
Thismakesitpossibletoreturnwebpageswhichhavenotactuallybeencrawled.
Notethatpagesthathavenotbeencrawledcancauseproblems,sincetheyarenevercheckedforvaliditybeforebeingreturnedtotheuser.
Inthiscase,thesearchenginecanevenreturnapagethatneveractuallyexisted,buthadhyperlinkspointingtoit.
However,itispossibletosorttheresults,sothatthisparticularproblemrarelyhappens.
ThisideaofpropagatinganchortexttothepageitreferstowasimplementedintheWorldWideWebWorm[McBryan94]especiallybecauseithelpssearchnon-textinformation,andexpandsthesearchcoveragewithfewerdownloadeddocuments.
Weuseanchorpropagationmostlybecauseanchortextcanhelpprovidebetterqualityresults.
Usinganchortextefficientlyistechnicallydifficultbecauseofthelargeamountsofdatawhichmustbeprocessed.
Inourcurrentcrawlof24millionpages,wehadover259millionanchorswhichweindexed.
2.
3OtherFeaturesAsidefromPageRankandtheuseofanchortext,Googlehasseveralotherfeatures.
First,ithaslocationinformationforallhitsandsoitmakesextensiveuseofproximityinsearch.
Second,Googlekeepstrackofsomevisualpresentationdetailssuchasfontsizeofwords.
Wordsinalargerorbolderfontareweightedhigherthanotherwords.
Third,fullrawHTMLofpagesisavailableinarepository.
3RelatedWorkSearchresearchonthewebhasashortandconcisehistory.
TheWorldWideWebWorm(WWWW)[McBryan94]wasoneofthefirstwebsearchengines.
Itwassubsequentlyfollowedbyseveralotheracademicsearchengines,manyofwhicharenowpubliccompanies.
ComparedtothegrowthoftheWebandtheimportanceofsearchenginestherearepreciousfewdocumentsaboutrecentsearchengines[Pinkerton94].
AccordingtoMichaelMauldin(chiefscientist,LycosInc)[Mauldin],"thevariousservices(includingLycos)closelyguardthedetailsofthesedatabases".
However,therehasbeenafairamountofworkonspecificfeaturesofsearchengines.
Especiallywellrepresentedisworkwhichcangetresultsbypost-processingtheresultsofexistingcommercialsearchengines,orproducesmallscale"individualized"searchengines.
Finally,therehasbeenalotofresearchoninformationretrievalsystems,especiallyonwellcontrolledcollections.
Inthenexttwosections,wediscusssomeareaswherethisresearchneedstobeextendedtoworkbetterontheweb.
3.
1InformationRetrievalWorkininformationretrievalsystemsgoesbackmanyyearsandiswelldeveloped[Witten94].
However,mostoftheresearchoninformationretrievalsystemsisonsmallwellcontrolledhomogeneouscollectionssuchascollectionsofscientificpapersornewsstoriesonarelatedtopic.
Indeed,theprimarybenchmarkforinformationretrieval,theTextRetrievalConference[TREC96],usesafairlysmall,wellcontrolledcollectionfortheirbenchmarks.
The"VeryLargeCorpus"benchmarkisonly20GBcomparedtothe147GBfromourcrawlof24millionwebpages.
ThingsthatworkwellonTRECoftendonotproducegoodresultsontheweb.
Forexample,thestandardvectorspacemodeltriestoreturnthedocumentthatmostcloselyapproximatesthequery,giventhatbothqueryanddocumentarevectorsdefinedbytheirwordoccurrence.
Ontheweb,thisstrategyoftenreturnsveryshortdocumentsthatarethequeryplusafewwords.
Forexample,wehaveseenamajorsearchenginereturnapagecontainingonly"BillClintonSucks"andpicturefroma"BillClinton"query.
Somearguethatontheweb,usersshouldspecifymoreaccuratelywhattheywantandaddmorewordstotheirquery.
Wedisagreevehementlywiththisposition.
Ifauserissuesaquerylike"BillClinton"theyshouldgetreasonableresultssincethereisaenormousamountofhighqualityinformationavailableonthistopic.
Givenexampleslikethese,webelievethatthestandardinformationretrievalworkneedstobeextendedtodealeffectivelywiththeweb.
3.
2DifferencesBetweentheWebandWellControlledCollectionsThewebisavastcollectionofcompletelyuncontrolledheterogeneousdocuments.
Documentsonthewebhaveextremevariationinternaltothedocuments,andalsointheexternalmetainformationthatmightbeavailable.
Forexample,documentsdifferinternallyintheirlanguage(bothhumanandprogramming),vocabulary(emailaddresses,links,zipcodes,phonenumbers,productnumbers),typeorformat(text,HTML,PDF,images,sounds),andmayevenbemachinegenerated(logfilesoroutputfromadatabase).
Ontheotherhand,wedefineexternalmetainformationasinformationthatcanbeinferredaboutadocument,butisnotcontainedwithinit.
Examplesofexternalmetainformationincludethingslikereputationofthesource,updatefrequency,quality,popularityorusage,andcitations.
Notonlyarethepossiblesourcesofexternalmetainformationvaried,butthethingsthatarebeingmeasuredvarymanyordersofmagnitudeaswell.
Forexample,comparetheusageinformationfromamajorhomepage,likeYahoo'swhichcurrentlyreceivesmillionsofpageviewseverydaywithanobscurehistoricalarticlewhichmightreceiveonevieweverytenyears.
Clearly,thesetwoitemsmustbetreatedverydifferentlybyasearchengine.
Anotherbigdifferencebetweenthewebandtraditionalwellcontrolledcollectionsisthatthereisvirtuallynocontroloverwhatpeoplecanputontheweb.
Couplethisflexibilitytopublishanythingwiththeenormousinfluenceofsearchenginestoroutetrafficandcompanieswhichdeliberatelymanipulatingsearchenginesforprofitbecomeaseriousproblem.
Thisproblemthathasnotbeenaddressedintraditionalclosedinformationretrievalsystems.
Also,itisinterestingtonotethatmetadataeffortshavelargelyfailedwithwebsearchengines,becauseanytextonthepagewhichisnotdirectlyrepresentedtotheuserisabusedtomanipulatesearchengines.
Thereareevennumerouscompanieswhichspecializeinmanipulatingsearchenginesforprofit.
4SystemAnatomyFirst,wewillprovideahighleveldiscussionofthearchitecture.
Then,thereissomein-depthdescriptionsofimportantdatastructures.
Finally,themajorapplications:crawling,indexing,andsearchingwillbeexaminedindepth.
4.
1GoogleArchitectureOverviewInthissection,wewillgiveahighleveloverviewofhowthewholesystemworksaspicturedinFigure1.
Furthersectionswilldiscusstheapplicationsanddatastructuresnotmentionedinthissection.
MostofGoogleisimplementedinCorC++forefficiencyandcanrunineitherSolarisorLinux.
InGoogle,thewebcrawling(downloadingofwebpages)isdonebyseveraldistributedcrawlers.
ThereisaURLserverthatsendslistsofURLstobefetchedtothecrawlers.
Thewebpagesthatarefetchedarethensenttothestoreserver.
Thestoreserverthencompressesandstoresthewebpagesintoarepository.
EverywebpagehasanassociatedIDnumbercalledadocIDwhichisassignedwheneveranewURLisparsedoutofawebpage.
Theindexingfunctionisperformedbytheindexerandthesorter.
Theindexerperformsanumberoffunctions.
Itreadstherepository,uncompressesthedocuments,andparsesthem.
Eachdocumentisconvertedintoasetofwordoccurrencescalledhits.
Thehitsrecordtheword,positionindocument,anapproximationoffontsize,andcapitalization.
Theindexerdistributesthesehitsintoasetof"barrels",creatingapartiallysortedforwardindex.
Theindexerperformsanotherimportantfunction.
Itparsesoutallthelinksineverywebpageandstoresimportantinformationabouttheminananchorsfile.
Thisfilecontainsenoughinformationtodeterminewhereeachlinkpointsfromandto,andthetextofthelink.
TheURLresolverreadstheanchorsfileandconvertsrelativeURLsintoabsoluteURLsandinturnintodocIDs.
Itputstheanchortextintotheforwardindex,associatedwiththedocIDthattheanchorpointsto.
ItalsogeneratesadatabaseoflinkswhicharepairsofdocIDs.
ThelinksdatabaseisusedtocomputePageRanksforallthedocuments.
Thesortertakesthebarrels,whicharesortedbydocID(thisisasimplification,seeSection4.
2.
5),andresortsthembywordIDtogeneratetheinvertedindex.
Thisisdoneinplacesothatlittletemporaryspaceisneededforthisoperation.
ThesorteralsoproducesalistofwordIDsandoffsetsintotheinvertedindex.
AprogramcalledDumpLexicontakesthislisttogetherwiththelexiconproducedbytheindexerandgeneratesanewlexicontobeusedbythesearcher.
ThesearcherisrunbyawebserverandusesthelexiconbuiltbyDumpLexicontogetherwiththeinvertedindexandthePageRankstoanswerqueries.
4.
2MajorDataStructuresGoogle'sdatastructuresareoptimizedsothatalargedocumentcollectioncanbecrawled,indexed,andsearchedwithlittlecost.
Although,CPUsandbulkinputoutputrateshaveimproveddramaticallyovertheyears,adiskseekstillrequiresabout10mstocomplete.
Googleisdesignedtoavoiddiskseekswheneverpossible,andthishashadaconsiderableinfluenceonthedesignofthedatastructures.
Figure1.
HighLevelGoogleArchitecture4.
2.
1BigFilesBigFilesarevirtualfilesspanningmultiplefilesystemsandareaddressableby64bitintegers.
Theallocationamongmultiplefilesystemsishandledautomatically.
TheBigFilespackagealsohandlesallocationanddeallocationoffiledescriptors,sincetheoperatingsystemsdonotprovideenoughforourneeds.
BigFilesalsosupportrudimentarycompressionoptions.
4.
2.
2RepositoryTherepositorycontainsthefullHTMLofeverywebpage.
Eachpageiscompressedusingzlib(seeRFC1950).
Thechoiceofcompressiontechniqueisatradeoffbetweenspeedandcompressionratio.
Wechosezlib'sspeedoverasignificantimprovementincompressionofferedbybzip.
Thecompressionrateofbzipwasapproximately4to1ontherepositoryascomparedtozlib's3to1compression.
Intherepository,thedocumentsarestoredoneaftertheotherandareprefixedbydocID,length,andURLascanbeseeninFigure2.
Therepositoryrequiresnootherdatastructurestobeusedinordertoaccessit.
Thishelpswithdataconsistencyandmakesdevelopmentmucheasier;wecanrebuildalltheotherdatastructuresfromonlytherepositoryandafilewhichlistscrawlererrors.
4.
2.
3DocumentIndexThedocumentindexkeepsinformationabouteachdocument.
ItisafixedwidthISAM(Indexsequentialaccessmode)index,orderedbydocID.
Theinformationstoredineachentryincludesthecurrentdocumentstatus,apointerintotherepository,adocumentchecksum,andvariousstatistics.
Ifthedocumenthasbeencrawled,italsocontainsapointerintoavariablewidthfilecalleddocinfowhichcontainsitsURLandtitle.
OtherwisethepointerpointsintotheURLlistwhichcontainsjusttheURL.
Thisdesigndecisionwasdrivenbythedesiretohaveareasonablycompactdatastructure,andtheabilitytofetcharecordinonediskseekduringasearchAdditionally,thereisafilewhichisusedtoconvertURLsintodocIDs.
ItisalistofURLchecksumswiththeircorrespondingdocIDsandissortedbychecksum.
InordertofindthedocIDofaparticularURL,theURL'schecksumiscomputedandabinarysearchisperformedonthechecksumsfiletofinditsdocID.
URLsmaybeconvertedintodocIDsinbatchbydoingamergewiththisfile.
ThisisthetechniquetheURLresolverusestoturnURLsintodocIDs.
Thisbatchmodeofupdateiscrucialbecauseotherwisewemustperformoneseekforeverylinkwhichassumingonediskwouldtakemorethanamonthforour322millionlinkdataset.
4.
2.
4LexiconThelexiconhasseveraldifferentforms.
Oneimportantchangefromearliersystemsisthatthelexiconcanfitinmemoryforareasonableprice.
Inthecurrentimplementationwecankeepthelexiconinmemoryonamachinewith256MBofmainmemory.
Thecurrentlexiconcontains14millionwords(thoughsomerarewordswerenotaddedtothelexicon).
Itisimplementedintwoparts--alistofthewords(concatenatedtogetherbutseparatedbynulls)andahashtableofpointers.
Forvariousfunctions,Figure2.
RepositoryDataStructurethelistofwordshassomeauxiliaryinformationwhichisbeyondthescopeofthispapertoexplainfully.
4.
2.
5HitListsAhitlistcorrespondstoalistofoccurrencesofaparticularwordinaparticulardocumentincludingposition,font,andcapitalizationinformation.
Hitlistsaccountformostofthespaceusedinboththeforwardandtheinvertedindices.
Becauseofthis,itisimportanttorepresentthemasefficientlyaspossible.
Weconsideredseveralalternativesforencodingposition,font,andcapitalization--simpleencoding(atripleofintegers),acompactencoding(ahandoptimizedallocationofbits),andHuffmancoding.
IntheendwechoseahandoptimizedcompactencodingsinceitrequiredfarlessspacethanthesimpleencodingandfarlessbitmanipulationthanHuffmancoding.
ThedetailsofthehitsareshowninFigure3.
Ourcompactencodingusestwobytesforeveryhit.
Therearetwotypesofhits:fancyhitsandplainhits.
FancyhitsincludehitsoccurringinaURL,title,anchortext,ormetatag.
Plainhitsincludeeverythingelse.
Aplainhitconsistsofacapitalizationbit,fontsize,and12bitsofwordpositioninadocument(allpositionshigherthan4095arelabeled4096).
Fontsizeisrepresentedrelativetotherestofthedocumentusingthreebits(only7valuesareactuallyusedbecause111istheflagthatsignalsafancyhit).
Afancyhitconsistsofacapitalizationbit,thefontsizesetto7toindicateitisafancyhit,4bitstoencodethetypeoffancyhit,and8bitsofposition.
Foranchorhits,the8bitsofpositionaresplitinto4bitsforpositioninanchorand4bitsforahashofthedocIDtheanchoroccursin.
Thisgivesussomelimitedphrasesearchingaslongastherearenotthatmanyanchorsforaparticularword.
WeexpecttoupdatethewaythatanchorhitsarestoredtoallowforgreaterresolutioninthepositionanddocIDhashfields.
Weusefontsizerelativetotherestofthedocumentbecausewhensearching,youdonotwanttorankotherwiseidenticaldocumentsdifferentlyjustbecauseoneofthedocumentsisinalargerfont.
Thelengthofahitlistisstoredbeforethehitsthemselves.
Tosavespace,thelengthofthehitlistiscombinedwiththewordIDintheforwardindexandthedocIDintheinvertedindex.
Thislimitsitto8and5bitsrespectively(therearesometrickswhichallow8bitstobeborrowedfromthewordID).
Ifthelengthislongerthanwouldfitinthatmanybits,anescapecodeisusedinthosebits,andthenexttwobytescontaintheactuallength.
4.
2.
6ForwardIndexTheforwardindexisactuallyalreadypartiallysorted.
Itisstoredinanumberofbarrels(weused64).
EachbarrelholdsarangeofwordID's.
Ifadocumentcontainswordsthatfallintoaparticularbarrel,thedocIDisrecordedintothebarrel,followedbyalistofwordID'swithhitlistswhichcorrespondtothosewords.
ThisschemerequiresslightlymorestoragebecauseofduplicateddocIDsbutthedifferenceisverysmallforareasonablenumberofbucketsandsavesconsiderabletimeandcodingcomplexityinthefinalindexingphasedonebythesorter.
Furthermore,insteadofstoringactualwordID's,westoreeachwordIDasarelativedifferencefromtheminimumwordIDthatfallsintotheFigure3.
ForwardandReverseIndexesandtheLexiconbarrelthewordIDisin.
Thisway,wecanusejust24bitsforthewordID'sintheunsortedbarrels,leaving8bitsforthehitlistlength.
4.
2.
7InvertedIndexTheinvertedindexconsistsofthesamebarrelsastheforwardindex,exceptthattheyhavebeenprocessedbythesorter.
ForeveryvalidwordID,thelexiconcontainsapointerintothebarrelthatwordIDfallsinto.
ItpointstoadoclistofdocID'stogetherwiththeircorrespondinghitlists.
Thisdoclistrepresentsalltheoccurrencesofthatwordinalldocuments.
AnimportantissueisinwhatorderthedocID'sshouldappearinthedoclist.
OnesimplesolutionistostorethemsortedbydocID.
Thisallowsforquickmergingofdifferentdoclistsformultiplewordqueries.
Anotheroptionistostorethemsortedbyarankingoftheoccurrenceofthewordineachdocument.
Thismakesansweringonewordqueriestrivialandmakesitlikelythattheanswerstomultiplewordqueriesarenearthestart.
However,mergingismuchmoredifficult.
Also,thismakesdevelopmentmuchmoredifficultinthatachangetotherankingfunctionrequiresarebuildoftheindex.
Wechoseacompromisebetweentheseoptions,keepingtwosetsofinvertedbarrels--onesetforhitlistswhichincludetitleoranchorhitsandanothersetforallhitlists.
Thisway,wecheckthefirstsetofbarrelsfirstandiftherearenotenoughmatcheswithinthosebarrelswecheckthelargerones.
4.
3CrawlingtheWebRunningawebcrawlerisachallengingtask.
Therearetrickyperformanceandreliabilityissuesandevenmoreimportantly,therearesocialissues.
Crawlingisthemostfragileapplicationsinceitinvolvesinteractingwithhundredsofthousandsofwebserversandvariousnameserverswhichareallbeyondthecontrolofthesystem.
Inordertoscaletohundredsofmillionsofwebpages,Googlehasafastdistributedcrawlingsystem.
AsingleURLserverserveslistsofURLstoanumberofcrawlers(wetypicallyranabout3).
BoththeURLserverandthecrawlersareimplementedinPython.
Eachcrawlerkeepsroughly300connectionsopenatonce.
Thisisnecessarytoretrievewebpagesatafastenoughpace.
Atpeakspeeds,thesystemcancrawlover100webpagespersecondusingfourcrawlers.
Thisamountstoroughly600Kpersecondofdata.
AmajorperformancestressisDNSlookup.
EachcrawlermaintainsaitsownDNScachesoitdoesnotneedtodoaDNSlookupbeforecrawlingeachdocument.
Eachofthehundredsofconnectionscanbeinanumberofdifferentstates:lookingupDNS,connectingtohost,sendingrequest,andreceivingresponse.
Thesefactorsmakethecrawleracomplexcomponentofthesystem.
ItusesasynchronousIOtomanageevents,andanumberofqueuestomovepagefetchesfromstatetostate.
Itturnsoutthatrunningacrawlerwhichconnectstomorethanhalfamillionservers,andgeneratestensofmillionsoflogentriesgeneratesafairamountofemailandphonecalls.
Becauseofthevastnumberofpeoplecomingonline,therearealwaysthosewhodonotknowwhatacrawleris,becausethisisthefirstonetheyhaveseen.
Almostdaily,wereceiveanemailsomethinglike,"Wow,youlookedatalotofpagesfrommywebsite.
Howdidyoulikeit"Therearealsosomepeoplewhodonotknowabouttherobotsexclusionprotocol,andthinktheirpageshouldbeprotectedfromindexingbyastatementlike,"Thispageiscopyrightedandshouldnotbeindexed",whichneedlesstosayisdifficultforwebcrawlerstounderstand.
Also,becauseofthehugeamountofdatainvolved,unexpectedthingswillhappen.
Forexample,oursystemtriedtocrawlanonlinegame.
Thisresultedinlotsofgarbagemessagesinthemiddleoftheirgame!
Itturnsoutthiswasaneasyproblemtofix.
Butthisproblemhadnotcomeupuntilwehaddownloadedtensofmillionsofpages.
Becauseoftheimmensevariationinwebpagesandservers,itisvirtuallyimpossibletotestacrawlerwithoutrunningitonlargepartoftheInternet.
Invariably,therearehundredsofobscureproblemswhichmayonlyoccurononepageoutofthewholewebandcausethecrawlertocrash,orworse,causeunpredictableorincorrectbehavior.
SystemswhichaccesslargepartsoftheInternetneedtobedesignedtobeveryrobustandcarefullytested.
Sincelargecomplexsystemssuchascrawlerswillinvariablycauseproblems,thereneedstobesignificantresourcesdevotedtoreadingtheemailandsolvingtheseproblemsastheycomeup.
4.
4IndexingtheWebParsing--AnyparserwhichisdesignedtorunontheentireWebmusthandleahugearrayofpossibleerrors.
TheserangefromtyposinHTMLtagstokilobytesofzerosinthemiddleofatag,non-ASCIIcharacters,HTMLtagsnestedhundredsdeep,andagreatvarietyofothererrorsthatchallengeanyone'simaginationtocomeupwithequallycreativeones.
Formaximumspeed,insteadofusingYACCtogenerateaCFGparser,weuseflextogeneratealexicalanalyzerwhichweoutfitwithitsownstack.
Developingthisparserwhichrunsatareasonablespeedandisveryrobustinvolvedafairamountofwork.
IndexingDocumentsintoBarrels--Aftereachdocumentisparsed,itisencodedintoanumberofbarrels.
EverywordisconvertedintoawordIDbyusinganin-memoryhashtable--thelexicon.
Newadditionstothelexiconhashtableareloggedtoafile.
OncethewordsareconvertedintowordID's,theiroccurrencesinthecurrentdocumentaretranslatedintohitlistsandarewrittenintotheforwardbarrels.
Themaindifficultywithparallelizationoftheindexingphaseisthatthelexiconneedstobeshared.
Insteadofsharingthelexicon,wetooktheapproachofwritingalogofalltheextrawordsthatwerenotinabaselexicon,whichwefixedat14millionwords.
Thatwaymultipleindexerscanruninparallelandthenthesmalllogfileofextrawordscanbeprocessedbyonefinalindexer.
Sorting--Inordertogeneratetheinvertedindex,thesortertakeseachoftheforwardbarrelsandsortsitbywordIDtoproduceaninvertedbarrelfortitleandanchorhitsandafulltextinvertedbarrel.
Thisprocesshappensonebarrelatatime,thusrequiringlittletemporarystorage.
Also,weparallelizethesortingphasetouseasmanymachinesaswehavesimplybyrunningmultiplesorters,whichcanprocessdifferentbucketsatthesametime.
Sincethebarrelsdon'tfitintomainmemory,thesorterfurthersubdividesthemintobasketswhichdofitintomemorybasedonwordIDanddocID.
Thenthesorter,loadseachbasketintomemory,sortsitandwritesitscontentsintotheshortinvertedbarrelandthefullinvertedbarrel.
4.
5SearchingThegoalofsearchingistoprovidequalitysearchresultsefficiently.
Manyofthelargecommercialsearchenginesseemedtohavemadegreatprogressintermsofefficiency.
Therefore,wehavefocusedmoreonqualityofsearchinourresearch,althoughwebelieveoursolutionsarescalabletocommercialvolumeswithabitmoreeffort.
ThegooglequeryevaluationprocessisshowinFigure4.
Toputalimitonresponsetime,onceacertainnumber(currently40,000)ofmatchingdocumentsarefound,thesearcherautomaticallygoestostep8inFigure4.
Thismeansthatitispossiblethatsub-optimalresultswouldbereturned.
Wearecurrentlyinvestigatingotherwaystosolvethisproblem.
Inthepast,wesortedthehitsaccordingtoPageRank,whichseemedtoimprovethesituation.
4.
5.
1TheRankingSystemGooglemaintainsmuchmoreinformationaboutwebdocumentsthantypicalsearchengines.
Everyhitlistincludesposition,font,andcapitalizationinformation.
Additionally,wefactorinhitsfromanchortextandthePageRankofthedocument.
Combiningallofthisinformationintoarankisdifficult.
Wedesignedourrankingfunctionsothatnoparticularfactorcanhavetoomuchinfluence.
First,considerthesimplestcase--asinglewordquery.
Inordertorankadocumentwithasinglewordquery,Googlelooksatthatdocument'shitlistforthatword.
Googleconsiderseachhittobeoneofseveraldifferenttypes(title,anchor,URL,plaintextlargefont,plaintextsmallfont,.
.
.
),eachofwhichhasitsowntype-weight.
Thetype-weightsmakeupavectorindexedbytype.
Googlecountsthenumberofhitsofeachtypeinthehitlist.
Theneverycountisconvertedintoacount-weight.
Count-weightsincreaselinearlywithcountsatfirstbutquicklytaperoffsothatmorethanacertaincountwillnothelp.
Wetakethedotproductofthevectorofcount-weightswiththevectoroftype-weightstocomputeanIRscoreforthedocument.
Finally,theIRscoreiscombinedwithPageRanktogiveafinalranktothedocument.
Foramulti-wordsearch,thesituationismorecomplicated.
Nowmultiplehitlistsmustbescannedthroughatoncesothathitsoccurringclosetogetherinadocumentareweightedhigherthanhitsoccurringfarapart.
Thehitsfromthemultiplehitlistsarematchedupsothatnearbyhitsarematchedtogether.
Foreverymatchedsetofhits,aproximityiscomputed.
Theproximityisbasedonhowfarapartthehitsareinthedocument(oranchor)butisclassifiedinto10differentvalue"bins"rangingfromaphrasematchto"notevenclose".
Countsarecomputednotonlyforeverytypeofhitbutforeverytypeandproximity.
Everytypeandproximitypairhasatype-prox-weight.
Thecountsareconvertedintocount-weightsandwetakethedotproductofthecount-weightsandthetype-prox-weightstocomputeanIRscore.
Allofthesenumbersandmatricescanallbedisplayedwiththesearchresultsusingaspecialdebugmode.
Thesedisplayshavebeenveryhelpfulindevelopingtherankingsystem.
4.
5.
2FeedbackTherankingfunctionhasmanyparameterslikethetype-weightsandthetype-prox-weights.
Figuringouttherightvaluesfortheseparametersissomethingofablackart.
Inordertodothis,wehaveauserfeedbackmechanisminthesearchengine.
Atrustedusermayoptionallyevaluatealloftheresultsthatarereturned.
Thisfeedbackissaved.
Thenwhenwemodifytherankingfunction,wecanseetheimpactofthischangeonallprevioussearcheswhichwereranked.
Althoughfarfromperfect,thisgivesussome1.
Parsethequery.
2.
ConvertwordsintowordIDs.
3.
Seektothestartofthedoclistintheshortbarrelforeveryword.
4.
Scanthroughthedoclistsuntilthereisadocumentthatmatchesallthesearchterms.
5.
Computetherankofthatdocumentforthequery.
6.
Ifweareintheshortbarrelsandattheendofanydoclist,seektothestartofthedoclistinthefullbarrelforeverywordandgotostep4.
7.
Ifwearenotattheendofanydoclistgotostep4.
Sortthedocumentsthathavematchedbyrankandreturnthetopk.
Figure4.
GoogleQueryEvaluationideaofhowachangeintherankingfunctionaffectsthesearchresults.
5ResultsandPerformanceThemostimportantmeasureofasearchengineisthequalityofitssearchresults.
Whileacompleteuserevaluationisbeyondthescopeofthispaper,ourownexperiencewithGooglehasshownittoproducebetterresultsthanthemajorcommercialsearchenginesformostsearches.
AsanexamplewhichillustratestheuseofPageRank,anchortext,andproximity,Figure4showsGoogle'sresultsforasearchon"billclinton".
TheseresultsdemonstratessomeofGoogle'sfeatures.
Theresultsareclusteredbyserver.
Thishelpsconsiderablywhensiftingthroughresultsets.
Anumberofresultsarefromthewhitehouse.
govdomainwhichiswhatonemayreasonablyexpectfromsuchasearch.
Currently,mostmajorcommercialsearchenginesdonotreturnanyresultsfromwhitehouse.
gov,muchlesstherightones.
Noticethatthereisnotitleforthefirstresult.
Thisisbecauseitwasnotcrawled.
Instead,Googlereliedonanchortexttodeterminethiswasagoodanswertothequery.
Similarly,thefifthresultisanemailaddresswhich,ofcourse,isnotcrawlable.
Itisalsoaresultofanchortext.
Alloftheresultsarereasonablyhighqualitypagesand,atlastcheck,nonewerebrokenlinks.
ThisislargelybecausetheyallhavehighPageRank.
ThePageRanksarethepercentagesinredalongwithbargraphs.
Finally,therearenoresultsaboutaBillotherthanClintonoraboutaClintonotherthanBill.
Thisisbecauseweplaceheavyimportanceontheproximityofwordoccurrences.
Ofcourseatruetestofthequalityofasearchenginewouldinvolveanextensiveuserstudyorresultsanalysiswhichwedonothaveroomforhere.
Instead,weinvitethereadertotryGoogleforthemselvesathttp://google.
stanford.
edu.
5.
1StorageRequirementsQuery:billclintonhttp://www.
whitehouse.
gov/100.
00%(nodate)(0K)http://www.
whitehouse.
gov/OfficeofthePresident99.
67%(Dec231996)(2K)http://www.
whitehouse.
gov/WH/EOP/OP/html/OP_Home.
htmlWelcomeToTheWhiteHouse99.
98%(Nov091997)(5K)http://www.
whitehouse.
gov/WH/Welcome.
htmlSendElectronicMailtothePresident99.
86%(Jul141997)(5K)http://www.
whitehouse.
gov/WH/Mail/html/Mail_President.
htmlmailto:president@whitehouse.
gov99.
98%mailto:President@whitehouse.
gov99.
27%The"Unofficial"BillClinton94.
06%(Nov111997)(14K)http://zpub.
com/un/un-bc.
htmlBillClintonMeetsTheShrinks86.
27%(Jun291997)(63K)http://zpub.
com/un/un-bc9.
htmlPresidentBillClinton-TheDarkSide97.
27%(Nov101997)(15K)http://www.
realchange.
org/clinton.
htm$3BillClinton94.
73%(nodate)(4K)http://www.
gatewy.
net/~tjohnson/clinton1.
htmlFigure4.
SampleResultsfromGoogleAsidefromsearchquality,GoogleisdesignedtoscalecosteffectivelytothesizeoftheWebasitgrows.
Oneaspectofthisistousestorageefficiently.
Table1hasabreakdownofsomestatisticsandstoragerequirementsofGoogle.
Duetocompressionthetotalsizeoftherepositoryisabout53GB,justoveronethirdofthetotaldataitstores.
Atcurrentdiskpricesthismakestherepositoryarelativelycheapsourceofusefuldata.
Moreimportantly,thetotalofallthedatausedbythesearchenginerequiresacomparableamountofstorage,about55GB.
Furthermore,mostqueriescanbeansweredusingjusttheshortinvertedindex.
WithbetterencodingandcompressionoftheDocumentIndex,ahighqualitywebsearchenginemayfitontoa7GBdriveofanewPC.
5.
2SystemPerformanceItisimportantforasearchenginetocrawlandindexefficiently.
Thiswayinformationcanbekeptuptodateandmajorchangestothesystemcanbetestedrelativelyquickly.
ForGoogle,themajoroperationsareCrawling,Indexing,andSorting.
Itisdifficulttomeasurehowlongcrawlingtookoverallbecausedisksfilledup,nameserverscrashed,oranynumberofotherproblemswhichstoppedthesystem.
Intotalittookroughly9daystodownloadthe26millionpages(includingerrors).
However,oncethesystemwasrunningsmoothly,itranmuchfaster,downloadingthelast11millionpagesinjust63hours,averagingjustover4millionpagesperdayor48.
5pagespersecond.
Werantheindexerandthecrawlersimultaneously.
Theindexerranjustfasterthanthecrawlers.
Thisislargelybecausewespentjustenoughtimeoptimizingtheindexersothatitwouldnotbeabottleneck.
Theseoptimizationsincludedbulkupdatestothedocumentindexandplacementofcriticaldatastructuresonthelocaldisk.
Theindexerrunsatroughly54pagespersecond.
Thesorterscanberuncompletelyinparallel;usingfourmachines,thewholeprocessofsortingtakesabout24hours.
5.
3SearchPerformanceImprovingtheperformanceofsearchwasnotthemajorfocusofourresearchuptothispoint.
ThecurrentversionofGoogleanswersmostqueriesinbetween1and10seconds.
ThistimeismostlydominatedbydiskIOoverNFS(sincedisksarespreadoveranumberofmachines).
Furthermore,Googledoesnothaveanyoptimizationssuchasquerycaching,subindicesoncommonterms,andothercommonoptimizations.
WeintendtospeedupGoogleconsiderablythroughdistributionandhardware,software,andalgorithmicimprovements.
Ourtargetistobeabletohandleseveralhundredqueriespersecond.
Table2hassomesamplequerytimesfromthecurrentversionofGoogle.
TheyarerepeatedtoshowthespeedupsresultingfromcachedIO.
StorageStatisticsTotalSizeofFetchedPages147.
8GBCompressedRepository53.
5GBShortInvertedIndex4.
1GBFullInvertedIndex37.
2GBLexicon293MBTemporaryAnchorData(notintotal)6.
6GBDocumentIndexIncl.
VariableWidthData9.
7GBLinksDatabase3.
9GBTotalWithoutRepository55.
2GBTotalWithRepository108.
7GBWebPageStatisticsNumberofWebPagesFetched24millionNumberofUrlsSeen76.
5millionNumberofEmailAddresses1.
7millionNumberof404's1.
6millionTable1.
Statistics6ConclusionsGoogleisdesignedtobeascalablesearchengine.
TheprimarygoalistoprovidehighqualitysearchresultsoverarapidlygrowingWorldWideWeb.
Googleemploysanumberoftechniquestoimprovesearchqualityincludingpagerank,anchortext,andproximityinformation.
Furthermore,Googleisacompletearchitectureforgatheringwebpages,indexingthem,andperformingsearchqueriesoverthem.
6.
1FutureWorkAlarge-scalewebsearchengineisacomplexsystemandmuchremainstobedone.
Ourimmediategoalsaretoimprovesearchefficiencyandtoscaletoapproximately100millionwebpages.
Somesimpleimprovementstoefficiencyincludequerycaching,smartdiskallocation,andsubindices.
Anotherareawhichrequiresmuchresearchisupdates.
Wemusthavesmartalgorithmstodecidewhatoldwebpagesshouldberecrawledandwhatnewonesshouldbecrawled.
Worktowardthisgoalhasbeendonein[Cho98].
Onepromisingareaofresearchisusingproxycachestobuildsearchdatabases,sincetheyaredemanddriven.
Weareplanningtoaddsimplefeaturessupportedbycommercialsearchengineslikebooleanoperators,negation,andstemming.
However,otherfeaturesarejuststartingtobeexploredsuchasrelevancefeedbackandclustering(Googlecurrentlysupportsasimplehostnamebasedclustering).
Wealsoplantosupportusercontext(liketheuser'slocation),andresultsummarization.
Wearealsoworkingtoextendtheuseoflinkstructureandlinktext.
SimpleexperimentsindicatePageRankcanbepersonalizedbyincreasingtheweightofauser'shomepageorbookmarks.
Asforlinktext,weareexperimentingwithusingtextsurroundinglinksinadditiontothelinktextitself.
AWebsearchengineisaveryrichenvironmentforresearchideas.
WehavefartoomanytolistheresowedonotexpectthisFutureWorksectiontobecomemuchshorterinthenearfuture.
6.
2HighQualitySearchThebiggestproblemfacingusersofwebsearchenginestodayisthequalityoftheresultstheygetback.
Whiletheresultsareoftenamusingandexpandusers'horizons,theyareoftenfrustratingandconsumeprecioustime.
Forexample,thetopresultforasearchfor"BillClinton"ononeofthemostpopularcommercialsearchengineswastheBillClintonJokeoftheDay:April14,1997.
GoogleisdesignedtoprovidehigherqualitysearchsoastheWebcontinuestogrowrapidly,informationcanbefoundeasily.
InordertoaccomplishthisGooglemakesheavyuseofhypertextualinformationconsistingoflinkstructureandlink(anchor)text.
Googlealsousesproximityandfontinformation.
Whileevaluationofasearchengineisdifficult,wehavesubjectivelyfoundthatGooglereturnshigherqualitysearchresultsthancurrentcommercialsearchengines.
TheanalysisoflinkstructureviaPageRankallowsGoogletoevaluatethequalityofwebpages.
Theuseoflinktextasadescriptionofwhatthelinkpointstohelpsthesearchenginereturnrelevant(andtosomedegreehighquality)results.
Finally,theuseofproximityinformationhelpsincreaserelevanceagreatdealformanyqueries.
InitialQuerySameQueryRepeated(IOmostlycached)QueryCPUTime(s)TotalTime(s)CPUTime(s)TotalTime(s)algore0.
092.
130.
060.
06vicepresident1.
773.
841.
661.
80harddisks0.
254.
860.
200.
24searchengines1.
319.
631.
161.
16Table2.
SearchTimes6.
3ScalableArchitectureAsidefromthequalityofsearch,Googleisdesignedtoscale.
Itmustbeefficientinbothspaceandtime,andconstantfactorsareveryimportantwhendealingwiththeentireWeb.
InimplementingGoogle,wehaveseenbottlenecksinCPU,memoryaccess,memorycapacity,diskseeks,diskthroughput,diskcapacity,andnetworkIO.
Googlehasevolvedtoovercomeanumberofthesebottlenecksduringvariousoperations.
Google'smajordatastructuresmakeefficientuseofavailablestoragespace.
Furthermore,thecrawling,indexing,andsortingoperationsareefficientenoughtobeabletobuildanindexofasubstantialportionoftheweb--24millionpages,inlessthanoneweek.
Weexpecttobeabletobuildanindexof100millionpagesinlessthanamonth.
6.
4AResearchToolInadditiontobeingahighqualitysearchengine,Googleisaresearchtool.
ThedataGooglehascollectedhasalreadyresultedinmanyotherpaperssubmittedtoconferencesandmanymoreontheway.
Recentresearchsuchas[Abiteboul97]hasshownanumberoflimitationstoqueriesabouttheWebthatmaybeansweredwithouthavingtheWebavailablelocally.
ThismeansthatGoogle(orasimilarsystem)isnotonlyavaluableresearchtoolbutanecessaryoneforawiderangeofapplications.
WehopeGooglewillbearesourceforsearchersandresearchersallaroundtheworldandwillsparkthenextgenerationofsearchenginetechnology.
7AcknowledgmentsScottHassanandAlanSteremberghavebeencriticaltothedevelopmentofGoogle.
Theirtalentedcontributionsareirreplaceable,andtheauthorsowethemmuchgratitude.
WewouldalsoliketothankHectorGarcia-Molina,RajeevMotwani,JeffUllman,andTerryWinogradandthewholeWebBasegroupfortheirsupportandinsightfuldiscussions.
FinallywewouldliketorecognizethegeneroussupportofourequipmentdonorsIBM,Intel,andSunandourfunders.
TheresearchdescribedherewasconductedaspartoftheStanfordIntegratedDigitalLibraryProject,supportedbytheNationalScienceFoundationunderCooperativeAgreementIRI-9411306.
FundingforthiscooperativeagreementisalsoprovidedbyDARPAandNASA,andbyIntervalResearch,andtheindustrialpartnersoftheStanfordDigitalLibrariesProject.
ReferencesBestoftheWeb1994--Navigatorshttp://botw.
org/1994/awards/navigators.
htmlBillClintonJokeoftheDay:April14,1997.
http://www.
io.
com/~cjburke/clinton/970414.
html.
Bzip2Homepagehttp://www.
muraroa.
demon.
co.
uk/GoogleSearchEnginehttp://google.
stanford.
edu/Harvesthttp://harvest.
transarc.
com/Mauldin,MichaelL.
LycosDesignChoicesinanInternetSearchService,IEEEExpertInterviewhttp://www.
computer.
org/pubs/expert/1997/trends/x1008/mauldin.
htmTheEffectofCellularPhoneUseUponDriverAttentionhttp://www.
webfirst.
com/aaa/text/cell/cell0toc.
htmSearchEngineWatchhttp://www.
searchenginewatch.
com/RFC1950(zlib)ftp://ftp.
uu.
net/graphics/png/documents/zlib/zdoc-index.
htmlRobotsExclusionProtocol:http://info.
webcrawler.
com/mak/projects/robots/exclusion.
htmWebGrowthSummary:http://www.
mit.
edu/people/mkgray/net/web-growth-summary.
htmlYahoo!
http://www.
yahoo.
com/[Abiteboul97]SergeAbiteboulandVictorVianu,QueriesandComputationontheWeb.
ProceedingsoftheInternationalConferenceonDatabaseTheory.
Delphi,Greece1997.
[Bagdikian97]BenH.
Bagdikian.
TheMediaMonopoly.
5thEdition.
Publisher:Beacon,ISBN:0807061557[Cho98]JunghooCho,HectorGarcia-Molina,LawrencePage.
EfficientCrawlingThroughURLOrdering.
SeventhInternationalWebConference(WWW98).
Brisbane,Australia,April14-18,1998.
[Gravano94]LuisGravano,HectorGarcia-Molina,andA.
Tomasic.
TheEffectivenessofGlOSSfortheText-DatabaseDiscoveryProblem.
Proc.
ofthe1994ACMSIGMODInternationalConferenceOnManagementOfData,1994.
[Kleinberg98]JonKleinberg,AuthoritativeSourcesinaHyperlinkedEnvironment,Proc.
ACM-SIAMSymposiumonDiscreteAlgorithms,1998.
[Marchiori97]MassimoMarchiori.
TheQuestforCorrectInformationontheWeb:HyperSearchEngines.
TheSixthInternationalWWWConference(WWW97).
SantaClara,USA,April7-11,1997.
[McBryan94]OliverA.
McBryan.
GENVLandWWWW:ToolsforTamingtheWeb.
FirstInternationalConferenceontheWorldWideWeb.
CERN,Geneva(Switzerland),May25-26-271994.
http://www.
cs.
colorado.
edu/home/mcbryan/mypapers/www94.
ps[Page98]LawrencePage,SergeyBrin,RajeevMotwani,TerryWinograd.
ThePageRankCitationRanking:BringingOrdertotheWeb.
Manuscriptinprogress.
http://google.
stanford.
edu/~backrub/pageranksub.
ps[Pinkerton94]BrianPinkerton,FindingWhatPeopleWant:ExperienceswiththeWebCrawler.
TheSecondInternationalWWWConferenceChicago,USA,October17-20,1994.
http://info.
webcrawler.
com/bp/WWW94.
html[Spertus97]EllenSpertus.
ParaSite:MiningStructuralInformationontheWeb.
TheSixthInternationalWWWConference(WWW97).
SantaClara,USA,April7-11,1997.
[TREC96]ProceedingsofthefifthTextREtrievalConference(TREC-5).
Gaithersburg,Maryland,November20-22,1996.
Publisher:DepartmentofCommerce,NationalInstituteofStandardsandTechnology.
Editors:D.
K.
HarmanandE.
M.
Voorhees.
Fulltextat:http://trec.
nist.
gov/[Witten94]IanHWitten,AlistairMoffat,andTimothyC.
Bell.
ManagingGigabytes:CompressingandIndexingDocumentsandImages.
NewYork:VanNostrandReinhold,1994.
[Weiss96]RonWeiss,BienvenidoVelez,MarkA.
Sheldon,ChanathipManprempre,PeterSzilagyi,AndrzejDuda,andDavidK.
Gifford.
HyPursuit:AHierarchicalNetworkSearchEnginethatExploitsContent-LinkHypertextClustering.
Proceedingsofthe7thACMConferenceonHypertext.
NewYork,1996.
VitaeSergeyBrinreceivedhisB.
S.
degreeinmathematicsandcomputersciencefromtheUniversityofMarylandatCollegeParkin1993.
Currently,heisaPh.
D.
candidateincomputerscienceatStanfordUniversitywherehereceivedhisM.
S.
in1995.
HeisarecipientofaNationalScienceFoundationGraduateFellowship.
Hisresearchinterestsincludesearchengines,informationextractionfromunstructuredsources,anddataminingoflargetextcollectionsandscientificdata.
LawrencePagewasborninEastLansing,Michigan,andreceivedaB.
S.
E.
inComputerEngineeringattheUniversityofMichiganAnnArborin1995.
HeiscurrentlyaPh.
D.
candidateinComputerScienceatStanfordUniversity.
Someofhisresearchinterestsincludethelinkstructureoftheweb,humancomputerinteraction,searchengines,scalabilityofinformationaccessinterfaces,andpersonaldatamining.
8AppendixA:AdvertisingandMixedMotivesCurrently,thepredominantbusinessmodelforcommercialsearchenginesisadvertising.
Thegoalsoftheadvertisingbusinessmodeldonotalwayscorrespondtoprovidingqualitysearchtousers.
Forexample,inourprototypesearchengineoneofthetopresultsforcellularphoneis"TheEffectofCellularPhoneUseUponDriverAttention",astudywhichexplainsingreatdetailthedistractionsandriskassociatedwithconversingonacellphonewhiledriving.
ThissearchresultcameupfirstbecauseofitshighimportanceasjudgedbythePageRankalgorithm,anapproximationofcitationimportanceontheweb[Page,98].
Itisclearthatasearchenginewhichwastakingmoneyforshowingcellularphoneadswouldhavedifficultyjustifyingthepagethatoursystemreturnedtoitspayingadvertisers.
Forthistypeofreasonandhistoricalexperiencewithothermedia[Bagdikian83],weexpectthatadvertisingfundedsearchengineswillbeinherentlybiasedtowardstheadvertisersandawayfromtheneedsoftheconsumers.
Sinceitisverydifficultevenforexpertstoevaluatesearchengines,searchenginebiasisparticularlyinsidious.
AgoodexamplewasOpenText,whichwasreportedtobesellingcompaniestherighttobelistedatthetopofthesearchresultsforparticularqueries[Marchiori97].
Thistypeofbiasismuchmoreinsidiousthanadvertising,becauseitisnotclearwho"deserves"tobethere,andwhoiswillingtopaymoneytobelisted.
Thisbusinessmodelresultedinanuproar,andOpenTexthasceasedtobeaviablesearchengine.
Butlessblatantbiasarelikelytobetoleratedbythemarket.
Forexample,asearchenginecouldaddasmallfactortosearchresultsfrom"friendly"companies,andsubtractafactorfromresultsfromcompetitors.
Thistypeofbiasisverydifficulttodetectbutcouldstillhaveasignificanteffectonthemarket.
Furthermore,advertisingincomeoftenprovidesanincentivetoprovidepoorqualitysearchresults.
Forexample,wenoticedamajorsearchenginewouldnotreturnalargeairline'shomepagewhentheairline'snamewasgivenasaquery.
Itsohappenedthattheairlinehadplacedanexpensivead,linkedtothequerythatwasitsname.
Abettersearchenginewouldnothaverequiredthisad,andpossiblyresultedinthelossoftherevenuefromtheairlinetothesearchengine.
Ingeneral,itcouldbearguedfromtheconsumerpointofviewthatthebetterthesearchengineis,thefeweradvertisementswillbeneededfortheconsumertofindwhattheywant.
Thisofcourseerodestheadvertisingsupportedbusinessmodeloftheexistingsearchengines.
However,therewillalwaysbemoneyfromadvertiserswhowantacustomertoswitchproducts,orhavesomethingthatisgenuinelynew.
Butwebelievetheissueofadvertisingcausesenoughmixedincentivesthatitiscrucialtohaveacompetitivesearchenginethatistransparentandintheacademicrealm.
9AppendixB:Scalability9.
1ScalabilityofGoogleWehavedesignedGoogletobescalableintheneartermtoagoalof100millionwebpages.
Wehavejustreceiveddiskandmachinestohandleroughlythatamount.
Allofthetimeconsumingpartsofthesystemareparallelizeandroughlylineartime.
Theseincludethingslikethecrawlers,indexers,andsorters.
Wealsothinkthatmostofthedatastructureswilldealgracefullywiththeexpansion.
However,at100millionwebpageswewillbeverycloseupagainstallsortsofoperatingsystemlimitsinthecommonoperatingsystems(currentlywerunonbothSolarisandLinux).
Theseincludethingslikeaddressablememory,numberofopenfiledescriptors,networksocketsandbandwidth,andmanyothers.
Webelieveexpandingtoalotmorethan100millionpageswouldgreatlyincreasethecomplexityofoursystem.
9.
2ScalabilityofCentralizedIndexingArchitecturesAsthecapabilitiesofcomputersincrease,itbecomespossibletoindexaverylargeamountoftextforareasonablecost.
Ofcourse,othermorebandwidthintensivemediasuchasvideoislikelytobecomemorepervasive.
But,becausethecostofproductionoftextislowcomparedtomedialikevideo,textislikelytoremainverypervasive.
Also,itislikelythatsoonwewillhavespeechrecognitionthatdoesareasonablejobconvertingspeechintotext,expandingtheamountoftextavailable.
Allofthisprovidesamazingpossibilitiesforcentralizedindexing.
Hereisanillustrativeexample.
WeassumewewanttoindexeverythingeveryoneintheUShaswrittenforayear.
Weassumethatthereare250millionpeopleintheUSandtheywriteanaverageof10kperday.
Thatworksouttobeabout850terabytes.
Alsoassumethatindexingaterabytecanbedonenowforareasonablecost.
Wealsoassumethattheindexingmethodsusedoverthetextarelinear,ornearlylinearintheircomplexity.
Givenalltheseassumptionswecancomputehowlongitwouldtakebeforewecouldindexour850terabytesforareasonablecostassumingcertaingrowthfactors.
Moore'sLawwasdefinedin1965asadoublingevery18monthsinprocessorpower.
Ithasheldremarkablytrue,notjustforprocessors,butforotherimportantsystemparameterssuchasdiskaswell.
IfweassumethatMoore'slawholdsforthefuture,weneedonly10moredoublings,or15yearstoreachourgoalofindexingeverythingeveryoneintheUShaswrittenforayearforapricethatasmallcompanycouldafford.
Ofcourse,hardwareexpertsaresomewhatconcernedMoore'sLawmaynotcontinuetoholdforthenext15years,buttherearecertainlyalotofinterestingcentralizedapplicationsevenifweonlygetpartofthewaytoourhypotheticalexample.
OfcourseadistributedsystemslikeGloss[Gravano94]orHarvestwilloftenbethemostefficientandeleganttechnicalsolutionforindexing,butitseemsdifficulttoconvincetheworldtousethesesystemsbecauseofthehighadministrationcostsofsettinguplargenumbersofinstallations.
Ofcourse,itisquitelikelythatreducingtheadministrationcostdrasticallyispossible.
Ifthathappens,andeveryonestartsrunningadistributedindexingsystem,searchingwouldcertainlyimprovedrastically.
Becausehumanscanonlytypeorspeakafiniteamount,andascomputerscontinueimproving,textindexingwillscaleevenbetterthanitdoesnow.
Ofcoursetherecouldbeaninfiniteamountofmachinegeneratedcontent,butjustindexinghugeamountsofhumangeneratedcontentseemstremendouslyuseful.
Soweareoptimisticthatourcentralizedwebsearchenginearchitecturewillimproveinitsabilitytocoverthepertinenttextinformationovertimeandthatthereisabrightfutureforsearch.

青云互联-洛杉矶CN2弹性云限时五折,9.5元/月起,三网CN2gia回程,可选Windows,可自定义配置

官方网站:点击访问青云互联官网优惠码:五折优惠码:5LHbEhaS (一次性五折,可月付、季付、半年付、年付)活动方案:的套餐分为大带宽限流和小带宽不限流两种套餐,全部为KVM虚拟架构,而且配置都可以弹性设置1、洛杉矶cera机房三网回程cn2gia 洛杉矶cera机房                ...

Vultr新注册赠送100美元活动截止月底 需要可免费享30天福利

昨天晚上有收到VULTR服务商的邮件,如果我们有清楚的朋友应该知道VULTR对于新注册用户已经这两年的促销活动是有赠送100美元最高余额,不过这个余额有效期是30天,如果我们到期未使用完的话也会失效的。但是对于我们一般用户来说,这个活动还是不错的,只需要注册新账户充值10美金激活账户就可以。而且我们自己充值的余额还是可以继续使用且无有效期的。如果我们有需要申请的话可以参考"2021年最新可用Vul...

RAKsmart:美国洛杉矶独服,E3处理器/16G/1TB,$76.77/月;美国/香港/日本/韩国站群服务器,自带5+253个IPv4

RAKsmart怎么样?RAKsmart机房即日起开始针对洛杉矶机房的独立服务器进行特别促销活动:低至$76.77/月,最低100Mbps带宽,最高10Gbps带宽,优化线路,不限制流量,具体包括有:常规服务器、站群服务器、10G大带宽服务器、整机机柜托管。活动截止6月30日结束。RAKsmart,美国华人老牌机房,专注于圣何塞服务器,有VPS、独立服务器等。支持PayPal、支付宝付款。点击直达...

pr查询为你推荐
outlookexpressoutlook Express是什么啊?怎么用啊?申请支付宝账户怎么申请支付宝的账号?360arp防火墙在哪360的9.6版本ARP防火墙在哪?netshwinsockreset游戏出现battlEye Launcher 怎么办sns网站有哪些中国都有哪些sns网站?还有它们都是哪个类型的?信息cuteftp抢米网什么意思抢小米手机三友网网测是什么意思?购物车在超市、商场中为什么需要使用购物车呢?闪拍网闪拍网是真的吗
vps代理 linuxapache虚拟主机 香港托管 shopex空间 双12活动 云主机51web 架设服务器 新天域互联 赞助 免费防火墙 百度云1t 多线空间 江苏双线服务器 空间购买 网页提速 智能dns解析 smtp服务器地址 日本代理ip 双线空间 cdn网站加速 更多