深度udk中文网
udk中文网 时间:2021-04-06 阅读:(
)
ISSN1000-9825,CODENRUXUEWE-mail:jos@iscas.
ac.
cnJournalofSoftware,Vol.
19,No.
2,February2008,pp.
246256http://www.
jos.
org.
cnDOI:10.
3724/SP.
J.
1001.
2008.
00246Tel/Fax:+86-10-625625632008byJournalofSoftware.
Allrightsreserved.
使用分类器自动发现特定领域的深度网入口王辉+,刘艳威,左万利(吉林大学计算机科学与技术学院,吉林长春130012)UsingClassifierstoFindDomain-SpecificOnlineDatabasesAutomaticallyWANGHui+,LIUYan-Wei,ZUOWan-Li(CollegeofComputerScienceandTechnology,JilinUniversity,Changchun130012,China)+Correspondingauthor:Phn:+86-431-85166492,E-mail:whui05@yahoo.
com.
cnWangH,LiuYW,ZuoWL.
Usingclassifierstofinddomain-specificonlinedatabasesautomatically.
JournalofSoftware,2008,19(2):246256.
http://www.
jos.
org.
cn/1000-9825/19/246.
htmAbstract:InhiddenWebdomain,general-purposesearchengines(i.
e.
,GoogleandYahoo)havetheirshortcomings.
Theycoverlessthanone-thirdofthedatastoredindocumentdatabases.
UnlikethesurfaceWeb,ifcombined,theycoverroughlythesamedata.
HiddenWebisahighlyimportantinformationsourcesincethecontentprovidedbymanyhiddenWebsitesisoftenofveryhighquality.
Thispaperproposesathree-stepframeworktoautomaticallyidentifydomain-specifichiddenWebentries.
Withthoseobtainedqueryinterfaces,theycanbeintegratedtoobtainaunifiedinterfacewhichisgiventouserstoquery.
Eightlarge-scaleexperimentsdemonstratethatthetechniquecanfinddomain-specifichiddenWebentriesaccuratelyandefficiently.
Keywords:deepWeb;hiddenWeb;surfaceWeb;hiddenWebentry;searchableform摘要:在深度网研究领域,通用搜索引擎(比如Google和Yahoo)具有许多不足之处:它们各自所能覆盖的数据量与整个深度网数据总量的比值小于1/3;与表层网中的情况不同,几个搜索引擎相结合所能覆盖的数据量基本没有发生变化.
许多深度网站点能够提供大量高质量的信息,并且,深度网正在逐渐成为一个最重要的信息资源.
提出了一个三分类器的框架,用于自动识别特定领域的深度网入口.
查询接口得到以后,可以将它们进行集成,然后将一个统一的接口提交给用户以方便他们查询信息.
通过8组大规模的实验,验证了所提出的方法可以准确高效地发现特定领域的深度网入口.
关键词:深度网;深度网;表层网;深度网入口;搜索表单中图法分类号:TP393文献标识码:A1IntroductionAccordingtohowitsdataisstored,theWorldWideWebcanbeclassifiedintotwocategoriesthataresurfaceWebanddeepWeb(alsocalledhiddenWeb).
InthesurfaceWeb,dataarestoredindocumentfiles;whileintheSupportedbytheNationalNaturalScienceFoundationofChinaunderGrantNo.
60373099(国家自然科学基金);theScienceandTechnologyDevelopmentProgramofJilinProvinceofChinaunderGrantNo.
20070533(吉林省科技发展计划)Received2007-08-02;Accepted2007-11-06王辉等:使用分类器自动发现特定领域的深度网入口247deepWeb,dataarestoredindatabases[1].
UnlikethesurfaceWeb,thedeepWebreferstothecollectionofWebdatathatisaccessiblebyinteractingwithaWeb-basedqueryinterface,andnotthroughthetraversalofstatichyperlinks.
AJuly2000whitepaper[2]estimatedthatdeepWebhas450000databases,7500terabytesofinformationand550billionindividualdocuments.
Incontrast,thesurfaceWebcontains19terabytesofinformationand1billionindividualdocuments.
Inaddition,accordingtomanystudies,thesizeofthehiddenWebincreasesrapidlyasmoreorganizationsreleasetheirvaluablecontentonlinethroughaneasilyusedWebinterface[3].
ThecontentprovidedbymanyhiddenWebsitesisoftenofveryhighqualityandcanbeextremelyvaluabletomanyusers.
ThesiteoftheUSPatentandTrademarksOfficeisanexample,whichmakesexistingpatentdocumentsavailableinordertohelppotentialinventorsexaminetheartswhichhadbeeninventedalready.
Toretrievedatafromonlinedatabases,threemainproblemsshouldbeconsidered.
Theyareinterfaceunification(alsocalledinterfaceintegration),querytranslationandresultmerging.
BeforehiddenWebdatabaseisqueried,searchsystemfirstcharacterizestheavailablesearchinterfacesandthen,givenaquery,itselectsasubsetofusefuldomain-specificsearchinterfaces,queriesthemandpresentsresultstotheusers.
Inthispaper,weconsideranoverlookedproblemthatprecedesthethreemainproblems.
ItisdiscoveringdeepWebentries.
Asearchsystemmustdiscoverasetofsearchinterfacesorbeprovidedwithsuchasetbeforeitcanproceedwiththeotherthreesteps.
Muchworkhasbeendoneineachofthesethreeareas.
Foreachdomain,theMetaQuerier[4]constructsaunifiedinterfacewhichisprovidedforuserstoquery.
Userssendtheirqueriesviatheunifiedinterface.
Amediatortranslatesthequeriestoeachspecificonlinedatabaseandthenreturnstheintegratedqueryresultstotheusers.
Chang,etal.
[5]usetheparsingapproachachievingabove85%accuracyforextractingqueryconditionsacrossrandomlyselecteddeepWebsourcesandforqueryinterfacematching.
Wu,etal.
[6]developanovelapproximationalgorithmLMax,whichbuildstheunifiedinterfaceviarecursiveapplicationsofclusteringaggregation.
Moreover,theyextendLMaxtohandletheirregularitiesthatfrequentlyoccurintheinterfaceschemas.
TheinterfaceExtractor[7]canachieveadeeperunderstandingofWebsearchinterfacesinthesensethatmoresemantic/metainformationonsearchinterfacescanbeextracted.
Withsuchsemantic/metainformation,theenrichedinterfaceschemacanbeusedinmanyapplications,forinstance,querytranslation,searchresultextractionandannotation.
Chang,etal.
[8]pursueasource-basedandrule-drivenframeworktoimplementquerytranslationacrossdifferentdeepWebsources.
Onthecontrary,He,etal.
[9]proposeagenerictype-basedandsearch-drivenquerytranslationframeworktoreachthesamegoal.
AdeepWebwrapperisaprogramthatextractscontentsfromsearchresults.
Nakatoh,etal.
[10]proposeanewautomaticgenerationalgorithmwhichdiscoversarepetitivepatternfromsearchresults.
Hedley,etal.
[11]describeaTwo-PhaseSampling(2PS)techniquetodetecttemplatesandextractquery-relatedinformationfromthesampleddocumentsofadatabase.
Mundluru,etal.
[12]giveahighlyeffectiveandefficientsolutionforautomaticallyminingresultrecordsfromsearchengineresponsepages.
ExperimentalresultsshowedthattheirproposedsystemsignificantlyoutperformsMDR[13],astate-of-the-artrecordminingsystem.
Thoughmuchworkhasbeendoneinthoseareas,littleworkhasbeendoneininterfacediscovery,especiallysincethethreemainproblemsdependonhavingasetofknownhiddenWebinterfaces.
Theremainingpaperisorganizedasfollows.
WereviewrelatedworkinSection2.
Section3concernswithpageandformclassifiers.
Three-stepframeworkofourhiddenWebcrawlerisdescribedinSection4.
ExperimentalresultsareelaboratedinSection5.
Section6isconcerningwithconclusionandfuturework.
248JournalofSoftware软件学报Vol.
19,No.
2,February20082RelatedWorkInrecentyears,thehiddenWebisbecomingahotresearchspot.
ItisestimatedthatthereareseveralmillionhiddenWebsites,whichcontainalargeamountofinformationthatisofhighquality[14].
ThedifficultiesinautomaticallyfillingoutstructuredWebformshavebeendocumentedintheliterature[15].
TheworkinRef.
[16]describesasemi-automaticcrawlercalledHiWE,whichisaidedbydomainknowledgetogeneratereasonablequeriesforhiddenWebinterfaces.
Cope,etal.
[17]useanautomaticfeaturegenerationtechniquetodepictcandidateformsandaC4.
5decisiontreetoclassifythem.
Intheirtwotestbeds—ANUcollectionandrandomWebcollection,theygetanaccuracyofmorethan85%andaprecisionofmorethan87%respectively.
Bergholz,etal.
[18]describeacrawlerwhichstartsfromthePubliclyIndexableWeb(PIW)tofindentrypointsintothehiddenWeb.
Thiscrawlerisdomain-specificandisinitializedwithpre-classifieddocumentsandrelevantkeywords.
LucianoandJuliana[19]composetwoclassifiersinahierarchicalfashiontoidentifyonlinedatabasesamongaheterogeneoussetofWebformsautomaticallygatheredbyafocusedcrawler.
InRef.
[20],theypresentanewadaptivefocusedcrawlingstrategyforefficientlylocatinghiddenWebentrypoints.
Unfortunately,theACHEframeworktheyproposedcannothandleverysparsedomainsefficiently.
Besides,theACHEframeworkiscomplexanditsoverheadislarge.
Ourtechniqueisdifferentfromtheirs.
Firstly,ourmodifiedbest-firstcrawlerjustfindsdomain-specifichiddenWebentries.
Secondly,weuseathree-stepframeworktoguideourdeepWebcrawlerinthispaper.
3PageandFormClassifiersInordertofinddomain-specifichiddenWebentries,weusethreeclassifierswhichworkinahierarchicalfashiontoguideourdeepWebcrawler.
Thethreeclassifiersincludeformstructureclassifier,formtextclassifierandpagetextclassifier.
3.
1FormstructureclassifierAformismadeupoftwopartsthatarestructuralandtextualparts.
ConsiderthefamousPerlCPANWebpageasanexample,wherewecanfinddistributions,modules,documentsandID's.
TheHTMLsourcecodeofitsformcontainedinthisWebpageislistedbelow:formmethod="get"action="/search"name="f"class="searchbox"inputtype="text"name="query"value=""size="35"brinselectname="mode"optionvalue="all"All/optionoptionvalue="module"Modules/optionoptionvalue="dist"Distributions/optionoptionvalue="author"Authors/option/select inputtype="submit"value="CPANSearch"/formWhendisplayedinIEbrowser,theresultisshowninFig.
1.
CPANsearchFig.
1AnillustratedforminterfacedisplayedinIEbrowserAllin王辉等:使用分类器自动发现特定领域的深度网入口249FromFig.
1,wecanseethataformcontainsnotonlytextualcontentssuchas'in','CPANSearch',butalsostructuralcontentssuchasselectelements,submissionbuttons.
Inordertoidentifywhetheraformisadomain-specificsearchableformornot,inthispaperweuseformstructuralandtextualfeaturestotrainformstructureandformtextclassifiersrespectively.
Onceobtainingaformstructureclassifier,wecangetridofthesenon-searchableforms,suchasformsforlogin,discussiongroupinterfaces,mailinglistsubscriptions,purchaseformsandWeb-basedemailforms.
Luciano,etal.
[21]andCope,etal.
[17]demonstratethatoptimumresultwillbeobtainedbyusingadecisiontreetoclassifysearchableandnon-searchableforms.
Accordingly,wealsousedecisiontreealgorithmtotrainaformstructureclassifier.
3.
2FormtextclassifierWiththeaidofdecisiontreeclassifier,wecanidentifywhetheraformisasearchableformornot.
Tofurtherascertainifasearchableformisadomain-specificone,wemustmakefulluseofformtextualfeatures.
Accordingtopreviousresearches[19,20,22],libsvmlearningalgorithm[22]shouldbeusedinthiscase.
Toextracttextualfeaturesfromforms,twotextextractingmethodsaretriedinthispaper.
OneiscalledFT(fulltext)techniqueandtheotherisnamedPT(partialtext)method.
TheFTmethodsimplyusesallHTMLcodesofformsandsplitsthemusingnon-alphanumericstrings.
IncontrastwithFTmethod,thePTtechniqueextractsthosetextualfeatures,whichcanbeseenbyourhumanbeings(whendisplayedinabrowser)aswellastheformactionattributetowhichallformdataaresent.
Forexample:formaction="http://www.
hotwire.
com/car/search-options.
jsp"method="get"name="searchCar"Thisisaformwhichisusedfordemo.
/formIntheaboveform,theactionattributevalueis:http://www.
hotwire.
com/car/search-options.
jspwhichisalsoextractedbythePTmethod.
Inordertousetheseextractedtextualfeatures,somepre-processingstepsareneeded.
First,allcharactersotherthanalphanumericonesarereplacedbyaspacecharacter;second,uppercasecharacters,ifany,areconvertedtotheirlowercaseequivalents;third,stopwords,ifany,areremoved,usingCPAN[23]PerlpackageLingua::EN::StopWords;fourth,eachwordintheremainingtextsisstemmed,usingCPANPerlpackageLingua::Stem::En;finally,TFIDF[24]isusedtotransformeachtrainingexampleintoitscorrespondingvector.
Thesameprocedureisalsoappliedtopagetextclassifier(seeSection3.
3).
Usingthoseextractedtextualfeatures,wecantrainaSVMclassifierwhichcanbeusedtoidentifywhetherasearchableformisdomain-specificornot.
3.
3PagetextclassifierTodecideautomaticallywhetheraWebpageisrelevantornot,weuseapagetextclassifier.
GivenaWebpage,wefirstobtainitscorrespondingplaintexts.
Afterthat,somepre-processingsteps(seeSection3.
2)areneededinordertousethesetextstotrainaSVMclassifier.
Withthesethreeclassifiersonhand,wecanapplythemtoguidingafocusedcrawlertofinddeepWebentries:First,usingthepagetextclassifiertodecidewhethertheWebpagescorrespondingtothegivenURLsarerelevantornot;Second,ifaWebpageisrelevant,weextractsearchableformsfromitwiththeaidoftheformstructureclassifier;Third,iftherelevantWebpagecontainssearchableforms,wefurtherusetheformtextclassifierto250JournalofSoftware软件学报Vol.
19,No.
2,February2008ascertainwhethertheyaredomain-specificornot.
Thereasonwhyweuseclassifiersinthishierarchicalfashionisthatthehierarchicalcompositionofclassifiersleadstomodularity.
Inthiscase,acomplexproblemisdecomposedintosimplersub-componentsandeachisdevotedtoasubsetofthehypothesis.
Thishasseveralmerits:First,theoverallclassificationprocessisnotonlyaccuratebutalsorobust;Second,wecanapplytoeachpartalearningmethodthatisbestsuitedforthefeaturesetofthepartition.
4Three-StepFrameworkFigure2showsthehigh-levelarchitectureproposedinthispaper.
WebpagesinrelevantWebsitesInternetPagetextclassifierBest-FirstcrawlerFormstructureclassifierSearchableformsFormtextclassifierRelevantformsFig.
2Thehigh-levelarchitectureproposedinthispaperNotethatthebest-firstcrawlerusedinthispaperisavariationofthebest-firstcrawlerproposedinRef.
[22].
InRef.
[22],theymakenodifferenceaboutURLswhichlieinaon-topicpage;whereaswegiveURLsprioritiesaccordingtothefollowingformulaa*page_score+b*anchor_score.
Here,weletaandbtakethesamevalueone.
ThedetailedprocedureaboutourdeepWebcrawlerisdisplayedinFig.
3.
ContainrelevantentryRelevantout-frontierisemptyin-frontierisemptydepth=3orthetotalnumberofpagesthreshold>=100isvisited.
Notethatthereasonwhywesetdepth<3isthatWebdatabasestendtolocateshallowlyintheirsitesandthevastmajorityofthem(approximately94%)canbefoundatthetop3levels[16].
Besides,inordertoprotectourcrawlerfromgettingtrappedinsomesites,wesetathresholdforvisitingmaximumWebpagespersite.
5ExperimentalResultsTEL-8QueryInterfaces[25]datasetisusedtotrainaformclassifier.
Thisdatasetcontainstheoriginalinterfacesextractedfromeightrepresentativedomains,whichareAirfares,Automobiles,Books,CarRentals,Hotels,Jobs,MoviesandMusicRecords.
Table1showstheinstances'distributionsoftheeightrepresentativedatabasedomains.
Table1Theinstances'distributionsoftheeightdatabasedomains.
223sourcesinallDomainSourcesDomainSourcesAirfare20Auto28Book43Rental13Hotel34Job20Movie32Music335.
1TrainingformstructureclassifierInthispaper,ourformstructureclassifieristrainedbyusingdecisiontreealgorithm.
Thedecisiontreetrainingdataarecollectedasfollows:weextract223searchableforms(seeTable1)fromTEL-8QueryInterfacesaspositiveexamplesandmanuallygather318non-searchableformsasnegativeones.
Foreachforminthesampledataset,wecountthefollowingfeatures:numberofcheckboxes;numberoffileinputs;numberofhiddentags;numberofimageinputs;numberofsubmissionmethods(getandpost);numberofselectelements;numberofpasswordtags;numberofradiotags;numberofword'search'withinformtagorsubmissionbutton;numberoftextelements;numberoftextareaelementsandnumberofword'email'ininputelements'nameorvalue.
Thedistributionsaboutallthosefeaturesinsearchableandnon-searchableformsaredisplayedinTable2.
FromTable2,wecandrawthefollowingconclusions:Searchableformshavealargenumberofcheckboxesanditems(options)inselectionlists.
No-Searchableformshavealargenumberofpasswordtagsand'email'ininputelements'nameorvalue.
Usingthesestructuralfeatures,wecantrainadecisiontreeclassifier.
Twotoolsareusedtoconstructaformstructureclassifier:Wekaj48algorithm[26]andAlgorithm::SVMLightPerlpackage[23].
InWeka,theprecisionofthedecisiontreeclassifieris0.
948718.
ThedecisiontreegeneratedbyWekaisdisplayedinFig.
4.
Infact,weusePerlpackageAlgorithm::SVMLightinthispapertotrainadecisiontreeclassifieranditsprecisionis0.
948717948.
Obviously,thesetwotoolshavethesimilarresultsaccordingtoourexperiments.
Nevertheless,inAlgorithm::SVMLightcurrentimplementation,onlydiscrete-valuedattributesaresupportedandconsequentlyitoutputsalargenumberofrules.
Infact,itoutputs122rulesinall.
Additionally,we252JournalofSoftware软件学报Vol.
19,No.
2,February2008thinkthatthedecisiontreemisclassifiesaninstanceifitcannotdecide,towhichcategorythisinstancebelongs.
Table2Features'distributionsofsearchableandnon-searchableformsFeature/CategorySearchableNon-SearchableRatiocheckboxemail_yesfilehiddenimagemethod_getoptionpasswordradiosearch_yestexttextarea2.
390.
010.
004.
450.
360.
4712.
640.
020.
480.
363.
000.
020.
180.
120.
001.
630.
210.
400.
170.
100.
100.
091.
010.
0713.
04:11:13.
132.
72:11.
73:11.
16:174.
23:11:5.
864.
76:13.
94:12.
97:11:2.
98Fig.
4ThedecisiontreegeneratedbyWekaj48algorithmusingformstructuralfeatures5.
2TrainingformtextclassifierAccordingtoFTandPTmethods(seeSection3.
2),wecanextractformtextualfeaturesfromforms.
FivemostfrequentfeaturesobtainedbyFTandPTtechniquesarepresentedinTable3.
Table3showsthatthePTmethodextractsmorevaluablefeaturesthantheFTtechniquedoes.
Table3FivemostfrequentfeaturesextractedbyFTandPTmethodsrespectivelyMethodCategoryTextualfeatures(Feature:Frequency)FTAirfareAutoBookRentalHotelJobMovieMusicoption:8113value:4161td:1181id:1069class:993option:5673value:3002td:1520tr:716class:498option:10997value:5753td:1538tr:788name:421option:6396value:3199td:892pm:550class:520option:13048value:6271class:1377div:1358td:1265option:7423value:3868u:775td:680tr:413option:6995value:3616div:1200class:1181td:734option:8090value:679td:516record:456font:323PTAirfareAutoBookRentalHotelJobMovieMusicpm:419airlin:279air:124am:102airwai:100docum:108car:105leas:84search:63make:56search:130titl:110book:95author:75new:72pm:402option:202am:168airport:144car:143hotel:234pm:228island:151new:135room:84job:207new:125locat:84servic:82island:81press:211book:123s:109video:107entertain:107record:456music:226sub:156search:97new:80王辉等:使用分类器自动发现特定领域的深度网入口253Usingthoseextractedtextualfeatures,wefinallyfinishtrainingeightSVMclassifiers.
TheprecisionsoftheeightSVMclassifiersareshowninTable4.
SincePTmethodextractsmorevaluabletextualfeaturesthanFTtechniquedoes,wecanusethemtrainamoreaccurateclassifier.
Table4indicatesthatnomatterwhatcategoryis,usingthePTmethodcanalwaysobtainamoreaccurateclassifierthantheFTtechniquecan.
Table4PrecisionsofSVMclassifierstrainedwithformtextualfeaturesthatareextractedbytheFTandPTmethodsrespectivelyCategoryFTmethodPTmethodAirfare0.
91820.
9364Auto0.
94761.
0Book0.
92730.
9636Rental0.
95450.
9636Hotel0.
93330.
9381Job0.
94090.
9682Movie0.
92730.
9773Music0.
90.
96825.
3TrainingpagetextclassifierInordertotrainpagetextclassifier,thispapergetsitspositivetrainingdatafromtheonlineopendirectoryproject(http://dmoz.
org/).
WeuseaPerlscriptprogramtofilloutthesearchableformandextractURLsfromitsreturnedresultpagesautomatically.
Asfornegativetrainingdata,wegetthemfromDMOZ(http://rdf.
dmoz.
org/).
DMOZconsistsofsixteencategories,whichareArts,Business,Computers,Games,Health,Home,Kids_and_Teens,News,Recreation,Reference,Regional,Science,Shopping,Society,SportsandWorld.
Wegetridoffourcategoriesoftheminourexperiments.
TheyareKids_and_Teens,Reference,RegionalandWorld.
ThenumberofURLsineachcategoryisshowninTable5.
Table5ThenumberofURLsineachDMOZcategoryArtsBusinessComputersGamesHealthHome58592451162028533612375813105033555NewsRecreationScienceShoppingSocietySports235704120308213014235160269864154921InDMOZ,eachexamplelookslikethis:ExternalPageabout="http://www.
airwise.
com/airports/us/SLC/index.
html"d:TitleSaltLakeCityAirport-airwise.
com/d:Titled:DescriptionInformationabouttheairportincludingairlines,groundtransportation,parking,weatherandairportnews.
/d:DescriptiontopicTop/Regional/North_America/United_States/Utah/Localities/S/Salt_Lake_City/Transportation/Airports/topic/ExternalPageThispaperusesthecontentof'd:Description'elementandtheWebpagecorrespondingtothe'about'254JournalofSoftware软件学报Vol.
19,No.
2,February2008ExternalPageattributetoobtainanegativetrainingexample.
Inordertobemorerepresentative,wederiveURLsfromeachcategoryaccordingtoitssize.
Sincethe'Arts'categoryhasthelargestnumberofURLs,wegetthemostnumberofURLsfromit.
ExcludedtheseURLswhichcannotbedownloaded,thenumberofpositiveandnegativeexampleswhichweusetotrainapageclassifierforeachcategoryislistedinTable6.
Usingthesetrainingdata,wecanfinishtrainingpageclassifiers.
TheprecisionsofthesepageclassifiersareshowninTable6.
Table6ThenumberofpositiveandnegativetraindataforeachcategoryaswellastheprecisionofitscorrespondingpageclassifierCategoryPositiveNegativePrecisionAirfareAutoBookRentalHotelJobMovieMusic1162512156891317081706160122316356433252282272131783124870.
96190.
94630.
91350.
97370.
99410.
97910.
90040.
875.
4Usingthethree-stepframeworktofinddeepWebentriesWeconducteightlarge-scaleexperimentswithourhiddenWebcrawler.
Foreachcategory,weinitializeourcrawlerwith100seedsthatareextractedfromtheDMOZasthestartingpoint.
WesavethosepagesandtheircorrespondingURLsifthefollowingtwoconditionsaresatisfiedatthesametime.
First,theyarejudgedtoberelevantbypageandformtextclassifiers.
Second,eachpagecontainsatleastonedomain-specificdeepWebentry.
SinceMusicRecordsdatabasesareverysparselydistributed,ourbest-firstfocusedcrawleronlylocates50deepWebentriesforit.
Forothercategories,ourcrawlerfinds100deepWebentriesforeachofthem.
FivedeepWebentriesaboutAirfarescategorywhicharelocatedbyourcrawlerarelistedbelow:http://www.
aircanada.
ca/http://www.
itn.
net/http://www.
aircharter.
com/http://www.
orbitz.
com/http://www.
nwa.
com/Atlast,wemanuallyverifywhetherthedeepWebentrieslocatedbyourcrawlerarewhatwewant.
TheprecisionsofallthesecategoriesareshowninTable7.
Table7TheprecisionsofdeepWebentriesforeachcategoryDomainPrecisionDomainPrecisionAirfare0.
90Auto0.
88Book0.
91Rental0.
95Hotel0.
94Job0.
81Movie0.
86Music0.
806ConclusionandFutureWorkInthispaper,athree-stepframeworkisproposedtoautomaticallyidentifydomain-specifichiddenWebentries.
Toverifyitseffectivenessandefficiency,eightlarge-scaleexperimentsareconducted.
Experimentalresultsdemonstratethatourmethodcanfinddomain-specificdeepWebentriesaccuratelyandefficiently.
Theaverageprecisionoftheeightrepresentativedomainsis0.
88.
Inthenearfuture,experimentsonalargernumberofcategoriesarerequiredinordertofurtherassessthe王辉等:使用分类器自动发现特定领域的深度网入口255effectivenessofourproposedtechnique.
Additionally,withthoseobtainedqueryinterfaces,wewillintegratethemtoobtainaunifiedinterfaceandgiveittouserstoquery.
Usersdon'thavetohuntsomedomain-specificsourcesandlearnthedetailsforqueryingeachresource.
AcknowledgementThisworkissponsoredbytheScienceandTechnologyDevelopmentProgramofJilinProvinceunderGrantNo.
20070533andtheNaturalScienceFoundationofChinaunderGrantNo.
60373099.
References:[1]RoccoD,CaverleeJ,LiuL,CritchlowT.
ExploitingthedeepWebwithDynaBot:Matching,probing,andranking.
In:EllisA,HaginoT,eds.
Proc.
oftheWorldWideWebSpecialInterestTracksAndPosters(WWW).
Chiba:ACM,2005.
11741175.
[2]BrightPlanet.
com.
ThedeepWeb:Surfacinghiddenvalue.
http://brightplanet.
com[3]BergmanMK.
ThedeepWeb:Surfacinghiddenvalue.
JournalofElectronicPublishing,2001,7(1):11741175.
http://www.
press.
umich.
edu/jep/07-01/bergman.
html[4]HeB,ZhangZ,ChangKCC.
KnockingthedoortothedeepWeb:IntegratingWebqueryinterfaces.
In:WeikumG,ed.
Proc.
oftheSIGMODConf.
Paris:ACM,2004.
913914.
[5]ChangKCC,HeB,ZhangZ.
MetaQuerieroverthedeepWeb:Shallowintegrationacrossholisticsources.
In:NascimentoMA,zsuMT,KossmannD,MillerRJ,BlakeleyJA,SchieferKB,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
MorganKaufmannPublishers,2004.
1521.
[6]WuW,DoanA,YuCT.
MerginginterfaceschemasonthedeepWebviaclusteringaggregation.
In:Proc.
oftheInt'lConf.
onDataMining(ICDM).
IEEEComputerSociety,2005.
801804.
[7]HeH,MengWY,YuCT,WuZH.
WISE-Integrator:AsystemforextractingandintegratingcomplexWebsearchinterfacesofthedeepWeb.
In:BhmK,JensenCS,HaasLM,KerstenML,LarsonPA,OoiBC,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
ACM,2005.
13141317.
[8]ChangKCC,Garcia-MolinaH.
Mindyourvocabulary:Querymappingacrossheterogeneousinformationsources.
In:DelisA,FaloutsosC,GhandeharizadehS,eds.
Proc.
oftheSIGMODConf.
Philadelphia:ACMPress,1999.
335346.
[9]HeB,ZhangZ,ChangKCC.
MetaQuerier:QueryingstructuredWebsourceson-the-fly.
In:zcanF,ed.
Proc.
oftheSIGMODConf.
ACM,2005.
927929.
[10]NakatohT,YamadaY,HirokawaS.
AutomaticgenerationofdeepWebwrappersbasedondiscoveryofrepetition.
In:Proc.
oftheAsiaInformationRetrievalSymp.
(AIRS).
Beijing:Springer-Verlag,2004.
269272.
[11]HedleyYL,YounasM,JamesA,SandersonM.
Atwo-phasesamplingtechniqueforinformationextractionfromhiddenWebdatabases.
In:LaenderAHF,LeeD,RonthalerM,eds.
Proc.
oftheInt'lWorkshoponWebInformationandDataManagement(WIDM).
Washington:ACM,2004.
18.
[12]MundluruD,KatukuriJR,CelebiS.
Automaticallyminingresultrecordsfromsearchengineresponsepages.
In:Proc.
oftheInt'lConf.
onDataMining(ICDM).
IEEEComputerSociety,2005.
749752.
[13]LiuB,GrossmanR,ZhaiYH.
MiningdatarecordsinWebpages.
In:GetoorL,SenatorTE,DomingosP,FaloutsosC,eds.
Proc.
oftheKnowledgeDiscoveryandDataMining(KDD).
Washington:ACM,2003.
601606.
[14]HsiehW,MadhavanJ,PikeR.
DatamanagementprojectsatGoogle.
In:ChaudhuriS,HristidisV,PolyzotisN,eds.
Proc.
oftheSIGMODConf.
Chicago:ACM,2006.
725726.
[15]WuP,WenJR,LiuH,MaWY.
QueryselectiontechniquesforefficientcrawlingofstructuredWebsources.
In:LiuL,ReuterA,WhangKY,ZhangJ,eds.
Proc.
oftheInt'lConf.
onDataMining(ICDE).
IEEEComputerSociety,2006.
47.
[16]RaghavanS,Garcia-MolinaH.
CrawlingthehiddenWeb.
In:ApersPMG,AtzeniP,CeriS,ParaboschiS,RamamohanaraoK,SnodgrassRT,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
Rome:MorganKaufmannPublishers,2001.
129138.
[17]CopeJ,CraswellN,HawkingD.
AutomateddiscoveryofsearchinterfacesontheWeb.
In:ScheweKD,ZhouX,eds.
Proc.
oftheAustralasianDatabaseConf.
(ADC).
AustralianComputerSociety,2003.
181189.
256JournalofSoftware软件学报Vol.
19,No.
2,February2008[18]BergholzA,ChidlovskiiB.
Crawlingfordomain-specifichiddenWebresources.
In:Proc.
oftheInt'lConf.
onWebInformationSystemsEngineering(WISE).
Roma:IEEEComputerSociety,2003.
125133.
[19]BarbosaL,FreireJ.
Combiningclassifierstoidentifyonlinedatabases.
In:WilliamsonCL,ZurkoME,Patel-SchneiderPF,ShenoyPJ,eds.
Proc.
oftheWorldWideWebConf.
(WWW).
ACM,2007.
431440.
[20]BarbosaL,FreireJ.
Anadaptivecrawlerforlocatinghidden-Webentrypoints.
In:WilliamsonCL,ZurkoME,Patel-SchneiderPF,ShenoyPJ,eds.
Proc.
oftheWorldWideWebConf.
(WWW).
ACM,2007.
441450.
[21]BarbosaL,FreireJ.
Searchingforhidden-Webdatabases.
In:DoanAH,NevenF,McCannR,BexGJ,eds.
Proc.
ofthe8thInt'lWorkshopontheWebandDatabases(WebDB).
Baltimore:ACMPress,2005.
16.
[22]ChangCC,LinCJ.
Libsvm—Alibraryforsupportvectormachines.
http://www.
csie.
ntu.
edu.
tw/~cjlin/libsvm/[23]CPAN.
http://search.
cpan.
org/[24]TorgoL,GamaJ.
Regressionbyclassification.
In:BorgesD,KaestnerC,eds.
Proc.
oftheBrasilianArtificialIntelligenceSymp.
Curitiba:Springer-Verlag,1996.
5160.
[25]TheuiucWebintegrationrepository.
http://metaquerier.
cs.
uiuc.
edu/repository/[26]Weka.
http://www.
cs.
waikato.
ac.
nz/ml/weka/WANGHuiwasbornin1972.
HereceivedhisPh.
D.
degreefromJilinUniversity.
HisresearchareaisWebinformationmining.
ZOUWan-Liwasbornin1957.
HeisaprofessoranddoctoralsupervisorattheJilinUniversityandaCCFseniormember.
Hisresearchareasaredatabase,dataminingandWebsearchengine.
LIUYan-Weiwasbornin1983.
HeisagraduatestudentattheJilinUniversity.
HisresearchareaisWebinformationmining.
- 深度udk中文网相关文档
- 好歌谱音乐网,提供,麻辣烫,陈小涛,在线热搜,华语,系列,喏小七,歌词千寻,麻辣烫,陈小
- 部首udk中文网
- 花旗银行udk中文网
- 免费的在线汉语词典,本汉语词典共收录词语38万余条,可查询词语的拼音、详细解释、分词解释、单字解释及相关
- 公司udk中文网
- 中文udk中文网
819云互联是海外领先的互联网业务平台服务提供商。专注为用户提供低价高性能云计算产品,致力于云计算应用的易用性开发,并引导云计算在国内普及。目前平台研发以及运营云服务基础设施服务平台(IaaS),面向全球客户提供基于云计算的IT解决方案与客户服务,拥有丰富的海外资源、香港,日本,美国等各国优质的IDC资源。官方网站:https://www.819yun.com香港特价物理服务器:地区CPU内存带宽...
RAKsmart商家一直以来在独立服务器、站群服务器和G口和10G口大端口流量服务器上下功夫比较大,但是在VPS主机业务上仅仅是顺带,尤其是我们看到大部分主流商家都做云服务器,而RAKsmart商家终于开始做云服务器,这次试探性的新增美国硅谷机房一个方案。月付7.59美元起,支持自定义配置,KVM虚拟化,美国硅谷机房,VPC网络/经典网络,大陆优化/精品网线路,支持Linux或者Windows操作...
iON Cloud怎么样?iON Cloud升级了新加坡CN2 VPS的带宽和流量最低配的原先带宽5M现在升级为10M,流量也从原先的150G升级为250G。注意,流量也仅计算出站方向。iON Cloud是Krypt旗下的云服务器品牌,成立于2019年,是美国老牌机房(1998~)krypt旗下的VPS云服务器品牌,主打国外VPS云服务器业务,均采用KVM架构,整体性能配置较高,云服务器产品质量靠...
udk中文网为你推荐
京沪高铁上市首秀在中国股市中:京沪高铁概念股有哪些百度商城百度商城知道在哪个地方,怎么找不到啊haole018.com为啥进WWWhaole001)COM怎么提示域名出错?囡道是haole001换地了吗www.55125.cn如何登录www.jbjy.cnwww.zjs.com.cn怎么查询我的平安信用卡寄送情况ww.66bobo.com谁知道11qqq com被换成哪个网站javlibrary.comsony home network library官方下载地址www.toutoulu.com老板强大的外包装还是被快递弄断了www4399com4399小游戏 请记住本站网站 4399.url175qq.comhttp://www.qq10008.com/这个网页是真的吗?
注册域名 花生壳动态域名 域名升级访问中 hawkhost 免费主机 themeforest 私人服务器 腾讯云数据库 iisphpmysql shopex空间 网通代理服务器 web服务器架设 北京双线机房 cdn联盟 老左正传 cdn加速原理 网络空间租赁 cdn加速是什么 如何注册阿里云邮箱 双12 更多