深度udk中文网

udk中文网  时间:2021-04-06  阅读:()
ISSN1000-9825,CODENRUXUEWE-mail:jos@iscas.
ac.
cnJournalofSoftware,Vol.
19,No.
2,February2008,pp.
246256http://www.
jos.
org.
cnDOI:10.
3724/SP.
J.
1001.
2008.
00246Tel/Fax:+86-10-625625632008byJournalofSoftware.
Allrightsreserved.
使用分类器自动发现特定领域的深度网入口王辉+,刘艳威,左万利(吉林大学计算机科学与技术学院,吉林长春130012)UsingClassifierstoFindDomain-SpecificOnlineDatabasesAutomaticallyWANGHui+,LIUYan-Wei,ZUOWan-Li(CollegeofComputerScienceandTechnology,JilinUniversity,Changchun130012,China)+Correspondingauthor:Phn:+86-431-85166492,E-mail:whui05@yahoo.
com.
cnWangH,LiuYW,ZuoWL.
Usingclassifierstofinddomain-specificonlinedatabasesautomatically.
JournalofSoftware,2008,19(2):246256.
http://www.
jos.
org.
cn/1000-9825/19/246.
htmAbstract:InhiddenWebdomain,general-purposesearchengines(i.
e.
,GoogleandYahoo)havetheirshortcomings.
Theycoverlessthanone-thirdofthedatastoredindocumentdatabases.
UnlikethesurfaceWeb,ifcombined,theycoverroughlythesamedata.
HiddenWebisahighlyimportantinformationsourcesincethecontentprovidedbymanyhiddenWebsitesisoftenofveryhighquality.
Thispaperproposesathree-stepframeworktoautomaticallyidentifydomain-specifichiddenWebentries.
Withthoseobtainedqueryinterfaces,theycanbeintegratedtoobtainaunifiedinterfacewhichisgiventouserstoquery.
Eightlarge-scaleexperimentsdemonstratethatthetechniquecanfinddomain-specifichiddenWebentriesaccuratelyandefficiently.
Keywords:deepWeb;hiddenWeb;surfaceWeb;hiddenWebentry;searchableform摘要:在深度网研究领域,通用搜索引擎(比如Google和Yahoo)具有许多不足之处:它们各自所能覆盖的数据量与整个深度网数据总量的比值小于1/3;与表层网中的情况不同,几个搜索引擎相结合所能覆盖的数据量基本没有发生变化.
许多深度网站点能够提供大量高质量的信息,并且,深度网正在逐渐成为一个最重要的信息资源.
提出了一个三分类器的框架,用于自动识别特定领域的深度网入口.
查询接口得到以后,可以将它们进行集成,然后将一个统一的接口提交给用户以方便他们查询信息.
通过8组大规模的实验,验证了所提出的方法可以准确高效地发现特定领域的深度网入口.
关键词:深度网;深度网;表层网;深度网入口;搜索表单中图法分类号:TP393文献标识码:A1IntroductionAccordingtohowitsdataisstored,theWorldWideWebcanbeclassifiedintotwocategoriesthataresurfaceWebanddeepWeb(alsocalledhiddenWeb).
InthesurfaceWeb,dataarestoredindocumentfiles;whileintheSupportedbytheNationalNaturalScienceFoundationofChinaunderGrantNo.
60373099(国家自然科学基金);theScienceandTechnologyDevelopmentProgramofJilinProvinceofChinaunderGrantNo.
20070533(吉林省科技发展计划)Received2007-08-02;Accepted2007-11-06王辉等:使用分类器自动发现特定领域的深度网入口247deepWeb,dataarestoredindatabases[1].
UnlikethesurfaceWeb,thedeepWebreferstothecollectionofWebdatathatisaccessiblebyinteractingwithaWeb-basedqueryinterface,andnotthroughthetraversalofstatichyperlinks.
AJuly2000whitepaper[2]estimatedthatdeepWebhas450000databases,7500terabytesofinformationand550billionindividualdocuments.
Incontrast,thesurfaceWebcontains19terabytesofinformationand1billionindividualdocuments.
Inaddition,accordingtomanystudies,thesizeofthehiddenWebincreasesrapidlyasmoreorganizationsreleasetheirvaluablecontentonlinethroughaneasilyusedWebinterface[3].
ThecontentprovidedbymanyhiddenWebsitesisoftenofveryhighqualityandcanbeextremelyvaluabletomanyusers.
ThesiteoftheUSPatentandTrademarksOfficeisanexample,whichmakesexistingpatentdocumentsavailableinordertohelppotentialinventorsexaminetheartswhichhadbeeninventedalready.
Toretrievedatafromonlinedatabases,threemainproblemsshouldbeconsidered.
Theyareinterfaceunification(alsocalledinterfaceintegration),querytranslationandresultmerging.
BeforehiddenWebdatabaseisqueried,searchsystemfirstcharacterizestheavailablesearchinterfacesandthen,givenaquery,itselectsasubsetofusefuldomain-specificsearchinterfaces,queriesthemandpresentsresultstotheusers.
Inthispaper,weconsideranoverlookedproblemthatprecedesthethreemainproblems.
ItisdiscoveringdeepWebentries.
Asearchsystemmustdiscoverasetofsearchinterfacesorbeprovidedwithsuchasetbeforeitcanproceedwiththeotherthreesteps.
Muchworkhasbeendoneineachofthesethreeareas.
Foreachdomain,theMetaQuerier[4]constructsaunifiedinterfacewhichisprovidedforuserstoquery.
Userssendtheirqueriesviatheunifiedinterface.
Amediatortranslatesthequeriestoeachspecificonlinedatabaseandthenreturnstheintegratedqueryresultstotheusers.
Chang,etal.
[5]usetheparsingapproachachievingabove85%accuracyforextractingqueryconditionsacrossrandomlyselecteddeepWebsourcesandforqueryinterfacematching.
Wu,etal.
[6]developanovelapproximationalgorithmLMax,whichbuildstheunifiedinterfaceviarecursiveapplicationsofclusteringaggregation.
Moreover,theyextendLMaxtohandletheirregularitiesthatfrequentlyoccurintheinterfaceschemas.
TheinterfaceExtractor[7]canachieveadeeperunderstandingofWebsearchinterfacesinthesensethatmoresemantic/metainformationonsearchinterfacescanbeextracted.
Withsuchsemantic/metainformation,theenrichedinterfaceschemacanbeusedinmanyapplications,forinstance,querytranslation,searchresultextractionandannotation.
Chang,etal.
[8]pursueasource-basedandrule-drivenframeworktoimplementquerytranslationacrossdifferentdeepWebsources.
Onthecontrary,He,etal.
[9]proposeagenerictype-basedandsearch-drivenquerytranslationframeworktoreachthesamegoal.
AdeepWebwrapperisaprogramthatextractscontentsfromsearchresults.
Nakatoh,etal.
[10]proposeanewautomaticgenerationalgorithmwhichdiscoversarepetitivepatternfromsearchresults.
Hedley,etal.
[11]describeaTwo-PhaseSampling(2PS)techniquetodetecttemplatesandextractquery-relatedinformationfromthesampleddocumentsofadatabase.
Mundluru,etal.
[12]giveahighlyeffectiveandefficientsolutionforautomaticallyminingresultrecordsfromsearchengineresponsepages.
ExperimentalresultsshowedthattheirproposedsystemsignificantlyoutperformsMDR[13],astate-of-the-artrecordminingsystem.
Thoughmuchworkhasbeendoneinthoseareas,littleworkhasbeendoneininterfacediscovery,especiallysincethethreemainproblemsdependonhavingasetofknownhiddenWebinterfaces.
Theremainingpaperisorganizedasfollows.
WereviewrelatedworkinSection2.
Section3concernswithpageandformclassifiers.
Three-stepframeworkofourhiddenWebcrawlerisdescribedinSection4.
ExperimentalresultsareelaboratedinSection5.
Section6isconcerningwithconclusionandfuturework.
248JournalofSoftware软件学报Vol.
19,No.
2,February20082RelatedWorkInrecentyears,thehiddenWebisbecomingahotresearchspot.
ItisestimatedthatthereareseveralmillionhiddenWebsites,whichcontainalargeamountofinformationthatisofhighquality[14].
ThedifficultiesinautomaticallyfillingoutstructuredWebformshavebeendocumentedintheliterature[15].
TheworkinRef.
[16]describesasemi-automaticcrawlercalledHiWE,whichisaidedbydomainknowledgetogeneratereasonablequeriesforhiddenWebinterfaces.
Cope,etal.
[17]useanautomaticfeaturegenerationtechniquetodepictcandidateformsandaC4.
5decisiontreetoclassifythem.
Intheirtwotestbeds—ANUcollectionandrandomWebcollection,theygetanaccuracyofmorethan85%andaprecisionofmorethan87%respectively.
Bergholz,etal.
[18]describeacrawlerwhichstartsfromthePubliclyIndexableWeb(PIW)tofindentrypointsintothehiddenWeb.
Thiscrawlerisdomain-specificandisinitializedwithpre-classifieddocumentsandrelevantkeywords.
LucianoandJuliana[19]composetwoclassifiersinahierarchicalfashiontoidentifyonlinedatabasesamongaheterogeneoussetofWebformsautomaticallygatheredbyafocusedcrawler.
InRef.
[20],theypresentanewadaptivefocusedcrawlingstrategyforefficientlylocatinghiddenWebentrypoints.
Unfortunately,theACHEframeworktheyproposedcannothandleverysparsedomainsefficiently.
Besides,theACHEframeworkiscomplexanditsoverheadislarge.
Ourtechniqueisdifferentfromtheirs.
Firstly,ourmodifiedbest-firstcrawlerjustfindsdomain-specifichiddenWebentries.
Secondly,weuseathree-stepframeworktoguideourdeepWebcrawlerinthispaper.
3PageandFormClassifiersInordertofinddomain-specifichiddenWebentries,weusethreeclassifierswhichworkinahierarchicalfashiontoguideourdeepWebcrawler.
Thethreeclassifiersincludeformstructureclassifier,formtextclassifierandpagetextclassifier.
3.
1FormstructureclassifierAformismadeupoftwopartsthatarestructuralandtextualparts.
ConsiderthefamousPerlCPANWebpageasanexample,wherewecanfinddistributions,modules,documentsandID's.
TheHTMLsourcecodeofitsformcontainedinthisWebpageislistedbelow:formmethod="get"action="/search"name="f"class="searchbox"inputtype="text"name="query"value=""size="35"brinselectname="mode"optionvalue="all"All/optionoptionvalue="module"Modules/optionoptionvalue="dist"Distributions/optionoptionvalue="author"Authors/option/select inputtype="submit"value="CPANSearch"/formWhendisplayedinIEbrowser,theresultisshowninFig.
1.
CPANsearchFig.
1AnillustratedforminterfacedisplayedinIEbrowserAllin王辉等:使用分类器自动发现特定领域的深度网入口249FromFig.
1,wecanseethataformcontainsnotonlytextualcontentssuchas'in','CPANSearch',butalsostructuralcontentssuchasselectelements,submissionbuttons.
Inordertoidentifywhetheraformisadomain-specificsearchableformornot,inthispaperweuseformstructuralandtextualfeaturestotrainformstructureandformtextclassifiersrespectively.
Onceobtainingaformstructureclassifier,wecangetridofthesenon-searchableforms,suchasformsforlogin,discussiongroupinterfaces,mailinglistsubscriptions,purchaseformsandWeb-basedemailforms.
Luciano,etal.
[21]andCope,etal.
[17]demonstratethatoptimumresultwillbeobtainedbyusingadecisiontreetoclassifysearchableandnon-searchableforms.
Accordingly,wealsousedecisiontreealgorithmtotrainaformstructureclassifier.
3.
2FormtextclassifierWiththeaidofdecisiontreeclassifier,wecanidentifywhetheraformisasearchableformornot.
Tofurtherascertainifasearchableformisadomain-specificone,wemustmakefulluseofformtextualfeatures.
Accordingtopreviousresearches[19,20,22],libsvmlearningalgorithm[22]shouldbeusedinthiscase.
Toextracttextualfeaturesfromforms,twotextextractingmethodsaretriedinthispaper.
OneiscalledFT(fulltext)techniqueandtheotherisnamedPT(partialtext)method.
TheFTmethodsimplyusesallHTMLcodesofformsandsplitsthemusingnon-alphanumericstrings.
IncontrastwithFTmethod,thePTtechniqueextractsthosetextualfeatures,whichcanbeseenbyourhumanbeings(whendisplayedinabrowser)aswellastheformactionattributetowhichallformdataaresent.
Forexample:formaction="http://www.
hotwire.
com/car/search-options.
jsp"method="get"name="searchCar"Thisisaformwhichisusedfordemo.
/formIntheaboveform,theactionattributevalueis:http://www.
hotwire.
com/car/search-options.
jspwhichisalsoextractedbythePTmethod.
Inordertousetheseextractedtextualfeatures,somepre-processingstepsareneeded.
First,allcharactersotherthanalphanumericonesarereplacedbyaspacecharacter;second,uppercasecharacters,ifany,areconvertedtotheirlowercaseequivalents;third,stopwords,ifany,areremoved,usingCPAN[23]PerlpackageLingua::EN::StopWords;fourth,eachwordintheremainingtextsisstemmed,usingCPANPerlpackageLingua::Stem::En;finally,TFIDF[24]isusedtotransformeachtrainingexampleintoitscorrespondingvector.
Thesameprocedureisalsoappliedtopagetextclassifier(seeSection3.
3).
Usingthoseextractedtextualfeatures,wecantrainaSVMclassifierwhichcanbeusedtoidentifywhetherasearchableformisdomain-specificornot.
3.
3PagetextclassifierTodecideautomaticallywhetheraWebpageisrelevantornot,weuseapagetextclassifier.
GivenaWebpage,wefirstobtainitscorrespondingplaintexts.
Afterthat,somepre-processingsteps(seeSection3.
2)areneededinordertousethesetextstotrainaSVMclassifier.
Withthesethreeclassifiersonhand,wecanapplythemtoguidingafocusedcrawlertofinddeepWebentries:First,usingthepagetextclassifiertodecidewhethertheWebpagescorrespondingtothegivenURLsarerelevantornot;Second,ifaWebpageisrelevant,weextractsearchableformsfromitwiththeaidoftheformstructureclassifier;Third,iftherelevantWebpagecontainssearchableforms,wefurtherusetheformtextclassifierto250JournalofSoftware软件学报Vol.
19,No.
2,February2008ascertainwhethertheyaredomain-specificornot.
Thereasonwhyweuseclassifiersinthishierarchicalfashionisthatthehierarchicalcompositionofclassifiersleadstomodularity.
Inthiscase,acomplexproblemisdecomposedintosimplersub-componentsandeachisdevotedtoasubsetofthehypothesis.
Thishasseveralmerits:First,theoverallclassificationprocessisnotonlyaccuratebutalsorobust;Second,wecanapplytoeachpartalearningmethodthatisbestsuitedforthefeaturesetofthepartition.
4Three-StepFrameworkFigure2showsthehigh-levelarchitectureproposedinthispaper.
WebpagesinrelevantWebsitesInternetPagetextclassifierBest-FirstcrawlerFormstructureclassifierSearchableformsFormtextclassifierRelevantformsFig.
2Thehigh-levelarchitectureproposedinthispaperNotethatthebest-firstcrawlerusedinthispaperisavariationofthebest-firstcrawlerproposedinRef.
[22].
InRef.
[22],theymakenodifferenceaboutURLswhichlieinaon-topicpage;whereaswegiveURLsprioritiesaccordingtothefollowingformulaa*page_score+b*anchor_score.
Here,weletaandbtakethesamevalueone.
ThedetailedprocedureaboutourdeepWebcrawlerisdisplayedinFig.
3.
ContainrelevantentryRelevantout-frontierisemptyin-frontierisemptydepth=3orthetotalnumberofpagesthreshold>=100isvisited.
Notethatthereasonwhywesetdepth<3isthatWebdatabasestendtolocateshallowlyintheirsitesandthevastmajorityofthem(approximately94%)canbefoundatthetop3levels[16].
Besides,inordertoprotectourcrawlerfromgettingtrappedinsomesites,wesetathresholdforvisitingmaximumWebpagespersite.
5ExperimentalResultsTEL-8QueryInterfaces[25]datasetisusedtotrainaformclassifier.
Thisdatasetcontainstheoriginalinterfacesextractedfromeightrepresentativedomains,whichareAirfares,Automobiles,Books,CarRentals,Hotels,Jobs,MoviesandMusicRecords.
Table1showstheinstances'distributionsoftheeightrepresentativedatabasedomains.
Table1Theinstances'distributionsoftheeightdatabasedomains.
223sourcesinallDomainSourcesDomainSourcesAirfare20Auto28Book43Rental13Hotel34Job20Movie32Music335.
1TrainingformstructureclassifierInthispaper,ourformstructureclassifieristrainedbyusingdecisiontreealgorithm.
Thedecisiontreetrainingdataarecollectedasfollows:weextract223searchableforms(seeTable1)fromTEL-8QueryInterfacesaspositiveexamplesandmanuallygather318non-searchableformsasnegativeones.
Foreachforminthesampledataset,wecountthefollowingfeatures:numberofcheckboxes;numberoffileinputs;numberofhiddentags;numberofimageinputs;numberofsubmissionmethods(getandpost);numberofselectelements;numberofpasswordtags;numberofradiotags;numberofword'search'withinformtagorsubmissionbutton;numberoftextelements;numberoftextareaelementsandnumberofword'email'ininputelements'nameorvalue.
Thedistributionsaboutallthosefeaturesinsearchableandnon-searchableformsaredisplayedinTable2.
FromTable2,wecandrawthefollowingconclusions:Searchableformshavealargenumberofcheckboxesanditems(options)inselectionlists.
No-Searchableformshavealargenumberofpasswordtagsand'email'ininputelements'nameorvalue.
Usingthesestructuralfeatures,wecantrainadecisiontreeclassifier.
Twotoolsareusedtoconstructaformstructureclassifier:Wekaj48algorithm[26]andAlgorithm::SVMLightPerlpackage[23].
InWeka,theprecisionofthedecisiontreeclassifieris0.
948718.
ThedecisiontreegeneratedbyWekaisdisplayedinFig.
4.
Infact,weusePerlpackageAlgorithm::SVMLightinthispapertotrainadecisiontreeclassifieranditsprecisionis0.
948717948.
Obviously,thesetwotoolshavethesimilarresultsaccordingtoourexperiments.
Nevertheless,inAlgorithm::SVMLightcurrentimplementation,onlydiscrete-valuedattributesaresupportedandconsequentlyitoutputsalargenumberofrules.
Infact,itoutputs122rulesinall.
Additionally,we252JournalofSoftware软件学报Vol.
19,No.
2,February2008thinkthatthedecisiontreemisclassifiesaninstanceifitcannotdecide,towhichcategorythisinstancebelongs.
Table2Features'distributionsofsearchableandnon-searchableformsFeature/CategorySearchableNon-SearchableRatiocheckboxemail_yesfilehiddenimagemethod_getoptionpasswordradiosearch_yestexttextarea2.
390.
010.
004.
450.
360.
4712.
640.
020.
480.
363.
000.
020.
180.
120.
001.
630.
210.
400.
170.
100.
100.
091.
010.
0713.
04:11:13.
132.
72:11.
73:11.
16:174.
23:11:5.
864.
76:13.
94:12.
97:11:2.
98Fig.
4ThedecisiontreegeneratedbyWekaj48algorithmusingformstructuralfeatures5.
2TrainingformtextclassifierAccordingtoFTandPTmethods(seeSection3.
2),wecanextractformtextualfeaturesfromforms.
FivemostfrequentfeaturesobtainedbyFTandPTtechniquesarepresentedinTable3.
Table3showsthatthePTmethodextractsmorevaluablefeaturesthantheFTtechniquedoes.
Table3FivemostfrequentfeaturesextractedbyFTandPTmethodsrespectivelyMethodCategoryTextualfeatures(Feature:Frequency)FTAirfareAutoBookRentalHotelJobMovieMusicoption:8113value:4161td:1181id:1069class:993option:5673value:3002td:1520tr:716class:498option:10997value:5753td:1538tr:788name:421option:6396value:3199td:892pm:550class:520option:13048value:6271class:1377div:1358td:1265option:7423value:3868u:775td:680tr:413option:6995value:3616div:1200class:1181td:734option:8090value:679td:516record:456font:323PTAirfareAutoBookRentalHotelJobMovieMusicpm:419airlin:279air:124am:102airwai:100docum:108car:105leas:84search:63make:56search:130titl:110book:95author:75new:72pm:402option:202am:168airport:144car:143hotel:234pm:228island:151new:135room:84job:207new:125locat:84servic:82island:81press:211book:123s:109video:107entertain:107record:456music:226sub:156search:97new:80王辉等:使用分类器自动发现特定领域的深度网入口253Usingthoseextractedtextualfeatures,wefinallyfinishtrainingeightSVMclassifiers.
TheprecisionsoftheeightSVMclassifiersareshowninTable4.
SincePTmethodextractsmorevaluabletextualfeaturesthanFTtechniquedoes,wecanusethemtrainamoreaccurateclassifier.
Table4indicatesthatnomatterwhatcategoryis,usingthePTmethodcanalwaysobtainamoreaccurateclassifierthantheFTtechniquecan.
Table4PrecisionsofSVMclassifierstrainedwithformtextualfeaturesthatareextractedbytheFTandPTmethodsrespectivelyCategoryFTmethodPTmethodAirfare0.
91820.
9364Auto0.
94761.
0Book0.
92730.
9636Rental0.
95450.
9636Hotel0.
93330.
9381Job0.
94090.
9682Movie0.
92730.
9773Music0.
90.
96825.
3TrainingpagetextclassifierInordertotrainpagetextclassifier,thispapergetsitspositivetrainingdatafromtheonlineopendirectoryproject(http://dmoz.
org/).
WeuseaPerlscriptprogramtofilloutthesearchableformandextractURLsfromitsreturnedresultpagesautomatically.
Asfornegativetrainingdata,wegetthemfromDMOZ(http://rdf.
dmoz.
org/).
DMOZconsistsofsixteencategories,whichareArts,Business,Computers,Games,Health,Home,Kids_and_Teens,News,Recreation,Reference,Regional,Science,Shopping,Society,SportsandWorld.
Wegetridoffourcategoriesoftheminourexperiments.
TheyareKids_and_Teens,Reference,RegionalandWorld.
ThenumberofURLsineachcategoryisshowninTable5.
Table5ThenumberofURLsineachDMOZcategoryArtsBusinessComputersGamesHealthHome58592451162028533612375813105033555NewsRecreationScienceShoppingSocietySports235704120308213014235160269864154921InDMOZ,eachexamplelookslikethis:ExternalPageabout="http://www.
airwise.
com/airports/us/SLC/index.
html"d:TitleSaltLakeCityAirport-airwise.
com/d:Titled:DescriptionInformationabouttheairportincludingairlines,groundtransportation,parking,weatherandairportnews.
/d:DescriptiontopicTop/Regional/North_America/United_States/Utah/Localities/S/Salt_Lake_City/Transportation/Airports/topic/ExternalPageThispaperusesthecontentof'd:Description'elementandtheWebpagecorrespondingtothe'about'254JournalofSoftware软件学报Vol.
19,No.
2,February2008ExternalPageattributetoobtainanegativetrainingexample.
Inordertobemorerepresentative,wederiveURLsfromeachcategoryaccordingtoitssize.
Sincethe'Arts'categoryhasthelargestnumberofURLs,wegetthemostnumberofURLsfromit.
ExcludedtheseURLswhichcannotbedownloaded,thenumberofpositiveandnegativeexampleswhichweusetotrainapageclassifierforeachcategoryislistedinTable6.
Usingthesetrainingdata,wecanfinishtrainingpageclassifiers.
TheprecisionsofthesepageclassifiersareshowninTable6.
Table6ThenumberofpositiveandnegativetraindataforeachcategoryaswellastheprecisionofitscorrespondingpageclassifierCategoryPositiveNegativePrecisionAirfareAutoBookRentalHotelJobMovieMusic1162512156891317081706160122316356433252282272131783124870.
96190.
94630.
91350.
97370.
99410.
97910.
90040.
875.
4Usingthethree-stepframeworktofinddeepWebentriesWeconducteightlarge-scaleexperimentswithourhiddenWebcrawler.
Foreachcategory,weinitializeourcrawlerwith100seedsthatareextractedfromtheDMOZasthestartingpoint.
WesavethosepagesandtheircorrespondingURLsifthefollowingtwoconditionsaresatisfiedatthesametime.
First,theyarejudgedtoberelevantbypageandformtextclassifiers.
Second,eachpagecontainsatleastonedomain-specificdeepWebentry.
SinceMusicRecordsdatabasesareverysparselydistributed,ourbest-firstfocusedcrawleronlylocates50deepWebentriesforit.
Forothercategories,ourcrawlerfinds100deepWebentriesforeachofthem.
FivedeepWebentriesaboutAirfarescategorywhicharelocatedbyourcrawlerarelistedbelow:http://www.
aircanada.
ca/http://www.
itn.
net/http://www.
aircharter.
com/http://www.
orbitz.
com/http://www.
nwa.
com/Atlast,wemanuallyverifywhetherthedeepWebentrieslocatedbyourcrawlerarewhatwewant.
TheprecisionsofallthesecategoriesareshowninTable7.
Table7TheprecisionsofdeepWebentriesforeachcategoryDomainPrecisionDomainPrecisionAirfare0.
90Auto0.
88Book0.
91Rental0.
95Hotel0.
94Job0.
81Movie0.
86Music0.
806ConclusionandFutureWorkInthispaper,athree-stepframeworkisproposedtoautomaticallyidentifydomain-specifichiddenWebentries.
Toverifyitseffectivenessandefficiency,eightlarge-scaleexperimentsareconducted.
Experimentalresultsdemonstratethatourmethodcanfinddomain-specificdeepWebentriesaccuratelyandefficiently.
Theaverageprecisionoftheeightrepresentativedomainsis0.
88.
Inthenearfuture,experimentsonalargernumberofcategoriesarerequiredinordertofurtherassessthe王辉等:使用分类器自动发现特定领域的深度网入口255effectivenessofourproposedtechnique.
Additionally,withthoseobtainedqueryinterfaces,wewillintegratethemtoobtainaunifiedinterfaceandgiveittouserstoquery.
Usersdon'thavetohuntsomedomain-specificsourcesandlearnthedetailsforqueryingeachresource.
AcknowledgementThisworkissponsoredbytheScienceandTechnologyDevelopmentProgramofJilinProvinceunderGrantNo.
20070533andtheNaturalScienceFoundationofChinaunderGrantNo.
60373099.
References:[1]RoccoD,CaverleeJ,LiuL,CritchlowT.
ExploitingthedeepWebwithDynaBot:Matching,probing,andranking.
In:EllisA,HaginoT,eds.
Proc.
oftheWorldWideWebSpecialInterestTracksAndPosters(WWW).
Chiba:ACM,2005.
11741175.
[2]BrightPlanet.
com.
ThedeepWeb:Surfacinghiddenvalue.
http://brightplanet.
com[3]BergmanMK.
ThedeepWeb:Surfacinghiddenvalue.
JournalofElectronicPublishing,2001,7(1):11741175.
http://www.
press.
umich.
edu/jep/07-01/bergman.
html[4]HeB,ZhangZ,ChangKCC.
KnockingthedoortothedeepWeb:IntegratingWebqueryinterfaces.
In:WeikumG,ed.
Proc.
oftheSIGMODConf.
Paris:ACM,2004.
913914.
[5]ChangKCC,HeB,ZhangZ.
MetaQuerieroverthedeepWeb:Shallowintegrationacrossholisticsources.
In:NascimentoMA,zsuMT,KossmannD,MillerRJ,BlakeleyJA,SchieferKB,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
MorganKaufmannPublishers,2004.
1521.
[6]WuW,DoanA,YuCT.
MerginginterfaceschemasonthedeepWebviaclusteringaggregation.
In:Proc.
oftheInt'lConf.
onDataMining(ICDM).
IEEEComputerSociety,2005.
801804.
[7]HeH,MengWY,YuCT,WuZH.
WISE-Integrator:AsystemforextractingandintegratingcomplexWebsearchinterfacesofthedeepWeb.
In:BhmK,JensenCS,HaasLM,KerstenML,LarsonPA,OoiBC,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
ACM,2005.
13141317.
[8]ChangKCC,Garcia-MolinaH.
Mindyourvocabulary:Querymappingacrossheterogeneousinformationsources.
In:DelisA,FaloutsosC,GhandeharizadehS,eds.
Proc.
oftheSIGMODConf.
Philadelphia:ACMPress,1999.
335346.
[9]HeB,ZhangZ,ChangKCC.
MetaQuerier:QueryingstructuredWebsourceson-the-fly.
In:zcanF,ed.
Proc.
oftheSIGMODConf.
ACM,2005.
927929.
[10]NakatohT,YamadaY,HirokawaS.
AutomaticgenerationofdeepWebwrappersbasedondiscoveryofrepetition.
In:Proc.
oftheAsiaInformationRetrievalSymp.
(AIRS).
Beijing:Springer-Verlag,2004.
269272.
[11]HedleyYL,YounasM,JamesA,SandersonM.
Atwo-phasesamplingtechniqueforinformationextractionfromhiddenWebdatabases.
In:LaenderAHF,LeeD,RonthalerM,eds.
Proc.
oftheInt'lWorkshoponWebInformationandDataManagement(WIDM).
Washington:ACM,2004.
18.
[12]MundluruD,KatukuriJR,CelebiS.
Automaticallyminingresultrecordsfromsearchengineresponsepages.
In:Proc.
oftheInt'lConf.
onDataMining(ICDM).
IEEEComputerSociety,2005.
749752.
[13]LiuB,GrossmanR,ZhaiYH.
MiningdatarecordsinWebpages.
In:GetoorL,SenatorTE,DomingosP,FaloutsosC,eds.
Proc.
oftheKnowledgeDiscoveryandDataMining(KDD).
Washington:ACM,2003.
601606.
[14]HsiehW,MadhavanJ,PikeR.
DatamanagementprojectsatGoogle.
In:ChaudhuriS,HristidisV,PolyzotisN,eds.
Proc.
oftheSIGMODConf.
Chicago:ACM,2006.
725726.
[15]WuP,WenJR,LiuH,MaWY.
QueryselectiontechniquesforefficientcrawlingofstructuredWebsources.
In:LiuL,ReuterA,WhangKY,ZhangJ,eds.
Proc.
oftheInt'lConf.
onDataMining(ICDE).
IEEEComputerSociety,2006.
47.
[16]RaghavanS,Garcia-MolinaH.
CrawlingthehiddenWeb.
In:ApersPMG,AtzeniP,CeriS,ParaboschiS,RamamohanaraoK,SnodgrassRT,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
Rome:MorganKaufmannPublishers,2001.
129138.
[17]CopeJ,CraswellN,HawkingD.
AutomateddiscoveryofsearchinterfacesontheWeb.
In:ScheweKD,ZhouX,eds.
Proc.
oftheAustralasianDatabaseConf.
(ADC).
AustralianComputerSociety,2003.
181189.
256JournalofSoftware软件学报Vol.
19,No.
2,February2008[18]BergholzA,ChidlovskiiB.
Crawlingfordomain-specifichiddenWebresources.
In:Proc.
oftheInt'lConf.
onWebInformationSystemsEngineering(WISE).
Roma:IEEEComputerSociety,2003.
125133.
[19]BarbosaL,FreireJ.
Combiningclassifierstoidentifyonlinedatabases.
In:WilliamsonCL,ZurkoME,Patel-SchneiderPF,ShenoyPJ,eds.
Proc.
oftheWorldWideWebConf.
(WWW).
ACM,2007.
431440.
[20]BarbosaL,FreireJ.
Anadaptivecrawlerforlocatinghidden-Webentrypoints.
In:WilliamsonCL,ZurkoME,Patel-SchneiderPF,ShenoyPJ,eds.
Proc.
oftheWorldWideWebConf.
(WWW).
ACM,2007.
441450.
[21]BarbosaL,FreireJ.
Searchingforhidden-Webdatabases.
In:DoanAH,NevenF,McCannR,BexGJ,eds.
Proc.
ofthe8thInt'lWorkshopontheWebandDatabases(WebDB).
Baltimore:ACMPress,2005.
16.
[22]ChangCC,LinCJ.
Libsvm—Alibraryforsupportvectormachines.
http://www.
csie.
ntu.
edu.
tw/~cjlin/libsvm/[23]CPAN.
http://search.
cpan.
org/[24]TorgoL,GamaJ.
Regressionbyclassification.
In:BorgesD,KaestnerC,eds.
Proc.
oftheBrasilianArtificialIntelligenceSymp.
Curitiba:Springer-Verlag,1996.
5160.
[25]TheuiucWebintegrationrepository.
http://metaquerier.
cs.
uiuc.
edu/repository/[26]Weka.
http://www.
cs.
waikato.
ac.
nz/ml/weka/WANGHuiwasbornin1972.
HereceivedhisPh.
D.
degreefromJilinUniversity.
HisresearchareaisWebinformationmining.
ZOUWan-Liwasbornin1957.
HeisaprofessoranddoctoralsupervisorattheJilinUniversityandaCCFseniormember.
Hisresearchareasaredatabase,dataminingandWebsearchengine.
LIUYan-Weiwasbornin1983.
HeisagraduatestudentattheJilinUniversity.
HisresearchareaisWebinformationmining.

Gcorelabs:美国GPU服务器,8路RTX2080Ti;2*Silver-4214/256G内存/1T SSD,1815欧/月

gcorelabs怎么样?gcorelabs是创建于2011年的俄罗斯一家IDC服务商,Gcorelabs提供优质的托管服务和VPS主机服务,Gcorelabs有一支强大的技术队伍,对主机的性能和稳定性要求非常高。Gcorelabs在 2017年收购了SkyparkCDN并提供全球CDN服务,目标是进入全球前五的网络服务商。G-Core Labs总部位于卢森堡,在莫斯科,明斯克和彼尔姆设有办事处。...

恒创科技SonderCloud,美国VPS综合性能测评报告,美国洛杉矶机房,CN2+BGP优质线路,2核4G内存10Mbps带宽,适用于稳定建站业务需求

最近主机参考拿到了一台恒创科技的美国VPS云服务器测试机器,那具体恒创科技美国云服务器性能到底怎么样呢?主机参考进行了一番VPS测评,大家可以参考一下,总体来说还是非常不错的,是值得购买的。非常适用于稳定建站业务需求。恒创科技服务器怎么样?恒创科技服务器好不好?henghost怎么样?henghost值不值得购买?SonderCloud服务器好不好?恒创科技henghost值不值得购买?恒创科技是...

无忧云:洛阳BGP云服务器低至38.4元/月起;雅安高防云服务器/高防物理机优惠

无忧云怎么样?无忧云,无忧云是一家成立于2017年的老牌商家旗下的服务器销售品牌,现由深圳市云上无忧网络科技有限公司运营,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免备案建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高防节点。一、无忧云官网点击此处进入无忧云官方网站二...

udk中文网为你推荐
摩根币摩根币是传销吗广东GDP破10万亿__年,我国国内生产总值(GDP)首破10万亿元.目前,我国经济总量排名世界第___位?www.983mm.comwww.47683.com杰景新特美国杰尼.巴尼特的资料钟神发跪求钟神发名言出处,A站大神看过来同ip网站同IP的两个网站,做单向链接,会不会被K掉??www.yahoo.com.hk香港的常用网站lcoc.toptop weenie 是什么?ww.66bobo.com这个www.中国应急救援网.com查询证件是真是假?dadi.tv智能网络电视smartTV是什么牌子
网站空间购买 老域名失效请用户记下 备案域名出售 com域名抢注 hawkhost 香港机房 秒解服务器 softbank官网 512au xfce patcha 万网优惠券 一点优惠网 本网站在美国维护 京东商城0元抢购 免费mysql 世界测速 河南移动m值兑换 息壤代理 中国电信宽带测速器 更多