深度udk中文网

udk中文网  时间:2021-04-06  阅读:()
ISSN1000-9825,CODENRUXUEWE-mail:jos@iscas.
ac.
cnJournalofSoftware,Vol.
19,No.
2,February2008,pp.
246256http://www.
jos.
org.
cnDOI:10.
3724/SP.
J.
1001.
2008.
00246Tel/Fax:+86-10-625625632008byJournalofSoftware.
Allrightsreserved.
使用分类器自动发现特定领域的深度网入口王辉+,刘艳威,左万利(吉林大学计算机科学与技术学院,吉林长春130012)UsingClassifierstoFindDomain-SpecificOnlineDatabasesAutomaticallyWANGHui+,LIUYan-Wei,ZUOWan-Li(CollegeofComputerScienceandTechnology,JilinUniversity,Changchun130012,China)+Correspondingauthor:Phn:+86-431-85166492,E-mail:whui05@yahoo.
com.
cnWangH,LiuYW,ZuoWL.
Usingclassifierstofinddomain-specificonlinedatabasesautomatically.
JournalofSoftware,2008,19(2):246256.
http://www.
jos.
org.
cn/1000-9825/19/246.
htmAbstract:InhiddenWebdomain,general-purposesearchengines(i.
e.
,GoogleandYahoo)havetheirshortcomings.
Theycoverlessthanone-thirdofthedatastoredindocumentdatabases.
UnlikethesurfaceWeb,ifcombined,theycoverroughlythesamedata.
HiddenWebisahighlyimportantinformationsourcesincethecontentprovidedbymanyhiddenWebsitesisoftenofveryhighquality.
Thispaperproposesathree-stepframeworktoautomaticallyidentifydomain-specifichiddenWebentries.
Withthoseobtainedqueryinterfaces,theycanbeintegratedtoobtainaunifiedinterfacewhichisgiventouserstoquery.
Eightlarge-scaleexperimentsdemonstratethatthetechniquecanfinddomain-specifichiddenWebentriesaccuratelyandefficiently.
Keywords:deepWeb;hiddenWeb;surfaceWeb;hiddenWebentry;searchableform摘要:在深度网研究领域,通用搜索引擎(比如Google和Yahoo)具有许多不足之处:它们各自所能覆盖的数据量与整个深度网数据总量的比值小于1/3;与表层网中的情况不同,几个搜索引擎相结合所能覆盖的数据量基本没有发生变化.
许多深度网站点能够提供大量高质量的信息,并且,深度网正在逐渐成为一个最重要的信息资源.
提出了一个三分类器的框架,用于自动识别特定领域的深度网入口.
查询接口得到以后,可以将它们进行集成,然后将一个统一的接口提交给用户以方便他们查询信息.
通过8组大规模的实验,验证了所提出的方法可以准确高效地发现特定领域的深度网入口.
关键词:深度网;深度网;表层网;深度网入口;搜索表单中图法分类号:TP393文献标识码:A1IntroductionAccordingtohowitsdataisstored,theWorldWideWebcanbeclassifiedintotwocategoriesthataresurfaceWebanddeepWeb(alsocalledhiddenWeb).
InthesurfaceWeb,dataarestoredindocumentfiles;whileintheSupportedbytheNationalNaturalScienceFoundationofChinaunderGrantNo.
60373099(国家自然科学基金);theScienceandTechnologyDevelopmentProgramofJilinProvinceofChinaunderGrantNo.
20070533(吉林省科技发展计划)Received2007-08-02;Accepted2007-11-06王辉等:使用分类器自动发现特定领域的深度网入口247deepWeb,dataarestoredindatabases[1].
UnlikethesurfaceWeb,thedeepWebreferstothecollectionofWebdatathatisaccessiblebyinteractingwithaWeb-basedqueryinterface,andnotthroughthetraversalofstatichyperlinks.
AJuly2000whitepaper[2]estimatedthatdeepWebhas450000databases,7500terabytesofinformationand550billionindividualdocuments.
Incontrast,thesurfaceWebcontains19terabytesofinformationand1billionindividualdocuments.
Inaddition,accordingtomanystudies,thesizeofthehiddenWebincreasesrapidlyasmoreorganizationsreleasetheirvaluablecontentonlinethroughaneasilyusedWebinterface[3].
ThecontentprovidedbymanyhiddenWebsitesisoftenofveryhighqualityandcanbeextremelyvaluabletomanyusers.
ThesiteoftheUSPatentandTrademarksOfficeisanexample,whichmakesexistingpatentdocumentsavailableinordertohelppotentialinventorsexaminetheartswhichhadbeeninventedalready.
Toretrievedatafromonlinedatabases,threemainproblemsshouldbeconsidered.
Theyareinterfaceunification(alsocalledinterfaceintegration),querytranslationandresultmerging.
BeforehiddenWebdatabaseisqueried,searchsystemfirstcharacterizestheavailablesearchinterfacesandthen,givenaquery,itselectsasubsetofusefuldomain-specificsearchinterfaces,queriesthemandpresentsresultstotheusers.
Inthispaper,weconsideranoverlookedproblemthatprecedesthethreemainproblems.
ItisdiscoveringdeepWebentries.
Asearchsystemmustdiscoverasetofsearchinterfacesorbeprovidedwithsuchasetbeforeitcanproceedwiththeotherthreesteps.
Muchworkhasbeendoneineachofthesethreeareas.
Foreachdomain,theMetaQuerier[4]constructsaunifiedinterfacewhichisprovidedforuserstoquery.
Userssendtheirqueriesviatheunifiedinterface.
Amediatortranslatesthequeriestoeachspecificonlinedatabaseandthenreturnstheintegratedqueryresultstotheusers.
Chang,etal.
[5]usetheparsingapproachachievingabove85%accuracyforextractingqueryconditionsacrossrandomlyselecteddeepWebsourcesandforqueryinterfacematching.
Wu,etal.
[6]developanovelapproximationalgorithmLMax,whichbuildstheunifiedinterfaceviarecursiveapplicationsofclusteringaggregation.
Moreover,theyextendLMaxtohandletheirregularitiesthatfrequentlyoccurintheinterfaceschemas.
TheinterfaceExtractor[7]canachieveadeeperunderstandingofWebsearchinterfacesinthesensethatmoresemantic/metainformationonsearchinterfacescanbeextracted.
Withsuchsemantic/metainformation,theenrichedinterfaceschemacanbeusedinmanyapplications,forinstance,querytranslation,searchresultextractionandannotation.
Chang,etal.
[8]pursueasource-basedandrule-drivenframeworktoimplementquerytranslationacrossdifferentdeepWebsources.
Onthecontrary,He,etal.
[9]proposeagenerictype-basedandsearch-drivenquerytranslationframeworktoreachthesamegoal.
AdeepWebwrapperisaprogramthatextractscontentsfromsearchresults.
Nakatoh,etal.
[10]proposeanewautomaticgenerationalgorithmwhichdiscoversarepetitivepatternfromsearchresults.
Hedley,etal.
[11]describeaTwo-PhaseSampling(2PS)techniquetodetecttemplatesandextractquery-relatedinformationfromthesampleddocumentsofadatabase.
Mundluru,etal.
[12]giveahighlyeffectiveandefficientsolutionforautomaticallyminingresultrecordsfromsearchengineresponsepages.
ExperimentalresultsshowedthattheirproposedsystemsignificantlyoutperformsMDR[13],astate-of-the-artrecordminingsystem.
Thoughmuchworkhasbeendoneinthoseareas,littleworkhasbeendoneininterfacediscovery,especiallysincethethreemainproblemsdependonhavingasetofknownhiddenWebinterfaces.
Theremainingpaperisorganizedasfollows.
WereviewrelatedworkinSection2.
Section3concernswithpageandformclassifiers.
Three-stepframeworkofourhiddenWebcrawlerisdescribedinSection4.
ExperimentalresultsareelaboratedinSection5.
Section6isconcerningwithconclusionandfuturework.
248JournalofSoftware软件学报Vol.
19,No.
2,February20082RelatedWorkInrecentyears,thehiddenWebisbecomingahotresearchspot.
ItisestimatedthatthereareseveralmillionhiddenWebsites,whichcontainalargeamountofinformationthatisofhighquality[14].
ThedifficultiesinautomaticallyfillingoutstructuredWebformshavebeendocumentedintheliterature[15].
TheworkinRef.
[16]describesasemi-automaticcrawlercalledHiWE,whichisaidedbydomainknowledgetogeneratereasonablequeriesforhiddenWebinterfaces.
Cope,etal.
[17]useanautomaticfeaturegenerationtechniquetodepictcandidateformsandaC4.
5decisiontreetoclassifythem.
Intheirtwotestbeds—ANUcollectionandrandomWebcollection,theygetanaccuracyofmorethan85%andaprecisionofmorethan87%respectively.
Bergholz,etal.
[18]describeacrawlerwhichstartsfromthePubliclyIndexableWeb(PIW)tofindentrypointsintothehiddenWeb.
Thiscrawlerisdomain-specificandisinitializedwithpre-classifieddocumentsandrelevantkeywords.
LucianoandJuliana[19]composetwoclassifiersinahierarchicalfashiontoidentifyonlinedatabasesamongaheterogeneoussetofWebformsautomaticallygatheredbyafocusedcrawler.
InRef.
[20],theypresentanewadaptivefocusedcrawlingstrategyforefficientlylocatinghiddenWebentrypoints.
Unfortunately,theACHEframeworktheyproposedcannothandleverysparsedomainsefficiently.
Besides,theACHEframeworkiscomplexanditsoverheadislarge.
Ourtechniqueisdifferentfromtheirs.
Firstly,ourmodifiedbest-firstcrawlerjustfindsdomain-specifichiddenWebentries.
Secondly,weuseathree-stepframeworktoguideourdeepWebcrawlerinthispaper.
3PageandFormClassifiersInordertofinddomain-specifichiddenWebentries,weusethreeclassifierswhichworkinahierarchicalfashiontoguideourdeepWebcrawler.
Thethreeclassifiersincludeformstructureclassifier,formtextclassifierandpagetextclassifier.
3.
1FormstructureclassifierAformismadeupoftwopartsthatarestructuralandtextualparts.
ConsiderthefamousPerlCPANWebpageasanexample,wherewecanfinddistributions,modules,documentsandID's.
TheHTMLsourcecodeofitsformcontainedinthisWebpageislistedbelow:formmethod="get"action="/search"name="f"class="searchbox"inputtype="text"name="query"value=""size="35"brinselectname="mode"optionvalue="all"All/optionoptionvalue="module"Modules/optionoptionvalue="dist"Distributions/optionoptionvalue="author"Authors/option/select inputtype="submit"value="CPANSearch"/formWhendisplayedinIEbrowser,theresultisshowninFig.
1.
CPANsearchFig.
1AnillustratedforminterfacedisplayedinIEbrowserAllin王辉等:使用分类器自动发现特定领域的深度网入口249FromFig.
1,wecanseethataformcontainsnotonlytextualcontentssuchas'in','CPANSearch',butalsostructuralcontentssuchasselectelements,submissionbuttons.
Inordertoidentifywhetheraformisadomain-specificsearchableformornot,inthispaperweuseformstructuralandtextualfeaturestotrainformstructureandformtextclassifiersrespectively.
Onceobtainingaformstructureclassifier,wecangetridofthesenon-searchableforms,suchasformsforlogin,discussiongroupinterfaces,mailinglistsubscriptions,purchaseformsandWeb-basedemailforms.
Luciano,etal.
[21]andCope,etal.
[17]demonstratethatoptimumresultwillbeobtainedbyusingadecisiontreetoclassifysearchableandnon-searchableforms.
Accordingly,wealsousedecisiontreealgorithmtotrainaformstructureclassifier.
3.
2FormtextclassifierWiththeaidofdecisiontreeclassifier,wecanidentifywhetheraformisasearchableformornot.
Tofurtherascertainifasearchableformisadomain-specificone,wemustmakefulluseofformtextualfeatures.
Accordingtopreviousresearches[19,20,22],libsvmlearningalgorithm[22]shouldbeusedinthiscase.
Toextracttextualfeaturesfromforms,twotextextractingmethodsaretriedinthispaper.
OneiscalledFT(fulltext)techniqueandtheotherisnamedPT(partialtext)method.
TheFTmethodsimplyusesallHTMLcodesofformsandsplitsthemusingnon-alphanumericstrings.
IncontrastwithFTmethod,thePTtechniqueextractsthosetextualfeatures,whichcanbeseenbyourhumanbeings(whendisplayedinabrowser)aswellastheformactionattributetowhichallformdataaresent.
Forexample:formaction="http://www.
hotwire.
com/car/search-options.
jsp"method="get"name="searchCar"Thisisaformwhichisusedfordemo.
/formIntheaboveform,theactionattributevalueis:http://www.
hotwire.
com/car/search-options.
jspwhichisalsoextractedbythePTmethod.
Inordertousetheseextractedtextualfeatures,somepre-processingstepsareneeded.
First,allcharactersotherthanalphanumericonesarereplacedbyaspacecharacter;second,uppercasecharacters,ifany,areconvertedtotheirlowercaseequivalents;third,stopwords,ifany,areremoved,usingCPAN[23]PerlpackageLingua::EN::StopWords;fourth,eachwordintheremainingtextsisstemmed,usingCPANPerlpackageLingua::Stem::En;finally,TFIDF[24]isusedtotransformeachtrainingexampleintoitscorrespondingvector.
Thesameprocedureisalsoappliedtopagetextclassifier(seeSection3.
3).
Usingthoseextractedtextualfeatures,wecantrainaSVMclassifierwhichcanbeusedtoidentifywhetherasearchableformisdomain-specificornot.
3.
3PagetextclassifierTodecideautomaticallywhetheraWebpageisrelevantornot,weuseapagetextclassifier.
GivenaWebpage,wefirstobtainitscorrespondingplaintexts.
Afterthat,somepre-processingsteps(seeSection3.
2)areneededinordertousethesetextstotrainaSVMclassifier.
Withthesethreeclassifiersonhand,wecanapplythemtoguidingafocusedcrawlertofinddeepWebentries:First,usingthepagetextclassifiertodecidewhethertheWebpagescorrespondingtothegivenURLsarerelevantornot;Second,ifaWebpageisrelevant,weextractsearchableformsfromitwiththeaidoftheformstructureclassifier;Third,iftherelevantWebpagecontainssearchableforms,wefurtherusetheformtextclassifierto250JournalofSoftware软件学报Vol.
19,No.
2,February2008ascertainwhethertheyaredomain-specificornot.
Thereasonwhyweuseclassifiersinthishierarchicalfashionisthatthehierarchicalcompositionofclassifiersleadstomodularity.
Inthiscase,acomplexproblemisdecomposedintosimplersub-componentsandeachisdevotedtoasubsetofthehypothesis.
Thishasseveralmerits:First,theoverallclassificationprocessisnotonlyaccuratebutalsorobust;Second,wecanapplytoeachpartalearningmethodthatisbestsuitedforthefeaturesetofthepartition.
4Three-StepFrameworkFigure2showsthehigh-levelarchitectureproposedinthispaper.
WebpagesinrelevantWebsitesInternetPagetextclassifierBest-FirstcrawlerFormstructureclassifierSearchableformsFormtextclassifierRelevantformsFig.
2Thehigh-levelarchitectureproposedinthispaperNotethatthebest-firstcrawlerusedinthispaperisavariationofthebest-firstcrawlerproposedinRef.
[22].
InRef.
[22],theymakenodifferenceaboutURLswhichlieinaon-topicpage;whereaswegiveURLsprioritiesaccordingtothefollowingformulaa*page_score+b*anchor_score.
Here,weletaandbtakethesamevalueone.
ThedetailedprocedureaboutourdeepWebcrawlerisdisplayedinFig.
3.
ContainrelevantentryRelevantout-frontierisemptyin-frontierisemptydepth=3orthetotalnumberofpagesthreshold>=100isvisited.
Notethatthereasonwhywesetdepth<3isthatWebdatabasestendtolocateshallowlyintheirsitesandthevastmajorityofthem(approximately94%)canbefoundatthetop3levels[16].
Besides,inordertoprotectourcrawlerfromgettingtrappedinsomesites,wesetathresholdforvisitingmaximumWebpagespersite.
5ExperimentalResultsTEL-8QueryInterfaces[25]datasetisusedtotrainaformclassifier.
Thisdatasetcontainstheoriginalinterfacesextractedfromeightrepresentativedomains,whichareAirfares,Automobiles,Books,CarRentals,Hotels,Jobs,MoviesandMusicRecords.
Table1showstheinstances'distributionsoftheeightrepresentativedatabasedomains.
Table1Theinstances'distributionsoftheeightdatabasedomains.
223sourcesinallDomainSourcesDomainSourcesAirfare20Auto28Book43Rental13Hotel34Job20Movie32Music335.
1TrainingformstructureclassifierInthispaper,ourformstructureclassifieristrainedbyusingdecisiontreealgorithm.
Thedecisiontreetrainingdataarecollectedasfollows:weextract223searchableforms(seeTable1)fromTEL-8QueryInterfacesaspositiveexamplesandmanuallygather318non-searchableformsasnegativeones.
Foreachforminthesampledataset,wecountthefollowingfeatures:numberofcheckboxes;numberoffileinputs;numberofhiddentags;numberofimageinputs;numberofsubmissionmethods(getandpost);numberofselectelements;numberofpasswordtags;numberofradiotags;numberofword'search'withinformtagorsubmissionbutton;numberoftextelements;numberoftextareaelementsandnumberofword'email'ininputelements'nameorvalue.
Thedistributionsaboutallthosefeaturesinsearchableandnon-searchableformsaredisplayedinTable2.
FromTable2,wecandrawthefollowingconclusions:Searchableformshavealargenumberofcheckboxesanditems(options)inselectionlists.
No-Searchableformshavealargenumberofpasswordtagsand'email'ininputelements'nameorvalue.
Usingthesestructuralfeatures,wecantrainadecisiontreeclassifier.
Twotoolsareusedtoconstructaformstructureclassifier:Wekaj48algorithm[26]andAlgorithm::SVMLightPerlpackage[23].
InWeka,theprecisionofthedecisiontreeclassifieris0.
948718.
ThedecisiontreegeneratedbyWekaisdisplayedinFig.
4.
Infact,weusePerlpackageAlgorithm::SVMLightinthispapertotrainadecisiontreeclassifieranditsprecisionis0.
948717948.
Obviously,thesetwotoolshavethesimilarresultsaccordingtoourexperiments.
Nevertheless,inAlgorithm::SVMLightcurrentimplementation,onlydiscrete-valuedattributesaresupportedandconsequentlyitoutputsalargenumberofrules.
Infact,itoutputs122rulesinall.
Additionally,we252JournalofSoftware软件学报Vol.
19,No.
2,February2008thinkthatthedecisiontreemisclassifiesaninstanceifitcannotdecide,towhichcategorythisinstancebelongs.
Table2Features'distributionsofsearchableandnon-searchableformsFeature/CategorySearchableNon-SearchableRatiocheckboxemail_yesfilehiddenimagemethod_getoptionpasswordradiosearch_yestexttextarea2.
390.
010.
004.
450.
360.
4712.
640.
020.
480.
363.
000.
020.
180.
120.
001.
630.
210.
400.
170.
100.
100.
091.
010.
0713.
04:11:13.
132.
72:11.
73:11.
16:174.
23:11:5.
864.
76:13.
94:12.
97:11:2.
98Fig.
4ThedecisiontreegeneratedbyWekaj48algorithmusingformstructuralfeatures5.
2TrainingformtextclassifierAccordingtoFTandPTmethods(seeSection3.
2),wecanextractformtextualfeaturesfromforms.
FivemostfrequentfeaturesobtainedbyFTandPTtechniquesarepresentedinTable3.
Table3showsthatthePTmethodextractsmorevaluablefeaturesthantheFTtechniquedoes.
Table3FivemostfrequentfeaturesextractedbyFTandPTmethodsrespectivelyMethodCategoryTextualfeatures(Feature:Frequency)FTAirfareAutoBookRentalHotelJobMovieMusicoption:8113value:4161td:1181id:1069class:993option:5673value:3002td:1520tr:716class:498option:10997value:5753td:1538tr:788name:421option:6396value:3199td:892pm:550class:520option:13048value:6271class:1377div:1358td:1265option:7423value:3868u:775td:680tr:413option:6995value:3616div:1200class:1181td:734option:8090value:679td:516record:456font:323PTAirfareAutoBookRentalHotelJobMovieMusicpm:419airlin:279air:124am:102airwai:100docum:108car:105leas:84search:63make:56search:130titl:110book:95author:75new:72pm:402option:202am:168airport:144car:143hotel:234pm:228island:151new:135room:84job:207new:125locat:84servic:82island:81press:211book:123s:109video:107entertain:107record:456music:226sub:156search:97new:80王辉等:使用分类器自动发现特定领域的深度网入口253Usingthoseextractedtextualfeatures,wefinallyfinishtrainingeightSVMclassifiers.
TheprecisionsoftheeightSVMclassifiersareshowninTable4.
SincePTmethodextractsmorevaluabletextualfeaturesthanFTtechniquedoes,wecanusethemtrainamoreaccurateclassifier.
Table4indicatesthatnomatterwhatcategoryis,usingthePTmethodcanalwaysobtainamoreaccurateclassifierthantheFTtechniquecan.
Table4PrecisionsofSVMclassifierstrainedwithformtextualfeaturesthatareextractedbytheFTandPTmethodsrespectivelyCategoryFTmethodPTmethodAirfare0.
91820.
9364Auto0.
94761.
0Book0.
92730.
9636Rental0.
95450.
9636Hotel0.
93330.
9381Job0.
94090.
9682Movie0.
92730.
9773Music0.
90.
96825.
3TrainingpagetextclassifierInordertotrainpagetextclassifier,thispapergetsitspositivetrainingdatafromtheonlineopendirectoryproject(http://dmoz.
org/).
WeuseaPerlscriptprogramtofilloutthesearchableformandextractURLsfromitsreturnedresultpagesautomatically.
Asfornegativetrainingdata,wegetthemfromDMOZ(http://rdf.
dmoz.
org/).
DMOZconsistsofsixteencategories,whichareArts,Business,Computers,Games,Health,Home,Kids_and_Teens,News,Recreation,Reference,Regional,Science,Shopping,Society,SportsandWorld.
Wegetridoffourcategoriesoftheminourexperiments.
TheyareKids_and_Teens,Reference,RegionalandWorld.
ThenumberofURLsineachcategoryisshowninTable5.
Table5ThenumberofURLsineachDMOZcategoryArtsBusinessComputersGamesHealthHome58592451162028533612375813105033555NewsRecreationScienceShoppingSocietySports235704120308213014235160269864154921InDMOZ,eachexamplelookslikethis:ExternalPageabout="http://www.
airwise.
com/airports/us/SLC/index.
html"d:TitleSaltLakeCityAirport-airwise.
com/d:Titled:DescriptionInformationabouttheairportincludingairlines,groundtransportation,parking,weatherandairportnews.
/d:DescriptiontopicTop/Regional/North_America/United_States/Utah/Localities/S/Salt_Lake_City/Transportation/Airports/topic/ExternalPageThispaperusesthecontentof'd:Description'elementandtheWebpagecorrespondingtothe'about'254JournalofSoftware软件学报Vol.
19,No.
2,February2008ExternalPageattributetoobtainanegativetrainingexample.
Inordertobemorerepresentative,wederiveURLsfromeachcategoryaccordingtoitssize.
Sincethe'Arts'categoryhasthelargestnumberofURLs,wegetthemostnumberofURLsfromit.
ExcludedtheseURLswhichcannotbedownloaded,thenumberofpositiveandnegativeexampleswhichweusetotrainapageclassifierforeachcategoryislistedinTable6.
Usingthesetrainingdata,wecanfinishtrainingpageclassifiers.
TheprecisionsofthesepageclassifiersareshowninTable6.
Table6ThenumberofpositiveandnegativetraindataforeachcategoryaswellastheprecisionofitscorrespondingpageclassifierCategoryPositiveNegativePrecisionAirfareAutoBookRentalHotelJobMovieMusic1162512156891317081706160122316356433252282272131783124870.
96190.
94630.
91350.
97370.
99410.
97910.
90040.
875.
4Usingthethree-stepframeworktofinddeepWebentriesWeconducteightlarge-scaleexperimentswithourhiddenWebcrawler.
Foreachcategory,weinitializeourcrawlerwith100seedsthatareextractedfromtheDMOZasthestartingpoint.
WesavethosepagesandtheircorrespondingURLsifthefollowingtwoconditionsaresatisfiedatthesametime.
First,theyarejudgedtoberelevantbypageandformtextclassifiers.
Second,eachpagecontainsatleastonedomain-specificdeepWebentry.
SinceMusicRecordsdatabasesareverysparselydistributed,ourbest-firstfocusedcrawleronlylocates50deepWebentriesforit.
Forothercategories,ourcrawlerfinds100deepWebentriesforeachofthem.
FivedeepWebentriesaboutAirfarescategorywhicharelocatedbyourcrawlerarelistedbelow:http://www.
aircanada.
ca/http://www.
itn.
net/http://www.
aircharter.
com/http://www.
orbitz.
com/http://www.
nwa.
com/Atlast,wemanuallyverifywhetherthedeepWebentrieslocatedbyourcrawlerarewhatwewant.
TheprecisionsofallthesecategoriesareshowninTable7.
Table7TheprecisionsofdeepWebentriesforeachcategoryDomainPrecisionDomainPrecisionAirfare0.
90Auto0.
88Book0.
91Rental0.
95Hotel0.
94Job0.
81Movie0.
86Music0.
806ConclusionandFutureWorkInthispaper,athree-stepframeworkisproposedtoautomaticallyidentifydomain-specifichiddenWebentries.
Toverifyitseffectivenessandefficiency,eightlarge-scaleexperimentsareconducted.
Experimentalresultsdemonstratethatourmethodcanfinddomain-specificdeepWebentriesaccuratelyandefficiently.
Theaverageprecisionoftheeightrepresentativedomainsis0.
88.
Inthenearfuture,experimentsonalargernumberofcategoriesarerequiredinordertofurtherassessthe王辉等:使用分类器自动发现特定领域的深度网入口255effectivenessofourproposedtechnique.
Additionally,withthoseobtainedqueryinterfaces,wewillintegratethemtoobtainaunifiedinterfaceandgiveittouserstoquery.
Usersdon'thavetohuntsomedomain-specificsourcesandlearnthedetailsforqueryingeachresource.
AcknowledgementThisworkissponsoredbytheScienceandTechnologyDevelopmentProgramofJilinProvinceunderGrantNo.
20070533andtheNaturalScienceFoundationofChinaunderGrantNo.
60373099.
References:[1]RoccoD,CaverleeJ,LiuL,CritchlowT.
ExploitingthedeepWebwithDynaBot:Matching,probing,andranking.
In:EllisA,HaginoT,eds.
Proc.
oftheWorldWideWebSpecialInterestTracksAndPosters(WWW).
Chiba:ACM,2005.
11741175.
[2]BrightPlanet.
com.
ThedeepWeb:Surfacinghiddenvalue.
http://brightplanet.
com[3]BergmanMK.
ThedeepWeb:Surfacinghiddenvalue.
JournalofElectronicPublishing,2001,7(1):11741175.
http://www.
press.
umich.
edu/jep/07-01/bergman.
html[4]HeB,ZhangZ,ChangKCC.
KnockingthedoortothedeepWeb:IntegratingWebqueryinterfaces.
In:WeikumG,ed.
Proc.
oftheSIGMODConf.
Paris:ACM,2004.
913914.
[5]ChangKCC,HeB,ZhangZ.
MetaQuerieroverthedeepWeb:Shallowintegrationacrossholisticsources.
In:NascimentoMA,zsuMT,KossmannD,MillerRJ,BlakeleyJA,SchieferKB,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
MorganKaufmannPublishers,2004.
1521.
[6]WuW,DoanA,YuCT.
MerginginterfaceschemasonthedeepWebviaclusteringaggregation.
In:Proc.
oftheInt'lConf.
onDataMining(ICDM).
IEEEComputerSociety,2005.
801804.
[7]HeH,MengWY,YuCT,WuZH.
WISE-Integrator:AsystemforextractingandintegratingcomplexWebsearchinterfacesofthedeepWeb.
In:BhmK,JensenCS,HaasLM,KerstenML,LarsonPA,OoiBC,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
ACM,2005.
13141317.
[8]ChangKCC,Garcia-MolinaH.
Mindyourvocabulary:Querymappingacrossheterogeneousinformationsources.
In:DelisA,FaloutsosC,GhandeharizadehS,eds.
Proc.
oftheSIGMODConf.
Philadelphia:ACMPress,1999.
335346.
[9]HeB,ZhangZ,ChangKCC.
MetaQuerier:QueryingstructuredWebsourceson-the-fly.
In:zcanF,ed.
Proc.
oftheSIGMODConf.
ACM,2005.
927929.
[10]NakatohT,YamadaY,HirokawaS.
AutomaticgenerationofdeepWebwrappersbasedondiscoveryofrepetition.
In:Proc.
oftheAsiaInformationRetrievalSymp.
(AIRS).
Beijing:Springer-Verlag,2004.
269272.
[11]HedleyYL,YounasM,JamesA,SandersonM.
Atwo-phasesamplingtechniqueforinformationextractionfromhiddenWebdatabases.
In:LaenderAHF,LeeD,RonthalerM,eds.
Proc.
oftheInt'lWorkshoponWebInformationandDataManagement(WIDM).
Washington:ACM,2004.
18.
[12]MundluruD,KatukuriJR,CelebiS.
Automaticallyminingresultrecordsfromsearchengineresponsepages.
In:Proc.
oftheInt'lConf.
onDataMining(ICDM).
IEEEComputerSociety,2005.
749752.
[13]LiuB,GrossmanR,ZhaiYH.
MiningdatarecordsinWebpages.
In:GetoorL,SenatorTE,DomingosP,FaloutsosC,eds.
Proc.
oftheKnowledgeDiscoveryandDataMining(KDD).
Washington:ACM,2003.
601606.
[14]HsiehW,MadhavanJ,PikeR.
DatamanagementprojectsatGoogle.
In:ChaudhuriS,HristidisV,PolyzotisN,eds.
Proc.
oftheSIGMODConf.
Chicago:ACM,2006.
725726.
[15]WuP,WenJR,LiuH,MaWY.
QueryselectiontechniquesforefficientcrawlingofstructuredWebsources.
In:LiuL,ReuterA,WhangKY,ZhangJ,eds.
Proc.
oftheInt'lConf.
onDataMining(ICDE).
IEEEComputerSociety,2006.
47.
[16]RaghavanS,Garcia-MolinaH.
CrawlingthehiddenWeb.
In:ApersPMG,AtzeniP,CeriS,ParaboschiS,RamamohanaraoK,SnodgrassRT,eds.
Proc.
oftheInt'lConf.
onVeryLargeDataBases(VLDB).
Rome:MorganKaufmannPublishers,2001.
129138.
[17]CopeJ,CraswellN,HawkingD.
AutomateddiscoveryofsearchinterfacesontheWeb.
In:ScheweKD,ZhouX,eds.
Proc.
oftheAustralasianDatabaseConf.
(ADC).
AustralianComputerSociety,2003.
181189.
256JournalofSoftware软件学报Vol.
19,No.
2,February2008[18]BergholzA,ChidlovskiiB.
Crawlingfordomain-specifichiddenWebresources.
In:Proc.
oftheInt'lConf.
onWebInformationSystemsEngineering(WISE).
Roma:IEEEComputerSociety,2003.
125133.
[19]BarbosaL,FreireJ.
Combiningclassifierstoidentifyonlinedatabases.
In:WilliamsonCL,ZurkoME,Patel-SchneiderPF,ShenoyPJ,eds.
Proc.
oftheWorldWideWebConf.
(WWW).
ACM,2007.
431440.
[20]BarbosaL,FreireJ.
Anadaptivecrawlerforlocatinghidden-Webentrypoints.
In:WilliamsonCL,ZurkoME,Patel-SchneiderPF,ShenoyPJ,eds.
Proc.
oftheWorldWideWebConf.
(WWW).
ACM,2007.
441450.
[21]BarbosaL,FreireJ.
Searchingforhidden-Webdatabases.
In:DoanAH,NevenF,McCannR,BexGJ,eds.
Proc.
ofthe8thInt'lWorkshopontheWebandDatabases(WebDB).
Baltimore:ACMPress,2005.
16.
[22]ChangCC,LinCJ.
Libsvm—Alibraryforsupportvectormachines.
http://www.
csie.
ntu.
edu.
tw/~cjlin/libsvm/[23]CPAN.
http://search.
cpan.
org/[24]TorgoL,GamaJ.
Regressionbyclassification.
In:BorgesD,KaestnerC,eds.
Proc.
oftheBrasilianArtificialIntelligenceSymp.
Curitiba:Springer-Verlag,1996.
5160.
[25]TheuiucWebintegrationrepository.
http://metaquerier.
cs.
uiuc.
edu/repository/[26]Weka.
http://www.
cs.
waikato.
ac.
nz/ml/weka/WANGHuiwasbornin1972.
HereceivedhisPh.
D.
degreefromJilinUniversity.
HisresearchareaisWebinformationmining.
ZOUWan-Liwasbornin1957.
HeisaprofessoranddoctoralsupervisorattheJilinUniversityandaCCFseniormember.
Hisresearchareasaredatabase,dataminingandWebsearchengine.
LIUYan-Weiwasbornin1983.
HeisagraduatestudentattheJilinUniversity.
HisresearchareaisWebinformationmining.

提速啦:美国多IP站群云服务器 8核8G 10M带宽 7IP 88元/月

提速啦(www.tisula.com)是赣州王成璟网络科技有限公司旗下云服务器品牌,目前拥有在籍员工40人左右,社保在籍员工30人+,是正规的国内拥有IDC ICP ISP CDN 云牌照资质商家,2018-2021年连续4年获得CTG机房顶级金牌代理商荣誉 2021年赣州市于都县创业大赛三等奖,2020年于都电子商务示范企业,2021年于都县电子商务融合推广大使。资源优势介绍:Ceranetwo...

香港CN2云服务器 1核 2G 35元/月 妮妮云

妮妮云的来历妮妮云是 789 陈总 张总 三方共同投资建立的网站 本着“良心 便宜 稳定”的初衷 为小白用户避免被坑妮妮云的市场定位妮妮云主要代理市场稳定速度的云服务器产品,避免新手购买云服务器的时候众多商家不知道如何选择,妮妮云就帮你选择好了产品,无需承担购买风险,不用担心出现被跑路 被诈骗的情况。妮妮云的售后保证妮妮云退款 通过于合作商的友好协商,云服务器提供2天内全额退款到网站余额,超过2天...

pacificrack:VPS降价,SSD价格下降

之前几个月由于CHIA挖矿导致全球固态硬盘的价格疯涨,如今硬盘挖矿基本上已死,硬盘的价格基本上恢复到常规价位,所以,pacificrack决定对全系Cloud server进行价格调整,降幅较大,“如果您是老用户,请通过续费管理或升级套餐,获取同步到最新的定价”。官方网站:https://pacificrack.com支持PayPal、支付宝等方式付款VPS特征:基于KVM虚拟,纯SSD raid...

udk中文网为你推荐
www.7788dy.com回家的诱惑 哪个网站更新的最快啊5xoy.comhttp www.05eee.com杨丽晓博客杨丽晓哪一年出生的?bbs2.99nets.com西安论坛、西安茶馆网、西安社区、西安bbs 的网址是多少?33tutu.comDnf绝望100鬼泣怎么过www.15job.com广州天河区的南方人才市场dpscycle魔兽世界国服,求几个暗影MS的输出宏官人放题求日本放题系列电影,要全集越多越好,求给力盗车飞侠请教:游戏盗车飞侠4怎么开飞机,怎么买枪,怎么开坦克啊?ename.com怎么样才能拥有自己的网站啊?就想WWW.XXXX.COM的那种!
沈阳虚拟主机 国际域名抢注 荷兰服务器 ion cdn服务器 私服服务器 iisphpmysql 2017年万圣节 华为网络硬盘 ca4249 骨干网络 天互数据 静态空间 adroit 鲁诺 vul 百度云空间 深圳域名 supercache 好看的空间 更多