homepageadsense

adsense  时间:2021-05-20  阅读:()
AnalysingFeaturesofJapaneseSplogsandCharacteristicsofKeywordsYuukiSatoTakehitoUtsuroUniversityofTsukuba,Tsukuba,305-8573,JAPANTomohiroFukuharaUniversityofTokyo,Kashiwa277-8568,JAPANYasuhideKawadaNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANYoshiakiMurakamiNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANHiroshiNakagawaUniversityofTokyo,Tokyo,113-0033,JAPANNorikoKandoNationalInstituteofInformatics,Tokyo,101-8430,JAPANABSTRACTThispaperfocusesonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyanalyzingthecharacteristicsofkey-wordscontainedinsplogs.
Sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecaneciently(manually)collectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wemanuallyexam-inevariousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentisexcerptfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinfor-mativeresults,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers.
CategoriesandSubjectDescriptorsH.
3.
0[INFORMATIONSTORAGEANDRETRIEVAL]:GeneralGeneralTermsReliabilityKeywordsBloganalysis,splog,timeseriescharacteristicsofkeywords,keywordbursts1.
INTRODUCTIONWeblogsorblogsareconsideredtobeoneofpersonaljour-nals,marketorproductcommentaries.
Whiletraditionalsearchenginescontinuetodiscoverandindexblogs,theblo-gospherehasproducedcustomblogsearchandanalysisen-Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.
Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.
AIRWeb'08,April22,2008Beijing,China.
Copyright2008ACM978-1-60558-159-0.
.
.
$5.
00.
gines,systemsthatemployspecializedinformationretrievaltechniques.
Thereareseveralpreviousworksandservicesonbloganalysissystems.
[13]proposedasystemcalledblog-WatcherthatcollectsandanalyzesJapaneseblogarticles.
[6]proposedasystemcalledBlogPulsethatanalyzestrendsofblogarticles.
WithrespecttobloganalysisservicesontheInternet,thereareseveralcommercialandnon-commercialservicessuchasTechnorati1,BlogPulse2,kizasi.
jp3,andblog-Watcher4.
Withrespecttomultilingualblogservices,GlobeofBlogs5providesaretrievalfunctionofblogarticlesacrosslanguages.
BestBlogsinAsiaDirectory6alsoprovidesaretrievalfunctionforAsianlanguageblogs.
Blogwise7alsoanalyzesmultilingualblogarticles.
AswithmostInternet-enabledapplications,theeaseofcontentcreationanddistributionmakestheblogospherespamprone[7,1,10,12,9].
Spamblogsorsplogsareblogshost-ingspamposts,createdusingmachinegeneratedorhijackedcontentforthesolepurposeofhostingadvertisementsorraisingthePageRankoftargetsites.
[10]reportedthatforEnglishblogs,around88%ofallpingingURLs(i.
e.
,bloghomepages)aresplogs,whichaccountforabout75%ofallpings.
Basedonthisestimation,asstatedin[1,11],splogscancauseproblemsincludingthedegradationofinforma-tionretrievalqualityandthesignicantwasteofnetworkandstorageresources.
Severalpreviousworks[10,12,9]reportedimportantcharacteristicsofsplogs.
[12]reportedcharacteristicsofpingtimeseries,in-degree/out-degreedis-tributions,andtypicalwordsinsplogsfoundinTREC8Blog06datacollection.
[10,9]alsoreportedtheresultsofanalyzingsplogsintheBlogPulsedataset.
Inthecontextofsemi-automaticallycollectingwebspamsincludingsplogs,[16]discusshowtocollectspammer-targetedkeywordstobeusedwhencollectingalargenumberofwebspamseciently.
Unlikethosepreviousworks,thispaperfocusesonana-lyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem[14].
Ashasbeenoftennotedinthepreviousworks,textcontentofsplogsismostlyex-1http://technorati.
com/2http://www.
blogpulse.
com/3http://kizasi.
jp/(inJapanese)4http://blogwatcher.
pi.
titech.
ac.
jp/(inJapanese)5http://www.
globeofblogs.
com/6http://www.
misohoni.
com/bba/7http://www.
blogwise.
com/8http://trec.
nist.
gov/Table1:FeaturesforCharacterizingSplogsandtheirRatesinSplogDataSetRateinFeatureTypesFeaturesDescriptionsSplogs(%)linkstoaliatedsitesBlogarticles(posts)containsucientlymanyout-goinglinkstoaliatedsites,exceptfortheout-goinglinksthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
80.
5Aliateadvertisementarti-cles(posts)Blogarticles(posts)themselvescontainsucientlymanyad-vertisements,exceptfortheadvertisementsthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
31.
0Featuresarticles(posts)withadultcontentBlogarticles(posts)containadultcontent.
8.
1keywordswithpopupadvertisementCertainbloghostshavefacilitiesofautomaticallyaddingpopupadvertisementstokeywords.
42.
1excerptfromnewsar-ticlesTextcontentisautomaticallyormanuallyexcerptedfromnewsarticles.
14.
3Contentexcerptfromblogar-ticles(posts)orotherwebtextsTextcontentisautomaticallyormanuallyexcerptedfromotherblogarticles(posts),orwebtextsotherthannewsarticlesandadvertisementpages.
70.
8Sourceexcerptfromadver-tisementpagesTextcontentisautomaticallyormanuallyexcerptedfromcer-tainadvertisementpages.
27.
1FeaturesoriginallywrittentextsSpammerswriteoriginalsplogtexts.
2.
9meaninglesssequenceofwordsMostofthemaresocalledwordsaladspamtext[2]andareautomaticallygenerated.
3.
6excerptfromothersources,selectedwithoutkeywordretrievalTextcontentisautomaticallyormanuallyexcerptedfromothersourceswithoutkeywordretrieval.
Typicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates.
12.
7Creationexcerptfromothersources,retrievedwithakeywordvaryingdaybydayTextcontentisautomaticallyormanuallyretrievedfromothersourceswithakeywordvaryingdaybyday,andthenexcerpted.
49.
5Procedureexcerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepageForabloghomepage,allofitstextcontentisexcerpt,whichareautomaticallyormanuallyretrievedfromothersourceswithasinglekeywordthroughoutallofitsposts.
36.
9Featureskeywordstuedblog[9]Blogarticles(posts)containlistsofkeywordsforSEOpurposes.
11.
5automaticallygener-atedtextMostofthemaresocalledwordsaladspamtext[2],whichisamixtureofseeminglymeaningfulwordsthattogethersignifynothing.
Sometimes,connectingseveralsentenceseachofwhichisexcerptedfromothersource.
4.
5cerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Consid-eringthisfact,inthiswork,weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyan-alyzingthecharacteristicsofkeywordscontainedinsplogs.
Thecharacteristicsofakeywordtowhichwepayattentioninthispaperiswhetherthekeywordisofpublic/privatecon-cernaswellasthedurationofpeople'sconcerntothekey-word.
Furthermore,sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecanecientlycollectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wethenmanuallyexaminevari-ousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentsareexcerptsfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinforma-tiveresultsofouranalysis,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers,andhence,theanalysisreportedinthispaperisstronglyaectedbythechoicesofthosespam-merswhentheycreatethosesplogs.
2.
PROCEDUREOFCREATINGSPLOGSTextcontentofsplogsismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertise-mentpages,andotherwebtexts.
Inanycase,splogshavecommercialintention—theydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Forthispurpose,splogsareusuallycreatedbysearchingforup-to-datecontentfromothersourcesandbyexcerptingthem.
Thisprocedureofcreatingsplogscanberoughlydividedintothefollowingtwocases:authenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesauthenticblogssplogsTimeSeries(a)keywordwithburst(b)keywordwithoutburstFigure1:TimeSeriesCharacteristicsofKeywordOccurrenceStatisticsinSplogs/AuthenticBlogsi)excerptingtextcontentfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordre-trieval,ii)excerptingtextcontentbyretrievingthemfromothersourceswithcertainkeywords.
Splogpostscreatedbytherstprocedurejustafewdaysbeforethecurrentdatetendtocontainup-to-datetextcon-tentwhichareoriginallyfromquiterecentnewsarticlesorblogposts.
Ontheotherhand,forsplogscreatedbythesec-ondprocedure,spammersusuallycarefullychoosekeywordsforretrievingtextcontentfromothersourcessuchasnewsarticlesandblogposts.
Theytendtochoosehighpayingadsense9keywords.
3.
FEATURESFORCHARACTERIZINGSPLOGSThissectiondescribesthefeaturesforcharacterizingJapanesesploghomepagesmanuallycollectedbytheprocedureofsec-tion5.
3.
AswesummarizeinTable1,thispaperconsidersthefol-lowingthreetypesoffeaturesforsplogs,namely,1)aliatefeatures,2)contentsourcefeatures,and3)creationproce-durefeatures.
Foreachofthesethreefeaturetypes,Table1listsseveralbinaryfeatureseachofwhichdenoteswhetherthegivensploghomepagehasthedesignatedcharacteristicsornot.
Here,notethatfeaturesofthesametypeareinde-pendentofeachotherandhencearenotnecessarilydisjoint.
Alsonotethatmostofthosefeaturesarefortheuseinman-ualexaminationofsplogs,andhence,itisnotnecessarilymeanttoautomaticallydetectthem.
3.
1AfliateFeaturesAmongthethreefeaturetypes,rstwedescribealiatefeatures.
Asintroducedin[10,9],splogsaregeneratedwithtwooftenoverlappingmotives,namely,creationoffakeblogsforthepurposeofhostingprotableadvertisement,andun-justiablyincreasingtherankingofaliatedsites.
Sincebothmotivesaredeeplyrelatedtoaliateadvertising,inthispaper,weconsiderfeaturesofsplogsregardingissuesofaliates.
Asthealiatefeatures,wemanuallyexaminethefollowingfourpoints:9http://google.
com/adsensei)whethertheblogarticle(posts)containout-goinglinkstoaliatedsites,ii)whethertheblogarticle(posts)themselvescontainad-vertisements,iii)whetherblogarticles(posts)containadultcontent10,iv)whetherblogarticles(posts)containpopupadvertise-mentsautomaticallyaddedtocertainkeywords.
3.
2ContentSourceFeaturesSecond,oneoftheimportantcharacteristicsofsplogsisthattheirtextcontentismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Inordertoestimatethemech-anismofcreatingsplogs,wemanuallyexaminethecontentsourceofsplogsandclassifythemaccordingtothefollowingvefeatures,namely,contentsourcefeatures:i)excerptfromnewsarticles,ii)excerptfromblogarticles(posts)orotherwebtexts,iii)excerptfromadvertisementpages,iv)originallywrittentexts,v)meaninglesssequenceofwordssuchaswordsaladspamtexts[2].
3.
3CreationProcedureFeaturesFurthermore,weestimatetheproceduresofsearchingthewebforthoseexcerptandmanuallyclassifythemaccord-ingtothefollowingvefeatures,namely,creationprocedurefeatures:i)excerptfromothersources,selectedwithoutkeywordretrieval,wheretypicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates,ii)excerptfromothersources,retrievedwithakeywordvaryingdaybyday,iii)excerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepage,iv)keywordstuedblog[9],10Adultcontentisamongthemajortargetgenresforaliateadvertising,whileothermajortargetgenresincludehealthfoodandslimmingproducts,cosmetics,andnance.
Weregardblogswhichcontainadultcontentasmoreharmfulthanothers,andrecordthemwithanindependentfeature.
Figure2:AKeywordMapforCharacterizingKeywordsv)automaticallygeneratedtextincludingwordsaladspamtexts[2].
Asthecreationprocedurefeatures,wedistinguishtwomajorproceduresofcreatingsplogs,i.
e.
,a)excerptfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordretrieval,andb)andexcerptbyretrievingtextsfromothersourceswithcertainkeywords.
Theformertypecorrespondstothefeaturei)above,whilethelattertothefeaturesii)andiii)above.
4.
CHARACTERISTICSOFSPLOGSANDKEYWORDS4.
1TimeSeriesCharacteristicsofKeywordsAmongtheproblemscausedbysplogs,thissectiondis-cussesissuesonnoisesinwordoccurrencestatisticsintheblogosphere.
Figure1illustratestwotypicalcasesofnoisesintimeserieskeywordoccurrencestatistics,where(a)isthecaseofakeywordwithburst,and(b)isthecaseofakey-wordwithoutburst.
Forbothcases,keywordoccurrencesaremixtureofthosefromauthenticblogsandsplogs.
With-outdetectingandremovingsplogs,itisdiculttoestimaterealkeywordoccurrencestatisticsonlyinauthenticblogs.
Forthecaseofthekeywordswithburst,especially,itisestimatedthatburstinsplogsmaybedelayedfromthatinauthenticblogs,becausetextcontentofsplogsismostlyexcerptfromothersourcessuchasnewsarticlesandblogposts.
4.
2KeywordMapforCharacterizingKeywordsThissectionintroducesthekeywordmapofFigure2forcharacterizingkeywords.
Theverticalaxisofthemapde-noteswhethereachkeywordisofpublic/privateconcern,whileitshorizontalaxisdenotesthedurationofpeople'sconcerntoeachkeyword.
Keywordswithpublicconcernaretypicallyreportedinnewsassocial/political/economicalis-sues,whilethosewithprivateconcernaretypicallyissuesregardingentertainmentorcelebrity,orhighpayingadsensekeywords.
Ontheotherhand,keywordswithshorttermdu-rationincludeseasonalonesandthoserelatedtotemporaryevents,whilethosewithlongtermdurationincludeorgani-zationnameswithalonghistorysuchaspoliticalpartiesandcountrynames,orthoserelatedtopermanentissuessuchashealthandbeauty.
OnthemapofFigure2,50keywordsthatarebalancedintheirdistributiononthemapareplaced,wherethepositionofeachkeywordisdeterminedtotallybyintuition.
Thosekeywordsvaryintheirtimeseriescharacteristicsofoccur-rencestatistics,wheresomeofthemarewithburstwhileothersarenot.
Eachofthosekeywordsisintendedtobeusedforretrievingblog(authenticblogandsplog)home-pagesintheprocedureofsection5.
3.
Themajorpurposeofplacingsuchvariouskeywordsontoamaplikethisistosimplyexaminethecorrelationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontain-ingeachkeyword.
Table2:SummaryofJapaneseBlogData(atDe-cember3rd,2007,0:00)#ofblogcurrent#ofhomepages#ofarticles#ofdaysarticlesperday3,591,306192,699,2761,355196,9755.
ANALYZINGSPLOGSBASEDONCHAR-ACTERISTICSOFKEYWORDS5.
1MotivationsThispaperreportstheresultsofanalyzingthefollowingthreepointsaftercollectingblogsandthenmanuallydetect-ingsplogsamongthem.
1.
Featuresofsplogsaremanuallyexaminedaccordingtothoseintroducedinsection3.
2.
Accordingtothekeywordmapforcharacterizingkey-words,variouscharacteristicsofkeywordsaremanu-allyexamined,whichincludetimeseriescharacteristicssuchaswhetherwith/withoutburst.
3.
Basedontheresultsofexaminingabovetwopoints,wefurtheranalyzevariouscorrelationbetweencharac-teristicsofsplogsandkeywords.
Thisanalysismainlyincludesthefollowings:(a)correlationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontainingeachkeyword.
Thiswillrevealthepreferenceofspammerswhenchoosingkeywords.
(b)correlationbetweenthecharacteristicsofkeywordsandthesplogcreationprocedures.
5.
2JapaneseBlogDataForcollectingtheJapaneseblogdata,weusethesystemcalledKANSHIN[3,4,5]whichcollectsblogarticles(posts)writteninChinese,Japanese,Korean,andEnglish.
Thesys-temhaslistsofbloghomepagesforeachlanguage.
Byusingtheselists,thesystemcollectsRSS11andAtomfeedlesprovidedbybloghomepages,andextractskeywordsfromfeedlesbyusingmorphologicalanalysistools,andstorekeywordsandarticlesineachdatabase.
Thesystemusesseverallinguistictoolsforextractingandindexingkeywordsfromblogarticlesforeachlanguage.
ForJapanese,itusesamorphologicalanalysistoolcalledJuman12.
Thesystemprovidesuserswithfunctionsforretrievingandanalyzingarticles.
Table2showsthesummaryofJapaneseblogdatastoredinthesystem(checkedatDecember3rd,2007).
3.
6millionbloghomepagesand193millionarticlesareregisteredforJapanesesinceMarch18th,2004.
5.
3ProcedureoftheAnalysisThissectiongivesthespecicprocedureofcollectingandanalyzingsplogsbasedoncharacteristicsofkeywords.
Theroughstrategyofcollectingsplogshereistosimplycollectbloghomepages,(i.
e.
,notblogposts)whichcontainagivenkeywordandthen,11SeveralreferencessuchasRDFSiteSummaryorReallySimpleSyndicationorRichSiteSummaryexist.
12http://nlp.
kuee.
kyoto-u.
ac.
jp/nl-resource/juman.
htmlseesaa44%cocolog32%jugem.
jp12%ameblo5%livedoor1%Rest6%yahoo0%goo.
ne0%Figure3:BlogHostDistributionintheSplogHome-pageDataSetconsideringthefeaturesofsplogsdenedinsec-tion3,tomanuallyjudgewhethereachofthecol-lectedbloghomepagesisasplogoranauthenticblog.
Consideringtheresultofapreliminaryexamination,weas-sumethat,forkeywordswithburst,therateofsplogsamongthebloghomepagesthatcontainthosekeywordsmaybehigherontheburstdatethanonotherdates.
Wefurtherassumethat,evenforkeywordswithoutburst,therateofsplogsmaybehigheronthedatewiththemostfrequentoc-currenceintheblogospherethanotherdates.
Basedonthisobservation,inordertocollectsucientnumberofsplogs,foreachkeyword,wecollectbloghomepagescontainingthekeywordonthedatewithitsmostfrequentoccurrence.
Fur-thermore,alsoconsideringtheresultofapreliminaryexam-ination,wepreferbloghomepageswithmorepostsperdaythanthosewithfewerpostsperday.
Thefollowinglistsummarizestheaboveprocedure.
1.
Foreachofthe50keywordsinFigure2,wecollectbloghomepageURLswhichcontainthekeywordonthedatewithitsmostfrequentoccurrenceduringtheyear2007.
2.
AmongthecollectedURLs,weselectthetopmost50withrespecttothenumberofpostsperday.
Wefur-therrandomlyselect60URLsfromtherest.
Thisamountto110URLsintotal,wherethetopmost50URLsareusuallywithmorethanthreepostsperday,whiletheremaining60URLsarewithoneortwopostsperday.
3.
ForeachofthecollectedURLs,anannotatorjudgeswhethereachbinaryfeaturedenedinsection3holdsornot.
4.
Basedontheabovejudgement,eachURLisjudgedtobeasplogoranauthenticblogaccordingtothefollowingrule.
(a)IfoneofthefollowingsholdsforthegivenURL,thenitismostly13splog.
13By"mostly",wemeanthatitisusuallynecessarytojudgebyconsideringthecontentsofeachblog.
Table3:SplogRateperBlogHostBlogHostseesaacocologjugem.
jpameblolivedoorgoo.
neyahooRestTotal#ofBlogSplog1921425424321026442HomepagesAuthenticBlog2031151693551281302073961703Total3952572233791311312074222145SplogRate(%)48.
655.
324.
26.
32.
30.
80.
06.
220.
6Table4:SplogRate,ProfessionalSpammerRate(fromprofessionalspammer/splog),#ofProfessionalSpammers,and,AmateurOnlySplogRate(fromamateurspammer/(fromamateurspammer+non-splog))(indescendingorderofsplogrates,boldfaced:"splograte>10%,professionalspammerrate>50%",underlined:"amateuronlysplograte20%ormore,mostlywithprivateconcern")KeywordSplogRate(%)ProfessionalSpammerRate(%)#ofProfessionalSpammersAmateurOnlySplogRate(%)erog,adultcontentblog89.
292.
4338.
5rumor88.
194.
8127.
8nationalpension58.
190.
2212.
0norevision40.
918.
5136.
1healthfood37.
458.
7219.
8cosmeticsurgery24.
414.
3221.
7Viagra22.
511.
1120.
5Darvish,aJapanesebaseballplayer22.
10.
0022.
1video19.
10.
0019.
1Asasho-ryu,asumowrestler15.
280.
023.
4Billy'sBootCamp15.
10.
0015.
1Saeko,aJapaneseactressandDarvish'swife14.
314.
3112.
2COMSN,Inc.
,elderlycarebusinesscompanywithascandal6.
971.
422.
1ZARD(aJapanesefemalesinger,accidentallydied)4.
720.
013.
8ChinaAirlines4.
720.
013.
8NorthKorea2.
9100.
010.
0Wii(avideogameconsoleofNintendo)2.
866.
711.
0heatwave2.
833.
311.
9"Thedignityofthewoman",thetitleofabook2.
00.
002.
0aJapaneseslangwordfor"lazywoman"1.
850.
010.
9UpperHouseelection0.
00.
000.
0DemocraticPartyofJapan0.
00.
000.
0Total20.
561.
5109.
0i.
Thefeature"originallywrittentext"doesnothold.
ii.
Thefeature"originallywrittentext"holdsandatleastoneofthefeatures"linkstoaliatedsites","advertisementarticles(posts)",or"ar-ticles(posts)withadultcontent"holds.
(b)Otherwise,thegivenURLisanauthenticblog.
5.
Finally,weanalyzethecorrelationbetweencharacter-isticsofkeywordsandthedistributionoffeaturesman-uallyannotatedtosplogs.
6.
PRELIMINARYRESULTSOFANALYZ-INGSPLOGSThissectiondiscussespreliminaryresultsofanalyzingJapanesesplogsbasedoncharacteristicsofkeywords,fea-turesofsplogs,aswellasotherfeatureswhichcanbeauto-maticallyanalyzedsuchasbloghostsdistribution.
Wefur-theranalyzethecorrelationbetweencharacteristicsofkey-wordsandthefeaturedistributionofsplogs.
Here,notethattheresultsshownbelowarepreliminaryinthattheyarefor22keywordsoutofthe50onthemapofFigure2.
6.
1BlogHostsStatisticsAscanbeclearlyseenfromFigure3,inourJapanesebloghomepagedataset,morethan88%ofsplogsarefromthetopthreehosts.
Furthermore,asshowninTable3,forthetoptwohosts,abouthalfofthebloghomepagesaresplogs14.
Itisestimatedthatthosehostswithhighsplogratespaylesscostofmanuallyremovingsplogsthanthosewithlowsplogrates.
Asweargueinthenextsection,itisobservedthataverysmallnumberofspammersactuallycreatesubstantialnumberofsploghomepagesonthosethreehosts,andthisincreasesthesplogratesofthosehosts.
6.
2RelationsbetweenCharacteristicsofKey-wordsandSplogs14DuetoerrorsintheprocedureofcollectingblogURLsforjudgingsplog/authenticblogdistinction,forthemoment,wedonothave110blogsURLsintotalforseveralkeywords.
Table5:10ProfessionalSpammersidentiedinourSplogDataSet#ofFeaturesofSplogs(inTable1)IDSplogsAliateContentSourceCreationProcedureKeywords1115(42.
5%)linkstoaliatedsites,popupadver-tisementblogorotherwebtextsretrievedwithasin-glekeywordrumor,norevision,cosmeticsurgery,Asasho-ryu,Saeko,ChinaAirlines,COMSN,Inc.
,ZARD,heatwave,Wii,NorthKorea,"lazywoman"256(20.
6%)linkstoaliatedsitesblogorotherwebtextsretrievedwithakey-wordvaryingdaybydayerog330(11.
0%)linkstoaliatedsitesnewsarticles,adver-tisementpagesselectedwithoutkey-wordretrievalnationalpension,COMSN,Inc.
426(9.
6%)linkstoaliatedsites,advertisementarticles,popupadvertisementblogorotherwebtexts,advertisementpagesretrievedwithakey-wordvaryingdaybydaynationalpension520(7.
4%)linkstoaliatedsites,advertisementarticlesadvertisementpagesretrievedwithakey-wordvaryingdaybyday,keywordstuedbloghealthfood610(3.
7%)linkstoaliatedsites,adultcontent,popupadvertisementnewsarticles,blogorotherwebtextsselectedwithoutkey-wordretrievalerog,Asasho-ryu,71015(5.
5%)———erog,healthfood,Viagra,cos-meticsurgery,Total272————Next,foreachofthe22keywords,Table4givessplogratesinthebloghomepagescollectedwiththekeyword,indescendingorderofsplogsrates.
Inthetable,those22key-wordsaredividedintothreegroups,i.
e.
,thosewithsplograteshigherthan30%,thosewithsplogrates3010%,andtherest.
Wefurthercountoccurrencesoffeaturesofsplogsintheentiresplogdataset,andlisttheirratesinthesplogdatasetasintherightmostcolumnofTable1.
Basedonthisfeatureanalysis,weexaminecorrelationofthosesplogfea-turesandcharacteristicsofkeywordswithsplograteshigherthan10%.
Furthermore,wejudgedwhethertwosplogsarecreatedbyanidenticalspammerwhentheirhtmllayoutsaresimilar15,andthengroupedthosesplogsfromanidenticalspammer.
Inthispaper,wenamethosespammerseachofwhomcre-atedmorethanonesplogsinourdatasetasprofessionalspammers,whilewealsonamethoseremainingspammerseachofwhomcreatedonlyonesploginourdatasetasam-ateurspammers.
Withthisjudgement,wecanidentify10professionalspammersinoursplogdataset(summarizedinTable5),whereoutofthetotal442sploghomepages,272(61.
5%)canberegardedascreatedbythose10professionalspammers.
Basedonthisprofessional/amateurspammeranalysis,foreachkeyword,Table4showsrateofsploghome-pagesbeingcreatedbyoneofthe10professionalspammersTable4alsoshowsthenumberofprofessionalspammersob-servedforeachkeyword,aswellassplogratesafterremovingthosecreatedbyprofessionalspammers(amateuronlysplog15Ournextplanistoemploythetechniquepresentedin[15],sothatwecanautomaticallygroupsploghomepagesintothe10groupsshownhere.
rate).
Majorconclusionsofthisanalysiscanbesummarizedasbelow,someofwhicharealsonotedinthemapofthe22keywordsinFigure4.
(1)Themostimportantfacttonotehereisthat,forfouroutofthevekeywordswithsplograteover30%,mostsploghomepagesarecreatedbyprofessionalspammers.
Splogscontainingthesefourkeywordsactuallyamounttomorethanhalfoftheentiresplogdataset.
Thisfactisveryimportantbecausethefollowinganalysisisstronglyaectedbythechoicesofthoseprofessionalspammersincreatingthosesplogs.
(2)AscanbeseenfromthemapinFigure4,mostofthekeywordsplacedintheupperhalfofthemaphavelowsplogrates.
Thismeansthatsplogstendtocontainkeywordswithprivateconcernmoreoftenthanthosewithpublicconcern.
"Nationalpension"and"Asasho-ryu"arewithexceptionallyhighsplogrates,thoughthisstatisticsisstronglyaectedbythechoicesofprofessionalspammers.
Thosespammerspostedsplogpostsoncertaindates,wherethesplogarticlesarecreatedfromtheexcerptsofthenewsreportsandblogpostsonthosedates.
Thoseexcerptsoccasionallyincludescandalreportscloselyrelatedtothetwokeywords.
(3)Thethreekeywords"rumor","erog,adultcontentblog",and"healthfood",correspondtoanothergroupofsplogscreatedbyprofessionalspammers.
Inthecaseofthesekey-words,thespammerspostedsplogposts,wherethesplogarticlesarecreatedfromtheexcerptofotherblogsandad-vertisements,butnotnewsarticles,byretrievingthemwithcertainkeywords.
7.
CONCLUSIONFigure4:KeywordMapwithSplogAnalysisResultsThispaperfocusedonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Amongvariousinformativeresultsofouranalysis,itisim-portanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofprofessionalspam-mers.
Futureworksincludefurtheranalysisofsplogsbyintegratingwithotherfeaturesstudiedinthepreviousworks[12,10,9],suchascharacteristicwordsinsplogs,in-degree/out-degreedistributions,andpingtimeseries.
Next,weplantoapplyexistingsplogdetectiontechniques[11,8]tooursplogdataset,andthentodevelopasplogdetectorwithhighaccuracy.
Splogs/authenticblogscollectedinthisworkarealsousefulforanalyzingcharacteristicsofkeywordsinamuchlargerscale,simplybyautomaticallycollectingamuchlargernumberofkeywords,andthenmeasuringcorrelationbetweensplogsandeachkeyword.
8.
REFERENCES[1]Wikipedia,Spamblog.
http://en.
wikipedia.
org/wiki/Spam_blog.
[2]Wikipedia,Wordsalad(computerscience).
http://en.
wikipedia.
org/wiki/Wordsalad%28computer_science%29.
[3]T.
Fukuhara,T.
Murayama,andT.
Nishida.
AnalyzingconcernsofpeopleusingWeblogarticlesandrealworldtemporaldata.
InProceedingsofWWW20052ndAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2005.
[4]T.
Fukuhara,H.
Nakagawa,andT.
Nishida.
Understandingsentimentofpeoplefromnewsarticles:Temporalsentimentanalysisofsocialevents.
InProceedingsofICWSM,pages271–272,2007.
[5]T.
Fukuhara,T.
Utsuro,andH.
Nakagawa.
Cross-lingualconcernanalysisfrommultilingualweblogarticles.
InA.
Nijholt,O.
Stock,andT.
Nishida,editors,Proceedingsofthe6thInternationalWorkshoponSocialIntelligenceDesign,pages55–64,2007.
[6]N.
Glance,M.
Hurst,andT.
Tomokiyo.
Blogpulse:AutomatedtrenddiscoveryforWeblogs.
InWWW2004WorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2004.
[7]Z.
Gy¨ongyiandH.
Garcia-Molina.
Webspamtaxonomy.
InProc.
1stAIRWeb,pages39–47,2005.
[8]P.
Kolari,T.
Finin,andA.
Joshi.
SVMsfortheBlogosphere:BlogidenticationandSplogdetection.
InProceedingsofthe2006AAAISpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages92–99,2006.
[9]P.
Kolari,T.
Finin,andA.
Joshi.
Spaminblogsandsocialmedia.
InTutorialatICWSM,2007.
[10]P.
Kolari,A.
Joshi,andT.
Finin.
Characterizingthesplogosphere.
InProceedingsofWWW20063rdAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2006.
[11]Y.
-R.
Lin,H.
Sundaram,Y.
Chi,J.
Tatemura,andB.
L.
Tseng.
Splogdetectionusingself-similarityanalysisonblogtemporaldynamics.
InProc.
3rdAIRWeb,pages1–8,2007.
[12]C.
MacdonaldandI.
Ounis.
TheTRECBlogs06collection:Creatingandanalysingablogtestcollection.
TechnicalReportTR-2006-224,UniversityofGlasgow,DepartmentofComputingScience,2006.
[13]T.
Nanno,T.
Fujiki,Y.
Suzuki,andM.
Okumura.
Automaticallycollecting,monitoring,andminingJapaneseweblogs.
InWWWAlt.
'04:Proceedingsofthe13thinternationalWorldWideWebconferenceonAlternatetrackpapers&posters,pages320–321.
ACMPress,2004.
[14]Y.
Sato,T.
Utsuro,T.
Fukuhara,Y.
Kawada,Y.
Murakami,H.
Nakagawa,andN.
Kando.
CollectingandanalyzingJapanesesplogsbasedoncharacteristicsofkeywords.
InProc.
ICWSM,pages218–219,2008.
[15]T.
Urvoy,T.
Lavergne,andP.
Filoche.
TrackingWebspamwithhiddenstylesimilarity.
InProc.
2ndAIRWeb,pages25–30,2006.
[16]Y.
Wang,M.
Ma,Y.
Niu,andH.
Chen.
Spamdouble-funnel:Connectingwebspammerswithadvertisers,.
InProc.
16thWWWConf.
,pages291–300,2007.

亚洲云Asiayu,成都云服务器 4核4G 30M 120元一月

点击进入亚云官方网站(www.asiayun.com)公司名:上海玥悠悠云计算有限公司成都铂金宿主机IO测试图亚洲云Asiayun怎么样?亚洲云Asiayun好不好?亚云由亚云团队运营,拥有ICP/ISP/IDC/CDN等资质,亚云团队成立于2018年,经过多次品牌升级。主要销售主VPS服务器,提供云服务器和物理服务器,机房有成都、美国CERA、中国香港安畅和电信,香港提供CN2 GIA线路,CE...

2021年全新Vultr VPS主机开通云服务器和选择机房教程(附IP不通问题)

昨天有分享到"2021年Vultr新用户福利注册账户赠送50美元"文章,居然还有网友曾经没有注册过他家的账户,薅过他们家的羊毛。通过一阵折腾居然能注册到账户,但是对于如何开通云服务器稍微有点不对劲,对于新人来说确实有点疑惑。因为Vultr采用的是预付费充值方式,会在每月的一号扣费,当然我们账户需要存留余额或者我们采用自动扣费支付模式。把笔记中以前的文章推送给网友查看,他居然告诉我界面不同,看的不对...

香港CN2云服务器 1核 2G 35元/月 妮妮云

妮妮云的来历妮妮云是 789 陈总 张总 三方共同投资建立的网站 本着“良心 便宜 稳定”的初衷 为小白用户避免被坑妮妮云的市场定位妮妮云主要代理市场稳定速度的云服务器产品,避免新手购买云服务器的时候众多商家不知道如何选择,妮妮云就帮你选择好了产品,无需承担购买风险,不用担心出现被跑路 被诈骗的情况。妮妮云的售后保证妮妮云退款 通过于合作商的友好协商,云服务器提供2天内全额退款到网站余额,超过2天...

adsense为你推荐
如何设置浏览器允许弹出窗口版本itunes"2014年全国民营企业招聘会现场A区域企业信息",,,,preloadedbaidu绑定ipad甘肃省政府采购支持ipadwin7telnetwindows7的TELNET服务在哪里开启啊win7如何关闭445端口如何关闭445端口,禁用smb协议联通iphone4iphone4想换联通的卡 是普通联通的卡都能开通3G么 还是得换联通3G卡 联通都有什么套餐 我是北京的
ddos 荷兰服务器 virpus bbr payoneer evssl证书 搜狗12306抢票助手 商家促销 南昌服务器托管 台湾谷歌网址 云全民 qingyun 免费mysql 建立邮箱 admit的用法 已备案删除域名 工作站服务器 vip域名 如何安装服务器系统 空间登录首页 更多