homepageadsense
adsense 时间:2021-05-20 阅读:(
)
AnalysingFeaturesofJapaneseSplogsandCharacteristicsofKeywordsYuukiSatoTakehitoUtsuroUniversityofTsukuba,Tsukuba,305-8573,JAPANTomohiroFukuharaUniversityofTokyo,Kashiwa277-8568,JAPANYasuhideKawadaNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANYoshiakiMurakamiNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANHiroshiNakagawaUniversityofTokyo,Tokyo,113-0033,JAPANNorikoKandoNationalInstituteofInformatics,Tokyo,101-8430,JAPANABSTRACTThispaperfocusesonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyanalyzingthecharacteristicsofkey-wordscontainedinsplogs.
Sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecaneciently(manually)collectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wemanuallyexam-inevariousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentisexcerptfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinfor-mativeresults,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers.
CategoriesandSubjectDescriptorsH.
3.
0[INFORMATIONSTORAGEANDRETRIEVAL]:GeneralGeneralTermsReliabilityKeywordsBloganalysis,splog,timeseriescharacteristicsofkeywords,keywordbursts1.
INTRODUCTIONWeblogsorblogsareconsideredtobeoneofpersonaljour-nals,marketorproductcommentaries.
Whiletraditionalsearchenginescontinuetodiscoverandindexblogs,theblo-gospherehasproducedcustomblogsearchandanalysisen-Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.
Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.
AIRWeb'08,April22,2008Beijing,China.
Copyright2008ACM978-1-60558-159-0.
.
.
$5.
00.
gines,systemsthatemployspecializedinformationretrievaltechniques.
Thereareseveralpreviousworksandservicesonbloganalysissystems.
[13]proposedasystemcalledblog-WatcherthatcollectsandanalyzesJapaneseblogarticles.
[6]proposedasystemcalledBlogPulsethatanalyzestrendsofblogarticles.
WithrespecttobloganalysisservicesontheInternet,thereareseveralcommercialandnon-commercialservicessuchasTechnorati1,BlogPulse2,kizasi.
jp3,andblog-Watcher4.
Withrespecttomultilingualblogservices,GlobeofBlogs5providesaretrievalfunctionofblogarticlesacrosslanguages.
BestBlogsinAsiaDirectory6alsoprovidesaretrievalfunctionforAsianlanguageblogs.
Blogwise7alsoanalyzesmultilingualblogarticles.
AswithmostInternet-enabledapplications,theeaseofcontentcreationanddistributionmakestheblogospherespamprone[7,1,10,12,9].
Spamblogsorsplogsareblogshost-ingspamposts,createdusingmachinegeneratedorhijackedcontentforthesolepurposeofhostingadvertisementsorraisingthePageRankoftargetsites.
[10]reportedthatforEnglishblogs,around88%ofallpingingURLs(i.
e.
,bloghomepages)aresplogs,whichaccountforabout75%ofallpings.
Basedonthisestimation,asstatedin[1,11],splogscancauseproblemsincludingthedegradationofinforma-tionretrievalqualityandthesignicantwasteofnetworkandstorageresources.
Severalpreviousworks[10,12,9]reportedimportantcharacteristicsofsplogs.
[12]reportedcharacteristicsofpingtimeseries,in-degree/out-degreedis-tributions,andtypicalwordsinsplogsfoundinTREC8Blog06datacollection.
[10,9]alsoreportedtheresultsofanalyzingsplogsintheBlogPulsedataset.
Inthecontextofsemi-automaticallycollectingwebspamsincludingsplogs,[16]discusshowtocollectspammer-targetedkeywordstobeusedwhencollectingalargenumberofwebspamseciently.
Unlikethosepreviousworks,thispaperfocusesonana-lyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem[14].
Ashasbeenoftennotedinthepreviousworks,textcontentofsplogsismostlyex-1http://technorati.
com/2http://www.
blogpulse.
com/3http://kizasi.
jp/(inJapanese)4http://blogwatcher.
pi.
titech.
ac.
jp/(inJapanese)5http://www.
globeofblogs.
com/6http://www.
misohoni.
com/bba/7http://www.
blogwise.
com/8http://trec.
nist.
gov/Table1:FeaturesforCharacterizingSplogsandtheirRatesinSplogDataSetRateinFeatureTypesFeaturesDescriptionsSplogs(%)linkstoaliatedsitesBlogarticles(posts)containsucientlymanyout-goinglinkstoaliatedsites,exceptfortheout-goinglinksthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
80.
5Aliateadvertisementarti-cles(posts)Blogarticles(posts)themselvescontainsucientlymanyad-vertisements,exceptfortheadvertisementsthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
31.
0Featuresarticles(posts)withadultcontentBlogarticles(posts)containadultcontent.
8.
1keywordswithpopupadvertisementCertainbloghostshavefacilitiesofautomaticallyaddingpopupadvertisementstokeywords.
42.
1excerptfromnewsar-ticlesTextcontentisautomaticallyormanuallyexcerptedfromnewsarticles.
14.
3Contentexcerptfromblogar-ticles(posts)orotherwebtextsTextcontentisautomaticallyormanuallyexcerptedfromotherblogarticles(posts),orwebtextsotherthannewsarticlesandadvertisementpages.
70.
8Sourceexcerptfromadver-tisementpagesTextcontentisautomaticallyormanuallyexcerptedfromcer-tainadvertisementpages.
27.
1FeaturesoriginallywrittentextsSpammerswriteoriginalsplogtexts.
2.
9meaninglesssequenceofwordsMostofthemaresocalledwordsaladspamtext[2]andareautomaticallygenerated.
3.
6excerptfromothersources,selectedwithoutkeywordretrievalTextcontentisautomaticallyormanuallyexcerptedfromothersourceswithoutkeywordretrieval.
Typicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates.
12.
7Creationexcerptfromothersources,retrievedwithakeywordvaryingdaybydayTextcontentisautomaticallyormanuallyretrievedfromothersourceswithakeywordvaryingdaybyday,andthenexcerpted.
49.
5Procedureexcerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepageForabloghomepage,allofitstextcontentisexcerpt,whichareautomaticallyormanuallyretrievedfromothersourceswithasinglekeywordthroughoutallofitsposts.
36.
9Featureskeywordstuedblog[9]Blogarticles(posts)containlistsofkeywordsforSEOpurposes.
11.
5automaticallygener-atedtextMostofthemaresocalledwordsaladspamtext[2],whichisamixtureofseeminglymeaningfulwordsthattogethersignifynothing.
Sometimes,connectingseveralsentenceseachofwhichisexcerptedfromothersource.
4.
5cerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Consid-eringthisfact,inthiswork,weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyan-alyzingthecharacteristicsofkeywordscontainedinsplogs.
Thecharacteristicsofakeywordtowhichwepayattentioninthispaperiswhetherthekeywordisofpublic/privatecon-cernaswellasthedurationofpeople'sconcerntothekey-word.
Furthermore,sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecanecientlycollectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wethenmanuallyexaminevari-ousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentsareexcerptsfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinforma-tiveresultsofouranalysis,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers,andhence,theanalysisreportedinthispaperisstronglyaectedbythechoicesofthosespam-merswhentheycreatethosesplogs.
2.
PROCEDUREOFCREATINGSPLOGSTextcontentofsplogsismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertise-mentpages,andotherwebtexts.
Inanycase,splogshavecommercialintention—theydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Forthispurpose,splogsareusuallycreatedbysearchingforup-to-datecontentfromothersourcesandbyexcerptingthem.
Thisprocedureofcreatingsplogscanberoughlydividedintothefollowingtwocases:authenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesauthenticblogssplogsTimeSeries(a)keywordwithburst(b)keywordwithoutburstFigure1:TimeSeriesCharacteristicsofKeywordOccurrenceStatisticsinSplogs/AuthenticBlogsi)excerptingtextcontentfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordre-trieval,ii)excerptingtextcontentbyretrievingthemfromothersourceswithcertainkeywords.
Splogpostscreatedbytherstprocedurejustafewdaysbeforethecurrentdatetendtocontainup-to-datetextcon-tentwhichareoriginallyfromquiterecentnewsarticlesorblogposts.
Ontheotherhand,forsplogscreatedbythesec-ondprocedure,spammersusuallycarefullychoosekeywordsforretrievingtextcontentfromothersourcessuchasnewsarticlesandblogposts.
Theytendtochoosehighpayingadsense9keywords.
3.
FEATURESFORCHARACTERIZINGSPLOGSThissectiondescribesthefeaturesforcharacterizingJapanesesploghomepagesmanuallycollectedbytheprocedureofsec-tion5.
3.
AswesummarizeinTable1,thispaperconsidersthefol-lowingthreetypesoffeaturesforsplogs,namely,1)aliatefeatures,2)contentsourcefeatures,and3)creationproce-durefeatures.
Foreachofthesethreefeaturetypes,Table1listsseveralbinaryfeatureseachofwhichdenoteswhetherthegivensploghomepagehasthedesignatedcharacteristicsornot.
Here,notethatfeaturesofthesametypeareinde-pendentofeachotherandhencearenotnecessarilydisjoint.
Alsonotethatmostofthosefeaturesarefortheuseinman-ualexaminationofsplogs,andhence,itisnotnecessarilymeanttoautomaticallydetectthem.
3.
1AfliateFeaturesAmongthethreefeaturetypes,rstwedescribealiatefeatures.
Asintroducedin[10,9],splogsaregeneratedwithtwooftenoverlappingmotives,namely,creationoffakeblogsforthepurposeofhostingprotableadvertisement,andun-justiablyincreasingtherankingofaliatedsites.
Sincebothmotivesaredeeplyrelatedtoaliateadvertising,inthispaper,weconsiderfeaturesofsplogsregardingissuesofaliates.
Asthealiatefeatures,wemanuallyexaminethefollowingfourpoints:9http://google.
com/adsensei)whethertheblogarticle(posts)containout-goinglinkstoaliatedsites,ii)whethertheblogarticle(posts)themselvescontainad-vertisements,iii)whetherblogarticles(posts)containadultcontent10,iv)whetherblogarticles(posts)containpopupadvertise-mentsautomaticallyaddedtocertainkeywords.
3.
2ContentSourceFeaturesSecond,oneoftheimportantcharacteristicsofsplogsisthattheirtextcontentismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Inordertoestimatethemech-anismofcreatingsplogs,wemanuallyexaminethecontentsourceofsplogsandclassifythemaccordingtothefollowingvefeatures,namely,contentsourcefeatures:i)excerptfromnewsarticles,ii)excerptfromblogarticles(posts)orotherwebtexts,iii)excerptfromadvertisementpages,iv)originallywrittentexts,v)meaninglesssequenceofwordssuchaswordsaladspamtexts[2].
3.
3CreationProcedureFeaturesFurthermore,weestimatetheproceduresofsearchingthewebforthoseexcerptandmanuallyclassifythemaccord-ingtothefollowingvefeatures,namely,creationprocedurefeatures:i)excerptfromothersources,selectedwithoutkeywordretrieval,wheretypicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates,ii)excerptfromothersources,retrievedwithakeywordvaryingdaybyday,iii)excerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepage,iv)keywordstuedblog[9],10Adultcontentisamongthemajortargetgenresforaliateadvertising,whileothermajortargetgenresincludehealthfoodandslimmingproducts,cosmetics,andnance.
Weregardblogswhichcontainadultcontentasmoreharmfulthanothers,andrecordthemwithanindependentfeature.
Figure2:AKeywordMapforCharacterizingKeywordsv)automaticallygeneratedtextincludingwordsaladspamtexts[2].
Asthecreationprocedurefeatures,wedistinguishtwomajorproceduresofcreatingsplogs,i.
e.
,a)excerptfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordretrieval,andb)andexcerptbyretrievingtextsfromothersourceswithcertainkeywords.
Theformertypecorrespondstothefeaturei)above,whilethelattertothefeaturesii)andiii)above.
4.
CHARACTERISTICSOFSPLOGSANDKEYWORDS4.
1TimeSeriesCharacteristicsofKeywordsAmongtheproblemscausedbysplogs,thissectiondis-cussesissuesonnoisesinwordoccurrencestatisticsintheblogosphere.
Figure1illustratestwotypicalcasesofnoisesintimeserieskeywordoccurrencestatistics,where(a)isthecaseofakeywordwithburst,and(b)isthecaseofakey-wordwithoutburst.
Forbothcases,keywordoccurrencesaremixtureofthosefromauthenticblogsandsplogs.
With-outdetectingandremovingsplogs,itisdiculttoestimaterealkeywordoccurrencestatisticsonlyinauthenticblogs.
Forthecaseofthekeywordswithburst,especially,itisestimatedthatburstinsplogsmaybedelayedfromthatinauthenticblogs,becausetextcontentofsplogsismostlyexcerptfromothersourcessuchasnewsarticlesandblogposts.
4.
2KeywordMapforCharacterizingKeywordsThissectionintroducesthekeywordmapofFigure2forcharacterizingkeywords.
Theverticalaxisofthemapde-noteswhethereachkeywordisofpublic/privateconcern,whileitshorizontalaxisdenotesthedurationofpeople'sconcerntoeachkeyword.
Keywordswithpublicconcernaretypicallyreportedinnewsassocial/political/economicalis-sues,whilethosewithprivateconcernaretypicallyissuesregardingentertainmentorcelebrity,orhighpayingadsensekeywords.
Ontheotherhand,keywordswithshorttermdu-rationincludeseasonalonesandthoserelatedtotemporaryevents,whilethosewithlongtermdurationincludeorgani-zationnameswithalonghistorysuchaspoliticalpartiesandcountrynames,orthoserelatedtopermanentissuessuchashealthandbeauty.
OnthemapofFigure2,50keywordsthatarebalancedintheirdistributiononthemapareplaced,wherethepositionofeachkeywordisdeterminedtotallybyintuition.
Thosekeywordsvaryintheirtimeseriescharacteristicsofoccur-rencestatistics,wheresomeofthemarewithburstwhileothersarenot.
Eachofthosekeywordsisintendedtobeusedforretrievingblog(authenticblogandsplog)home-pagesintheprocedureofsection5.
3.
Themajorpurposeofplacingsuchvariouskeywordsontoamaplikethisistosimplyexaminethecorrelationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontain-ingeachkeyword.
Table2:SummaryofJapaneseBlogData(atDe-cember3rd,2007,0:00)#ofblogcurrent#ofhomepages#ofarticles#ofdaysarticlesperday3,591,306192,699,2761,355196,9755.
ANALYZINGSPLOGSBASEDONCHAR-ACTERISTICSOFKEYWORDS5.
1MotivationsThispaperreportstheresultsofanalyzingthefollowingthreepointsaftercollectingblogsandthenmanuallydetect-ingsplogsamongthem.
1.
Featuresofsplogsaremanuallyexaminedaccordingtothoseintroducedinsection3.
2.
Accordingtothekeywordmapforcharacterizingkey-words,variouscharacteristicsofkeywordsaremanu-allyexamined,whichincludetimeseriescharacteristicssuchaswhetherwith/withoutburst.
3.
Basedontheresultsofexaminingabovetwopoints,wefurtheranalyzevariouscorrelationbetweencharac-teristicsofsplogsandkeywords.
Thisanalysismainlyincludesthefollowings:(a)correlationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontainingeachkeyword.
Thiswillrevealthepreferenceofspammerswhenchoosingkeywords.
(b)correlationbetweenthecharacteristicsofkeywordsandthesplogcreationprocedures.
5.
2JapaneseBlogDataForcollectingtheJapaneseblogdata,weusethesystemcalledKANSHIN[3,4,5]whichcollectsblogarticles(posts)writteninChinese,Japanese,Korean,andEnglish.
Thesys-temhaslistsofbloghomepagesforeachlanguage.
Byusingtheselists,thesystemcollectsRSS11andAtomfeedlesprovidedbybloghomepages,andextractskeywordsfromfeedlesbyusingmorphologicalanalysistools,andstorekeywordsandarticlesineachdatabase.
Thesystemusesseverallinguistictoolsforextractingandindexingkeywordsfromblogarticlesforeachlanguage.
ForJapanese,itusesamorphologicalanalysistoolcalledJuman12.
Thesystemprovidesuserswithfunctionsforretrievingandanalyzingarticles.
Table2showsthesummaryofJapaneseblogdatastoredinthesystem(checkedatDecember3rd,2007).
3.
6millionbloghomepagesand193millionarticlesareregisteredforJapanesesinceMarch18th,2004.
5.
3ProcedureoftheAnalysisThissectiongivesthespecicprocedureofcollectingandanalyzingsplogsbasedoncharacteristicsofkeywords.
Theroughstrategyofcollectingsplogshereistosimplycollectbloghomepages,(i.
e.
,notblogposts)whichcontainagivenkeywordandthen,11SeveralreferencessuchasRDFSiteSummaryorReallySimpleSyndicationorRichSiteSummaryexist.
12http://nlp.
kuee.
kyoto-u.
ac.
jp/nl-resource/juman.
htmlseesaa44%cocolog32%jugem.
jp12%ameblo5%livedoor1%Rest6%yahoo0%goo.
ne0%Figure3:BlogHostDistributionintheSplogHome-pageDataSetconsideringthefeaturesofsplogsdenedinsec-tion3,tomanuallyjudgewhethereachofthecol-lectedbloghomepagesisasplogoranauthenticblog.
Consideringtheresultofapreliminaryexamination,weas-sumethat,forkeywordswithburst,therateofsplogsamongthebloghomepagesthatcontainthosekeywordsmaybehigherontheburstdatethanonotherdates.
Wefurtherassumethat,evenforkeywordswithoutburst,therateofsplogsmaybehigheronthedatewiththemostfrequentoc-currenceintheblogospherethanotherdates.
Basedonthisobservation,inordertocollectsucientnumberofsplogs,foreachkeyword,wecollectbloghomepagescontainingthekeywordonthedatewithitsmostfrequentoccurrence.
Fur-thermore,alsoconsideringtheresultofapreliminaryexam-ination,wepreferbloghomepageswithmorepostsperdaythanthosewithfewerpostsperday.
Thefollowinglistsummarizestheaboveprocedure.
1.
Foreachofthe50keywordsinFigure2,wecollectbloghomepageURLswhichcontainthekeywordonthedatewithitsmostfrequentoccurrenceduringtheyear2007.
2.
AmongthecollectedURLs,weselectthetopmost50withrespecttothenumberofpostsperday.
Wefur-therrandomlyselect60URLsfromtherest.
Thisamountto110URLsintotal,wherethetopmost50URLsareusuallywithmorethanthreepostsperday,whiletheremaining60URLsarewithoneortwopostsperday.
3.
ForeachofthecollectedURLs,anannotatorjudgeswhethereachbinaryfeaturedenedinsection3holdsornot.
4.
Basedontheabovejudgement,eachURLisjudgedtobeasplogoranauthenticblogaccordingtothefollowingrule.
(a)IfoneofthefollowingsholdsforthegivenURL,thenitismostly13splog.
13By"mostly",wemeanthatitisusuallynecessarytojudgebyconsideringthecontentsofeachblog.
Table3:SplogRateperBlogHostBlogHostseesaacocologjugem.
jpameblolivedoorgoo.
neyahooRestTotal#ofBlogSplog1921425424321026442HomepagesAuthenticBlog2031151693551281302073961703Total3952572233791311312074222145SplogRate(%)48.
655.
324.
26.
32.
30.
80.
06.
220.
6Table4:SplogRate,ProfessionalSpammerRate(fromprofessionalspammer/splog),#ofProfessionalSpammers,and,AmateurOnlySplogRate(fromamateurspammer/(fromamateurspammer+non-splog))(indescendingorderofsplogrates,boldfaced:"splograte>10%,professionalspammerrate>50%",underlined:"amateuronlysplograte20%ormore,mostlywithprivateconcern")KeywordSplogRate(%)ProfessionalSpammerRate(%)#ofProfessionalSpammersAmateurOnlySplogRate(%)erog,adultcontentblog89.
292.
4338.
5rumor88.
194.
8127.
8nationalpension58.
190.
2212.
0norevision40.
918.
5136.
1healthfood37.
458.
7219.
8cosmeticsurgery24.
414.
3221.
7Viagra22.
511.
1120.
5Darvish,aJapanesebaseballplayer22.
10.
0022.
1video19.
10.
0019.
1Asasho-ryu,asumowrestler15.
280.
023.
4Billy'sBootCamp15.
10.
0015.
1Saeko,aJapaneseactressandDarvish'swife14.
314.
3112.
2COMSN,Inc.
,elderlycarebusinesscompanywithascandal6.
971.
422.
1ZARD(aJapanesefemalesinger,accidentallydied)4.
720.
013.
8ChinaAirlines4.
720.
013.
8NorthKorea2.
9100.
010.
0Wii(avideogameconsoleofNintendo)2.
866.
711.
0heatwave2.
833.
311.
9"Thedignityofthewoman",thetitleofabook2.
00.
002.
0aJapaneseslangwordfor"lazywoman"1.
850.
010.
9UpperHouseelection0.
00.
000.
0DemocraticPartyofJapan0.
00.
000.
0Total20.
561.
5109.
0i.
Thefeature"originallywrittentext"doesnothold.
ii.
Thefeature"originallywrittentext"holdsandatleastoneofthefeatures"linkstoaliatedsites","advertisementarticles(posts)",or"ar-ticles(posts)withadultcontent"holds.
(b)Otherwise,thegivenURLisanauthenticblog.
5.
Finally,weanalyzethecorrelationbetweencharacter-isticsofkeywordsandthedistributionoffeaturesman-uallyannotatedtosplogs.
6.
PRELIMINARYRESULTSOFANALYZ-INGSPLOGSThissectiondiscussespreliminaryresultsofanalyzingJapanesesplogsbasedoncharacteristicsofkeywords,fea-turesofsplogs,aswellasotherfeatureswhichcanbeauto-maticallyanalyzedsuchasbloghostsdistribution.
Wefur-theranalyzethecorrelationbetweencharacteristicsofkey-wordsandthefeaturedistributionofsplogs.
Here,notethattheresultsshownbelowarepreliminaryinthattheyarefor22keywordsoutofthe50onthemapofFigure2.
6.
1BlogHostsStatisticsAscanbeclearlyseenfromFigure3,inourJapanesebloghomepagedataset,morethan88%ofsplogsarefromthetopthreehosts.
Furthermore,asshowninTable3,forthetoptwohosts,abouthalfofthebloghomepagesaresplogs14.
Itisestimatedthatthosehostswithhighsplogratespaylesscostofmanuallyremovingsplogsthanthosewithlowsplogrates.
Asweargueinthenextsection,itisobservedthataverysmallnumberofspammersactuallycreatesubstantialnumberofsploghomepagesonthosethreehosts,andthisincreasesthesplogratesofthosehosts.
6.
2RelationsbetweenCharacteristicsofKey-wordsandSplogs14DuetoerrorsintheprocedureofcollectingblogURLsforjudgingsplog/authenticblogdistinction,forthemoment,wedonothave110blogsURLsintotalforseveralkeywords.
Table5:10ProfessionalSpammersidentiedinourSplogDataSet#ofFeaturesofSplogs(inTable1)IDSplogsAliateContentSourceCreationProcedureKeywords1115(42.
5%)linkstoaliatedsites,popupadver-tisementblogorotherwebtextsretrievedwithasin-glekeywordrumor,norevision,cosmeticsurgery,Asasho-ryu,Saeko,ChinaAirlines,COMSN,Inc.
,ZARD,heatwave,Wii,NorthKorea,"lazywoman"256(20.
6%)linkstoaliatedsitesblogorotherwebtextsretrievedwithakey-wordvaryingdaybydayerog330(11.
0%)linkstoaliatedsitesnewsarticles,adver-tisementpagesselectedwithoutkey-wordretrievalnationalpension,COMSN,Inc.
426(9.
6%)linkstoaliatedsites,advertisementarticles,popupadvertisementblogorotherwebtexts,advertisementpagesretrievedwithakey-wordvaryingdaybydaynationalpension520(7.
4%)linkstoaliatedsites,advertisementarticlesadvertisementpagesretrievedwithakey-wordvaryingdaybyday,keywordstuedbloghealthfood610(3.
7%)linkstoaliatedsites,adultcontent,popupadvertisementnewsarticles,blogorotherwebtextsselectedwithoutkey-wordretrievalerog,Asasho-ryu,71015(5.
5%)———erog,healthfood,Viagra,cos-meticsurgery,Total272————Next,foreachofthe22keywords,Table4givessplogratesinthebloghomepagescollectedwiththekeyword,indescendingorderofsplogsrates.
Inthetable,those22key-wordsaredividedintothreegroups,i.
e.
,thosewithsplograteshigherthan30%,thosewithsplogrates3010%,andtherest.
Wefurthercountoccurrencesoffeaturesofsplogsintheentiresplogdataset,andlisttheirratesinthesplogdatasetasintherightmostcolumnofTable1.
Basedonthisfeatureanalysis,weexaminecorrelationofthosesplogfea-turesandcharacteristicsofkeywordswithsplograteshigherthan10%.
Furthermore,wejudgedwhethertwosplogsarecreatedbyanidenticalspammerwhentheirhtmllayoutsaresimilar15,andthengroupedthosesplogsfromanidenticalspammer.
Inthispaper,wenamethosespammerseachofwhomcre-atedmorethanonesplogsinourdatasetasprofessionalspammers,whilewealsonamethoseremainingspammerseachofwhomcreatedonlyonesploginourdatasetasam-ateurspammers.
Withthisjudgement,wecanidentify10professionalspammersinoursplogdataset(summarizedinTable5),whereoutofthetotal442sploghomepages,272(61.
5%)canberegardedascreatedbythose10professionalspammers.
Basedonthisprofessional/amateurspammeranalysis,foreachkeyword,Table4showsrateofsploghome-pagesbeingcreatedbyoneofthe10professionalspammersTable4alsoshowsthenumberofprofessionalspammersob-servedforeachkeyword,aswellassplogratesafterremovingthosecreatedbyprofessionalspammers(amateuronlysplog15Ournextplanistoemploythetechniquepresentedin[15],sothatwecanautomaticallygroupsploghomepagesintothe10groupsshownhere.
rate).
Majorconclusionsofthisanalysiscanbesummarizedasbelow,someofwhicharealsonotedinthemapofthe22keywordsinFigure4.
(1)Themostimportantfacttonotehereisthat,forfouroutofthevekeywordswithsplograteover30%,mostsploghomepagesarecreatedbyprofessionalspammers.
Splogscontainingthesefourkeywordsactuallyamounttomorethanhalfoftheentiresplogdataset.
Thisfactisveryimportantbecausethefollowinganalysisisstronglyaectedbythechoicesofthoseprofessionalspammersincreatingthosesplogs.
(2)AscanbeseenfromthemapinFigure4,mostofthekeywordsplacedintheupperhalfofthemaphavelowsplogrates.
Thismeansthatsplogstendtocontainkeywordswithprivateconcernmoreoftenthanthosewithpublicconcern.
"Nationalpension"and"Asasho-ryu"arewithexceptionallyhighsplogrates,thoughthisstatisticsisstronglyaectedbythechoicesofprofessionalspammers.
Thosespammerspostedsplogpostsoncertaindates,wherethesplogarticlesarecreatedfromtheexcerptsofthenewsreportsandblogpostsonthosedates.
Thoseexcerptsoccasionallyincludescandalreportscloselyrelatedtothetwokeywords.
(3)Thethreekeywords"rumor","erog,adultcontentblog",and"healthfood",correspondtoanothergroupofsplogscreatedbyprofessionalspammers.
Inthecaseofthesekey-words,thespammerspostedsplogposts,wherethesplogarticlesarecreatedfromtheexcerptofotherblogsandad-vertisements,butnotnewsarticles,byretrievingthemwithcertainkeywords.
7.
CONCLUSIONFigure4:KeywordMapwithSplogAnalysisResultsThispaperfocusedonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Amongvariousinformativeresultsofouranalysis,itisim-portanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofprofessionalspam-mers.
Futureworksincludefurtheranalysisofsplogsbyintegratingwithotherfeaturesstudiedinthepreviousworks[12,10,9],suchascharacteristicwordsinsplogs,in-degree/out-degreedistributions,andpingtimeseries.
Next,weplantoapplyexistingsplogdetectiontechniques[11,8]tooursplogdataset,andthentodevelopasplogdetectorwithhighaccuracy.
Splogs/authenticblogscollectedinthisworkarealsousefulforanalyzingcharacteristicsofkeywordsinamuchlargerscale,simplybyautomaticallycollectingamuchlargernumberofkeywords,andthenmeasuringcorrelationbetweensplogsandeachkeyword.
8.
REFERENCES[1]Wikipedia,Spamblog.
http://en.
wikipedia.
org/wiki/Spam_blog.
[2]Wikipedia,Wordsalad(computerscience).
http://en.
wikipedia.
org/wiki/Wordsalad%28computer_science%29.
[3]T.
Fukuhara,T.
Murayama,andT.
Nishida.
AnalyzingconcernsofpeopleusingWeblogarticlesandrealworldtemporaldata.
InProceedingsofWWW20052ndAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2005.
[4]T.
Fukuhara,H.
Nakagawa,andT.
Nishida.
Understandingsentimentofpeoplefromnewsarticles:Temporalsentimentanalysisofsocialevents.
InProceedingsofICWSM,pages271–272,2007.
[5]T.
Fukuhara,T.
Utsuro,andH.
Nakagawa.
Cross-lingualconcernanalysisfrommultilingualweblogarticles.
InA.
Nijholt,O.
Stock,andT.
Nishida,editors,Proceedingsofthe6thInternationalWorkshoponSocialIntelligenceDesign,pages55–64,2007.
[6]N.
Glance,M.
Hurst,andT.
Tomokiyo.
Blogpulse:AutomatedtrenddiscoveryforWeblogs.
InWWW2004WorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2004.
[7]Z.
Gy¨ongyiandH.
Garcia-Molina.
Webspamtaxonomy.
InProc.
1stAIRWeb,pages39–47,2005.
[8]P.
Kolari,T.
Finin,andA.
Joshi.
SVMsfortheBlogosphere:BlogidenticationandSplogdetection.
InProceedingsofthe2006AAAISpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages92–99,2006.
[9]P.
Kolari,T.
Finin,andA.
Joshi.
Spaminblogsandsocialmedia.
InTutorialatICWSM,2007.
[10]P.
Kolari,A.
Joshi,andT.
Finin.
Characterizingthesplogosphere.
InProceedingsofWWW20063rdAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2006.
[11]Y.
-R.
Lin,H.
Sundaram,Y.
Chi,J.
Tatemura,andB.
L.
Tseng.
Splogdetectionusingself-similarityanalysisonblogtemporaldynamics.
InProc.
3rdAIRWeb,pages1–8,2007.
[12]C.
MacdonaldandI.
Ounis.
TheTRECBlogs06collection:Creatingandanalysingablogtestcollection.
TechnicalReportTR-2006-224,UniversityofGlasgow,DepartmentofComputingScience,2006.
[13]T.
Nanno,T.
Fujiki,Y.
Suzuki,andM.
Okumura.
Automaticallycollecting,monitoring,andminingJapaneseweblogs.
InWWWAlt.
'04:Proceedingsofthe13thinternationalWorldWideWebconferenceonAlternatetrackpapers&posters,pages320–321.
ACMPress,2004.
[14]Y.
Sato,T.
Utsuro,T.
Fukuhara,Y.
Kawada,Y.
Murakami,H.
Nakagawa,andN.
Kando.
CollectingandanalyzingJapanesesplogsbasedoncharacteristicsofkeywords.
InProc.
ICWSM,pages218–219,2008.
[15]T.
Urvoy,T.
Lavergne,andP.
Filoche.
TrackingWebspamwithhiddenstylesimilarity.
InProc.
2ndAIRWeb,pages25–30,2006.
[16]Y.
Wang,M.
Ma,Y.
Niu,andH.
Chen.
Spamdouble-funnel:Connectingwebspammerswithadvertisers,.
InProc.
16thWWWConf.
,pages291–300,2007.
轻云互联成立于2018年的国人商家,广州轻云互联网络科技有限公司旗下品牌,主要从事VPS、虚拟主机等云计算产品业务,适合建站、新手上车的值得选择,香港三网直连(电信CN2GIA联通移动CN2直连);美国圣何塞(回程三网CN2GIA)线路,所有产品均采用KVM虚拟技术架构,高效售后保障,稳定多年,高性能可用,网络优质,为您的业务保驾护航。官方网站:点击进入广州轻云网络科技有限公司活动规则:1.用户购...
Spinservers是Majestic Hosting Solutions,LLC旗下站点,主营美国独立服务器租用和Hybrid Dedicated等,数据中心位于美国德克萨斯州达拉斯和加利福尼亚圣何塞机房。TheServerStore.com,自 1994 年以来,它是一家成熟的企业 IT 设备供应商,专门从事二手服务器和工作站业务,在德克萨斯州拥有 40,000 平方英尺的仓库,库存中始终有...
百纵科技湖南百纵科技有限公司是一家具有ISP ICP 电信增值许可证的正规公司,多年不断转型探索现已颇具规模,公司成立于2009年 通过多年经营积累目前已独具一格,公司主要经营香港服务器,香港站群服务器,美国高防服务器,美国站群服务器,云服务器,母机租用托管!美国CN2云服务器,美国VPS,美国高防云主机,美国独立服务器,美国站群服务器,美国母机。美国原生IP支持大批量订货 合作 适用电商 亚马逊...
adsense为你推荐
SCProute支持ipad支持ipad支持ipad支持ipad支持ipadnetbios端口26917 8000 4001 netbios-ns 端口 是干什么的iphone连不上wifi为什么苹果手机连不上wifi微信都发不出去?联通合约机iphone5我想问下,我想入手iphone5的联通合约机, 会被坑吗graphsearch怎么用eagraph-cn?
海外主机租用 主机域名 raksmart 搬瓦工官网 优key 免费ftp站点 电子邮件服务器 免费网页申请 万网空间购买 美国迈阿密 网络速度 免费获得q币 .htaccess 美国服务器 西部主机 paypal登陆 screen vim命令 sockscap怎么用 qq空间登入 更多