homepageadsense

adsense  时间:2021-05-20  阅读:()
AnalysingFeaturesofJapaneseSplogsandCharacteristicsofKeywordsYuukiSatoTakehitoUtsuroUniversityofTsukuba,Tsukuba,305-8573,JAPANTomohiroFukuharaUniversityofTokyo,Kashiwa277-8568,JAPANYasuhideKawadaNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANYoshiakiMurakamiNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANHiroshiNakagawaUniversityofTokyo,Tokyo,113-0033,JAPANNorikoKandoNationalInstituteofInformatics,Tokyo,101-8430,JAPANABSTRACTThispaperfocusesonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyanalyzingthecharacteristicsofkey-wordscontainedinsplogs.
Sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecaneciently(manually)collectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wemanuallyexam-inevariousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentisexcerptfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinfor-mativeresults,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers.
CategoriesandSubjectDescriptorsH.
3.
0[INFORMATIONSTORAGEANDRETRIEVAL]:GeneralGeneralTermsReliabilityKeywordsBloganalysis,splog,timeseriescharacteristicsofkeywords,keywordbursts1.
INTRODUCTIONWeblogsorblogsareconsideredtobeoneofpersonaljour-nals,marketorproductcommentaries.
Whiletraditionalsearchenginescontinuetodiscoverandindexblogs,theblo-gospherehasproducedcustomblogsearchandanalysisen-Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.
Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.
AIRWeb'08,April22,2008Beijing,China.
Copyright2008ACM978-1-60558-159-0.
.
.
$5.
00.
gines,systemsthatemployspecializedinformationretrievaltechniques.
Thereareseveralpreviousworksandservicesonbloganalysissystems.
[13]proposedasystemcalledblog-WatcherthatcollectsandanalyzesJapaneseblogarticles.
[6]proposedasystemcalledBlogPulsethatanalyzestrendsofblogarticles.
WithrespecttobloganalysisservicesontheInternet,thereareseveralcommercialandnon-commercialservicessuchasTechnorati1,BlogPulse2,kizasi.
jp3,andblog-Watcher4.
Withrespecttomultilingualblogservices,GlobeofBlogs5providesaretrievalfunctionofblogarticlesacrosslanguages.
BestBlogsinAsiaDirectory6alsoprovidesaretrievalfunctionforAsianlanguageblogs.
Blogwise7alsoanalyzesmultilingualblogarticles.
AswithmostInternet-enabledapplications,theeaseofcontentcreationanddistributionmakestheblogospherespamprone[7,1,10,12,9].
Spamblogsorsplogsareblogshost-ingspamposts,createdusingmachinegeneratedorhijackedcontentforthesolepurposeofhostingadvertisementsorraisingthePageRankoftargetsites.
[10]reportedthatforEnglishblogs,around88%ofallpingingURLs(i.
e.
,bloghomepages)aresplogs,whichaccountforabout75%ofallpings.
Basedonthisestimation,asstatedin[1,11],splogscancauseproblemsincludingthedegradationofinforma-tionretrievalqualityandthesignicantwasteofnetworkandstorageresources.
Severalpreviousworks[10,12,9]reportedimportantcharacteristicsofsplogs.
[12]reportedcharacteristicsofpingtimeseries,in-degree/out-degreedis-tributions,andtypicalwordsinsplogsfoundinTREC8Blog06datacollection.
[10,9]alsoreportedtheresultsofanalyzingsplogsintheBlogPulsedataset.
Inthecontextofsemi-automaticallycollectingwebspamsincludingsplogs,[16]discusshowtocollectspammer-targetedkeywordstobeusedwhencollectingalargenumberofwebspamseciently.
Unlikethosepreviousworks,thispaperfocusesonana-lyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem[14].
Ashasbeenoftennotedinthepreviousworks,textcontentofsplogsismostlyex-1http://technorati.
com/2http://www.
blogpulse.
com/3http://kizasi.
jp/(inJapanese)4http://blogwatcher.
pi.
titech.
ac.
jp/(inJapanese)5http://www.
globeofblogs.
com/6http://www.
misohoni.
com/bba/7http://www.
blogwise.
com/8http://trec.
nist.
gov/Table1:FeaturesforCharacterizingSplogsandtheirRatesinSplogDataSetRateinFeatureTypesFeaturesDescriptionsSplogs(%)linkstoaliatedsitesBlogarticles(posts)containsucientlymanyout-goinglinkstoaliatedsites,exceptfortheout-goinglinksthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
80.
5Aliateadvertisementarti-cles(posts)Blogarticles(posts)themselvescontainsucientlymanyad-vertisements,exceptfortheadvertisementsthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
31.
0Featuresarticles(posts)withadultcontentBlogarticles(posts)containadultcontent.
8.
1keywordswithpopupadvertisementCertainbloghostshavefacilitiesofautomaticallyaddingpopupadvertisementstokeywords.
42.
1excerptfromnewsar-ticlesTextcontentisautomaticallyormanuallyexcerptedfromnewsarticles.
14.
3Contentexcerptfromblogar-ticles(posts)orotherwebtextsTextcontentisautomaticallyormanuallyexcerptedfromotherblogarticles(posts),orwebtextsotherthannewsarticlesandadvertisementpages.
70.
8Sourceexcerptfromadver-tisementpagesTextcontentisautomaticallyormanuallyexcerptedfromcer-tainadvertisementpages.
27.
1FeaturesoriginallywrittentextsSpammerswriteoriginalsplogtexts.
2.
9meaninglesssequenceofwordsMostofthemaresocalledwordsaladspamtext[2]andareautomaticallygenerated.
3.
6excerptfromothersources,selectedwithoutkeywordretrievalTextcontentisautomaticallyormanuallyexcerptedfromothersourceswithoutkeywordretrieval.
Typicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates.
12.
7Creationexcerptfromothersources,retrievedwithakeywordvaryingdaybydayTextcontentisautomaticallyormanuallyretrievedfromothersourceswithakeywordvaryingdaybyday,andthenexcerpted.
49.
5Procedureexcerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepageForabloghomepage,allofitstextcontentisexcerpt,whichareautomaticallyormanuallyretrievedfromothersourceswithasinglekeywordthroughoutallofitsposts.
36.
9Featureskeywordstuedblog[9]Blogarticles(posts)containlistsofkeywordsforSEOpurposes.
11.
5automaticallygener-atedtextMostofthemaresocalledwordsaladspamtext[2],whichisamixtureofseeminglymeaningfulwordsthattogethersignifynothing.
Sometimes,connectingseveralsentenceseachofwhichisexcerptedfromothersource.
4.
5cerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Consid-eringthisfact,inthiswork,weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyan-alyzingthecharacteristicsofkeywordscontainedinsplogs.
Thecharacteristicsofakeywordtowhichwepayattentioninthispaperiswhetherthekeywordisofpublic/privatecon-cernaswellasthedurationofpeople'sconcerntothekey-word.
Furthermore,sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecanecientlycollectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wethenmanuallyexaminevari-ousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentsareexcerptsfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinforma-tiveresultsofouranalysis,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers,andhence,theanalysisreportedinthispaperisstronglyaectedbythechoicesofthosespam-merswhentheycreatethosesplogs.
2.
PROCEDUREOFCREATINGSPLOGSTextcontentofsplogsismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertise-mentpages,andotherwebtexts.
Inanycase,splogshavecommercialintention—theydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Forthispurpose,splogsareusuallycreatedbysearchingforup-to-datecontentfromothersourcesandbyexcerptingthem.
Thisprocedureofcreatingsplogscanberoughlydividedintothefollowingtwocases:authenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesauthenticblogssplogsTimeSeries(a)keywordwithburst(b)keywordwithoutburstFigure1:TimeSeriesCharacteristicsofKeywordOccurrenceStatisticsinSplogs/AuthenticBlogsi)excerptingtextcontentfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordre-trieval,ii)excerptingtextcontentbyretrievingthemfromothersourceswithcertainkeywords.
Splogpostscreatedbytherstprocedurejustafewdaysbeforethecurrentdatetendtocontainup-to-datetextcon-tentwhichareoriginallyfromquiterecentnewsarticlesorblogposts.
Ontheotherhand,forsplogscreatedbythesec-ondprocedure,spammersusuallycarefullychoosekeywordsforretrievingtextcontentfromothersourcessuchasnewsarticlesandblogposts.
Theytendtochoosehighpayingadsense9keywords.
3.
FEATURESFORCHARACTERIZINGSPLOGSThissectiondescribesthefeaturesforcharacterizingJapanesesploghomepagesmanuallycollectedbytheprocedureofsec-tion5.
3.
AswesummarizeinTable1,thispaperconsidersthefol-lowingthreetypesoffeaturesforsplogs,namely,1)aliatefeatures,2)contentsourcefeatures,and3)creationproce-durefeatures.
Foreachofthesethreefeaturetypes,Table1listsseveralbinaryfeatureseachofwhichdenoteswhetherthegivensploghomepagehasthedesignatedcharacteristicsornot.
Here,notethatfeaturesofthesametypeareinde-pendentofeachotherandhencearenotnecessarilydisjoint.
Alsonotethatmostofthosefeaturesarefortheuseinman-ualexaminationofsplogs,andhence,itisnotnecessarilymeanttoautomaticallydetectthem.
3.
1AfliateFeaturesAmongthethreefeaturetypes,rstwedescribealiatefeatures.
Asintroducedin[10,9],splogsaregeneratedwithtwooftenoverlappingmotives,namely,creationoffakeblogsforthepurposeofhostingprotableadvertisement,andun-justiablyincreasingtherankingofaliatedsites.
Sincebothmotivesaredeeplyrelatedtoaliateadvertising,inthispaper,weconsiderfeaturesofsplogsregardingissuesofaliates.
Asthealiatefeatures,wemanuallyexaminethefollowingfourpoints:9http://google.
com/adsensei)whethertheblogarticle(posts)containout-goinglinkstoaliatedsites,ii)whethertheblogarticle(posts)themselvescontainad-vertisements,iii)whetherblogarticles(posts)containadultcontent10,iv)whetherblogarticles(posts)containpopupadvertise-mentsautomaticallyaddedtocertainkeywords.
3.
2ContentSourceFeaturesSecond,oneoftheimportantcharacteristicsofsplogsisthattheirtextcontentismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Inordertoestimatethemech-anismofcreatingsplogs,wemanuallyexaminethecontentsourceofsplogsandclassifythemaccordingtothefollowingvefeatures,namely,contentsourcefeatures:i)excerptfromnewsarticles,ii)excerptfromblogarticles(posts)orotherwebtexts,iii)excerptfromadvertisementpages,iv)originallywrittentexts,v)meaninglesssequenceofwordssuchaswordsaladspamtexts[2].
3.
3CreationProcedureFeaturesFurthermore,weestimatetheproceduresofsearchingthewebforthoseexcerptandmanuallyclassifythemaccord-ingtothefollowingvefeatures,namely,creationprocedurefeatures:i)excerptfromothersources,selectedwithoutkeywordretrieval,wheretypicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates,ii)excerptfromothersources,retrievedwithakeywordvaryingdaybyday,iii)excerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepage,iv)keywordstuedblog[9],10Adultcontentisamongthemajortargetgenresforaliateadvertising,whileothermajortargetgenresincludehealthfoodandslimmingproducts,cosmetics,andnance.
Weregardblogswhichcontainadultcontentasmoreharmfulthanothers,andrecordthemwithanindependentfeature.
Figure2:AKeywordMapforCharacterizingKeywordsv)automaticallygeneratedtextincludingwordsaladspamtexts[2].
Asthecreationprocedurefeatures,wedistinguishtwomajorproceduresofcreatingsplogs,i.
e.
,a)excerptfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordretrieval,andb)andexcerptbyretrievingtextsfromothersourceswithcertainkeywords.
Theformertypecorrespondstothefeaturei)above,whilethelattertothefeaturesii)andiii)above.
4.
CHARACTERISTICSOFSPLOGSANDKEYWORDS4.
1TimeSeriesCharacteristicsofKeywordsAmongtheproblemscausedbysplogs,thissectiondis-cussesissuesonnoisesinwordoccurrencestatisticsintheblogosphere.
Figure1illustratestwotypicalcasesofnoisesintimeserieskeywordoccurrencestatistics,where(a)isthecaseofakeywordwithburst,and(b)isthecaseofakey-wordwithoutburst.
Forbothcases,keywordoccurrencesaremixtureofthosefromauthenticblogsandsplogs.
With-outdetectingandremovingsplogs,itisdiculttoestimaterealkeywordoccurrencestatisticsonlyinauthenticblogs.
Forthecaseofthekeywordswithburst,especially,itisestimatedthatburstinsplogsmaybedelayedfromthatinauthenticblogs,becausetextcontentofsplogsismostlyexcerptfromothersourcessuchasnewsarticlesandblogposts.
4.
2KeywordMapforCharacterizingKeywordsThissectionintroducesthekeywordmapofFigure2forcharacterizingkeywords.
Theverticalaxisofthemapde-noteswhethereachkeywordisofpublic/privateconcern,whileitshorizontalaxisdenotesthedurationofpeople'sconcerntoeachkeyword.
Keywordswithpublicconcernaretypicallyreportedinnewsassocial/political/economicalis-sues,whilethosewithprivateconcernaretypicallyissuesregardingentertainmentorcelebrity,orhighpayingadsensekeywords.
Ontheotherhand,keywordswithshorttermdu-rationincludeseasonalonesandthoserelatedtotemporaryevents,whilethosewithlongtermdurationincludeorgani-zationnameswithalonghistorysuchaspoliticalpartiesandcountrynames,orthoserelatedtopermanentissuessuchashealthandbeauty.
OnthemapofFigure2,50keywordsthatarebalancedintheirdistributiononthemapareplaced,wherethepositionofeachkeywordisdeterminedtotallybyintuition.
Thosekeywordsvaryintheirtimeseriescharacteristicsofoccur-rencestatistics,wheresomeofthemarewithburstwhileothersarenot.
Eachofthosekeywordsisintendedtobeusedforretrievingblog(authenticblogandsplog)home-pagesintheprocedureofsection5.
3.
Themajorpurposeofplacingsuchvariouskeywordsontoamaplikethisistosimplyexaminethecorrelationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontain-ingeachkeyword.
Table2:SummaryofJapaneseBlogData(atDe-cember3rd,2007,0:00)#ofblogcurrent#ofhomepages#ofarticles#ofdaysarticlesperday3,591,306192,699,2761,355196,9755.
ANALYZINGSPLOGSBASEDONCHAR-ACTERISTICSOFKEYWORDS5.
1MotivationsThispaperreportstheresultsofanalyzingthefollowingthreepointsaftercollectingblogsandthenmanuallydetect-ingsplogsamongthem.
1.
Featuresofsplogsaremanuallyexaminedaccordingtothoseintroducedinsection3.
2.
Accordingtothekeywordmapforcharacterizingkey-words,variouscharacteristicsofkeywordsaremanu-allyexamined,whichincludetimeseriescharacteristicssuchaswhetherwith/withoutburst.
3.
Basedontheresultsofexaminingabovetwopoints,wefurtheranalyzevariouscorrelationbetweencharac-teristicsofsplogsandkeywords.
Thisanalysismainlyincludesthefollowings:(a)correlationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontainingeachkeyword.
Thiswillrevealthepreferenceofspammerswhenchoosingkeywords.
(b)correlationbetweenthecharacteristicsofkeywordsandthesplogcreationprocedures.
5.
2JapaneseBlogDataForcollectingtheJapaneseblogdata,weusethesystemcalledKANSHIN[3,4,5]whichcollectsblogarticles(posts)writteninChinese,Japanese,Korean,andEnglish.
Thesys-temhaslistsofbloghomepagesforeachlanguage.
Byusingtheselists,thesystemcollectsRSS11andAtomfeedlesprovidedbybloghomepages,andextractskeywordsfromfeedlesbyusingmorphologicalanalysistools,andstorekeywordsandarticlesineachdatabase.
Thesystemusesseverallinguistictoolsforextractingandindexingkeywordsfromblogarticlesforeachlanguage.
ForJapanese,itusesamorphologicalanalysistoolcalledJuman12.
Thesystemprovidesuserswithfunctionsforretrievingandanalyzingarticles.
Table2showsthesummaryofJapaneseblogdatastoredinthesystem(checkedatDecember3rd,2007).
3.
6millionbloghomepagesand193millionarticlesareregisteredforJapanesesinceMarch18th,2004.
5.
3ProcedureoftheAnalysisThissectiongivesthespecicprocedureofcollectingandanalyzingsplogsbasedoncharacteristicsofkeywords.
Theroughstrategyofcollectingsplogshereistosimplycollectbloghomepages,(i.
e.
,notblogposts)whichcontainagivenkeywordandthen,11SeveralreferencessuchasRDFSiteSummaryorReallySimpleSyndicationorRichSiteSummaryexist.
12http://nlp.
kuee.
kyoto-u.
ac.
jp/nl-resource/juman.
htmlseesaa44%cocolog32%jugem.
jp12%ameblo5%livedoor1%Rest6%yahoo0%goo.
ne0%Figure3:BlogHostDistributionintheSplogHome-pageDataSetconsideringthefeaturesofsplogsdenedinsec-tion3,tomanuallyjudgewhethereachofthecol-lectedbloghomepagesisasplogoranauthenticblog.
Consideringtheresultofapreliminaryexamination,weas-sumethat,forkeywordswithburst,therateofsplogsamongthebloghomepagesthatcontainthosekeywordsmaybehigherontheburstdatethanonotherdates.
Wefurtherassumethat,evenforkeywordswithoutburst,therateofsplogsmaybehigheronthedatewiththemostfrequentoc-currenceintheblogospherethanotherdates.
Basedonthisobservation,inordertocollectsucientnumberofsplogs,foreachkeyword,wecollectbloghomepagescontainingthekeywordonthedatewithitsmostfrequentoccurrence.
Fur-thermore,alsoconsideringtheresultofapreliminaryexam-ination,wepreferbloghomepageswithmorepostsperdaythanthosewithfewerpostsperday.
Thefollowinglistsummarizestheaboveprocedure.
1.
Foreachofthe50keywordsinFigure2,wecollectbloghomepageURLswhichcontainthekeywordonthedatewithitsmostfrequentoccurrenceduringtheyear2007.
2.
AmongthecollectedURLs,weselectthetopmost50withrespecttothenumberofpostsperday.
Wefur-therrandomlyselect60URLsfromtherest.
Thisamountto110URLsintotal,wherethetopmost50URLsareusuallywithmorethanthreepostsperday,whiletheremaining60URLsarewithoneortwopostsperday.
3.
ForeachofthecollectedURLs,anannotatorjudgeswhethereachbinaryfeaturedenedinsection3holdsornot.
4.
Basedontheabovejudgement,eachURLisjudgedtobeasplogoranauthenticblogaccordingtothefollowingrule.
(a)IfoneofthefollowingsholdsforthegivenURL,thenitismostly13splog.
13By"mostly",wemeanthatitisusuallynecessarytojudgebyconsideringthecontentsofeachblog.
Table3:SplogRateperBlogHostBlogHostseesaacocologjugem.
jpameblolivedoorgoo.
neyahooRestTotal#ofBlogSplog1921425424321026442HomepagesAuthenticBlog2031151693551281302073961703Total3952572233791311312074222145SplogRate(%)48.
655.
324.
26.
32.
30.
80.
06.
220.
6Table4:SplogRate,ProfessionalSpammerRate(fromprofessionalspammer/splog),#ofProfessionalSpammers,and,AmateurOnlySplogRate(fromamateurspammer/(fromamateurspammer+non-splog))(indescendingorderofsplogrates,boldfaced:"splograte>10%,professionalspammerrate>50%",underlined:"amateuronlysplograte20%ormore,mostlywithprivateconcern")KeywordSplogRate(%)ProfessionalSpammerRate(%)#ofProfessionalSpammersAmateurOnlySplogRate(%)erog,adultcontentblog89.
292.
4338.
5rumor88.
194.
8127.
8nationalpension58.
190.
2212.
0norevision40.
918.
5136.
1healthfood37.
458.
7219.
8cosmeticsurgery24.
414.
3221.
7Viagra22.
511.
1120.
5Darvish,aJapanesebaseballplayer22.
10.
0022.
1video19.
10.
0019.
1Asasho-ryu,asumowrestler15.
280.
023.
4Billy'sBootCamp15.
10.
0015.
1Saeko,aJapaneseactressandDarvish'swife14.
314.
3112.
2COMSN,Inc.
,elderlycarebusinesscompanywithascandal6.
971.
422.
1ZARD(aJapanesefemalesinger,accidentallydied)4.
720.
013.
8ChinaAirlines4.
720.
013.
8NorthKorea2.
9100.
010.
0Wii(avideogameconsoleofNintendo)2.
866.
711.
0heatwave2.
833.
311.
9"Thedignityofthewoman",thetitleofabook2.
00.
002.
0aJapaneseslangwordfor"lazywoman"1.
850.
010.
9UpperHouseelection0.
00.
000.
0DemocraticPartyofJapan0.
00.
000.
0Total20.
561.
5109.
0i.
Thefeature"originallywrittentext"doesnothold.
ii.
Thefeature"originallywrittentext"holdsandatleastoneofthefeatures"linkstoaliatedsites","advertisementarticles(posts)",or"ar-ticles(posts)withadultcontent"holds.
(b)Otherwise,thegivenURLisanauthenticblog.
5.
Finally,weanalyzethecorrelationbetweencharacter-isticsofkeywordsandthedistributionoffeaturesman-uallyannotatedtosplogs.
6.
PRELIMINARYRESULTSOFANALYZ-INGSPLOGSThissectiondiscussespreliminaryresultsofanalyzingJapanesesplogsbasedoncharacteristicsofkeywords,fea-turesofsplogs,aswellasotherfeatureswhichcanbeauto-maticallyanalyzedsuchasbloghostsdistribution.
Wefur-theranalyzethecorrelationbetweencharacteristicsofkey-wordsandthefeaturedistributionofsplogs.
Here,notethattheresultsshownbelowarepreliminaryinthattheyarefor22keywordsoutofthe50onthemapofFigure2.
6.
1BlogHostsStatisticsAscanbeclearlyseenfromFigure3,inourJapanesebloghomepagedataset,morethan88%ofsplogsarefromthetopthreehosts.
Furthermore,asshowninTable3,forthetoptwohosts,abouthalfofthebloghomepagesaresplogs14.
Itisestimatedthatthosehostswithhighsplogratespaylesscostofmanuallyremovingsplogsthanthosewithlowsplogrates.
Asweargueinthenextsection,itisobservedthataverysmallnumberofspammersactuallycreatesubstantialnumberofsploghomepagesonthosethreehosts,andthisincreasesthesplogratesofthosehosts.
6.
2RelationsbetweenCharacteristicsofKey-wordsandSplogs14DuetoerrorsintheprocedureofcollectingblogURLsforjudgingsplog/authenticblogdistinction,forthemoment,wedonothave110blogsURLsintotalforseveralkeywords.
Table5:10ProfessionalSpammersidentiedinourSplogDataSet#ofFeaturesofSplogs(inTable1)IDSplogsAliateContentSourceCreationProcedureKeywords1115(42.
5%)linkstoaliatedsites,popupadver-tisementblogorotherwebtextsretrievedwithasin-glekeywordrumor,norevision,cosmeticsurgery,Asasho-ryu,Saeko,ChinaAirlines,COMSN,Inc.
,ZARD,heatwave,Wii,NorthKorea,"lazywoman"256(20.
6%)linkstoaliatedsitesblogorotherwebtextsretrievedwithakey-wordvaryingdaybydayerog330(11.
0%)linkstoaliatedsitesnewsarticles,adver-tisementpagesselectedwithoutkey-wordretrievalnationalpension,COMSN,Inc.
426(9.
6%)linkstoaliatedsites,advertisementarticles,popupadvertisementblogorotherwebtexts,advertisementpagesretrievedwithakey-wordvaryingdaybydaynationalpension520(7.
4%)linkstoaliatedsites,advertisementarticlesadvertisementpagesretrievedwithakey-wordvaryingdaybyday,keywordstuedbloghealthfood610(3.
7%)linkstoaliatedsites,adultcontent,popupadvertisementnewsarticles,blogorotherwebtextsselectedwithoutkey-wordretrievalerog,Asasho-ryu,71015(5.
5%)———erog,healthfood,Viagra,cos-meticsurgery,Total272————Next,foreachofthe22keywords,Table4givessplogratesinthebloghomepagescollectedwiththekeyword,indescendingorderofsplogsrates.
Inthetable,those22key-wordsaredividedintothreegroups,i.
e.
,thosewithsplograteshigherthan30%,thosewithsplogrates3010%,andtherest.
Wefurthercountoccurrencesoffeaturesofsplogsintheentiresplogdataset,andlisttheirratesinthesplogdatasetasintherightmostcolumnofTable1.
Basedonthisfeatureanalysis,weexaminecorrelationofthosesplogfea-turesandcharacteristicsofkeywordswithsplograteshigherthan10%.
Furthermore,wejudgedwhethertwosplogsarecreatedbyanidenticalspammerwhentheirhtmllayoutsaresimilar15,andthengroupedthosesplogsfromanidenticalspammer.
Inthispaper,wenamethosespammerseachofwhomcre-atedmorethanonesplogsinourdatasetasprofessionalspammers,whilewealsonamethoseremainingspammerseachofwhomcreatedonlyonesploginourdatasetasam-ateurspammers.
Withthisjudgement,wecanidentify10professionalspammersinoursplogdataset(summarizedinTable5),whereoutofthetotal442sploghomepages,272(61.
5%)canberegardedascreatedbythose10professionalspammers.
Basedonthisprofessional/amateurspammeranalysis,foreachkeyword,Table4showsrateofsploghome-pagesbeingcreatedbyoneofthe10professionalspammersTable4alsoshowsthenumberofprofessionalspammersob-servedforeachkeyword,aswellassplogratesafterremovingthosecreatedbyprofessionalspammers(amateuronlysplog15Ournextplanistoemploythetechniquepresentedin[15],sothatwecanautomaticallygroupsploghomepagesintothe10groupsshownhere.
rate).
Majorconclusionsofthisanalysiscanbesummarizedasbelow,someofwhicharealsonotedinthemapofthe22keywordsinFigure4.
(1)Themostimportantfacttonotehereisthat,forfouroutofthevekeywordswithsplograteover30%,mostsploghomepagesarecreatedbyprofessionalspammers.
Splogscontainingthesefourkeywordsactuallyamounttomorethanhalfoftheentiresplogdataset.
Thisfactisveryimportantbecausethefollowinganalysisisstronglyaectedbythechoicesofthoseprofessionalspammersincreatingthosesplogs.
(2)AscanbeseenfromthemapinFigure4,mostofthekeywordsplacedintheupperhalfofthemaphavelowsplogrates.
Thismeansthatsplogstendtocontainkeywordswithprivateconcernmoreoftenthanthosewithpublicconcern.
"Nationalpension"and"Asasho-ryu"arewithexceptionallyhighsplogrates,thoughthisstatisticsisstronglyaectedbythechoicesofprofessionalspammers.
Thosespammerspostedsplogpostsoncertaindates,wherethesplogarticlesarecreatedfromtheexcerptsofthenewsreportsandblogpostsonthosedates.
Thoseexcerptsoccasionallyincludescandalreportscloselyrelatedtothetwokeywords.
(3)Thethreekeywords"rumor","erog,adultcontentblog",and"healthfood",correspondtoanothergroupofsplogscreatedbyprofessionalspammers.
Inthecaseofthesekey-words,thespammerspostedsplogposts,wherethesplogarticlesarecreatedfromtheexcerptofotherblogsandad-vertisements,butnotnewsarticles,byretrievingthemwithcertainkeywords.
7.
CONCLUSIONFigure4:KeywordMapwithSplogAnalysisResultsThispaperfocusedonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Amongvariousinformativeresultsofouranalysis,itisim-portanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofprofessionalspam-mers.
Futureworksincludefurtheranalysisofsplogsbyintegratingwithotherfeaturesstudiedinthepreviousworks[12,10,9],suchascharacteristicwordsinsplogs,in-degree/out-degreedistributions,andpingtimeseries.
Next,weplantoapplyexistingsplogdetectiontechniques[11,8]tooursplogdataset,andthentodevelopasplogdetectorwithhighaccuracy.
Splogs/authenticblogscollectedinthisworkarealsousefulforanalyzingcharacteristicsofkeywordsinamuchlargerscale,simplybyautomaticallycollectingamuchlargernumberofkeywords,andthenmeasuringcorrelationbetweensplogsandeachkeyword.
8.
REFERENCES[1]Wikipedia,Spamblog.
http://en.
wikipedia.
org/wiki/Spam_blog.
[2]Wikipedia,Wordsalad(computerscience).
http://en.
wikipedia.
org/wiki/Wordsalad%28computer_science%29.
[3]T.
Fukuhara,T.
Murayama,andT.
Nishida.
AnalyzingconcernsofpeopleusingWeblogarticlesandrealworldtemporaldata.
InProceedingsofWWW20052ndAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2005.
[4]T.
Fukuhara,H.
Nakagawa,andT.
Nishida.
Understandingsentimentofpeoplefromnewsarticles:Temporalsentimentanalysisofsocialevents.
InProceedingsofICWSM,pages271–272,2007.
[5]T.
Fukuhara,T.
Utsuro,andH.
Nakagawa.
Cross-lingualconcernanalysisfrommultilingualweblogarticles.
InA.
Nijholt,O.
Stock,andT.
Nishida,editors,Proceedingsofthe6thInternationalWorkshoponSocialIntelligenceDesign,pages55–64,2007.
[6]N.
Glance,M.
Hurst,andT.
Tomokiyo.
Blogpulse:AutomatedtrenddiscoveryforWeblogs.
InWWW2004WorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2004.
[7]Z.
Gy¨ongyiandH.
Garcia-Molina.
Webspamtaxonomy.
InProc.
1stAIRWeb,pages39–47,2005.
[8]P.
Kolari,T.
Finin,andA.
Joshi.
SVMsfortheBlogosphere:BlogidenticationandSplogdetection.
InProceedingsofthe2006AAAISpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages92–99,2006.
[9]P.
Kolari,T.
Finin,andA.
Joshi.
Spaminblogsandsocialmedia.
InTutorialatICWSM,2007.
[10]P.
Kolari,A.
Joshi,andT.
Finin.
Characterizingthesplogosphere.
InProceedingsofWWW20063rdAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2006.
[11]Y.
-R.
Lin,H.
Sundaram,Y.
Chi,J.
Tatemura,andB.
L.
Tseng.
Splogdetectionusingself-similarityanalysisonblogtemporaldynamics.
InProc.
3rdAIRWeb,pages1–8,2007.
[12]C.
MacdonaldandI.
Ounis.
TheTRECBlogs06collection:Creatingandanalysingablogtestcollection.
TechnicalReportTR-2006-224,UniversityofGlasgow,DepartmentofComputingScience,2006.
[13]T.
Nanno,T.
Fujiki,Y.
Suzuki,andM.
Okumura.
Automaticallycollecting,monitoring,andminingJapaneseweblogs.
InWWWAlt.
'04:Proceedingsofthe13thinternationalWorldWideWebconferenceonAlternatetrackpapers&posters,pages320–321.
ACMPress,2004.
[14]Y.
Sato,T.
Utsuro,T.
Fukuhara,Y.
Kawada,Y.
Murakami,H.
Nakagawa,andN.
Kando.
CollectingandanalyzingJapanesesplogsbasedoncharacteristicsofkeywords.
InProc.
ICWSM,pages218–219,2008.
[15]T.
Urvoy,T.
Lavergne,andP.
Filoche.
TrackingWebspamwithhiddenstylesimilarity.
InProc.
2ndAIRWeb,pages25–30,2006.
[16]Y.
Wang,M.
Ma,Y.
Niu,andH.
Chen.
Spamdouble-funnel:Connectingwebspammerswithadvertisers,.
InProc.
16thWWWConf.
,pages291–300,2007.

阿里云年中活动最后一周 - ECS共享型N4 2G1M年付59元

以前我们在参与到云服务商促销活动的时候周期基本是一周时间,而如今我们会看到无论是云服务商还是电商活动基本上周期都要有超过一个月,所以我们有一些网友习惯在活动结束之前看看商家是不是有最后的促销活动吸引力的,比如有看到阿里云年中活动最后一周,如果我们有需要云服务器的可以看看。在前面的文章中(阿里云新人福利选择共享性N4云服务器年79.86元且送2月数据库),(LAOZUO.ORG)有提到阿里云今年的云...

atcloud:480G超高防御VPS低至$4/月,美国/新加坡等6机房,512m内存/1核/500g硬盘/不限流量

atcloud主要提供常规cloud(VPS)和storage(大硬盘存储)系列VPS,其数据中心分布在美国(俄勒冈、弗吉尼亚)、加拿大、英国、法国、德国、新加坡,所有VPS默认提供480Gbps的超高DDoS防御+不限流量,杜绝DDoS攻击骚扰,比较适合海外建站等相关业务。ATCLOUD.NET是一家成立于2020年的海外主机商,主要提供KVM架构的VPS产品、LXC容器化产品、权威DNS智能解...

爱用云互联租用服务器租美国、日本、美国、日本、购买2天内不满意可以退换,IP可免费更换!

爱用云互联怎么样?爱用云是一家成立于2018年的老牌商家旗下的服务器销售品牌,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免备案建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高防节点。专注为个人开发者用户,中小型,大型企业用户提供一站式核心网络云端服务部署,促使用户云端...

adsense为你推荐
Thresholdcsslegraphincludingandroid支持ipadVTLHiosphotoshop技术什么是ps技术photoshop技术ps是一种什么技术??????iphone连不上wifi苹果手机为什么突然连不上家里的wifi?ipad上网ipad上网速度很慢怎么回事?canvas2七尾奈留除了DC canvas2 sola EF 快乐小兔幸运草 以外改编成动画的作品有哪些?
山东虚拟主机 免费动态域名解析 z.com 宕机监控 空间服务商 申请个人网页 中国智能物流骨干网 天互数据 我爱水煮鱼 ftp教程 日本bb瘦 速度云 129邮箱 根服务器 重庆电信服务器托管 789 dnspod 免费asp空间申请 美国迈阿密 空间申请 更多