LosingMyRevolutionHowManyResourcesSharedonSocialMediaHaveBeenLostHanyM.
SalahEldeenandMichaelL.
NelsonOldDominionUniversity,DepartmentofComputerScienceNorfolkVA,23529,USA{hany,mln}@cs.
odu.
eduAbstract.
Socialmediacontenthasgrownexponentiallyintherecentyearsandtheroleofsocialmediahasevolvedfromjustnarratinglifeeventstoactuallyshapingthem.
Inthispaperweexplorehowmanyresourcessharedinsocialmediaarestillavailableontheliveweborinpublicwebarchives.
Byanalyzingsixdierentevent-centricdatasetsofresourcessharedinsocialmediaintheperiodfromJune2009toMarch2012,wefoundabout11%lostand20%archivedafterjustayearandanaverageof27%lostand41%archivedaftertwoandahalfyears.
Furthermore,wefoundanearlylinearrelationshipbetweentimeofsharingoftheresourceandthepercentagelost,withaslightlylesslinearrelationshipbetweentimeofsharingandarchivingcoverageoftheresource.
Fromthismodelweconcludethataftertherstyearofpublishing,nearly11%ofsharedresourceswillbelostandafterthatwewillcontinuetolose0.
02%perday.
Keywords:WebArchiving,SocialMedia,DigitalPreservation1IntroductionWithmorethan845millionFacebookusersattheendof2011[5]andover140milliontweetssentdailyin2011[16]userscantakephotos,videos,posttheiropinions,andreportincidentsastheyhappen.
Manyofthepostsandtweetsareaboutquotidianeventsandtheirpreservationisdebatable.
However,someofthepostsandeventsareaboutculturallyimportanteventswhosepreservationislesscontroversial.
Inthispaperweshedlightontheimportanceofarchivingsocialmediacontentabouttheseeventsandestimatehowmuchofthiscontentisarchived,stillavailable,orlostwithnopossibilityofrecovery.
Toemphasizetheculturallyimportantcommentaryandsharing,wecol-lecteddataaboutsixeventsinthetimeperiodofJune2009toMarch2012:theH1N1virusoutbreak,MichaelJackson'sdeath,theIranianelectionsandprotests,BarackObama'sNobelPeacePrize,theEgyptianrevolution,andtheSyrianuprising.
arXiv:1209.
3026v1[cs.
DL]13Sep20122HanyM.
SalahEldeenandMichaelL.
Nelson2RelatedWorkToourknowledge,nopriorstudyhasanalyzedtheamountofsharedresourcesinsocialmedialostthroughtime.
Therehavebeenmanystudiesanalyzingthebehaviorofuserswithinasocialnetwork,howtheyinteract,andwhatcontenttheyshare[3,19,20,23].
AsforTwitter,Kwaketal.
[6]studieditsnatureanditstopologicalcharacteristicsandfoundadeviationfromknowncharacteristicsofhumansocialnetworksthatwereanalyzedbyNewmanandPark[10].
Leeanalyzedthereasonsbehindsharingnewsinsocialmediaandfoundthatinfor-mativenesswasthestrongestmotivationinpredictingnewssharingintention,followedbysocializingandstatusseeking[4].
AlsosharedcontentinsocialmedialikeTwittermoveanddiuserelativelyfastasstatedbyYangetal.
[22].
Furthermore,manyconcernswereraisedaboutthepersistenceofsharedresourcesandwebcontentingeneral.
NelsonandAllenstudiedthepersistenceofobjectsinadigitallibraryandfoundthat,withjustoverayear,3%ofthesampletheycollectedhaveappearedtonolongerbeavailable[9].
Sandersonetal.
analyzedthepersistenceandavailabilityofwebresourcesreferencedfrompapersinscholarlyrepositoriesusingMementoandfoundthat28%oftheseresourceshavebeenlost[14].
Memento[17]isacollectionofHTTPextensionsthatenablesuniform,inter-archiveaccess.
Ainsworthetal.
[1]examinedhowmuchofthewebisarchivedandfounditrangesfrom16%to79%,dependingonthestartingseedURIs.
McCownetal.
examinedthefactorsaectingreconstructingwebsites(usingcachesandarchives)andfoundthatPageRank,Age,andthenumberofhopsfromthetop-levelofthesiteweremostinuential[8].
3DataGatheringWecompiledalistofURIsthatweresharedinsocialmediaandcorrespondtospecicculturallyimportantevents.
Inthissectionwedescribethedataacqui-sitionandsamplingprocessweperformedtoextractsixdierentdatasetswhichwillbetestedandanalyzedinthefollowingsections.
3.
1StanfordSNAPProjectDatasetTheStanfordLargeNetworkDatasetisacollectionofabout50largenetworkdatasetshavingmillionsofnodes,edgesandtuples.
ItwascollectedasapartoftheStanfordNetworkAnalysisPlatform(SNAP)project[15].
Itincludessocialnetworks,webgraphs,roadnetworks,Internetnetworks,citationnetworks,collaborationnetworks,andcommunicationnetworks.
Forthepurposeofourinvestigation,weselectedtheirTwitterpostsdataset.
ThisdatasetwascollectedfromJune1st,2009toDecember31st,2009andcontainsnearly476milliontweetspostedbynearly17millionusers.
Thedatasetisestimatedtocover20%-30%ofallpostspublishedonTwitterduringthattimeframe[21].
ToselectwhichLosingMyRevolution3eventswillbecoveredinthisstudy,weexaminedCNN's2009eventstimeline1.
Wewantedtoselectasmallnumberofeventsthatwerediverse,withlimitedoverlap,andrelativelyimportanttoalargenumberofpeople.
Giventhat,weselectedfourevents:theH1N1virusoutbreak,theIranianprotestsandelections,MichaelJackson'sdeath,andBarrackObama'sNobelPeacePrizeaward.
Preparation:Atweetistypicallycomposedoftext,hashtags,embeddedre-sourcesorURIsandusertagsallspanningamaximumof140characters.
HereisanexampleofatweetrecordintheSNAPdataset:T2009-07-3123:57:18Uhttp://Twitter.
com/nickgotchWRT@rockingjude:December21,2009DepopulationbyFoodWillBeginhttp://is.
gd/1WMZbWHOA.
.
BETTERWATCHRTplz#pwa#tcotThelinestartingwiththeletterTindicatesthedateandtimeofthetweetcreation.
WhilethelinestartingwithUshowsalinktotheuserwhoau-thoredthisparticulartweet.
Finally,thelinestartingwithWshowstheen-tiretweetincludingalltheuser-references"@rockingjude",theembeddedURIs"http://is.
gd/1WMZb",andhashtags"#pwa#tcot".
TagExpansion:Wewantedtoselecttweetsthatwecansaywithhighcon-denceareaboutaselectedevent.
Inthiscase,precisionismoreimportantthanrecallascollectingeverysingletweetpublishedaboutacertaineventislessimportantthanmakingsurethattheselectedtweetsaredenitelyaboutthatevent.
Severalstudiesfocusedonestimatingtheaboutnessofacertainwebpageoraresourceingeneral[12,18].
FortunatelyinTwitter,hashtagsincorporatedwithinatweetcanhelpusestimatetheir"aboutness".
Usersnormallyaddcer-tainhashtagstotheirtweetstoeasethesearchanddiscoverabilityinfollowingacertaintopic.
Thesehashtagswillbeutilizedintheevent-centricltrationprocess.
Foreachevent,weselectedinitialtagsthatdescribeit(Table1).
Thoseinitialtagswerederivedempiricallyafterexaminingsomeevent-relatedtweets.
Nextweextractedallthehashtagsthatco-occurredwithourinitialsetofhashtags.
Forexample,inclassH1N1weextractedalltheotherhashtagsthatappearedalongwith#h1n1withinthesametweetandkeptcountoftheirfrequency.
Thoseextractedhashtagsweresortedindescendingorderofthefrequencyoftheirappearanceintweets.
Weremovedallthegeneralscopetagslike#cnn,#health,#death,#warandothers.
Inregardstoaboutness,removinggeneraltagswillindeeddecreaserecallbutwillincreaseprecision.
Finallywepickedthetop8-10hashtagstorepresentthisevent-classandbeutilizedintheltrationprocess.
Table1showsthenalsetoftagsselectedforeachclass.
TweetFiltration:Inthepreviousstepweextractedthetagsthatwillhelpusclassifyandltertweetsinthedatasetaccordingtoeachevent.
Thisltration1http://www.
cnn.
com/2009/US/12/16/year.
timeline/index.
html4HanyM.
SalahEldeenandMichaelL.
NelsonEventInitialHashtagsTopCo-occurringHashtagsH1N1'h1n1''swine'=61,829'swineflu'=56,419'flu'=8,436Outbreak=61,351'pandemic'=6,839'influenza'=1,725'grippe'=1,559'tamiflu'=331M.
Jackson's'michaeljackson''michael'=27,075'mj'=18,584'thisisit'8,770'rip'=3,559'jacko'=3,325Death=22,934'kingofpop'=2,888'jackson'=2,559'thriller'=1,357'thankyoumichael'=1,050Iranian'iranelection''iran'949,641'gr88'=197,113'tehran'=109,006'freeiran'=13,378Elections=911,808'neda'=191,067'mousavi'=16,587'united4iran'=9,198'iranrevolution'=7,295Obama's'obama'=48,161&'nobel'=2,261'obamanobel'=14'nobelprize''nobelpeace'=113NobelPrize'peace'=3,721'barack'=1292'nobelpeaceprize'=107Table1.
Twitterhashtagsgeneratedforlteringandtheirfrequencyofoccurringprocessaimstoextractareasonablesizeddatasetoftweetsforeacheventandtominimizetheinter-eventoverlap.
Sincethelifeandpersistenceofthetweetitselfisnotthefocusofthisstudybutrathertheassociatedresourcethatappearsinthetweet(image,video,shortenedURIorotherembeddedresource),wewillextractonlythetweetsthatcontainanembeddedresource.
Thisstepresultedin181milliontweetswithembeddedresources(http://is.
gd/1WMZbinthepriorexample).
ThesetweetswerefurtherlteredtokeeponlythetweetsthathaveatleastoneoftheexpandedtagsobtainedfromTable1.
Thenumberoftweetsafterthisphasereached1.
1milliontweets.
Filteringthetweetsbasedontheoccurrenceofatleastoneofthehashtagsonlyisundesirableasitwillcausetwoproblems:First,itwillintroducepossibleeventoverlapduetogeneraltweetstalkingabouttwoormoretopics.
Second,isthatusingonlythesingleoccurrenceofthesetagswillyieldahugeamountoftweetsandweneedtoreducethissizetoreachamoremanageablesize.
In-tuitivelyspeaking,stronglyrelatedhashtagswillco-occuroften.
Forexample,atweetthathas#h1n1alongwith#swineuand#pandemicismostlikelyabouttheH1N1outbreakratherthanatweethavingjustthetag#uorjust#sick.
Filteringwiththisco-occurrencewillinturnsolvebothproblemsasbyincreasingrelevancetoaparticularevent,generaltweetsthattalkaboutseveraleventswillbelteredoutthusdiminishingtheoverlap,andinturnitwillreducethesizeofthedataset.
Next,weincreasetheprecisionofthetweetsassociatedwitheacheventfromthesetof1.
1milliontweets.
Intherstiterationweselectedthetagthathadthehighestfrequencyofco-occurrenceinthedatasetwiththeinitialtagandaddedittoasetwewillcalltheselectionset.
Afterthatwechecktheco-occurrenceofalltheremainingextractedtagswiththetagintheselectionsetandrecordthefrequenciesofco-occurrence.
Aftersortingthefrequenciesofco-occurrencewiththetagfromtheselectionset,wepickthehighestonetokeepaddittotheselectionset.
Werepeatthisstepofcountingco-occurrencesbutwithallthepreviouslyextractedhashtagsintheselectionsetfrompreviousiterations.
Toelaborate,forH1N1assumethatthehastag'#h1n1'hadthehighestfrequencyofappearanceinthedatasetsoweaddittotheselectionset.
IntheLosingMyRevolution5nextiterationwerecordthehowmanytimeseachtaginthelistappearedalongwith'#h1n1'inasametweet.
Ifweselected'#swine'astheonewiththehighestfrequencyofoccurrencewiththeinitialtag'#h1n1'weaddittotheselectionlistandinthenextiterationwerecordthefrequencyofoccurrenceoftheremaininghashtagswithbothoftheextractedtags'#h1n1'and'#swine'.
Werepeatthisstep,foreachevent,tothepointwherewehaveamanageablesizedatasetwhichwearecondentinits'aboutness'inrelationtotheevent.
EventHashtagsselectedforlterationTweetsExtractedOperationPerformedFinalTweetsMJmichael27,075michael&michaeljackson22,934Sample10%2,293Iraniran949,641iran&iranelection911,808iran&iranelection&gr88189,757iran&iranelection&gr88&neda91,815iran&iranelection&gr88&neda&tehran34,294Sample10%3,429H1N1h1n161,351h1n1&swine44,972h1n1&swine&swineflu42,574h1n1&swine&swineflu&pandemic5,517TakeAll5,517Obamaobama48,161obama&nobel1,118TakeAll1,118Table2.
TweetFiltrationiterationsandnaltweetcollectionsTwoproblemsappearedfromthisapproachwiththeIranandMichaelJack-sondatasets.
IntheIrandatasetthenumberoftweetswasinhundredsofthou-sandsandevenwith5tagsco-occurrenceitwasstillabout34K+tweets.
Tosolvethisweperformedarandomsamplingfromthoseresultingtweetstotakeonly10%ofthemresultinginasmallermanageabledataset.
ThesecondproblemwiththeMichaelJacksondatasetuponusing5tagstodecreaseittoamanage-ablesizewerealizedtherewerefewuniquedomainsfortheembeddedresources.
Acloserlookrevealedthiscombinationoftagswasmostlyborder-linetweetspam(MJringtones).
Tosolvethisweusedonlythetwotoptags"#michael"and"#michaeljackson",andthenwerandomlysampled10%oftheresultingtweetstoreachthedesireddatasetsize(Table2).
3.
2EgyptianRevolutionDatasetTheoneyearanniversaryofthiseventwastheoriginalmotivationforthisstudy[13].
Inthiscase,westartedwithaneventandthentriedtogetso-cialmediacontentdescribingit.
Despiteitsubiquity,gatheringsocialmediaforapasteventissurprisinglyhard.
WepickedtheEgyptianrevolutionduetotheroleofthesocialmediaincuratinganddrivingtheincidentsthatledtotheresignationofthepresident.
SeveralinitiativeswerecommencedtocollectandcuratethesocialmediacontentduringtherevolutionlikeR-sheif.
org2whichspecializesinsocialcontentanalysisoftheissuesintheArabworldbyusingaggregatedatafromTwitterandtheWeb.
WearecurrentlyintheprocessofobtainingthemillionsofrecordsrelatedtotheArabSpringof2011.
Meanwhile,wedecidedtobuildourowndatasetmanually.
2http://www.
r-shief.
org/6HanyM.
SalahEldeenandMichaelL.
NelsonThereareseveralsitesthatcurateresourcesabouttheEgyptianRevolutionandwewanttoinvestigateasmanyofthemaspossible.
Atthesametime,weneedtodiversifyourresourcesandthetypesofdigitalartifactsthatareembeddedinthem.
Tweets,videos,images,embeddedlinks,entirewebpagesandbookswereincludedinourinvestigation.
Forthesakeofconsistency,welimitedouranalysistoresourcescreatedwithintheperiodfromthe20thofJanuary2011tothe1stofMarch2011.
Inthenextsubsectionsweexplaineachoftheresourcesweutilizedinourdataacquisitionindetail.
Storify:StorifyisawebsitethatenablesuserstocreatestoriesbycreatingcollectionsofURIs(e.
g.
,Tweets,images,videos,links)andarrangethemtem-porally.
Theseentriesarepostedbyreferencetotheirhostwebsites.
Thus,addingcontenttoStorifydoesnotnecessarilymeanitisarchived.
IfauseraddedavideofromYouTubeandafterawhilethepublisherofthatvideodecidedtoremoveitfromYouTubetheuserisleftwithagapintheirStorifyentry.
ForthispurposewegatheredalltheStorifyentriesthatwerecreatedbetween20thofJanuary2011andthe1stofMarch2011,resultingin219uniqueresources.
IAmJan25:Someentirewebsiteswerededicatedasacollectionhubofmediatocuratetherevolution.
Basedonpubliccontributions,thosewebsitescollectdierenttypesofmedia,classifythem,orderthemchronologicallyandpublishthemtothepublic.
WepickedawebsitenamedIAmJan25.
com,asanexampleofthesewebsites,toanalyzeandinvestigate.
Theadministratorsofthewebsitereceivedselectedvideosandimagesfornotableeventsandactionsthathappenedduringtherevolution.
Thoseimagesandvideoswereselectedbyusersastheyvouchedforthemtobeofsomeimportanceandtheysendtheresource'sURItothewebsiteadministrators.
Thewebsiteitselfisdividedintotwocollections:avideocollectionandanimagecollection.
Thevideocollectionhad2387uniqueURIswhiletheimagecollectionhad3525uniqueURIs.
TweetsFromTahrir:Severalbookswerepublishedin2011documentingtherevolutionandtheArabSpring.
TobridgethegapbetweenbooksanddigitalmediaweanalyzedabookentitledTweetsfromTahrir[11]whichwaspub-lishedonApril21st,2011.
Asthenamestates,thisbooktellsastoryformedbytweetsofpeopleduringtherevolutionandtheclasheswiththepastregime.
Weanalyzedthisbookasacollectionoftweetsthathadtheluxuryofapaperbackpreservationandfocusedonthetweetedmedia,inthiscaseimages.
Thebookhadatotalof1118tweetshaving23uniqueimages.
3.
3SyriaDatasetThisdatasethasbeenselectedtorepresentacurrent(March2012)event.
UsingtheTwittersearchAPI,wefollowedthesamepatternofdataacquisitionasinsection3.
1.
Westartedwithonehashtag,#Syria,andexpandedit.
Table3LosingMyRevolution7showthetagsproducedfromthetagexpansionstep.
AfterthateachofthosetagswereinputintoaprocessutilizingtheTwitterstreamingAPIandproducedtherst1000resultsmatchingeachtag.
Fromthisset,werandomlysampled10%.
Asaresult,1955tweetswereextractedeachhavingoneormoreembeddedresourcesandtagsfromtheexpandedtagsinTable3.
InitialHashtagsExtractedHashtags'Syria''Bashar''RiseDamascus''GenocideInSyria''STOPASSAD2012''AssadCrimes''Assad'Table3.
Twitter#TagsgeneratedforlteringtheSyrianuprisingTable4showstheresourcescollectedalongwiththetopleveldomainsthatthoseresourcesbelongtoforeachevent.
EventTopDomains(numberofresourcesfound)MJyoutube(110),twitpic(45),latimes(43),cnn(30),amazon(30)Iranyoutube(385),twitpic(36),blogspot(30),roozonline(29)H1N1rhizalabs(676),reuters(17),google(16),utrackers(16),calgaryherald(11)Obamablogspot(16),nytimes(15),wordpress(12),youtube(11),cnn(10)Egyptyoutube(2414),cloudfront(2303),yfrog(1255),twitpic(114),imageshack.
us(20)Syriayoutube(130),twitter(61),hostpic.
biz(9),telegraph.
co.
uk(5)Table4.
Thetopleveldomainsfoundforeacheventordereddescendinglybythenumberofresources.
4UniquenessandExistenceFromthepreviousdatagatheringstepweobtainedsixdierentdatasetsrelatedtosixdierenthistoricevents.
ForeacheventweextractedalistofURIsthatweresharedintweetsoruploadedtositeslikeStorifyorIAmJan25.
ToanswerthequestionofhowmuchofthesocialmediacontentismissingwetestthoseURIsforeachdatasettoeliminateURIaliasesinwhichseveralURIsidentifytothesameresource.
UponobtainingthoseuniqueURIsweexaminehowmanyofwhicharestillavailableonthelivewebandhowmanyareavailableinpublicwebarchives.
4.
1UniquenessSomeURIs,especiallythosethatappearinTwitter,maybealiasesforthesameresource.
Forexample"http://bit.
ly/2EEjBl"and"http://goo.
gl/2ViC"bothresolveto"http://www.
cnn.
com".
Tosolvethis,weresolvedalltheURIsfollowingredirectstothenalURI.
TheHTTPresponseofthelastredirecthasaeldcalledlocationthatcontainstheoriginallongURIoftheresource.
ThisstepreducedthetotalnumberofURIsinthesixdatasetsfrom21,625to11,051.
Table5showsthenumberofuniqueresourcesineverydataset.
4.
2ExistenceontheLive-WebAfterobtainingtheuniqueURIsfromthepreviousstepweresolveallofthemandclassifythemasSuccessorFailure.
TheSuccessclassincludesalltheresources8HanyM.
SalahEldeenandMichaelL.
NelsonAllUnique2,2931,187=51.
77%MJArchivedNotArchivedAvailable316=26.
62%474=39.
93%Missing90=7.
58%307=25.
86%397=33.
45%406=34.
20%each/1,187AllUnique3,4291,340=39.
08%IranArchivedNotArchivedAvailable415=30.
97%586=43.
73%Missing101=7.
54%238=17.
76%339=25.
30%516=38.
51%each/1,340AllUnique5,5171,645=29.
82%H1N1ArchivedNotArchivedAvailable595=36.
17%656=39.
88%Missing98=5.
96%296=17.
99%394=23.
95%693=42.
12%each/1,645AllUnique1,118370=33.
09%ObamaArchivedNotArchivedAvailable143=38.
65%135=36.
49%Missing33=8.
92%59=15.
95%92=24.
86%176=47.
57%each/370AllUnique7,3136,154=84.
15%EgyptArchivedNotArchivedAvailable1,069=17.
37%4440=72.
15%Missing173=2.
81%472=7.
67%645=10.
48%1242=20.
18%each/6,154AllUnique1,955355=18.
16%SyriaArchivedNotArchivedAvailable19=5.
35%311=87.
61%Missing0=0%25=7.
04%25=7.
04%19=5.
35%each/355Table5.
Percentagesofuniqueresourcesfromalltheextractedonesweobtainedpereventandthepercentagesofpresenceofthoseuniqueresourcesonlivewebandinarchives.
Allresources=21,625,Uniqueresources=11,051thatultimatelyreturna"200OK"HTTPresponse.
TheFailureclassincludesalltheresourcesthatreturna"4XX"familyresponselike:"404NotFound","403Forbidden"and"410Gone",the"30X"redirectfamilywhilehavinginniteloopredirects,andservererrorswithresponse"50X".
Toavoidtransienterrorswerepeatedtherequests,onalldatasets,severaltimesforaweektoresolvethoseerrors.
Wealsotestfor"Soft404s",whicharepagesthatreturn"200OK"responsecodebutarenotarepresentationoftheresource,usingatechniquebasedonaheuristicforautomaticallydiscoveringsoft404sfromBar-Yossefetal.
[2].
Wealsoincludenoresponsefromtheserver,aswellasDNStimeouts,asfailures.
Notethatfailuremeansthatthisresourceismissingontheliveweb.
Table5summarizes,foreachdataset,thetotalpercentagesoftheresourcesmissingfromthelivewebandthenumberofmissingresourcesdividedbythetotalnumberofuniqueresources.
4.
3ExistenceintheArchivesInthepreviousstepwetestedtheexistenceoftheuniquelistofURIsforeacheventontheliveweb.
Next,weevaluatehowmanyURIshavebeenarchivedinpublicwebarchives.
TocheckthosearchivesweutilizetheMementoframe-work.
IfthereisamementofortheURI,wedownloaditsmementotimemapandanalyzeit.
Thetimemapisadatestamporderedlistofallknownarchivedver-sions(called"mementos")ofaURI.
Next,weparsethistimemapandextractLosingMyRevolution9thenumberofmementosthatpointtoversionsoftheresourceinthepublicarchives.
Wedeclaretheresourcetobearchivedifithasatleastonememento.
Thisstepwasalsorepeatedseveraltimestoavoidthetransientstatesofthearchivesbeforedeemingaresourceasunarchived.
TheresultsofthisexperimentalongwiththearchivecoveragepercentagearepresentedinTable5.
5ExistenceasaFunctionofTimeInspectingtheresultsfromthepreviousstepssuggeststhatthenumberofmiss-ingsharedresourcesinsocialmediacorrespondingtoaneventisdirectlypropor-tionalwithitsage.
Todeterminedatesforeachoftheeventsthisweextractedallthecreationdatesfromallthetweet-baseddatasetsandsortedthem.
Foreachevent,weplottedagraphillustratingthenumberoftweetsperdayrelatedtothateventasshowningure1.
Sincethedatasetisseparatedtemporallyinto3partitions,andinordertodisplayalltheeventsononegraphwereducedthesizeofthex-axisbyremovingthetimeperiodsnotcoveredinourstudy.
Fig.
1.
URIssharedperdaycorrespondingtoeacheventandshowingthetwopeaksinthenon-Syrianandnon-EgyptianeventsUponexaminingthegraphwefoundaninterestingphenomenainthenon-Syrianandnon-Egyptianevents:eacheventhastwopeaks.
Uponinvestigatinghistorytimelineswecametoconclusionthatthosepeaksreectasecondwaveofsocialmediainteractionasaresultofnewincidentwithinthesameeventafteraperiodoftime.
Forexample,intheH1N1dataset,therstpeakillustratestheworld-wideoutbreakannouncementwhilethesecondpeakdenotesthereleaseofthevaccine.
IntheIrandataset,therstpeakshowsthepeakoftheelectionswhilethesecondpeakpinpointstheIraniantrials.
AsfortheMJdatasettherstpeakcorrespondstohisdeathandthesecondpeakdescribestherumorsthatMichaelJacksondiedofunnaturalcausesandapossiblehomicide.
FortheObamadataset,therstpeakrevealstheannouncementofhiswinningtheprizewhilethesecondpeakpresentstheaward-givingceremonyinOslo.
FortheEgyptianevolution,theresourcesareallwithinasmalltimeslotof2weeks10HanyM.
SalahEldeenandMichaelL.
Nelsonaroundthedate11thofFebruary.
AsfortheSyrianevent,sincethecollectionwasveryrecenttherewasnoobviouspeaks.
Thosepeaksweexaminedwillbecometemporalcentroidsofthesocialcontentcollections(thedatasets).
MJ(June25th&July10th2009),Iran(June13th&1stAugust2009),H1N1(September11th&5thOctober2009),andObama(October9th&December10th2009).
Egyptwas(February11th2011)andtheSyriadatasetalsohadonecentroidonMarch27th2012.
Wespliteacheventaccordingtothetwocentroidsineacheventaccordingly.
Figure1showsthosepeaksandTable6showsthemissingcontentandthearchivedcontentpercentagescorrespondingtoeachcentroid.
MJIranH1N1ObamaEgyptSyria%Missing36.
24%31.
62%26.
98%24.
47%23.
49%25.
64%24.
59%26.
15%10.
48%7.
04%%Archived39.
45%30.
78%43.
08%36.
26%41.
65%43.
87%47.
87%46.
15%20.
18%5.
35%Table6.
TheSplitDatasetFig.
2.
Percentageofcontentmissingandarchivedfortheeventsasafunctionoftime.
Figure2showsthemissingandarchivedvaluesfromTable6asafunctionoftimesinceshared.
Equation1showsthemodeledestimateforthepercentageofsharedresourceslost,whereAgeisindays.
Whilethereisalesslinearrelationshipbetweentimeandbeingarchived,equation2showsthemodeledestimateforthepercentageofsharedresourcesarchivedinapublicarchive.
ContentLostPercentage=0.
02(Ageindays)+4.
20(1)ContentArchivedPercentage=0.
04(Ageindays)+6.
74(2)Giventheseobservationsandourcurvettingweestimatethatafterayearfrompublishingabout11%ofcontentsharedinsocialmediawillbegone.
Afterthispoint,wearelosingroughly0.
02%ofthiscontentperday.
LosingMyRevolution116ConclusionsandFutureworkWecanconcludethatthereisanearlylinearrelationshipbetweentimeofshar-inginthesocialmediaandthepercentagelost.
Althoughnotaslinear,thereisasimilarrelationshipbetweenthetimeofsharingandtheexpectedpercentageofcoverageinthearchives.
Toreachthisconclusion,weextractedcollectionsoftweetsandothersocialmediacontentthatwaspostedandsharedinrelationtosixdierenteventsthatoccurredinthetimeperiodfromJune2009toMarch2012.
Nextweextractedtheembeddedresourceswithinthissocialmediacontentandtestedtheirexistenceonthelivewebandinthearchives.
Afteranalyzingthepercentageslostandarchivedinrelationtotimeandplottingthemweusedalinearregressionmodeltotthosepoints.
Finallywepresentedtwolinearmodelsthatcanestimatetheexistenceofaresource,thatwaspostedorsharedatonepointoftimeinthesocialmedia,onthelivewebandinthearchivesasafunctionofageinthesocialmedia.
Inthenextstageofourresearchweneedtoexpandthedatasetsandimportothersimilardatasetsespeciallyintheuncoveredtemporalareas(e.
g.
,theyearof2010andbefore2009).
Examiningmoredatasetsacrossextendedpointsintimecouldenableustobettermodelthesetwofunctionsoftime.
Alsoseveralotherfactorsbesidetimewouldbeanalyzedtounderstandtheireectonpersistenceonthelivewebandarchivingcoveragelike:publishingvenue,rateofsharing,popularityofauthorsandthenatureoftherelatedevent.
7AcknowledgmentsThisworkwassupportedinpartbytheLibraryofCongressandNSFIIS-1009392.
References1.
Ainsworth,ScottG.
andAlsum,AhmedandSalahEldeen,HanyandWeigle,MicheleC.
andNelson,MichaelL.
:HowMuchoftheWebIsArchivedInProceedingsofthe11thannualinternationalACM/IEEEjointconferenceonDigitallibraries,JCDL'11,pages133-136,(2011).
2.
Bar-Yossef,ZivandBroder,AndreiZ.
andKumar,RaviandTomkins,Andrew.
:SicTransitGloriaTelae:TowardsanUnderstandingoftheWeb'sDecay.
InProceedingsofthe13thinternationalconferenceonWorldWideWeb,WWW'04,pages328-337,(2004).
3.
F.
Benevenut,T.
Rodrigues,M.
Cha,andV.
Almeida.
:CharacterizingUserBehav-iorinOnlineSocialNetworks.
InInProc.
ofACMSIGCOMMInternetMeasure-mentConference,SIGCOMM'09,pages49-62,(2009).
4.
Lee,CheiandMa,LongandGoh,Dion.
:WhyDoPeopleShareNewsinSocialMediaActiveMediaTechnology,SpringerBerlin/Heidelberg,pages129-140,Vol-ume:6890,(2011).
12HanyM.
SalahEldeenandMichaelL.
Nelson5.
Facebookocialfactsheet,http://newsroom.
fb.
com/content/default.
aspxNewsAreaId=226.
Kwak,HaewoonandLee,ChanghyunandPark,HosungandMoon,Sue.
:WhatisTwitter,aSocialNetworkoraNewsMediaInProceedingsofthe19thinternationalconferenceonWorldwideweb,WWW'10,pages591-600,(2010).
7.
GordonMohr,MicheleKimpton,MichealStackandIgorRanitovic.
:IntroductiontoHeritrix,anArchivalQualityWebCrawler.
In4thInternationalWebArchivingWorkshop,IWAW'04,(2004).
8.
FrankMcCownandNorouDiawaraandMichaelL.
Nelson.
:FactorsAectingWebsiteReconstructionfromtheWebInfrastructure.
InProceedingsofthe7thACM/IEEE-CSJointConferenceonDigitalLibraries,JCDL'07,pages39-48,(2007).
9.
MichaelL.
Nelson,B.
DanetteAllen.
:ObjectPersistenceandAvailabilityinDigitalLibraries.
D-LibMagazine,Volume8,Number1,January(2002)10.
M.
E.
J.
NewmanandJ.
Park.
:Whysocialnetworksaredierentfromothertypesofnetworks.
Phys.
Rev.
E,68(3):036122,September,(2003).
11.
AlexNunnsandNadiaIdle.
:TweetsFromTahrir.
ISBN-10:1935928457.
12.
T.
A.
PhelpsandR.
Wilensky.
:RobustHyperlinksCostJustFiveWordsEach.
TechnicalReport,UCB/CSD-00-1091,EECSDepartment,UniversityofCalifornia,Berkeley,(2000).
13.
HanyM.
SalahEldeen,MichaelL.
Nelson.
:LosingMyRevolution:AyearaftertheEgyptianRevolution,10%ofthesocialmediadocumentationisgone.
http://ws-dl.
blogspot.
com/2012/02/2012-02-11-losing-my-revolution-year.
html14.
RobertSanderson,MarkPhillipsandHerbertVandeSompel.
:AnalyzingthePersistenceofReferencedWebResourceswithMemento.
CoRR,arXiv:1105.
3459,(2011)15.
StanfordSNAPProjectDataset,http://snap.
stanford.
edu/16.
Twitternumbers,http://blog.
Twitter.
com/2011/03/numbers.
html17.
H.
VandeSompel,M.
L.
Nelson,R.
Sanderson,L.
L.
Balakireva,S.
Ainsworth,H.
Shankar.
:Memento:TimeTravelfortheWeb,TechnicalReport,arXiv:0911.
1112,November,(2009).
18.
Wan,X.
,Yang,J.
:Wordrank-basedLexicalSignaturesforFindingLostorRelatedWebPages.
InProceedingsofthe8thAsia-PacicWebconferenceonFrontiersofWWWResearchandDevelopment,APWeb'06,pages843-849,(2006).
19.
C.
Wilson,B.
Boe,A.
Sala,K.
P.
Puttaswamy,andB.
Y.
Zhao.
:UserInteractionsinSocialNetworksandtheirImplications.
InProceedingsofthe4thACMEuropeanconferenceonComputersystems,EuroSys'09,pages205-218,(2009).
20.
Wu,ShaomeiandHofman,JakeM.
andMason,WinterA.
andWatts,DuncanJ.
:WhoSaysWhattoWhomonTwitter.
InProceedingsofthe20thinternationalconferenceonWorldwideweb,WWW'11,pages705-714,(2011).
21.
JaewonYangandJureLeskovec.
:PatternsofTemporalVariationinOnlineMedia.
InACMInternationalConferenceonWebSearchandDataMinig,WSDM'11,pages177-186,(2011).
22.
J.
YangandS.
Counts.
:PredictingtheSpeed,Scale,andRangeofInformationDiusioninTwitter.
In4thInternationalAAAIConferenceonWeblogsandSocialMedia,ICWSM'10,May,(2010).
23.
D.
ZhaoandM.
B.
Rosson.
:HowandWhyPeopleTwitter:TheRolethatMicro-bloggingPlaysinInformalCommunicationatWork.
InProceedingsoftheACM2009internationalconferenceonSupportinggroupwork.
GROUP'09,pages243-252,(2009).
国外商家提供Windows系统的并不常见,CheapWindowsVPS 此次提供的 2 款 VPS 促销套餐,提供 5 折永久优惠码,优惠后月付 4.5 美元起,价格还是挺诱人的,VPS 不限流量,接入 1Gbps 带宽,8 个机房皆可选,其中洛杉矶机房还提供亚洲优化网络供选择,操作系统有 Windows 10 专业版、2012 R2、2016、Linux等。Cheap Windows VPS是...
今天CloudCone发布了最新的消息,推送了几款特价独立服务器/杜甫产品,美国洛杉矶MC机房,分配100Mbps带宽不限流量,可以选择G口限制流量计划方案,存储分配的比较大,选择HDD硬盘的话2TB起,MC机房到大陆地区线路还不错,有需要美国特价独立服务器的朋友可以关注一下。CloudCone怎么样?CloudCone服务器好不好?CloudCone值不值得购买?CloudCone是一家成立于2...
快云科技: 12.12特惠推出全场VPS 7折购 续费同价 年付仅不到五折公司介绍:快云科技是成立于2020年的新进主机商,持有IDC/ICP等证件资质齐全主营产品有:香港弹性云服务器,美国vps和日本vps,香港物理机,国内高防物理机以及美国日本高防物理机产品特色:全配置均20M带宽,架构采用KVM虚拟化技术,全盘SSD硬盘,RAID10阵列, 国内回程三网CN2 GIA,平均延迟50ms以下。...
403forbidden为你推荐
客户flash企业电子邮局企业邮箱怎么使用?中国企业信息网全国企业信息公示系统怎么查询企业信息asp.net网页制作ASP.NET设计网页的方法?结点cuteftp小型汽车网上自主编号申请成都新车上牌办理流程和办理条件是如何的青岛网通测速家用电脑上网(青岛网通)512k网速算不算快,玩主流网游卡不卡中国保健养猪网中央7台致富经养猪oa办公软件价格一套OA办公系统多少钱discuz7.0discuz7.0如何升级到discuz x2.0
免费注册网站域名 国外主机 googleapps bash漏洞 英语简历模板word 彩虹ip 牛人与腾讯客服对话 申请个人网站 支付宝扫码领红包 美国独立日 英雄联盟台服官网 游戏服务器出租 godaddy空间 广东服务器托管 免费主页空间 上海联通 easypanel bwg wordpress安装 代理服务器是什么 更多