LosingMyRevolutionHowManyResourcesSharedonSocialMediaHaveBeenLostHanyM.
SalahEldeenandMichaelL.
NelsonOldDominionUniversity,DepartmentofComputerScienceNorfolkVA,23529,USA{hany,mln}@cs.
odu.
eduAbstract.
Socialmediacontenthasgrownexponentiallyintherecentyearsandtheroleofsocialmediahasevolvedfromjustnarratinglifeeventstoactuallyshapingthem.
Inthispaperweexplorehowmanyresourcessharedinsocialmediaarestillavailableontheliveweborinpublicwebarchives.
Byanalyzingsixdierentevent-centricdatasetsofresourcessharedinsocialmediaintheperiodfromJune2009toMarch2012,wefoundabout11%lostand20%archivedafterjustayearandanaverageof27%lostand41%archivedaftertwoandahalfyears.
Furthermore,wefoundanearlylinearrelationshipbetweentimeofsharingoftheresourceandthepercentagelost,withaslightlylesslinearrelationshipbetweentimeofsharingandarchivingcoverageoftheresource.
Fromthismodelweconcludethataftertherstyearofpublishing,nearly11%ofsharedresourceswillbelostandafterthatwewillcontinuetolose0.
02%perday.
Keywords:WebArchiving,SocialMedia,DigitalPreservation1IntroductionWithmorethan845millionFacebookusersattheendof2011[5]andover140milliontweetssentdailyin2011[16]userscantakephotos,videos,posttheiropinions,andreportincidentsastheyhappen.
Manyofthepostsandtweetsareaboutquotidianeventsandtheirpreservationisdebatable.
However,someofthepostsandeventsareaboutculturallyimportanteventswhosepreservationislesscontroversial.
Inthispaperweshedlightontheimportanceofarchivingsocialmediacontentabouttheseeventsandestimatehowmuchofthiscontentisarchived,stillavailable,orlostwithnopossibilityofrecovery.
Toemphasizetheculturallyimportantcommentaryandsharing,wecol-lecteddataaboutsixeventsinthetimeperiodofJune2009toMarch2012:theH1N1virusoutbreak,MichaelJackson'sdeath,theIranianelectionsandprotests,BarackObama'sNobelPeacePrize,theEgyptianrevolution,andtheSyrianuprising.
arXiv:1209.
3026v1[cs.
DL]13Sep20122HanyM.
SalahEldeenandMichaelL.
Nelson2RelatedWorkToourknowledge,nopriorstudyhasanalyzedtheamountofsharedresourcesinsocialmedialostthroughtime.
Therehavebeenmanystudiesanalyzingthebehaviorofuserswithinasocialnetwork,howtheyinteract,andwhatcontenttheyshare[3,19,20,23].
AsforTwitter,Kwaketal.
[6]studieditsnatureanditstopologicalcharacteristicsandfoundadeviationfromknowncharacteristicsofhumansocialnetworksthatwereanalyzedbyNewmanandPark[10].
Leeanalyzedthereasonsbehindsharingnewsinsocialmediaandfoundthatinfor-mativenesswasthestrongestmotivationinpredictingnewssharingintention,followedbysocializingandstatusseeking[4].
AlsosharedcontentinsocialmedialikeTwittermoveanddiuserelativelyfastasstatedbyYangetal.
[22].
Furthermore,manyconcernswereraisedaboutthepersistenceofsharedresourcesandwebcontentingeneral.
NelsonandAllenstudiedthepersistenceofobjectsinadigitallibraryandfoundthat,withjustoverayear,3%ofthesampletheycollectedhaveappearedtonolongerbeavailable[9].
Sandersonetal.
analyzedthepersistenceandavailabilityofwebresourcesreferencedfrompapersinscholarlyrepositoriesusingMementoandfoundthat28%oftheseresourceshavebeenlost[14].
Memento[17]isacollectionofHTTPextensionsthatenablesuniform,inter-archiveaccess.
Ainsworthetal.
[1]examinedhowmuchofthewebisarchivedandfounditrangesfrom16%to79%,dependingonthestartingseedURIs.
McCownetal.
examinedthefactorsaectingreconstructingwebsites(usingcachesandarchives)andfoundthatPageRank,Age,andthenumberofhopsfromthetop-levelofthesiteweremostinuential[8].
3DataGatheringWecompiledalistofURIsthatweresharedinsocialmediaandcorrespondtospecicculturallyimportantevents.
Inthissectionwedescribethedataacqui-sitionandsamplingprocessweperformedtoextractsixdierentdatasetswhichwillbetestedandanalyzedinthefollowingsections.
3.
1StanfordSNAPProjectDatasetTheStanfordLargeNetworkDatasetisacollectionofabout50largenetworkdatasetshavingmillionsofnodes,edgesandtuples.
ItwascollectedasapartoftheStanfordNetworkAnalysisPlatform(SNAP)project[15].
Itincludessocialnetworks,webgraphs,roadnetworks,Internetnetworks,citationnetworks,collaborationnetworks,andcommunicationnetworks.
Forthepurposeofourinvestigation,weselectedtheirTwitterpostsdataset.
ThisdatasetwascollectedfromJune1st,2009toDecember31st,2009andcontainsnearly476milliontweetspostedbynearly17millionusers.
Thedatasetisestimatedtocover20%-30%ofallpostspublishedonTwitterduringthattimeframe[21].
ToselectwhichLosingMyRevolution3eventswillbecoveredinthisstudy,weexaminedCNN's2009eventstimeline1.
Wewantedtoselectasmallnumberofeventsthatwerediverse,withlimitedoverlap,andrelativelyimportanttoalargenumberofpeople.
Giventhat,weselectedfourevents:theH1N1virusoutbreak,theIranianprotestsandelections,MichaelJackson'sdeath,andBarrackObama'sNobelPeacePrizeaward.
Preparation:Atweetistypicallycomposedoftext,hashtags,embeddedre-sourcesorURIsandusertagsallspanningamaximumof140characters.
HereisanexampleofatweetrecordintheSNAPdataset:T2009-07-3123:57:18Uhttp://Twitter.
com/nickgotchWRT@rockingjude:December21,2009DepopulationbyFoodWillBeginhttp://is.
gd/1WMZbWHOA.
.
BETTERWATCHRTplz#pwa#tcotThelinestartingwiththeletterTindicatesthedateandtimeofthetweetcreation.
WhilethelinestartingwithUshowsalinktotheuserwhoau-thoredthisparticulartweet.
Finally,thelinestartingwithWshowstheen-tiretweetincludingalltheuser-references"@rockingjude",theembeddedURIs"http://is.
gd/1WMZb",andhashtags"#pwa#tcot".
TagExpansion:Wewantedtoselecttweetsthatwecansaywithhighcon-denceareaboutaselectedevent.
Inthiscase,precisionismoreimportantthanrecallascollectingeverysingletweetpublishedaboutacertaineventislessimportantthanmakingsurethattheselectedtweetsaredenitelyaboutthatevent.
Severalstudiesfocusedonestimatingtheaboutnessofacertainwebpageoraresourceingeneral[12,18].
FortunatelyinTwitter,hashtagsincorporatedwithinatweetcanhelpusestimatetheir"aboutness".
Usersnormallyaddcer-tainhashtagstotheirtweetstoeasethesearchanddiscoverabilityinfollowingacertaintopic.
Thesehashtagswillbeutilizedintheevent-centricltrationprocess.
Foreachevent,weselectedinitialtagsthatdescribeit(Table1).
Thoseinitialtagswerederivedempiricallyafterexaminingsomeevent-relatedtweets.
Nextweextractedallthehashtagsthatco-occurredwithourinitialsetofhashtags.
Forexample,inclassH1N1weextractedalltheotherhashtagsthatappearedalongwith#h1n1withinthesametweetandkeptcountoftheirfrequency.
Thoseextractedhashtagsweresortedindescendingorderofthefrequencyoftheirappearanceintweets.
Weremovedallthegeneralscopetagslike#cnn,#health,#death,#warandothers.
Inregardstoaboutness,removinggeneraltagswillindeeddecreaserecallbutwillincreaseprecision.
Finallywepickedthetop8-10hashtagstorepresentthisevent-classandbeutilizedintheltrationprocess.
Table1showsthenalsetoftagsselectedforeachclass.
TweetFiltration:Inthepreviousstepweextractedthetagsthatwillhelpusclassifyandltertweetsinthedatasetaccordingtoeachevent.
Thisltration1http://www.
cnn.
com/2009/US/12/16/year.
timeline/index.
html4HanyM.
SalahEldeenandMichaelL.
NelsonEventInitialHashtagsTopCo-occurringHashtagsH1N1'h1n1''swine'=61,829'swineflu'=56,419'flu'=8,436Outbreak=61,351'pandemic'=6,839'influenza'=1,725'grippe'=1,559'tamiflu'=331M.
Jackson's'michaeljackson''michael'=27,075'mj'=18,584'thisisit'8,770'rip'=3,559'jacko'=3,325Death=22,934'kingofpop'=2,888'jackson'=2,559'thriller'=1,357'thankyoumichael'=1,050Iranian'iranelection''iran'949,641'gr88'=197,113'tehran'=109,006'freeiran'=13,378Elections=911,808'neda'=191,067'mousavi'=16,587'united4iran'=9,198'iranrevolution'=7,295Obama's'obama'=48,161&'nobel'=2,261'obamanobel'=14'nobelprize''nobelpeace'=113NobelPrize'peace'=3,721'barack'=1292'nobelpeaceprize'=107Table1.
Twitterhashtagsgeneratedforlteringandtheirfrequencyofoccurringprocessaimstoextractareasonablesizeddatasetoftweetsforeacheventandtominimizetheinter-eventoverlap.
Sincethelifeandpersistenceofthetweetitselfisnotthefocusofthisstudybutrathertheassociatedresourcethatappearsinthetweet(image,video,shortenedURIorotherembeddedresource),wewillextractonlythetweetsthatcontainanembeddedresource.
Thisstepresultedin181milliontweetswithembeddedresources(http://is.
gd/1WMZbinthepriorexample).
ThesetweetswerefurtherlteredtokeeponlythetweetsthathaveatleastoneoftheexpandedtagsobtainedfromTable1.
Thenumberoftweetsafterthisphasereached1.
1milliontweets.
Filteringthetweetsbasedontheoccurrenceofatleastoneofthehashtagsonlyisundesirableasitwillcausetwoproblems:First,itwillintroducepossibleeventoverlapduetogeneraltweetstalkingabouttwoormoretopics.
Second,isthatusingonlythesingleoccurrenceofthesetagswillyieldahugeamountoftweetsandweneedtoreducethissizetoreachamoremanageablesize.
In-tuitivelyspeaking,stronglyrelatedhashtagswillco-occuroften.
Forexample,atweetthathas#h1n1alongwith#swineuand#pandemicismostlikelyabouttheH1N1outbreakratherthanatweethavingjustthetag#uorjust#sick.
Filteringwiththisco-occurrencewillinturnsolvebothproblemsasbyincreasingrelevancetoaparticularevent,generaltweetsthattalkaboutseveraleventswillbelteredoutthusdiminishingtheoverlap,andinturnitwillreducethesizeofthedataset.
Next,weincreasetheprecisionofthetweetsassociatedwitheacheventfromthesetof1.
1milliontweets.
Intherstiterationweselectedthetagthathadthehighestfrequencyofco-occurrenceinthedatasetwiththeinitialtagandaddedittoasetwewillcalltheselectionset.
Afterthatwechecktheco-occurrenceofalltheremainingextractedtagswiththetagintheselectionsetandrecordthefrequenciesofco-occurrence.
Aftersortingthefrequenciesofco-occurrencewiththetagfromtheselectionset,wepickthehighestonetokeepaddittotheselectionset.
Werepeatthisstepofcountingco-occurrencesbutwithallthepreviouslyextractedhashtagsintheselectionsetfrompreviousiterations.
Toelaborate,forH1N1assumethatthehastag'#h1n1'hadthehighestfrequencyofappearanceinthedatasetsoweaddittotheselectionset.
IntheLosingMyRevolution5nextiterationwerecordthehowmanytimeseachtaginthelistappearedalongwith'#h1n1'inasametweet.
Ifweselected'#swine'astheonewiththehighestfrequencyofoccurrencewiththeinitialtag'#h1n1'weaddittotheselectionlistandinthenextiterationwerecordthefrequencyofoccurrenceoftheremaininghashtagswithbothoftheextractedtags'#h1n1'and'#swine'.
Werepeatthisstep,foreachevent,tothepointwherewehaveamanageablesizedatasetwhichwearecondentinits'aboutness'inrelationtotheevent.
EventHashtagsselectedforlterationTweetsExtractedOperationPerformedFinalTweetsMJmichael27,075michael&michaeljackson22,934Sample10%2,293Iraniran949,641iran&iranelection911,808iran&iranelection&gr88189,757iran&iranelection&gr88&neda91,815iran&iranelection&gr88&neda&tehran34,294Sample10%3,429H1N1h1n161,351h1n1&swine44,972h1n1&swine&swineflu42,574h1n1&swine&swineflu&pandemic5,517TakeAll5,517Obamaobama48,161obama&nobel1,118TakeAll1,118Table2.
TweetFiltrationiterationsandnaltweetcollectionsTwoproblemsappearedfromthisapproachwiththeIranandMichaelJack-sondatasets.
IntheIrandatasetthenumberoftweetswasinhundredsofthou-sandsandevenwith5tagsco-occurrenceitwasstillabout34K+tweets.
Tosolvethisweperformedarandomsamplingfromthoseresultingtweetstotakeonly10%ofthemresultinginasmallermanageabledataset.
ThesecondproblemwiththeMichaelJacksondatasetuponusing5tagstodecreaseittoamanage-ablesizewerealizedtherewerefewuniquedomainsfortheembeddedresources.
Acloserlookrevealedthiscombinationoftagswasmostlyborder-linetweetspam(MJringtones).
Tosolvethisweusedonlythetwotoptags"#michael"and"#michaeljackson",andthenwerandomlysampled10%oftheresultingtweetstoreachthedesireddatasetsize(Table2).
3.
2EgyptianRevolutionDatasetTheoneyearanniversaryofthiseventwastheoriginalmotivationforthisstudy[13].
Inthiscase,westartedwithaneventandthentriedtogetso-cialmediacontentdescribingit.
Despiteitsubiquity,gatheringsocialmediaforapasteventissurprisinglyhard.
WepickedtheEgyptianrevolutionduetotheroleofthesocialmediaincuratinganddrivingtheincidentsthatledtotheresignationofthepresident.
SeveralinitiativeswerecommencedtocollectandcuratethesocialmediacontentduringtherevolutionlikeR-sheif.
org2whichspecializesinsocialcontentanalysisoftheissuesintheArabworldbyusingaggregatedatafromTwitterandtheWeb.
WearecurrentlyintheprocessofobtainingthemillionsofrecordsrelatedtotheArabSpringof2011.
Meanwhile,wedecidedtobuildourowndatasetmanually.
2http://www.
r-shief.
org/6HanyM.
SalahEldeenandMichaelL.
NelsonThereareseveralsitesthatcurateresourcesabouttheEgyptianRevolutionandwewanttoinvestigateasmanyofthemaspossible.
Atthesametime,weneedtodiversifyourresourcesandthetypesofdigitalartifactsthatareembeddedinthem.
Tweets,videos,images,embeddedlinks,entirewebpagesandbookswereincludedinourinvestigation.
Forthesakeofconsistency,welimitedouranalysistoresourcescreatedwithintheperiodfromthe20thofJanuary2011tothe1stofMarch2011.
Inthenextsubsectionsweexplaineachoftheresourcesweutilizedinourdataacquisitionindetail.
Storify:StorifyisawebsitethatenablesuserstocreatestoriesbycreatingcollectionsofURIs(e.
g.
,Tweets,images,videos,links)andarrangethemtem-porally.
Theseentriesarepostedbyreferencetotheirhostwebsites.
Thus,addingcontenttoStorifydoesnotnecessarilymeanitisarchived.
IfauseraddedavideofromYouTubeandafterawhilethepublisherofthatvideodecidedtoremoveitfromYouTubetheuserisleftwithagapintheirStorifyentry.
ForthispurposewegatheredalltheStorifyentriesthatwerecreatedbetween20thofJanuary2011andthe1stofMarch2011,resultingin219uniqueresources.
IAmJan25:Someentirewebsiteswerededicatedasacollectionhubofmediatocuratetherevolution.
Basedonpubliccontributions,thosewebsitescollectdierenttypesofmedia,classifythem,orderthemchronologicallyandpublishthemtothepublic.
WepickedawebsitenamedIAmJan25.
com,asanexampleofthesewebsites,toanalyzeandinvestigate.
Theadministratorsofthewebsitereceivedselectedvideosandimagesfornotableeventsandactionsthathappenedduringtherevolution.
Thoseimagesandvideoswereselectedbyusersastheyvouchedforthemtobeofsomeimportanceandtheysendtheresource'sURItothewebsiteadministrators.
Thewebsiteitselfisdividedintotwocollections:avideocollectionandanimagecollection.
Thevideocollectionhad2387uniqueURIswhiletheimagecollectionhad3525uniqueURIs.
TweetsFromTahrir:Severalbookswerepublishedin2011documentingtherevolutionandtheArabSpring.
TobridgethegapbetweenbooksanddigitalmediaweanalyzedabookentitledTweetsfromTahrir[11]whichwaspub-lishedonApril21st,2011.
Asthenamestates,thisbooktellsastoryformedbytweetsofpeopleduringtherevolutionandtheclasheswiththepastregime.
Weanalyzedthisbookasacollectionoftweetsthathadtheluxuryofapaperbackpreservationandfocusedonthetweetedmedia,inthiscaseimages.
Thebookhadatotalof1118tweetshaving23uniqueimages.
3.
3SyriaDatasetThisdatasethasbeenselectedtorepresentacurrent(March2012)event.
UsingtheTwittersearchAPI,wefollowedthesamepatternofdataacquisitionasinsection3.
1.
Westartedwithonehashtag,#Syria,andexpandedit.
Table3LosingMyRevolution7showthetagsproducedfromthetagexpansionstep.
AfterthateachofthosetagswereinputintoaprocessutilizingtheTwitterstreamingAPIandproducedtherst1000resultsmatchingeachtag.
Fromthisset,werandomlysampled10%.
Asaresult,1955tweetswereextractedeachhavingoneormoreembeddedresourcesandtagsfromtheexpandedtagsinTable3.
InitialHashtagsExtractedHashtags'Syria''Bashar''RiseDamascus''GenocideInSyria''STOPASSAD2012''AssadCrimes''Assad'Table3.
Twitter#TagsgeneratedforlteringtheSyrianuprisingTable4showstheresourcescollectedalongwiththetopleveldomainsthatthoseresourcesbelongtoforeachevent.
EventTopDomains(numberofresourcesfound)MJyoutube(110),twitpic(45),latimes(43),cnn(30),amazon(30)Iranyoutube(385),twitpic(36),blogspot(30),roozonline(29)H1N1rhizalabs(676),reuters(17),google(16),utrackers(16),calgaryherald(11)Obamablogspot(16),nytimes(15),wordpress(12),youtube(11),cnn(10)Egyptyoutube(2414),cloudfront(2303),yfrog(1255),twitpic(114),imageshack.
us(20)Syriayoutube(130),twitter(61),hostpic.
biz(9),telegraph.
co.
uk(5)Table4.
Thetopleveldomainsfoundforeacheventordereddescendinglybythenumberofresources.
4UniquenessandExistenceFromthepreviousdatagatheringstepweobtainedsixdierentdatasetsrelatedtosixdierenthistoricevents.
ForeacheventweextractedalistofURIsthatweresharedintweetsoruploadedtositeslikeStorifyorIAmJan25.
ToanswerthequestionofhowmuchofthesocialmediacontentismissingwetestthoseURIsforeachdatasettoeliminateURIaliasesinwhichseveralURIsidentifytothesameresource.
UponobtainingthoseuniqueURIsweexaminehowmanyofwhicharestillavailableonthelivewebandhowmanyareavailableinpublicwebarchives.
4.
1UniquenessSomeURIs,especiallythosethatappearinTwitter,maybealiasesforthesameresource.
Forexample"http://bit.
ly/2EEjBl"and"http://goo.
gl/2ViC"bothresolveto"http://www.
cnn.
com".
Tosolvethis,weresolvedalltheURIsfollowingredirectstothenalURI.
TheHTTPresponseofthelastredirecthasaeldcalledlocationthatcontainstheoriginallongURIoftheresource.
ThisstepreducedthetotalnumberofURIsinthesixdatasetsfrom21,625to11,051.
Table5showsthenumberofuniqueresourcesineverydataset.
4.
2ExistenceontheLive-WebAfterobtainingtheuniqueURIsfromthepreviousstepweresolveallofthemandclassifythemasSuccessorFailure.
TheSuccessclassincludesalltheresources8HanyM.
SalahEldeenandMichaelL.
NelsonAllUnique2,2931,187=51.
77%MJArchivedNotArchivedAvailable316=26.
62%474=39.
93%Missing90=7.
58%307=25.
86%397=33.
45%406=34.
20%each/1,187AllUnique3,4291,340=39.
08%IranArchivedNotArchivedAvailable415=30.
97%586=43.
73%Missing101=7.
54%238=17.
76%339=25.
30%516=38.
51%each/1,340AllUnique5,5171,645=29.
82%H1N1ArchivedNotArchivedAvailable595=36.
17%656=39.
88%Missing98=5.
96%296=17.
99%394=23.
95%693=42.
12%each/1,645AllUnique1,118370=33.
09%ObamaArchivedNotArchivedAvailable143=38.
65%135=36.
49%Missing33=8.
92%59=15.
95%92=24.
86%176=47.
57%each/370AllUnique7,3136,154=84.
15%EgyptArchivedNotArchivedAvailable1,069=17.
37%4440=72.
15%Missing173=2.
81%472=7.
67%645=10.
48%1242=20.
18%each/6,154AllUnique1,955355=18.
16%SyriaArchivedNotArchivedAvailable19=5.
35%311=87.
61%Missing0=0%25=7.
04%25=7.
04%19=5.
35%each/355Table5.
Percentagesofuniqueresourcesfromalltheextractedonesweobtainedpereventandthepercentagesofpresenceofthoseuniqueresourcesonlivewebandinarchives.
Allresources=21,625,Uniqueresources=11,051thatultimatelyreturna"200OK"HTTPresponse.
TheFailureclassincludesalltheresourcesthatreturna"4XX"familyresponselike:"404NotFound","403Forbidden"and"410Gone",the"30X"redirectfamilywhilehavinginniteloopredirects,andservererrorswithresponse"50X".
Toavoidtransienterrorswerepeatedtherequests,onalldatasets,severaltimesforaweektoresolvethoseerrors.
Wealsotestfor"Soft404s",whicharepagesthatreturn"200OK"responsecodebutarenotarepresentationoftheresource,usingatechniquebasedonaheuristicforautomaticallydiscoveringsoft404sfromBar-Yossefetal.
[2].
Wealsoincludenoresponsefromtheserver,aswellasDNStimeouts,asfailures.
Notethatfailuremeansthatthisresourceismissingontheliveweb.
Table5summarizes,foreachdataset,thetotalpercentagesoftheresourcesmissingfromthelivewebandthenumberofmissingresourcesdividedbythetotalnumberofuniqueresources.
4.
3ExistenceintheArchivesInthepreviousstepwetestedtheexistenceoftheuniquelistofURIsforeacheventontheliveweb.
Next,weevaluatehowmanyURIshavebeenarchivedinpublicwebarchives.
TocheckthosearchivesweutilizetheMementoframe-work.
IfthereisamementofortheURI,wedownloaditsmementotimemapandanalyzeit.
Thetimemapisadatestamporderedlistofallknownarchivedver-sions(called"mementos")ofaURI.
Next,weparsethistimemapandextractLosingMyRevolution9thenumberofmementosthatpointtoversionsoftheresourceinthepublicarchives.
Wedeclaretheresourcetobearchivedifithasatleastonememento.
Thisstepwasalsorepeatedseveraltimestoavoidthetransientstatesofthearchivesbeforedeemingaresourceasunarchived.
TheresultsofthisexperimentalongwiththearchivecoveragepercentagearepresentedinTable5.
5ExistenceasaFunctionofTimeInspectingtheresultsfromthepreviousstepssuggeststhatthenumberofmiss-ingsharedresourcesinsocialmediacorrespondingtoaneventisdirectlypropor-tionalwithitsage.
Todeterminedatesforeachoftheeventsthisweextractedallthecreationdatesfromallthetweet-baseddatasetsandsortedthem.
Foreachevent,weplottedagraphillustratingthenumberoftweetsperdayrelatedtothateventasshowningure1.
Sincethedatasetisseparatedtemporallyinto3partitions,andinordertodisplayalltheeventsononegraphwereducedthesizeofthex-axisbyremovingthetimeperiodsnotcoveredinourstudy.
Fig.
1.
URIssharedperdaycorrespondingtoeacheventandshowingthetwopeaksinthenon-Syrianandnon-EgyptianeventsUponexaminingthegraphwefoundaninterestingphenomenainthenon-Syrianandnon-Egyptianevents:eacheventhastwopeaks.
Uponinvestigatinghistorytimelineswecametoconclusionthatthosepeaksreectasecondwaveofsocialmediainteractionasaresultofnewincidentwithinthesameeventafteraperiodoftime.
Forexample,intheH1N1dataset,therstpeakillustratestheworld-wideoutbreakannouncementwhilethesecondpeakdenotesthereleaseofthevaccine.
IntheIrandataset,therstpeakshowsthepeakoftheelectionswhilethesecondpeakpinpointstheIraniantrials.
AsfortheMJdatasettherstpeakcorrespondstohisdeathandthesecondpeakdescribestherumorsthatMichaelJacksondiedofunnaturalcausesandapossiblehomicide.
FortheObamadataset,therstpeakrevealstheannouncementofhiswinningtheprizewhilethesecondpeakpresentstheaward-givingceremonyinOslo.
FortheEgyptianevolution,theresourcesareallwithinasmalltimeslotof2weeks10HanyM.
SalahEldeenandMichaelL.
Nelsonaroundthedate11thofFebruary.
AsfortheSyrianevent,sincethecollectionwasveryrecenttherewasnoobviouspeaks.
Thosepeaksweexaminedwillbecometemporalcentroidsofthesocialcontentcollections(thedatasets).
MJ(June25th&July10th2009),Iran(June13th&1stAugust2009),H1N1(September11th&5thOctober2009),andObama(October9th&December10th2009).
Egyptwas(February11th2011)andtheSyriadatasetalsohadonecentroidonMarch27th2012.
Wespliteacheventaccordingtothetwocentroidsineacheventaccordingly.
Figure1showsthosepeaksandTable6showsthemissingcontentandthearchivedcontentpercentagescorrespondingtoeachcentroid.
MJIranH1N1ObamaEgyptSyria%Missing36.
24%31.
62%26.
98%24.
47%23.
49%25.
64%24.
59%26.
15%10.
48%7.
04%%Archived39.
45%30.
78%43.
08%36.
26%41.
65%43.
87%47.
87%46.
15%20.
18%5.
35%Table6.
TheSplitDatasetFig.
2.
Percentageofcontentmissingandarchivedfortheeventsasafunctionoftime.
Figure2showsthemissingandarchivedvaluesfromTable6asafunctionoftimesinceshared.
Equation1showsthemodeledestimateforthepercentageofsharedresourceslost,whereAgeisindays.
Whilethereisalesslinearrelationshipbetweentimeandbeingarchived,equation2showsthemodeledestimateforthepercentageofsharedresourcesarchivedinapublicarchive.
ContentLostPercentage=0.
02(Ageindays)+4.
20(1)ContentArchivedPercentage=0.
04(Ageindays)+6.
74(2)Giventheseobservationsandourcurvettingweestimatethatafterayearfrompublishingabout11%ofcontentsharedinsocialmediawillbegone.
Afterthispoint,wearelosingroughly0.
02%ofthiscontentperday.
LosingMyRevolution116ConclusionsandFutureworkWecanconcludethatthereisanearlylinearrelationshipbetweentimeofshar-inginthesocialmediaandthepercentagelost.
Althoughnotaslinear,thereisasimilarrelationshipbetweenthetimeofsharingandtheexpectedpercentageofcoverageinthearchives.
Toreachthisconclusion,weextractedcollectionsoftweetsandothersocialmediacontentthatwaspostedandsharedinrelationtosixdierenteventsthatoccurredinthetimeperiodfromJune2009toMarch2012.
Nextweextractedtheembeddedresourceswithinthissocialmediacontentandtestedtheirexistenceonthelivewebandinthearchives.
Afteranalyzingthepercentageslostandarchivedinrelationtotimeandplottingthemweusedalinearregressionmodeltotthosepoints.
Finallywepresentedtwolinearmodelsthatcanestimatetheexistenceofaresource,thatwaspostedorsharedatonepointoftimeinthesocialmedia,onthelivewebandinthearchivesasafunctionofageinthesocialmedia.
Inthenextstageofourresearchweneedtoexpandthedatasetsandimportothersimilardatasetsespeciallyintheuncoveredtemporalareas(e.
g.
,theyearof2010andbefore2009).
Examiningmoredatasetsacrossextendedpointsintimecouldenableustobettermodelthesetwofunctionsoftime.
Alsoseveralotherfactorsbesidetimewouldbeanalyzedtounderstandtheireectonpersistenceonthelivewebandarchivingcoveragelike:publishingvenue,rateofsharing,popularityofauthorsandthenatureoftherelatedevent.
7AcknowledgmentsThisworkwassupportedinpartbytheLibraryofCongressandNSFIIS-1009392.
References1.
Ainsworth,ScottG.
andAlsum,AhmedandSalahEldeen,HanyandWeigle,MicheleC.
andNelson,MichaelL.
:HowMuchoftheWebIsArchivedInProceedingsofthe11thannualinternationalACM/IEEEjointconferenceonDigitallibraries,JCDL'11,pages133-136,(2011).
2.
Bar-Yossef,ZivandBroder,AndreiZ.
andKumar,RaviandTomkins,Andrew.
:SicTransitGloriaTelae:TowardsanUnderstandingoftheWeb'sDecay.
InProceedingsofthe13thinternationalconferenceonWorldWideWeb,WWW'04,pages328-337,(2004).
3.
F.
Benevenut,T.
Rodrigues,M.
Cha,andV.
Almeida.
:CharacterizingUserBehav-iorinOnlineSocialNetworks.
InInProc.
ofACMSIGCOMMInternetMeasure-mentConference,SIGCOMM'09,pages49-62,(2009).
4.
Lee,CheiandMa,LongandGoh,Dion.
:WhyDoPeopleShareNewsinSocialMediaActiveMediaTechnology,SpringerBerlin/Heidelberg,pages129-140,Vol-ume:6890,(2011).
12HanyM.
SalahEldeenandMichaelL.
Nelson5.
Facebookocialfactsheet,http://newsroom.
fb.
com/content/default.
aspxNewsAreaId=226.
Kwak,HaewoonandLee,ChanghyunandPark,HosungandMoon,Sue.
:WhatisTwitter,aSocialNetworkoraNewsMediaInProceedingsofthe19thinternationalconferenceonWorldwideweb,WWW'10,pages591-600,(2010).
7.
GordonMohr,MicheleKimpton,MichealStackandIgorRanitovic.
:IntroductiontoHeritrix,anArchivalQualityWebCrawler.
In4thInternationalWebArchivingWorkshop,IWAW'04,(2004).
8.
FrankMcCownandNorouDiawaraandMichaelL.
Nelson.
:FactorsAectingWebsiteReconstructionfromtheWebInfrastructure.
InProceedingsofthe7thACM/IEEE-CSJointConferenceonDigitalLibraries,JCDL'07,pages39-48,(2007).
9.
MichaelL.
Nelson,B.
DanetteAllen.
:ObjectPersistenceandAvailabilityinDigitalLibraries.
D-LibMagazine,Volume8,Number1,January(2002)10.
M.
E.
J.
NewmanandJ.
Park.
:Whysocialnetworksaredierentfromothertypesofnetworks.
Phys.
Rev.
E,68(3):036122,September,(2003).
11.
AlexNunnsandNadiaIdle.
:TweetsFromTahrir.
ISBN-10:1935928457.
12.
T.
A.
PhelpsandR.
Wilensky.
:RobustHyperlinksCostJustFiveWordsEach.
TechnicalReport,UCB/CSD-00-1091,EECSDepartment,UniversityofCalifornia,Berkeley,(2000).
13.
HanyM.
SalahEldeen,MichaelL.
Nelson.
:LosingMyRevolution:AyearaftertheEgyptianRevolution,10%ofthesocialmediadocumentationisgone.
http://ws-dl.
blogspot.
com/2012/02/2012-02-11-losing-my-revolution-year.
html14.
RobertSanderson,MarkPhillipsandHerbertVandeSompel.
:AnalyzingthePersistenceofReferencedWebResourceswithMemento.
CoRR,arXiv:1105.
3459,(2011)15.
StanfordSNAPProjectDataset,http://snap.
stanford.
edu/16.
Twitternumbers,http://blog.
Twitter.
com/2011/03/numbers.
html17.
H.
VandeSompel,M.
L.
Nelson,R.
Sanderson,L.
L.
Balakireva,S.
Ainsworth,H.
Shankar.
:Memento:TimeTravelfortheWeb,TechnicalReport,arXiv:0911.
1112,November,(2009).
18.
Wan,X.
,Yang,J.
:Wordrank-basedLexicalSignaturesforFindingLostorRelatedWebPages.
InProceedingsofthe8thAsia-PacicWebconferenceonFrontiersofWWWResearchandDevelopment,APWeb'06,pages843-849,(2006).
19.
C.
Wilson,B.
Boe,A.
Sala,K.
P.
Puttaswamy,andB.
Y.
Zhao.
:UserInteractionsinSocialNetworksandtheirImplications.
InProceedingsofthe4thACMEuropeanconferenceonComputersystems,EuroSys'09,pages205-218,(2009).
20.
Wu,ShaomeiandHofman,JakeM.
andMason,WinterA.
andWatts,DuncanJ.
:WhoSaysWhattoWhomonTwitter.
InProceedingsofthe20thinternationalconferenceonWorldwideweb,WWW'11,pages705-714,(2011).
21.
JaewonYangandJureLeskovec.
:PatternsofTemporalVariationinOnlineMedia.
InACMInternationalConferenceonWebSearchandDataMinig,WSDM'11,pages177-186,(2011).
22.
J.
YangandS.
Counts.
:PredictingtheSpeed,Scale,andRangeofInformationDiusioninTwitter.
In4thInternationalAAAIConferenceonWeblogsandSocialMedia,ICWSM'10,May,(2010).
23.
D.
ZhaoandM.
B.
Rosson.
:HowandWhyPeopleTwitter:TheRolethatMicro-bloggingPlaysinInformalCommunicationatWork.
InProceedingsoftheACM2009internationalconferenceonSupportinggroupwork.
GROUP'09,pages243-252,(2009).
蓝竹云怎么样 蓝竹云好不好蓝竹云是新商家这次给我们带来的 挂机宝25元/年 美国西雅图云服务器 下面是套餐和评测,废话不说直接开干~~蓝竹云官网链接点击打开官网江西上饶挂机宝宿主机配置 2*E5 2696V2 384G 8*1500G SAS RAID10阵列支持Windows sever 2008,Windows sever 2012,Centos 7.6,Debian 10.3,Ubuntu1...
华纳云(HNCloud Limited)是一家专业的全球数据中心基础服务提供商,总部在香港,隶属于香港联合通讯国际有限公司,拥有香港政府颁发的商业登记证明,保证用户的安全性和合规性。 华纳云是APNIC 和 ARIN 会员单位。主要提供香港和美国机房的VPS云服务器和独立服务器。商家支持支付宝、网银、Paypal付款。华纳云主要面向国内用户群,所以线路质量还是不错的,客户使用体验总体反响还是比较好...
Virmach自上次推出了短租30天的VPS后,也就是月抛型vps,到期不能续费,直接终止服务。此次又推出为期6个月的月抛VPS,可选圣何塞和水牛城机房,适合短期有需求的用户,有兴趣的可以关注一下。VirMach是一家创办于2014年的美国商家,支持支付宝、PayPal等方式,是一家主营廉价便宜VPS服务器的品牌,隶属于Virtual Machine Solutions LLC旗下!在廉价便宜美国...
403forbidden为你推荐
点击media指纹iphone德国iphone禁售令有人说苹果手机从2017年开始,中国禁售了苹果appstore宕机为什App Store下载软件 到了一半就停了 不动了爱买网超爱买网的特点zhuo爱作文:温暖的( )青岛网通测速网通,联通,长城这三个宽带哪个网速最快?我是青岛的qq头像上传失败我怎么总是QQ上传头像失败,ie假死我的电脑,IE一直会死机,怎么回事???drupal主题如何在 drupal 上让网页呈现手机版页面以让智能手机更易浏览阅读
免费域名空间 双线虚拟主机 3322动态域名 主机点评 linode代购 美国主机网 Hello图床 xfce 怎么测试下载速度 美国在线代理服务器 搜索引擎提交入口 根服务器 免费外链相册 石家庄服务器托管 学生服务器 域名和主机 googlevoice 镇江高防服务器 什么是dns apache启动失败 更多