runmetinfo

metinfo 时间:2021-04-12 阅读:()

UnderstandingbigdatathemesfromscientificbiomedicalliteraturethroughtopicmodelingAllardJ.
vanAltena*,PerryD.
Moerland,AeilkoH.
ZwindermanandSílviaD.
OlabarriagaBackgroundTheusageoftheterm'bigdata'haspickedupsince2011.
ThiswastheyearthatGartnerintroduced"BigDataandExtremeInformationProcessingandManagement"initshypecycle[1].
Furthermore,increasedinterestisvisibleintheevergrowingsearchtraf-ficshownbyGoogleTrends[2].
Scientificpublicationsin(bio)medicine,whichareourmaininterestinthisstudy,alsoshowamassiveincreaseinthenumberofpaperspub-lishedyearlythatmentionbigdata[3].
AbstractNowadays,bigdataisakeycomponentin(bio)medicalresearch.
However,themean-ingofthetermissubjecttoawidearrayofopinions,withoutaformaldefinition.
Thishamperscommunicationandleadstomissedopportunities.
Forexample,inthe(bio)medicalfieldwehaveobservedmanydifferentinterpretations,someofwhichhaveanegativeconnotation,impedingexploitationofbigdataapproaches.
Inthispaperwepursueabetterunderstandingofthetermbigdatathroughadata-drivensystematicapproachusingtextanalysisofscientific(bio)medicalliterature.
Weattempttofindhowexistingbigdatadefinitionsareexpressedwithinthechosenapplicationdomain.
WebuilduponfindingsofpreviousqualitativeresearchbyDeMauroetal.
(LibRev65:122–135,14),whichanalysedfifteendefinitionsandidentifiedfourkeybigdatathemes(i.
e.
,information,methods,technology,andimpact).
Wehaverevisitedtheseandotherdefinitionsofbigdata,andconsolidatedthemintoeightadditionalthemes,resultinginatotaloftwelvethemes.
Thecorpuswascomposedofpaperabstractsextractedfrom(bio)medicalliteraturedatabases,searchingfor'bigdata'.
Aftertextpre-processingandparameterselection,topicmodellingwasappliedwith25topics.
Theresultingtop-20wordspertopicwereannotatedwiththetwelvebigdatathemesbysevenobservers.
TheanalysisoftheseannotationsshowthatthethemesproposedbyDeMauroetal.
arestronglyexpressedinthecorpus.
Furthermore,severalofthemostpopularbigdataV's(i.
e.
,volume,velocity,andvalue)alsohavearelativelyhighpres-ence.
OtherV'sintroducedmorerecently(e.
g.
variability)werehoweverhardlyfoundinthe25topics.
Thesefindingsshowthatthecurrentunderstandingofbigdatawithinthe(bio)medicaldomainisinagreementwithmoregeneraldefinitionsoftheterm.
Keywords:Textmining,Topicmodelling,Bigdata,BiomedicalresearchOpenAccessTheAuthor(s)2016.
ThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.
0InternationalLicense(http://creativecommons.
org/licenses/by/4.
0/),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicense,andindicateifchangesweremade.
RESEARCHvanAltenaetal.
JBigData(2016)3:23DOI10.
1186/s40537-016-0057-0*Correspondence:a.
j.
vanaltena@amc.
uva.
nlDepartmentofEpidemiology,BiostatisticsandBioinformatics,AcademicMedicalCenteroftheUniversityofAmsterdam,1105AZAmsterdam,TheNetherlandsPage2of21vanAltenaetal.
JBigData(2016)3:23Still,inspiteofthepopularityofthisterm,thereismuchdebateaboutthedefinitionofbigdata.
In2001Gartner(called"METAGroup"atthetime[4])publishedareportwhichinhindsightisoftenreferredtoasthefirstdescriptionofbigdata.
Itdefinesthetermthroughinformationtechnology(IT)challengesdescribedbythreeV's:volume,velocity,andvariety[5].
Overtheyearsthishasevolvedintomanyinterpretations.
Mostly,companiesdefinebigdatainthelightoftheirprimebusiness,meaningthatGooglewillmentionanalysis(e.
g.
,GoogleFlu),whileOracleemphasisesvolumeandstorage[6],andIBMorMicro-softfocusoncomputationandusability[7].
Inaweb-blog,postedonthedatasciencesub-domainoftheBerkeleyschoolofinformation,43'thoughtleaders'fromtheindus-trywereaskedfortheirdefinitionofbigdata[8].
Notmanyoftheseleadersagreedwitheachotheranddefinitionsrangefrom"datathatcannotfiteasilyintoastandardrela-tionaldatabase"to"bigdataisnotallaboutvolume,itismoreaboutcombiningdifferentdatasetsandtoanalyseitinreal-timetogetinsightsforyourorganisation".
Onagov-ernmentallevel,theUSNationalInstituteofStandardsandTechnology(NIST)definedbigdatain2014astheneedforscalabletechnologyandfourV's:volume,velocity,vari-ety,andvariability.
Finally,inthescientificdomain,bigdataismostlyunderstoodasthechallengesofworkingwithlargevolumesofdata[9–11].
Possiblyduetothisgreatvarietyofdefinitions,inpracticewehaveobservedmanydif-ferentinterpretationsofthetermbigdataamong(bio)medicalscientists.
Someunder-standbigdataasapositivedevelopment,andactivelypursueusageofnewmethodsandtechnologyassociatedwiththeterm[3].
Others,however,viewitasaharmfulinfluenceon,forexample,thestrengthofresearchevidence,preferringclassicalstatisticalmeth-ods[12].
Abetterunderstandingofbigdatawouldfacilitatecommunicationandclarifyexpectationsregardingthisoverloadedterm[13].
Someresearchershaveattemptedtocapturecomprehensivedefinitionsofbigdata,suchasDeMauroetal.
[14],WardandBarker[15],andAndreu-Perezetal.
[3].
Thefirsttwofocusonnodomaininparticular,whereasAndreu-Perezetal.
[3]focusesonhealth-orientedapplications.
OfparticularinterestistheworkbyDeMauroetal.
whichanaly-sisvariousbigdatadefinitionsandfromthesedistiltheirown.
Theirproposeddefinitionisbasedonfourthemesfoundintheunderlyingdefinitionsthatweregathered,namelyinformation,methods,technology,andimpact.
Notethatallthecasesmentionedabovearebasedonqualitativeliteraturestudies.
HansmannandNiemeyer[16],however,usedtextminingtounderstandthethemesincludedinbigdataliterature.
Theycombinedautomaticandmanualapproachestoidentifythreethemes:ITinfrastructure,methods,anddata.
Whiletheseeffortshavebeenvaluableforabetterunderstandingofthetermbigdata,theydonotpresentsystematicevidenceoftheactualthemesusedinthescien-tificliterature,inparticularforthe(bio)medicalresearchdomain.
Inthispaperwepresentoureffortstoanswerthefollowingresearchquestion:Whichthemesfromvariousexistingbigdatadefinitionsareexpressedin(bio)medicalscientificpublicationsForthispurpose,weadoptedadata-drivensystematicapproach.
First,bigdatadefinitionswererevisedand12themeswereidentified.
Then,(bio)medicallitera-turewassystematicallygatheredfromtwoscientificdatabases(i.
e.
,PubMedandPub-MedCentral)andanalysedautomaticallywithtextmining.
Whiletherearemanytextminingandclusteringmethods,wechosetopicmodelling(TM,[17,18])becausethisPage3of21vanAltenaetal.
JBigData(2016)3:23methodcapturestwoaspectsthatareimportantforthisdataset:wordsmayhavemul-tiplemeaningsorinterpretationsanddocumentsmaycontainoneormoretopics.
ThetopicsidentifiedthroughTMwereannotatedwiththe12themesbyasmallgroupofobservers.
Inthefollowingsectionswedetailthemethods,presenttheresultsanddis-cussourfindings.
MethodsInthissectiontheconstructionofthecorpusisdescribed,followedbyanexplanationoftheconceptsbehindTM.
ThentheapplicationofTMtothecorpusispresentedinthreesteps:pre-processing,modelfitting,andpost-processing.
Finallywepresentthegather-ingandsummaryofexistingbigdatadefinitions,andtheprocessusedtoidentifytheminthetopicsdeterminedbyTM.
CorpusThecorpusofdocumentswascreatedbyqueryingtwoliteraturedatabasesfocusedon(bio)medicalpublications:PubMedandPubMedCentral(PMC).
Thesearchquerieswereasfollows:PubMed:"bigdata"[TIAB]OR(big[TIAB]AND"healthdata"[TIAB])OR"largedata"[TI];PMC:"bigdata"[TI]OR"bigdata"[AB]OR(big[TI]AND"healthdata"[TI])OR(big[AB]AND"healthdata"[AB])OR"largedata"[TI].
Eachquerywasbuilttosearchforliteraluseoftheterm'bigdata',thereforeselectingdocumentsthatwereself-identifiedwithbigdata.
Nowordspacingwasallowedtomini-misetheamountofirrelevantresults.
Theterms'bighealthdata'and'largedata'wereaddedbecausetheyalsoretrievedrelevantliterature,especiallyforpublicationsbefore2011,whenthetermbigdatawasnotpopularyet.
Titlesandabstractswereexportedfromthedatabasesandmergedintoalocalreposi-toryforfurtherprocessing.
Basedonthetitle(strippedofallspecialcharactersandspaces)orthedigitalobjectidentifier(DOI,ifavailable),duplicateswereremovedfromthecorpus.
Lastly,anyrecordwithanemptyabstract(i.
e.
,notprovidedinthedatabase)wasalsoremovedfromthecorpus.
TopicmodellingconceptsAspecifictypeofTMwaschosen,namelyLatentDirichletAllocation(LDA)[17].
ThroughoutthispapertheabbreviationsTMandLDAareusedinterchangeablytoindi-catetopicmodellingthroughtheapplicationofLDA.
TheconceptofTMiscapturedinFig.
1usingtheplatenotation[17–19].
PlateDdenotesthesetofdocuments,whileθ(d)isthemultinomialdistributionovertopicsfordocumentd.
PlateN(d)denotesthesetofwordswforaspecificdocumentd,whilezisthetopictowhichwordwisassigned.
Lastly,plateTdenotesthesetoftopicswhereφ(z)isthemultinomialdistributionoverwordsfortopicz.
InTM,θ,φ,andzarethelatentvariablesthathavetobeestimated.
TogetherwiththeDirichletdistributedhyperparametersαandβ,themodeliscalledLatentDirichletPage4of21vanAltenaetal.
JBigData(2016)3:23Allocation[17,19].
Thehyperparametersαandβshouldbeinterpretedassmoothingfactorsforrespectivelytopic-to-document(θ)andword-to-topic(φ)assignments.
TopicmodellingimplementationThestatisticalsoftwareR[20]wasusedtoimplementthepre-processing,TMfitting,modelselection,andpost-processingsteps.
Pre-processingwasexecutedusingtheRtmandquantedapackages[21,22].
Process-ingconsistedofremovingstopwordstakenfromtheSMARTlist[23,24](e.
g.
,about,the,which).
1Extrastopwordswereadded,whichwereeitherjunkwordsresultingfromprocessingsteps,ortermsthatappearedveryoftenanddilutedtheTMoutcome,suchas'bigdata','introduction'and'discussion'.
2Fromtheremainingwords,bi-gramswerecreatedwithfunctiondfm:twowordsthatoccurnexttoeachotheratleast15timesinthewholecorpusarejoinedbyanunderscore(e.
g.
,health_care).
Furthermore,wordswerestemmedwithfunctionstemDocument;e.
g.
,'develop','developed',and'develop-ment'wereallstemmedto'develop'.
Lastly,wordslongerthan26characterswereremoved.
Fittingthemodelconsistedofestimatingthelatentvariablesθ,φandz,whichwasdonewiththeRtopicmodelspackage[26].
Directlycalculatingθandφwasshowntobesuboptimal[19],thereforeweusedaBayesianapproachfromthetopicmodelspack-ageusingGibbsiterativesamplingtoapproximatethedistributionz.
Inthissamplingprocesstheprobabilityofawordoccurringinatopicisestimated.
Thisprobabilityofagivenword-to-topicassignmentiscalculatedfromhowoftenthewordalreadyoccursinthetopicandhowdominantthetopicisforthedocumentfromwhichthewordwassampled.
Oncethemodelfittingconverges,θandφcanbederivedfromtheapproxi-mateddistributionzwiththeposteriorfunction.
MultiplemodelswerefittedtodeterminethebestTMparameters.
Wefirstconductedexperimentstofindadequatevaluesforαandβ.
Theseinfluencethemodelasfollows:withasmallα(i.
e.
,withmanytopicsα=50/Tbecomessmaller)itislikelyfordocu-mentstocontainonlyafewtopics,whereasabiggerα(i.
e.
,fewtopics)resultsinmore1Thefulllistcanbefoundat[25].
2Thecompletelistis:big,data,ieee,discussion,conclusion,introduction,methods,psycinfodatabase,rightsreserved,recordapa,journalabstract,aparights,psycinfo,reservedjournal.
θ(d)zwαφ(z)βDN(d)TFig.
1Platenotationoftopicmodelling,platesareshownasrectanglesandthearrowsareconditionaldependencies.
Showstherelationsbetweenknownvariables(documentsD,numberofwordsN(d),andwordsw),latentvariables(multinomialdistributionsθ(d)andφ(z),andwordtotopicassignmentz),andhyperparameters(αandβ)Page5of21vanAltenaetal.
JBigData(2016)3:23topicsperdocument.
Asmallβsimilarlymakesitlikelyforatopictocontainamixtureofafewwords,therebypushingthemodeltoselecthighlyspecificwordspertopic.
Arangeofvalueswasfittedforbothαandβandmodeloutcomeswerecompared.
Withinareasonablerange(i.
e.
,0.
1<α<1)weobservedonlyminordifferencesbetweentop-ics.
Ultimately,fixedvalueswerechosenforαandβ,respectively50/Tand0.
01assug-gestedintheliterature[19,27].
FormodelselectionweanalysedthelikelihoodforvaryingnumberedoftopicsintherangeT∈{5,10,15,100,150,200,500}.
However,likelihoodalonecannotbeusedtofindthebestmodel.
Apenalisingfactorhastobeaddedforthemodel'scomplexity(i.
e.
,thenumberofvariablesthathavetobeestimated).
Twoinformationcriteriawereconsid-ered,namelytheBayesianInformationCriterion(BIC)[28]andtheAkaikeInformationCriterion(AIC)[29].
Whenincreasingthenumberoftopicsinamodel,eachtopicbecomesmorespecificand,therefore,easiertointerpret.
BICputsmoreemphasisonthesimplicity(intermsofthenumberoffreeparameters)ofthemodel,resultinginasmallernumberoftopicsascomparedtoAIC.
WethereforechosetoperformmodelselectionusingtheAIC.
InthecaseofTM,thevariablestobeestimatedarethelatentvariablesφandθ,whichgrowwiththenumberoftopics.
ThemodelwheretheAICreacheditsminimumwasconsideredtheoptimalmodel.
Equation(1)definestheAIC,whereTisthenumberoftopicsinmodelMT,ListhelikelihoodofmodelMT,andWisthenumberofuniquewordsinthecorpus:Post-processingconsistedofretrievingθandφfortheoptimalmodel,andcalculatingtherelevanceofwordswithinatopicaccordingtothemethoddescribedbySievertetal.
[30].
Equation(2)defineshowrelevancerwascalculatedforwordwintopictgiven:Therelevanceisaconvexcombinationoftwomeasures:thetopic-specificdistribution(φtw)and'lift'(φtw/pw),whichisaratiobetweentopic-specificandcorpus-widedistributions.
Thesemeasurescanbebalancedwith0≤≤1,bygivingmoreweighttoφ(=1)ortothelift(=0).
Inourexperimentsavalueof0.
6waschosenfor,assuggestedinSiev-ertetal.
[30].
T*Wrelevancieswerecalculated(i.
e.
,eachwordhadonerelevancescorepertopic)andusedtosortthemostrelevantwordspertopic.
BigdatadefinitionsThedefinitionproposedbyDeMauroetal.
wasusedasastartingpointforthisstudy.
Furthermore,theunderlyingdefinitionsgatheredinDeMauroetal.
werereassessedandwherenecessaryupdated(e.
g.
,updatesinwhitepaperspublishedbyindustry).
Lastly,apublicationbyAndreu-Perezetal.
[3]wasaddedbecauseitdefinedsixbigdataV'sinthecontextof(bio)medicalresearch.
Allthedefinitionswereanalysed.
Ifthedefinitionwasgiveninfreetext,themajorthemeswereextracted.
Themeswerethengroupedonsimilarity,forexample,volumeandsizeweremergedintoonetheme.
Forvariousreasonsafewdefinitionsweredis-carded,asdiscussedinthe"Bigdatadefinitions"section.
(1)AIC(MT)=2log(L)+2((T1)+T(W1))(2)r(t,w|)=log(φtw)+(1)logφtwpwPage6of21vanAltenaetal.
JBigData(2016)3:23TopicanalysisTopicmodelresultswereanalysedmanuallybyinspectingthetoprelevantwords(i.
e.
,20pertopic).
Theobserversreceivedalistoftopicsandadescriptionofeachtheme.
Theywereinstructedtoreadallthewordsineachtopic,thenconsultthebigdatadefinitionthemes,andfinallyprovidetheiropinionaboutwhichthemesareassociatedwiththatsetofwords.
Eachofthetopicswasassignedzero,one,ormorethemesbyeachobserverindividually.
Intotalsevenpersonsperformedtheanalysisindependently:eachoftheauthorsandthreeexternalhealthdatascientists.
ResultsThissectionreportstheresultsofcorpusextraction,TMmodelfittingandselection,gatheringandconsolitationofbigdatadefinitions,andannotationoftopicswiththethemes.
CorpusAtotalof1659documentswereextractedfromPubmedand543fromPubMedCentral.
Afterremovingduplicatesandrecordswithanemptyabstract,1308documentswereincludedinthecorpusasshowninFig.
2.
Afterpre-processing136,339wordsremainedinthecorpus,ofwhich7849wereunique.
Alargeportion(7081words)hadalowfrequency(<40occurrences).
Figures3and4giveanimpressionofthecorpus'scontents,showingafrequencyplotofthetop2000words,whichseemstobeinaccordancewithZipf'slaw[31].
Tocreatethewordcloudthetop100mostfrequentwordswereextracted(asmarkedwiththeverticallineinthefrequencyplot).
TopicmodellingandmodelselectionIntotal49modelsMTwerefittedwithTrangingbetween5and500.
TheAICcurveforallfittedmodelsMisshowninFig.
5.
TheminimumoftheAICcurveliesatT=14,howeverthedifferencesaresmalluntilT=25.
Wealsocalculatedthedistancesbetweentopicsfromdiversemodels(T∈{1425}),whichshowedthattopicsarefairlystable(datanotshown).
Whenincreasingthenumberoftopics,changesobservedincludeonetopicsplittingintotwotopicsoranewtopicappearing.
Wesawnomajorreorganisa-tionoftopicsorwordswithintopics.
WealsoobservedthatincreasingthenumberofFig.
2Corpusgeneration:documentsextractedperliteraturedatabase,documentsremovedfromthecorpus,andtotalnumberofincludeddocumentsPage7of21vanAltenaetal.
JBigData(2016)3:23topicsinthemodelmakesthetermsineachindividualtopicmorespecific.
Forexample,onetopiccoveringbothapplicationandbigdatathemesmightbesplitintotwoseparatetopicsinalargermodel.
WethereforeselectedM25forannotation,asthismodelhasa05001000150020000200400600800FrequencyWordRankFig.
3Frequencyofthetop2000uniquewordsinthecorpus.
Theverticallineisthecut-offpoint(n=100)usedforthewordcloudFig.
4Wordcloudofthetop-100uniquewordsinthecorpus01003005002e+066e+06NumberofTopicsAIC5101520253017400001820000NumberofTopicsAICFig.
5Left:AICcurveofthe49fittedTMmodels(20modelsbetweenT=5andT=30notplotted,seeright).
Theminimumismarkedbythedottedline(T=14).
Right:Close-upoftheAICcurvebetweenT=5andT=30,showing26fittedTMmodels.
Theminimumismarkedbythedottedline(T=14)Page8of21vanAltenaetal.
JBigData(2016)3:23betterinterpretabilitycomparedtoM14(morespecifictopics),withcomparablequalityofmodelfit(similarAIC).
ToassesstherobustnessofthemodelM25,thelog-likelihoodwastrackedforeachiterationofGibbssampling.
Thismodelwasfittedthreetimeswithfixedinput,butwithdifferentstartingseedsforthesampling.
TheoutcomeofthesefitsispresentedinFig.
6.
Itshowsthatthelog-likelihoodreachesitsapproximatemaximumafter100–150iterations.
Modelsrunwithahighernumberofiterations(upto4000,datanotshown)showednomajordifferenceinlog-likelihoodconvergence,therefore,finalmodelssuchasM14andM25wererunfor500iterations.
Thetop-20mostrelevantwordspertopicoftheM25modelareshowninTable4.
BigdatadefinitionsIntotal17definitionsofbigdatawereconsideredfromthefollowingsources[3,5,6,14,15,32–43].
Table1presentstheresultsofouranalysislistingthefoundthemes,theirdescription,andrespectivesources.
Notethatwehavenotattemptedtoconsolidatethenamesofthethemes,leavingthecompletedescriptionasfoundinthesources.
Thedefi-nitionscanbedividedintothreegroups,witheachgroupcontainingmultiplethemes.
Thefirstgroup(I)correspondstothebigdataV's,whichoccurinvariousformsinmanyoftheanalyseddefinitions.
Somewordsweremergedintoonethemebecausetheyareessentiallypseudonymsofeachother.
Forexample:volume,size,voluminous,andcardinalitywerefoundintenofthedefinitionsand,fromtheirdescriptions,refertotheamountofdata.
Alsonotethatvelocityandcontinuity,andcomplexityandvarietywerecombined.
Thesecondgroup(II)correspondstotheaggregatedthemesproposedbyDeMauroetal.
,whichrepresentconceptsofahigherlevelofabstractionthanthepreviousgroup.
Thethirdgroup(III)includesathemeidentifiedinthreedefinitions,whichdescribebigdataasdatathatisbeyondconventionalprocessingandanalysis.
TheV'sdescribedatabymanydifferentaspects,butnoneofthosedefineahardlimitbeyondwhichdatabecomesbig.
Thetheme'beyondconventional'thereforedescribesbigdataassome-thingthatneedsnovelspecialisedandscalablesolutions.
Thisalsomeansthatthetypesofproblemsandapplicationsthatareassignedtothescopeofbigdatachangeovertime,astechnologyandmethodsevolveandimprove.
0100200300400500800000740000LogLikelihoodIterationFig.
6Convergenceofthelog-likelihoodforthechosenoptimalmodelM25forthreerunsstartingfromdif-ferentseedsPage9of21vanAltenaetal.
JBigData(2016)3:23Thefourthgroup(IV)wasnotfoundinthestudieddefinitions,butwasaddedtocopewiththerealityofourdata.
Becausethebodyofliteratureusedinthisstudywasobtainedfrom(bio)medicalliteraturedatabases,weexpectedtoseeapplication-relatedthemestobestronglyrepresentedintheresultingtopics.
WethereforeincludedtheApplicationthemetoclassifythosetopicsthatdonotfallunderbigdata.
NotethatsomedefinitionsconsideredbyDeMauroetal.
werenotusedhere:ThedefinitionbyMicrosoft[40]wasaweb-blogpostfrom2013,thereforepossiblyoutdated;Table1DescriptionofthemesidentifiedinbigdatadefinitionsfromliteratureThemenameThemedescriptionDefinitionsourcesIVolume,size,voluminous,cardinalityLargequantitiesofdatainnumberofbytes;sizeofavailabledata(e.
g.
allrecordsinsteadofasample);beyondconventionalstoragetechniques;numberofrecordsataparticularinstance[3,5,6,15,32–34,36,37,39]Velocity,continuityFlowrateatwhichdataiscreated,stored,analysed,andvisualised;increasedthroughinventionofnewdatastreamssuchassocialmedia;beyondconventionalmeansofpro-cessing,needingnewtechniquessuchasstreaming;growthofdataovertime[3,5,6,32–34,37]Variety,complexityManydifferenttypesofdata;notboundtoatraditionaldataformat;formatchangesovertime;heteroge-neousandunstructureddata[3,5,6,15,32–34,36,37,39]VeracityTrustworthinessofdata;reliabilityofdataqualityandgatheringenviron-ment[3,32]ValueWorth/relevancyofdata(e.
g.
eco-nomic,individual/privacy,societal,humanityvalue)[3,6,38]VariabilityConsistencyofdataovertime;influ-enceswhichsystematicallychangedatameasuresovertime[3,34]IIInformationWheresignalsareturnedintodata(e.
g.
bookdigitalisation,orgatheringfrompersonaldevicemeasure-ments)[14]TechnologyTools,systems,andsoftware(e.
g.
scal-ableprocessingandtransmissionsystemssuchasHadoop)[14,15,34–36,38]MethodsProceduresandtheirapplication(e.
g.
clustering,naturallanguageprocessing,machinelearning,neuralnetworks,visualisation)[14,35,38]ImpactEthical,business,societal[14]IIIBeyondconventionalDatawhosesizecallformethodsbeyondthetried-and-true;necessityofscalablesystemsforstorage,processing,manipulation,analysis,visualisation[35–37]IVApplicationAbouttheapplicationdomaintreatedinthepapers–Page10of21vanAltenaetal.
JBigData(2016)3:23Shneidermanetal.
[41]doesnotspecificallymentionbigdata,asitwasapublicationfrom2008whenthistermwasnotinuseyet;ThedefinitionbyManyikaetal.
[43]wasonlydescribedintheexecutivesummary;Mayer-Schnbergeretal.
[42]proposeanabstractdefinitionthatwasconsideredtoodifficulttoconvertintointerpretablethemesfortopicanalysis.
TopicanalysisThelistoftopicsandwordsandbigdatathemeswereanalysedbythesevenobservers.
Theobserversallworkedatthelocaldepartmentofepidemiology,biostatisticsandbio-informatics,thereforetheywereextremelysuitablefortheannotationtask.
Thebigdatathemes(Table1)andtopicwords(Table4)werewellunderstoodandthetaskcouldbefinishedwithoutfurtherhelpinareasonableamountoftime(30mintoanhour).
TherawannotationresultsaredisplayedperobserverandpertopicinTable2.
Notethatsomeobserversdidnotassignanythemetosometopics,andthatinmanycasesmorethanonethemewasassignedtothetopics.
Table3presentsthefrequencyofthemesassignedpertopic,highlightinghighorunanimousagreementamongtheobservers(shownunderlinedandbold).
Italsoshowstheoverallthemes,i.
e.
,thosethatwereassignedtoatopicbyatleastfourobservers.
Infourtopicslessthanfourobserversassignedthesamethemetoit(i.
e.
,3,17,19and25).
Outoftheremaining21topics,fivehadunanimousagreementbetweentheobserv-ersforsometheme(i.
e.
,6,7,8,20and21).
Theremaining16topicscouldbesplitintotopicswithasingleoveralltheme(i.
e.
,2,4,9,10,11,13,14,15,16,18,22,24)andtopicswithtwooverallthemes(i.
e.
,1,5,12,23).
NotethatthemostfrequentlyassignedthemewasApplication(66times),followedbythethemesinthesecondgroup,proposedbydeMauroetal.
Fromthethemesinthefirstgroup,volumeandvelocityoccurredmoreoftenthantheothers.
Notably,variabil-itywashardlyidentifiedamongthesetopics.
Figure7presentsthedistributionoftopicsoverdocumentsbasedontheprobabilityofeachtopictoeachdocument(i.
e.
,θ).
Thelargemajorityoftopics(inblack)haveastrongpresenceinonlyafewhundreddocuments.
However,therearefourtopics(inredandblue)thatdeviatefromthispattern.
Thetworedtopics(topic1and2,seeTable4)haveastrongerpresenceinmoredocumentsascomparedtothetopicspicturedinblack.
Thebluetopics(topic3and5,seeTable4)haveastrongerpresenceinnearlyalldocuments.
DiscussionInthispaperweattemptedtoidentifythemesrelatedtobigdatadefinitionsinalargecorpusof(bio)medicalliteraturethroughtopicmodelling.
Wehavefollowedastruc-turedandobjectiveapproachasmuchaspossible.
Thisprocessdeliverednovelandinterestingresults,whichhoweverneedtobecarefullyinterpretedduetoremaininglimitationsinourstudy.
IdentificationofthemesinbigdatadefinitionsDuetothelackofaconsolidatedandwidelyaccepteddefinitionofbigdata,itwasnec-essarytoconsultalargenumberofscientificpapers.
Thisworkislimitedtoscientificliterature,butobviouslytherearemanyotherdefinitionsofbigdatathathavenotbeenPage11of21vanAltenaetal.
JBigData(2016)3:23consideredinourwork,suchastheBerkeleyblogmentionedintheintroduction[8].
Nevertheless,mostofthedefinitionsin[8]canbemappedtothethemesidentifiedinthisstudy.
Interestingly,thewordcloudin[8]highlightswordssuchassize,complex,andtechniques,whicharealsofoundinthedescriptionsofthethemesconsolidatedinTable1.
Furthermore,therearequalitativeapproachestodescribingthebigdatafieldinTable2RawannotationresultsperobserverThefollowingcodingisusedtorepresentthethemesdescribedinTable1:volvolume,velovelocity,veraveracity,infoinformation,metmethods,techtechnology,impimpact,appapplication,beyondbeyondconventionalTopicThemeassignmentgroupedbyobserverABCDEFG1Imp,valueValueApp,imp,valueVera,valueimp,app,veraImp,value2Vera,appImp,appInfo,appVera,veloAppTech,variety,vera3Imp,appAppApp4MetMetVol,metMetTech,metTech,veloMet5Vol,velo,beyondTechVol,tech,beyondBeyond,vol,veloTech,complex,beyondVolVol6TechTechTech,veloTech,beyondTech,beyondTechTech,variety,vera7MetMetVera,metMetTech,met,info,appMetMet8AppAppInfo,appApp,infoAppAppVariety,app9AppImpImpImpValue,imp,app10AppMet,techVariety,info,metApp,metAppApp,variety,infoVol,beyond11AppAppAppApp,ImpAppAppImp,value12Tech,vol,veloVolVol,veloVol,velo,beyondTech,vol,veloVol,veloMet,vol13Variability,veraMetMetMetApp,infoMetMet14InfoInfoTech,appApp,infoImpInfoValue,imp,app15ImpAppImpAppInfo,appApp,impValue,vera16AppMetAppInfo,appInfo,appAppBeyond,vol17ValueInfoTech,beyondInfoContinuity,variabilityTechValue,tech18AppMetInfoApp,infoMet,app,tech,infoAppVol,vera19ValueAppMet,appInfoContinuity,appVarietyTech,imp20MetMetMetMetMet,infoMetMet21AppAppAppApp,impInfo,appAppVariety,app,vera22Info,veloInfoInfo,appInfo,veraVelo,continu-ity,appApp,infoInfo23Info,appAppInfo,appInfoInfoApp,infoBeyond,vol,vera,info24ValueAppInfo,appInfo,appContinuity,info,impAppVol,variety25MetMetInfoInfo,met,techVol,veloVeloTotal33223940533549Page12of21vanAltenaetal.
JBigData(2016)3:23Table3Summedannotationspertopicandtheme,andoverallthemepertopic(≥4counts)TopicThemesOverallVolumeVelocityVarietyVeracityValueVariabilityInformationTechnologyMethodsImpactBeyondcon.
Application12542Value,Impact21131114Application313–41126Methods552134Volume,Beyondconventional6172Technology711171Methods8127Application9142Impact101221314Application11126Application12651211Volume,Velocity1311151Methods1414112Information1511134Application1612115Application17111231–18113124Application1911111113–2017Methods2111117Application222163Information2311614Application,Information241111314Application2512213–total1717812142392436191166Page13of21vanAltenaetal.
JBigData(2016)3:23Table4Top20wordsforthe25-topicmodelidentifiedwithTMTopics12345HealthPatientArticleAlgorithmChallengedResearchClinicReviewClusterAnalyteHealthcareHospitalDiscussLearnToolPoliciesElectronFieldMethodAmountHealth_careCareRecentFeatureTechnologicPrivaciesOutcomeIssueEfficienciesComputabilityNationMedicaidAspectApproximateAnalysingEthicRecordFocusTreeRequireProtectEhrEmergeRepresentAdvanceGovernClinical_researchFutureFastVarietiesInformHealth_recordHighlightMatrixSolutionSecureClinicianCurrentAccuraciesGrowthChallengedTreatmentContextProblemLarge_amountShareImproveOverviewDistanceMassiveConcernAssessPaperHierarchicalGenerateAccessHealthcareParadigmComputabilityDatasetCommunitiesQualitiesConferFasterVastFundPotentialNaturalCalculateProcessHealth_informaticsPatient_careTechnologicGraphHandleHealth_systemRoutineLiteratureOutperformInfrastructural678910SystemModelAgeChangeNetworkProcessPredictRiskNurseMolecularDeviceInferInfluenzaInnovatedStructuralFrameworkStatisticIndicatingScienceBiomarkerCloudRegressExposureSocialComplexArchitecturalSimulateCohortQuestionHeterogeneitiesHadoopPredictorRateHistorianIntegralApplicabilityBayesianSymptomInfluenceSystems_biologyServiceFitMonthPracticalMechanicalManageGoodYearbookInsightOmicPlatformOptimalVariableCulturalApproachDesignPriorLifeTurnCharacterMapreducableBaseDeathProductDynameomicsComputabilityVariableDiabetesFoodFunctionBaseMachine_learningAdjustSocietiesBiologicSupportHigh_dimensionalGeographicUnderstandTransitImplementTraditionConditionDriveRdgeTaskRankFactorEvolutionTopologicalDeployParameterDemographicScientificProteinCloud_computingFeatureIncidencePrincipleOrgan1112131415DiseaseDatasetEffectSearchBiomedicalPreventTimeGroupSocial_mediaInformaticEpidemiologicSampleMeasurableLanguageScienceVaccinationLarge_scaleTestableGoogleMedicinalProgressComputabilityEstimateWordMedicaidPage14of21vanAltenaetal.
JBigData(2016)3:23Table4continued1112131415ImmuneSpeedAnalysingPublicEducateLeveragePerformanceStudiedRelateResearchPopularIncreasedStatisticPsychologicalLearnInitialApproachBiasTrendPersonalized_medicineDevelopThousandLargeEmoticonEraHeartStepEandomTwitterOntologicalAdministrationRateValuableMessageDisciplinaryInterventionImplementPowerOnlineTranslateGenerateFullMethodRelationshipStudentBloodMemorialSample_sizeSocialScientistAdvanceScaleMarkerVisitTrainPublic_healthHundredFindContentImpactReportedBlockLarge_setCasenessWorkshopConsensusApplicabilityImportPositDiscoveriesEarlierMultipleErrorInvestigacinKnowledge1617181920GenetWebSequenceMineClassifiableGeneResourceGenomeKnowledgeSetAssociatingCodeBioinformaticExtractObjectPhenotypeFileProteomeInformLarge_setPathwayLaboratoriesHigh_throughputChemicalClassDiseasePublicDNASpecialisedNoiseGenotypeCompressTranscriptomePlantGeneralFactorSemanticProteinBiologicPairEnrichSoftwareCompositeConceptPerformanceTraitRetrievableNgsDevelopAbilitiesGenome_wideAccessMetagenomeToxicNeural_networkMetabolicShareVirusConstructSimilarGenomeFormatAnalysingNoteTrainMutatedInformHostCurateDimensionNumberInterfaceBiologicRichMachineIdentifiSourceAssembleGapCategoricalPolymorphismPlatformCellPreservationApplianceIndividualMetadataMicrobiomeEcologicalFormulaRegularStorageAligndiverseEncounterUnificationExchangeHumanAbstractCoefficient2122232425DrugVisualImageCancerLowTargetActivatedBrainStudiedReduceCellHumanDisorderTumorTimeEventBehaviorSignalValidBaseScreenMobileSubjectResearchReductionResponseEnvironmentResolutionRegistriesDigitalExperimentInteractNeuroimagingTherapeuticNodeDetectedExplorationFunctionDatabaseEnergiesAnalyseUserNeuronInjuriesDeepAdversaryCollectSegmentOncologistSmallMultipleSensorPsychiatricClinical_trialsCostPage15of21vanAltenaetal.
JBigData(2016)3:23publicationssuchasChenetal.
[13]andTsaietal.
[44].
Notethat,althoughtheseworksdonotstrivetodeliveraformaldefinition,thedescriptionofthebigdatafieldinboththesepublicationsincludethesameaspectsfoundinthedefinitionthemes.
Wehaveobservedalargeoverlapamongthebigdatadefinitionliteratureconsideredinthisstudy,neverthelesswithvariationsinthefocusappliedbyeachauthor.
Further-more,certainthemesoccurmoreoftenthanothersinthedefinitions(Table1).
TheoriginalthreeV's(volume,velocity,variety)occurinmanydefinitionscomparedtotherelatively'newer'V's(veracity,value,variability),whicharepresentinonlyafew.
ThisisalsothecasewithTechnologyandMethodswhicharefoundindefinitionsmoreoftenthanInformationandImpact.
Finally,asthecorpuswasgatheredfrom(bio)medicalliteraturedatabases,weexpectedtofindtopicsdescribingthisdomain.
Thereforethetheme'Application'hasbeenintro-duced,whichisobviouslynotfoundinthepublishedbigdatadefinitions.
Indeed,theannotationresultspresentedinTable3showthat10outof25topicshavebeenanno-tatedwithApplicationbythemajorityoftheobservers.
Notethatthelargefractionofapplication-relatedwordsmighthaveovershadowedothersthatarerelatedtobigdatathemes.
Scrubbingthecorpusofapplication-relatedwordscouldbeusedtocircumventthisproblem.
Thisopensthepossibilityforfittinghighlygranularmodelsthatwouldbemoreeasilyinterpretableandbetterreflectbigdatainsteadoftheresearchfieldtopics.
Table4continued2122232425CompoundToolConnectomeClaimSizeProfileWearableNeuroscienceTherapiesNumeratorMissQuantifiableModeEfficaciesOperabilityTypeTrackMriDiagnosticCombinaPotentialMovementScanHeterogeneitiesPeakCombinaPhysicalQuantitationSetSpectralMetaDisplayAnalysingSpecificStructuralCompleteSmartphoneMicroscopicOngoingLocatePointInterestMultiConsortiumQualities020040060080012000.
00.
20.
4TopictodocumentprobabilityOrdereddocumentindexFig.
7Distributionoftopicsoverdocuments(i.
e.
,θ,y-axis).
Thedocumentsaresortedontopic-to-documentrelevancewithineachtopic.
Thex-axisrepresentstheorderofthesorteddocuments.
Eachlinerepresentsonetopic,inblack.
Exceptionsaretopics1and2,plottedinred,andtopic3and5,plottedinbluePage16of21vanAltenaetal.
JBigData(2016)3:23CorpusgatheringBydesign,inthisstudyweonlyconsideredpapersthatwereself-annotatedwithbigdata,whateverdefinitiontheauthorsmighthaveused.
Thisledtoaninterestingobservationbyoneobserverwhocouldnotfindhisresearchdomaininanyofthetopics.
However,thesearcheddatabasescertainlyincludedthisdomainandmanyofthebigdatathemescouldpotentiallybeassignedtoitspapers.
Thedomaincouldbemissingduetovariousreasons,suchasalowfrequencyofthisresearchdomaininthecorpus.
However,thisobserveracknowledgedtoconsiderhisdomainas'conventional',therefore,paperspub-lishedaboutthisresearchdomainmostlikelydonotmentionbigdataandwerethere-forenotcapturedinthesearchperformedinthisstudy.
Notealsothatweonlyconsideredtwodatabases,whereasmanyotherscouldbeincludedaswell(e.
g.
,ScopusorOvid).
Nevertheless,PubMedandPMCareimportantsourcesinmedicalresearchandthereforehavebeenconsideredsufficientlyrepresenta-tiveforthepurposesofourstudy.
Finally,apotentiallimitationofourstudyisthatonlyabstractswereincludedinthecorpusinsteadoffull-textpapers.
Ourassumptionisthattheabstractscontaintheessenceofapaperandarethereforerepresentativeoftheactualthemesfoundinafullpaper.
Moreover,itiscurrentlystilldifficulttoretrieveandparsefullpapersinanauto-matedfashion,whichwouldhaveseverelylimitedthenumberofpapersconsideredinourstudy.
AutomaticidentificationoftopicsIntheprogressofthisresearchvarioustextminingapproacheswereattemptedtoiden-tifyrelevanttopicstocharacterisethepublications.
First,weattemptedtouseAlche-myAPI[45],anaturallanguageprocessingservicethatisaccessiblethroughtheweb.
However,inapilotexperimentof100documentsweobservedthatthenumberofresultsproducedwouldbetoobigforeffectiveanalysis(i.
e.
,3774results,ofwhich3006wereunique).
Moreover,AlchemyAPI'smethodisimplementedbyproprietarycode,sorelationsbetweendocumentsandresultsweredifficulttointerpret.
Wecontinuedsearchingforatextminingmethodandconsidereddocumentcluster-ingtofindthedefinitionthemesinliterature.
Inprinciple,documentclusteringcouldcapturethemesbutresultsareoftenlimitedtoonethemeperdocument.
Furthermore,analysingdocumentclusterstofinddefinitionthemeswouldbeanon-trivial(ifnotimpossible)task.
Aseeminglymoresuitablemethodwastopicmodelling,amethodthatcandiscoverlatentsemanticsintext.
Themainpurposeoftopicmodelsisdescribedas"discoveringmainthemesthatpervadelargeunstructuredcollectionsofdocuments"[18].
Further-more,TMcapturesmultiplemeaningsofwords,butmostimportantly,itcanidentifymultipletopicsforeachobserveddocument.
TheLDAapproachisperhapsthemostpopularandcommontopicmodel.
TheRpackageimplementingthealgorithmtopic-modelshad22,576downloadsin2015.
3Moreover,thepaperdescribingtheunderlyingmodelbyBleietal.
[17]hasbeencitedover16,000times.
4Wethereforechosetousethe3http://cran-logs.
rstudio.
com/on9June2016.
4https://scholar.
google.
com/on20October2016.
Page17of21vanAltenaetal.
JBigData(2016)3:23LDAimplementationofTMbecauseofitsappropriatenessforourdata,therelativeeaseofuseofthisapproach(i.
e.
,readytouseimplementationsinR),andextensiveuseintheliteraturebyourpeers.
VariousTMapproachesweretriedtofindamodelwithamanageablenumberoftopicswhichallowedformanualannotation.
Thelargestchallengeswereencounteredduringmodelselection.
Twomodelevaluationmethods(i.
e.
,perplexityandharmonicmean)areoftenusedinTMliterature[16,19,46,47].
Theharmonicmeanmethodcal-culatesanapproximationofthemarginallikelihoodofafittedmodel,whileperplexitymeasureshowwellafittedmodelcanpredictunseendata.
Thesecriteriawerecalculatedformultiplemodelswithvaryingparametersexpectingthatthemodeldecisionbound-arylayatsomeoptimumoftheresponsecurve.
Forbothcriteriawewerelookingforasuddendecreaseinmarginaldifferencebetweentwoconsecutivedatapoints(i.
e.
,mod-els).
Unfortunately,inourcase,evenwhenfittingmodelswithupto1,500topics(datanotshown),thecurvesdidnotshowanoptimum.
FinallyweoptedforTMwithmodelselectionthroughAIC,amethodbasedonlikeli-hoodandmodelcomplexity.
TheAICcurveshowsanoptimumatM14,howeverM25waschosenforfurtheranalysis.
WhileexperimentingwiththeparameterTwenoticedthatquantitativelymeasuringmodelfitdidnotrelatetotheinterpretabilityofthetopics,asalsonotedin[30,48].
Comparisonbetweenmodelsshowedthattherewasnomajorreorganisationoftopics(datanotshown),butincreasingthenumberoftopicsmadethemmorespecificandthereforemoreinterpretable.
ManualannotationoftopicsSubjectivityofthemanualannotationisoneofthelimitationsofthisstudy.
SomeresearchhasbeendoneinobjectifyingtheanalysisofTMresults[27,30,49,50].
How-ever,sofar,theresultsofTMcannotbequantitativelyevaluated[16,48].
Forthepur-poseofthisstudy,agroupofsevenobserverswasdeemedenoughforthetopicanalysis.
Wealsopresentallthedatainthepaper,suchthatthereadercanassessthetopicsthem-selvestoconfirmordisputeourresults.
WetookgreatefforttoobjectifytheinterpretationofTMresults,butsevenisasmallnumberofobservers.
Ideallymorepersonsshouldbeinvolvedintheassessmentofthemeassignment.
Forexample,crowdsourcingservicessuchasMechanicalTurkcouldbeused[51].
However,thisparticularannotationtaskrequiressufficientback-groundknowledgeinhealthdatascience,whichsignificantlyreducesthepoolofsuitableobservers.
Alltheobserversinthisstudyweretrainedinhealthdatascience,thereforetheyarefamiliarwiththetermsandconceptsthatappearedinthetopicsandthebigdatathemes.
Nevertheless,nobaselineassessmentwasperformedtomorepreciselyunderstandtheirowninterpretations,whichmighthaveintroducedsomenoiseinourresults.
Ingeneral,theobserversreportedsomedifficultytoassociatewordswithatheme.
Theyalsonotedthattheirannotationdecisionsweremostlybasedonwordsthatstoodoutinthetopic,whichmeansthatnotallwordswereconsideredequally.
Thispossi-blyledtothediscrepancybetweenannotatorsdisplayedbytheresults(Tables2,3).
Forexample,whenasked,annotatorFnotedthathechoseTechnologyfortopic4becauseofthespecificword'cluster',whileallotherschoseMethods.
NotethatclustercouldbePage18of21vanAltenaetal.
JBigData(2016)3:23interpretedasacomputercluster(i.
e.
,Technology)oraclusterusedinunsupervisedmachinelearning(i.
e.
,Methods).
Furthermore,notethatInformationisoftenco-anno-tatedorinterchangedwithApplication.
Forexample,neuroimaging,neuroscience,image,andsignalarepresentintopic23.
ThefirsttwowordscanbeassociatedwithApplication,andthelatterwithInformation.
Also,topicscontainingwordsreferringtodata(e.
g.
,imagesandage)havebeenannotatedasInformationand/orApplicationbysomeobservers.
Forsuchreasonssomeobserverssaidthatitwaspossiblethattheirannotationmightchangeslightlyiftheywouldanalysethetopicsagain.
BigdatathemesinbiomedicalliteratureDespiteannotationsubjectivityweconsidertohavefoundsufficientagreementbetweentheobserverstosupportourfindings,whichshowhowbigdatathemesareidentifiedinbiomedicalliterature(seeTable3).
Technologyandmethodsarefoundfairlyoftenintopics.
Notethattheidentificationofthesethemesisfacilitatedbecausetheycanbeassociatedtoconcretetermssuchasdevice,cloud,andplatformforTechnology,ormodel,infer,andsimulateforMethods.
FromtheV's,volumeandvelocitywerethemostidentifiedthemes,whicharealsoeasilyassociatedwithtermssuchaslargescale,performance,andcomputability.
Thesetermsarefrequentlyusedinpractice,explainingwhytheyhavebeensostronglyidentifiedintopics4,5,6,7,12,13and20.
Impact,variety,veracity,value,andbeyondconventionalwereannotatedlessoften.
Becausethesearemoreabstractconceptsitislikelythattheyaremoredifficulttodis-coverwithintopics.
Forexample,Valuewasannotatedtotopic1,containingwordssuchassecure,challenged,andprotect.
Comparedtoconcretethemes(e.
g.
,technologyandvolume),itwasmoredifficultfortheannotatorstofindafittingtheme.
Variabilitywasannotatedonlytwice,howeverwedobelievethatitisanintegralpartofbigdata.
Vari-abilitynotbeingrecognisedcouldmeanthattheobserverscouldnotidentifythethemeproperly(duetopoorthemedescriptionorunderstanding),orthatthetopicsintheselectedmodelcouldnotcapturethistheme(duetoinsufficientrepresentationinthecorpus).
EachofthethemesfromthedefinitionbyDeMauroetal.
(information,methods,technology,impact)wasannotatedmoreoftenthananyother(apartfromApplication).
Notethatbydesignthesethemesaredefinedinabroadermanner,whichmeansthattheyincludetheothers.
Forexample,MethodsincludesafewV'ssuchasvolumeandvelocityaswellasbeyondconventional.
Perhapsduetotheirbroadness,thethemesfromDeMauroetal.
werechosenmoreeasily,indicatingthattheirdefinitioncoverstheunderstandingofbigdatainabetterway.
However,onemightwonderwhetherthesethemesareexclusivelyrelatedtobigdataorwhethertheywillalsopop-outinothertypesofpapers.
Theset-upofourstudyisnotabletoanswerthisquestion.
RelatedworkOtherstudieshavebeenperformedtodiscernadefinitionofbigdata[3,14,15].
Thesehaveprovidedanoverviewofbigdataresearchindifferentresearchfields[3];alitera-tureanalysistodiscoverbigdatathemesandaproposalfortheirconsolidationintoonedefinition[14];andananalysisofindustrystatementsonbigdata[15].
EachofthesePage19of21vanAltenaetal.
JBigData(2016)3:23studiesusedqualitativemethods,whereasourworkbuildsupontheirfindingswithaquantitativemethod.
Inparticular,ourstudyprovidesevidencethatsupportsthedefini-tionproposedbyDeMauroetal.
[14]andanaggregationofitsunderlyingdefinitions(seeTable1).
ManyresearchershaveappliedTMfortextanalysisinvariousfields[52].
Mostsimi-lartoourapproachisastudybyHansmannandNiemeyer[16],whichappliedTMtoabigdatacorpustodiscoveritscharacteristics.
Theirresearchidentifiedthreethemes,namelyITinfrastructure,methods,anddata,andappliedTMintwostages.
Thefirststageseparatedthecorpusof248manuallyselectedpapersintothethreethemesmen-tionedabove.
Then,inthesecondstage,TMwasappliedtothepaperswhichhadbeengroupedbytheme.
Anin-depthword-by-wordanalysisofbigdatacharacteristicswasperformedonthesecondstageTMresults.
Themeaningofeachwordwasassessed,findingtheimportantconceptsforeachofthethemesandwhereresearchfocusliesinthecorpus.
Ourworkdiffersfrom[16]inthreeways.
First,theiranalysiswasbasedononlythreebigdatathemes,whereasweusedmultipledefinitionswhichledtotwelvethemes.
Secondly,wecollectedalargercorpusresultingfromasystematicreviewoftheliterature.
Lastly,theresearchgoalsdiffer:insteadoffindingthedefiningconceptsforeachofthethemes,ourapproachidentifiesexistingdefinitionsinabiomedicalbigdatacorpus.
Therearealsomoresophisticated(andcomplex)textanalysisapproachessuchasthemethoddescribedbyHurtadoetal.
[53].
Whereasweappliedabag-of-wordsprinciple,whereeachwordisconsideredindependently,themethodbyHurtadoetal.
processeswholesentencesandpreservescontextinformation.
In[53]textminingwasappliedtofindtrendsintopicsovertimeandpredicttopicpopularityinthefuture.
Whilethisisnotapplicableinourcurrentcaseitmightbeinterestingforfurtherresearch(e.
g.
,find-ingtrendsofbigdataovertimewithinscientificliterature).
Lastly,theirmethodtogen-eratetopicsalsogivesthemaconciselabelbuiltfromthetopic'skeywords.
Thiswouldpartiallyremovesubjectivityfromannotation,howeverinterpretationoftheresultsisstillboundtohumaninterpretation.
ConclusionInthisworkwedescribeasystematicstudythatattemptedtoanswerthequestion:'Whichthemesfromvariousexistingbigdatadefinitionsareexpressedin(bio)medicalscientificpublications'.
Alargenumberofexistingdefinitionswereanalysedandcon-solidatedintotwelvethemes.
Alargecorpusofrepresentativebiomedicalscientificpub-licationswascollectedandautomaticallyanalysedwithtextminingtoidentifythe25mostrelevanttopicsbasedontitleandabstract.
Manualannotationwasperformedbysevenobserverstoidentifybigdatathemesinthetopics.
Inspiteofthelimitationsofourstudy,theresultsshowthatthesethemescanbeidentifiedinthiscorpus.
Volume,VelocityandValuearerecognizedfrequently,butinparticularresultsshowstrongpres-enceofthethemesdefinedbyDeMauroetal.
(i.
e.
,Information,Methods,Technology,andImpact).
Thisfindingindicatesthattheirdefinitionofbigdataissupportedbythecurrentunderstandingexpressedbyauthorswhentheyusethetermbigdataintheirown(bio)medicalpublicationsinthiscorpus.
Toourknowledgethisisthefirsttimethatthisisshowninasystematicmannerforliteratureinanapplicationfield.
Page20of21vanAltenaetal.
JBigData(2016)3:23AbbreviationsIT:informationtechnology;NIST:NationalInstituteofStandardsandTechnology;TM:topicmodelling;DOI:digitalobjectidentifier;LDA:latentdirichletallocation;V's:bigdataaspectsi.
e.
,volume,velocity,variety,veracity,value,variability.
Authors'contributionsSDOandAJvAconceivedthestudyandtogetherwithPDMandAHZcreatedthestudydesign.
AJvAperformedthestudyexecution,SDOandAJvAanalysedandinterpretedtheresults.
AJvAdraftedthemanuscriptwhichwasproofreadandeditedbySDO,thefinalmanuscriptwasalsoproofreadbyPDMandAHZ.
Allauthorsreadandapprovedthefinalmanuscript.
AcknowledgementsThisworkwascarriedoutontheHighPerformanceComputingCloudresourcesoftheDutchnationale-infrastructurewiththesupportofSURFFoundation.
Furthermore,wewouldliketothanktheobserversfortheirworkonannotatingtheresults.
CompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
AvailabilityofdataandmaterialsTheoriginalcorpusdatawillnotbepublishedduetocopyrightconcerns.
However,thesearchcanberepeatedwiththesameresults,seeMethodssection.
Thesearchwasperformedon29March2016andthereforeincludespublicationsuptothisdate.
OurRimplementationofTMcanbefoundonGitHub,see[54].
FundingThispublicationwassupportedbytheDutchnationalprogramCOMMIT/whichisfundedbytheNederlandseOrganisa-tievoorWetenschappelijkOnderzoek(NWO).
Received:24September2016Accepted:1November2016References1.
FennJ,LeHongH.
Hypecycleforemergingtechnologies,2011.
Stamford:Gartner;2011.
2.
Google:GoogleTrends.
https://www.
google.
com/trends/explore#q=big+data.
Accessed28Mar2016.
3.
Andreu-PerezJ,PoonCC,MerrifieldRD,WongST,YangG-Z.
Bigdataforhealth.
IEEEJBiomedHealthInform.
2015;19(4):1193–208.
doi:10.
1109/JBHI.
2015.
2450362.
4.
Gartner:GartnerAcquisitions.
http://www.
gartner.
com/technology/about/acquisition_history.
jsp.
Accessed27Mar2016.
5.
LaneyD.
3Ddatamanagement:controllingdatavolume,velocityandvariety.
METAGroupResNote.
2001;6:70.
6.
DijcksJP.
Oracle:Bigdatafortheenterprise.
RedwoodCity:Oracle;2012.
7.
IBM:IBM-WhatIsbigdataAccessedthroughGooglecache.
https://www.
ibm.
com/software/data/bigdata/what-is-big-data.
html.
Accessed17Dec2015.
8.
DutcherJ.
Whatisbigdatahttps://datascience.
berkeley.
edu/what-is-big-data/.
Accessed12Sept2016.
9.
JacobsA.
Thepathologiesofbigdata.
CommunACM.
2009;52(8):36–44.
doi:10.
1145/1536616.
1536632.
10.
DeRouenT.
Promisesandpitfallsintheuseof"BigData"forclinicalresearch.
JDentRes.
2015;94(9):107–9.
doi:10.
1177/0022034515587863.
11.
ZikopoulosP,EatonC.
UnderstandingBigdata:analyticsforenterpriseclasshadoopandstreamingdata.
NewYork:McGraw-HillOsborneMedia;2011.
12.
LeviM.
Klerenvandekeizer[Theemperor'sclothes].
MedischContact;2015.
13.
ChenM,MaoS,LiuY.
Bigdata:asurvey.
MobileNetwAppl.
2014;19(2):171–209.
doi:10.
1007/s11036-013-0489-0.
14.
DeMauroA,GrecoM,GrimaldiM.
Aformaldefinitionofbigdatabasedonitsessentialfeatures.
LibRev.
2016;65(3):122–35.
doi:10.
1108/LR-06-2015-0061.
15.
WardJS,BarkerA.
Undefinedbydata:asurveyofbigdatadefinitions;2013.
16.
HansmannT,NiemeyerP.
Bigdata-characterizinganemergingresearchfieldusingtopicmodels.
In:Proceedingsofthe2014IEEE/WIC/ACMInternationaljointconferencesonwebintelligence(WI)andIntelligentAgentTechnolo-gies(IAT).
Vol1.
WI-IAT'14.
Washington,DC:IEEEComputerSociety;2014.
p.
43–51.
doi:10.
1109/WI-IAT.
2014.
1517.
BleiDM,NgAY,JordanMI.
Latentdirichletallocation.
JMachLearnRes.
2003;3:993–1022.
18.
BleiDM.
Probabilistictopicmodels.
CommunACM.
2012;55(4):77–84.
doi:10.
1145/2133806.
2133826.
19.
SteyversM,GriffithsT.
Probabilistictopicmodels.
HandbookLatentSemantAnal.
2007;427(7):424–40.
20.
RCoreTeamR:alanguageandenvironmentforstatisticalcomputing.
Vienna:RFoundationforStatisticalComput-ing;2015.
https://www.
R-project.
org/.
21.
FeinererI,HornikK,MeyerD.
Textmininginfrastructureinr.
JStatSoftw.
2008;25(5):1–54.
22.
BenoitK,NultyP.
Quanteda:quantitativeanalysisoftextualdata.
2015.
Rpackageversion0.
8.
5-10.
http://github.
com/kbenoit/quanteda.
23.
LewisDD,YangY,RoseTG,LiF.
Rcv1:Anewbenchmarkcollectionfortextcategorizationresearch.
JMachLearnRes.
2004;5:361–97.
24.
SaltonG.
TheSMARTretrievalsystem-experimentsinautomaticdocumentprocessing.
UpperSaddleRiver:Prentice-HallInc;1971.
Page21of21vanAltenaetal.
JBigData(2016)3:2325.
LewisDD,YangY,RoseTG,LiF.
http://jmlr.
csail.
mit.
edu/papers/volume5/lewis04a/a11-smart-stop-list/english.
stop.
Accessed2015-11-2026.
GrünB,HornikK.
Topicmodels:anRpackageforfittingtopicmodels.
JStatSoftw.
2011;13(40):1–30.
27.
ChuangJ,GuptaS,ManningC,HeerJ.
Topicmodeldiagnostics:assessingdomainrelevanceviatopicalalignment.
In:Proceedingsofthe30thInternationalConferenceonmachinelearning(ICML-13);2013.
p.
612–20.
28.
SchwarzG.
Estimatingthedimensionofamodel.
AnnStat.
1978;6(2):461–4.
doi:10.
1214/aos/1176344136.
29.
AkaikeH.
In:ParzenE,TanabeK,KitagawaG,editors.
Informationtheoryandanextensionofthemaximumlikeli-hoodprinciple.
NewYork:Springer;1998.
p.
199–213.
doi:10.
1007/978-1-4612-1694-0_1530.
SievertC,ShirleyKE.
LDAvis:amethodforvisualizingandinterpretingtopics.
In:ProceedingsoftheWorkshoponinteractivelanguagelearning,visualization,andinterfaces;2014.
p.
63–70.
31.
ZipfGK.
Humanbehaviorandtheprincipleofleasteffort:anintroductiontohumanecology.
Indianapolis:Addison-WesleyPress;1949.
32.
SchroeckM,ShockleyR,SmartJ,Romero-MoralesD,TufanoP.
Analytics:thereal-worlduseofbigdata.
IBMGlobalBusinessServices.
2012:1–20.
33.
SuthaharanS.
Bigdataclassification:problemsandchallengesinnetworkintrusionpredictionwithmachinelearn-ing.
SIGMETRICSPerformEvalRev.
2014;41(4):70–3.
doi:10.
1145/2627534.
2627557.
34.
ChangL.
NISTbigdatainteroperabilityframework.
vol1.
Definitions.
doi:10.
6028/NIST.
SP.
1500-135.
FisherD,DeLineR,CzerwinskiM,DruckerS.
Interactionswithbigdataanalytics.
Interactions.
2012;19(3):50–9.
doi:10.
1145/2168931.
2168943.
36.
ChenH,ChiangRH,StoreyVC.
Businessintelligenceandanalytics:frombigdatatobigimpact.
MISQ.
2012;36(4):1165–88.
37.
DumbillE.
Makingsenseofbigdata.
BigData.
2013;1(1):1–2.
38.
BoydD,CrawfordK.
Criticalquestionsforbigdata:provocationsforacultural,technologicalandscholarlyphenom-enon.
InfCommunSoc.
2012;15(5):662–79.
doi:10.
1080/1369118X.
2012.
678878.
39.
CenterI.
Bigdataanalytics.
IntelITCenter;2012.
40.
Microsoft:thebigbang:howthebigdataexplosionischangingtheworld.
https://news.
microsoft.
com/2013/02/11/the-big-bang-how-the-big-data-explosion-is-changing-the-world/.
Accessed11Feb2013.
41.
ShneidermanB.
Extremevisualization:Squeezingabillionrecordsintoamillionpixels.
In:Proceedingsofthe2008ACMSIGMODinternationalconferenceonmanagementofdata.
SIGMOD'08.
NewYork:ACM.
p.
3–12;2008.
doi:10.
1145/1376616.
137661842.
Mayer-SchnbergerV,CukierK.
Bigdata:arevolutionthatwilltransformhowwelive.
London:JohnMurrayPublish-ers;2013.
43.
ManyikaJ,ChuiM,BrownB,BughinJ,DobbsR,RoxburghC,ByersAH.
Bigdata:thenextfrontierforinnovation,competition,andproductivity.
2011.
44.
TsaiC-W,LaiC-F,ChaoH-C,VasilakosAV.
Bigdataanalytics:asurvey.
JBigData.
2015;2(1):21.
doi:10.
1186/s40537-015-0030-3.
45.
AlchemyAPI:Alchemy.
http://www.
alchemyapi.
com.
Accessed15Dec2015.
46.
WallachHM,MurrayI,SalakhutdinovR,MimnoD.
Evaluationmethodsfortopicmodels.
In:Proceedingsofthe26thAnnualinternationalconferenceonmachinelearning.
ICML'09.
NewYork:ACM;2009.
p.
1105–1112.
doi:10.
1145/1553374.
1553515.
47.
SievertC.
Findingstructureinxkcdcomicswithlatentdirichletallocation.
https://cpsievert.
github.
io/xkcd/.
Accessed20Nov2015.
48.
ChangJ,GerrishS,WangC,Boyd-GraberJL,BleiDM.
Readingtealeaves:howhumansinterprettopicmodels.
In:BengioY,SchuurmansD,LaffertyJ,WilliamsC,CulottaA,editors.
Advancesinneuralinformationprocessingsystems22.
RedHook:CurranAssociatesInc;2009.
p.
288–96.
49.
LauJH,GrieserK,NewmanD,BaldwinT.
Automaticlabellingoftopicmodels.
Proceedingsofthe49thAnnualMeetingoftheassociationforcomputationallinguistics:humanlanguagetechnologies,vol1.
HLT'11.
Stroudsburg:AssociationforComputationalLinguistics;2011.
p.
1536–45.
50.
MeiQ,ShenX,ZhaiC.
Automaticlabelingofmultinomialtopicmodels.
In:Proceedingsofthe13thACMSIGKDDinternationalconferenceonknowledgediscoveryanddatamining.
KDD'07.
NewYork:ACM;2007.
p.
490–499.
doi:10.
1145/1281192.
128124651.
Amazon:AmazonMechanicalTurk.
https://www.
mturk.
com.
Accessed27Feb2016.
52.
ZhaoWX,JiangJ,WengJ,HeJ,LimEP,YanH,LiX.
Comparingtwitterandtraditionalmediausingtopicmodels.
In:Proceedingsofthe33rdEuropeanConferenceonadvancesininformationretrieval.
ECIR'11.
Berlin:Springer;2011.
p.
338–349.
http://dl.
acm.
org/citation.
cfmid=1996889.
1996934.
53.
HurtadoJL,AgarwalA,ZhuX.
Topicdiscoveryandfuturetrendforecastingfortexts.
JBigData.
2016;3(1):1–21.
doi:10.
1186/s40537-016-0039-2.
54.
Altena,vanAJ.
AMCeScience/R-topicmodellingatSubmission.
https://github.
com/AMCeScience/R-topicmodelling/releases/tag/Submission.

展开全文