extensivewww.cnnic.cn

www.cnnic.cn  时间:2021-04-18  阅读:()
Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.
researchgate.
net/publication/288918371Rethinkingbigdata:AreviewonthedataqualityandusageissuesArticleinISPRSJournalofPhotogrammetryandRemoteSensing·December2015DOI:10.
1016/j.
isprsjprs.
2015.
11.
006CITATIONS4READS1944authors,including:JianzhengLiuTheUniversityofHongKong9PUBLICATIONS12CITATIONSSEEPROFILEWeifengLiTheUniversityofHongKong28PUBLICATIONS171CITATIONSSEEPROFILEJianshengWuHarbinMedicalUniversity45PUBLICATIONS511CITATIONSSEEPROFILEAllin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate,lettingyouaccessandreadthemimmediately.
Availablefrom:JianzhengLiuRetrievedon:15November2016ReviewArticleRethinkingbigdata:AreviewonthedataqualityandusageissuesJianzhengLiua,JieLia,WeifengLia,,JianshengWub,caDepartmentofUrbanPlanningandDesign,FacultyofArchitecture,KnowlesBuilding,TheUniversityofHongKong,PokfulamRoad,HongKongbKeyLaboratoryofHumanEnvironmentalScienceandTechnology,RoomE318,PekingUniversityShenzhenGraduateSchool,UniversityTown,Shenzhen518055,ChinacKeyLaboratoryforEarthSurfaceProcesses,CollegeofUrbanandEnvironmentalSciences,PekingUniversity,Beijing100871,ChinaarticleinfoArticlehistory:Received18May2015Receivedinrevisedform17November2015Accepted17November2015AvailableonlinexxxxKeywords:BigdataDataqualityanderrorDataethnicsSpatialinformationsciencesabstractTherecentexplosivepublicationsofbigdatastudieshavewelldocumentedtheriseofbigdataanditsongoingprevalence.
Differenttypesof''bigdata"haveemergedandhavegreatlyenrichedspatialinfor-mationsciencesandrelatedeldsintermsofbreadthandgranularity.
Studiesthatweredifculttocon-ductinthepasttimeduetodataavailabilitycannowbecarriedout.
However,bigdatabringslotsof''bigerrors"indataqualityanddatausage,whichcannotbeusedasasubstituteforsoundresearchdesignandsolidtheories.
Weindicatedandsummarizedtheproblemsfacedbycurrentbigdatastudieswithregardtodatacollection,processingandanalysis:inauthenticdatacollection,informationincompletenessandnoiseofbigdata,unrepresentativeness,consistencyandreliability,andethicalissues.
Casesofempiricalstudiesareprovidedasevidencesforeachproblem.
Weproposethatbigdataresearchshouldcloselyfol-lowgoodscienticpracticetoprovidereliableandscientic''stories",aswellasexploreanddeveloptechniquesandmethodstomitigateorrectifythose'big-errors'broughtbybigdata.
2015InternationalSocietyforPhotogrammetryandRemoteSensing,Inc.
(ISPRS).
PublishedbyElsevierB.
V.
Allrightsreserved.
1.
IntroductionTheprevalenceofbigdataexertsaprofoundimpactonmanydisciplines,includingpublichealthandeconomics(EinavandLevin,2014;KhouryandIoannidis,2014).
Almostalldisciplinesandresearchareas,includingcomputerscience,business,andmedicine,arecurrentlydeeplyinvolvedinthisspreadingcomputa-tionalcultureofbigdatabecauseofitsbroadreachofinuenceandpotentialwithinmultipledisciplines(BoydandCrawford,2012).
Asahighlyinterdisciplinarysubject,spaceinformationscienceandrelateddisciplines(e.
g.
,geographyandurbanstudies)arealsolargelyaffectedbythenewtechnicalwaveofbigdata.
Thepastseveralyearshaveseenthepopularapplicationsofbigdata,suchasinferringpeople'sdailytravelbehaviorandinteractionusingmobilephonedataandtaxitrajectorydata.
Wecanforeseethatthewaveofbigdatawilleventuallybeextendedtoothercityapplicationssuchasreal-timepopulationcensusandenergyuseathomeorinvehicles.
Thekeyquestionisnolongertechnologicalbutorganizational(Batty,2012).
However,''bigdataispartofthewavebutthatisjustdata.
Dataonlymattersifitisuseful"(Webster,2014).
Wearguethatbigdataalsobringsproblemsindataqualityanddatausage,whichunderminetheusabilityofbigdata.
Researchbasedondatawitherrorsdon'tmeettherequirementsofgoodscienticresearchintermsofauthenticityandaccuracy.
Thistypeofresearchwilllikelyresultinbiasedorwrongconclusionsifwedonothaveadeeperunderstandingofthequalityissuesofbigdataanditsconsequentproblems.
Thisstudyreviewsexistingliteratureonbigdatainspatialinformationsciencesandrelatedeldstoobtainanunderstandingofthecurrenthypeonbigdataanditsdataquality.
Weattempttodeterminethetypicaldataqualityanddatausageproblemsthatunderminetheauthenticityandreliabilityofbigdataresearchinthiseld.
Ourintentionisnottodiscouragebigdataresearchbuttopromoteascienticandreliableresearchcultureforbigdatastudiesandtofacilitatetheproductionofhigh-qualityresearch.
Thisreviewiscomprisedofthreesections.
Werstpresentanoverviewofbigdataresearchinspatialinformationsciencesandrelatedelds.
Thissectionclariesthedenitionofbigdataandsummarizesthecurrentapplicationscopeofbigdatainspatialinformationsciencesandrelatedelds.
Severalinuentialempiri-calpublicationsthatfocusonbigdataarehighlighted.
Wealsoexplicatethethreepathswhereinbigdatainuencespatialinfor-mationsciencesandrelateddisciplineswhicharedatacollection,dataprocessinganddataanalysis;andweassesscurrentbigdatahttp://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
0060924-2716/2015InternationalSocietyforPhotogrammetryandRemoteSensing,Inc.
(ISPRS).
PublishedbyElsevierB.
V.
Allrightsreserved.
Correspondingauthor.
E-mailaddresses:jzliu@hku.
hk(J.
Liu),Jessieleepku@hotmail.
com(J.
Li),wi@hku.
hk(W.
Li),wujs@pkusz.
edu.
cn(J.
Wu).
ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxxContentslistsavailableatScienceDirectISPRSJournalofPhotogrammetryandRemoteSensingjournalhomepage:www.
elsevier.
com/locate/isprsjprsPleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006studiesfromtheperspectiveofthethreepathsrespectively.
Wethenfocusonthe'bigerrors'indatacollection,processing,andanalysisforbigdatainspatialinformationsciencesandrelatedelds.
Weelaborateonthevedataqualityanddatausageissuesofbigdata,namely,authoritativenessproblem,informationincompletenessandnoiseproblem,representativenessproblem,consistencyandreliabilityissues,andethicalproblems.
Casesofempiricalstudiesarepresentedasevidence.
Finally,thispaperpre-sentsseveralspeciccopingstrategiesandrecommendationstohelpdecreasethedataerrorofbigdataresearch.
2.
Overviewonbigdataresearchinspatialinformationsciences2.
1.
WhatexactlyisbigdataandwhatmakesthempopularBigdataisgenerallyconsideredlinkableinformationthathavelargedatavolumesandcomplexdatastructures(KhouryandIoannidis,2014),suchassocialmediadata,mobilephonecallrecords,commercialwebsitedata(e.
g.
,eBay,Taobao),volunteeringgeographicalinformation,searchenginedata,smartcarddata,andtaxitrajectorydata.
BigdatacameintothefocusofacademicsonlyinthepastdecadeasshowninFig.
1,buttheexplosivepublica-tionsofbigdatastudiesshowthatbigdatatopicswillprobablycontinuetoproliferateinthenextfewyears.
Themostpopulardescriptionofbigdatathusfaristhe''3V"model,where''3V"referstovolume,variety,andvelocity(Laney,2001).
Volumeliterallymeansthattypicalbigdatahaveparticu-larlylargedatavolume.
Forexample,mobilephonecallrecordsusuallyhave70milliondataentries(Gaoetal.
,2013),videosurveillancerecordscanevenhavelargerdatavolumeintermsofdatastorage.
Varietymeansthatbigdatahavediversieddatasources,datastructures,andpotentialapplications.
Velocityreferstotherealtimeorquasi-real-timedataupdating.
Forinstance,airqualitymonitoringdataareoftenupdatedonceorseveraltimeseachday.
Inadditiontothe''3V"model,''4V"and''5V"modelsareemergingasresearchersattempttoredenebigdata.
IBMpro-motestheconformancetoveracitytoexplainthebiasproblemsbroughtbybigdataandbelievesthatthe''4V"modelcanaccu-ratelydescribebigdata(IBM,2013).
Severalmediacolumnsarguethatbigdataalsohavethefeaturesofvalue,variability,andvisu-alization(McNulty,2014).
However,thetypical''bigdata"inspatialinformationsciencesandrelatedeldsappearsuntforthe''4V"bigdatamodel.
Some''bigdata"suchasthesocialmediadataofaspecictopicaresmallintermsofdatavolumeandareevensmallerthansometraditionaldatasetssuchascensusdata.
Bigdataismoreaboutthecapacitytosearch,aggregate,andcross-referencelargedatasetsthanitslargevolume(BoydandCrawford,2012).
Thus,wearguethat''ne-scalespatial–temporaldata"willbeamoreappropriatetermtodescribethebigdatainspatialinformationsciencesandrelatedeldssincethebigdataintheseeldsisusuallycharacterizedbyaverynegranularityandspatial–temporaldimensions.
Webelievethatoneofthereasonswhybigdataispopularinmostdisciplinesisthatitlargelyimprovesthedataavailabilityandaccessibilityofresearchsubjects,thusallowingthestudyoftopicsthatweredifculttointerrogatebecauseofpoordataavail-ability.
Bigdataprovide''thecapacitytocollectandanalyzedatawithanunprecedentedbreadthanddepthandscale"(Lazeretal.
,2009).
Forexample,obtainingdetaileddataonthespatial–temporalbehaviorofurbanresidentsusedtobedifcult.
However,suchinformationhasnowbecomeaccessibleandeasytocollectbecauseofthepopularityofpersonalcommunicationdeviceswithsmartsensors.
Anotherimportantfeatureofbigdatathatmakesitprevalentisthatitprovidesextraordinaryne-graineddetaileddataintermsofanalysisunits,spatial,andtemporalresolution.
Forinstance,smartcardandmobilephonedataarecollectedattheindividuallevel(Richardsonetal.
,2013).
Suchdatacanbeobservedatshortintervals,forexample,onaper-hourbasis.
Datawithneanalysisunitsofferasignicantchanceforrigorousandaccurateresearchbecauseresearcherscanexaminethecausalrelationshipinasmallanalysisunitandavoidecologicalfallacyandtheotherissuescausedbydataaggregation(Robinson,2009).
Furthermore,thenespatialandtemporalresolutionofbigdataenableresearcherstolookintourbanissuesandothergeographicalprocessesinnedetailtogeneratenewunderstandingandtheories,becausemostcurrenttheoriesarebuiltonradicalandmassivechangestourbanissuesandothergeographicalprocessesinsteadofgradualandsubtlechangeswhichareprobablymoreimportant(Batty,2012).
2.
2.
Aglimpseofbig-data-relatedresearchBigdataresearchbasicallyisdatadriveninalmosteverydisci-plineandeld.
Therefore,bigdataresearcheitherfocusesonmethodologicalinnovationorprioritizestheapplicationofbigdataondifferenttopicsingeographyandurbanstudies.
Thescopeofbigdataresearchisdifculttosummarizebecausebigdatahavedifferenttypesandeachtypehasdifferentapplications.
Method-ologicalbigdatastudiesaregenerallycomputationintensive.
Forexample,afewscholarshaveproposedinnovativecomputationalframeworkfordataminingonbigdata(Gaoetal.
,2014;Wuetal.
,2014).
Notablestudiesincludeurbancomputing(Zhengetal.
,2013)andtheapplicationofmachinelearningtechniquessuchasneuralnetworkanddeeplearningtobigdataanalysis(O'Leary,2013;Pijanowskietal.
,2014).
Visualizationtoolsandtechniquesarealsobecomingpopular(CheshireandBatty,2012).
Themostfrequentlyinvestigatedtopicsintheapplicationofbigdatainthiseldishumanmobility(Gaoetal.
,2013;Gonzalezetal.
,2008;Liuetal.
,2012;Peietal.
,2014;Rothetal.
,2011;Songetal.
,2010),followedbyspatialinteraction(Gaoetal.
,2013;Kringsetal.
,2009)andurbanstructurepatterns(Leeetal.
,2013;Tooleetal.
,2012;Yuanetal.
,2012).
Signicantprogresshasbeenachievedinbigdataresearchinspatialinformationsciencesandrelatedelds.
Table1showssev-eralempiricalstudiesinthespatialinformationsciencesandrelatedelds.
Thesestudiesareselectedbasedontheirpotentialresearchimpact,diversiedbigdatatypesandresearchproblems.
Wehighlightthesestudiesbysummarizingthestudyfocus,datasource,methods,andresultsforeachempiricalstudy.
ThisFig.
1.
Numberofpublishedstudieson''bigdata"basedonliteraturesearchofthephrase''bigdata"inthetopiceldinthedatabaseofwebofknowledgefromtheyear1956to2014.
2J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxxPleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006summarycanserveasaquickoverviewoftheachievementsofbigdataresearchinthiseld.
2.
3.
HowdoesbigdatachangecurrentresearchWebelievethatbigdataexerttheirimpactonspatialinforma-tionsciencesandrelatedeldsinthreeaspects,namely,datacollection,dataprocessing,anddataanalysis.
First,thedatacollectionapproachhasbeentransformedfromtraditionalmethods(e.
g.
,questionnairesandinterviews)intoafastandpowerfulICT-basedmethod,includingthewebservicepro-videdbydifferentdatavendorssuchasnationalenvironmentalprotectionagenciesandcommercialinstitutions,aswellasdeviceswithsensorssuchasmobilephonesandsmarttransportationcards.
Thistransformationisfundamentalbecauseithaschangedtheothertwoaspects,namely,thewayresearchersprocessandanalyzedata.
Thechangeindatacollectionhasledtochangesindataprocess-ing.
Bigdataischaracterizedbyhighvolume,velocity,variety,veracity,negranularity,andrichdataavailability.
Consequently,themethodsandprocedurestoprocessthesebigdatamusthavethecapabilitytohandlehighvolumeandreal-timedataandserveasaltertodecreasedataerrorsanddata''noise.
"Thethirdaspectofchangesbroughtbybigdataincurrentresearchisdataanalysis.
Thedatastructureandnegranularityofbigdatarequiresnewdataanalysismethodsandtoolsbecausemanyexistingtoolsandinstrumentsfordataanalysisareforthetraditionaldataofcoarsetemporalandspatialresolutions(CheshireandBatty,2012).
2.
4.
EvaluatingcurrentbigdataresearchTheemergenceofbigdatahasclearlybeentransformingtheresearchlandscape.
Thetransitionofdatacollectionhasincreasedtherichnessandavailabilityofdata.
Aboomintheresearchareasisexpected,asevidencedbytheextensivepublicationsinrecentyears(Fig.
1).
However,itappearsthatsomeresearchersareimmersedinthiscarnivalofbigdataandoverlookeditspotentialproblems.
Theseresearcherstendtoembracebigdatawithoutanydoubtandscrutiny.
Theprocessofdatacollectionandprocessingforbigdataisexpectedtoplayakeyroleinavoidingthesystematicbiasesofbigdataanddecreasethedata''noise"toensuretheappropriateuseofbigdatainauthenticscienticinquiries.
However,wehavenotseenthishappen,asshowninthe''bigerrors"sectionthatfollowsinthisreview.
Dataanalysisisexpectedtochangeinthisneweraofbigdata.
Thefeatureofbigdatarequiresnewapproachesandtoolsthatcanaccommodatebigdatawithdifferentdatastructuresandcanpro-cessdatawithdifferentspatialandtemporalscales.
However,thusfarwehavenotfoundanydataanalysismethodthatissigni-cantlydifferentfromthetraditionalapproach.
Someoftheexistingbigdatastudiesbasicallyfollowaresearchparadigmofcombining''newapproachesbasedonnewdata"witholdtopics(Yuanetal.
,2012).
Inattempttoexplorethebigdata,thesestudiesusuallyapplyordevelopmethodsbasedontraditionaldataminingtech-niques,andusethisseemingly'new'approachtoexploreanoldtopic.
Originalcontributionforanempiricalresearchusuallycomesfromeitherexploringanewtopic/phenomenon,orusinganewmethod/model,orgainingnewinsights.
Onlyemployingnewdataisnotenoughforagoodandoriginalempiricalstudy,asshowninTable2.
Someofthecurrentbigdatastudiesruntheriskofmerelydoingdataexerciseinsteadofmakingoriginalcon-tribution.
Whetherthesestudiescontributetotheunderstandingofresearchproblems,produceanynewinsightsintotheresearchproblem,orimprovethetheoriesthatexplaintherealworldisdoubtful.
3.
''Big-errorsbroughtbybigdataAnderson(2008),formereditor-in-chiefofWiredmagazine,indicatedthefollowinginhisarticle''Theendoftheory":''outwithTable1Summaryofselectedbigdataresearchinspatialinformationsciences.
EntryStudyfocusBigdataMethodResultGonzalezetal.
(2008)HumanmobilityMobilephonedataStatisticalttingHumantrajectoriesshowahighdegreeoftemporalandspatialregularity.
Rothetal.
(2011)HumanmobilityLondonsubway''Oyster"carddataNullmodelApolycentricstructureiscomposedoflargeowsorganizedaroundalimitednumberofactivitycenters.
Kringsetal.
(2009)SpatialinteractionMobilephonedataGravitymodelCommunicationintensitybetweentwocitiesisproportionaltotheproductoftheirsizesdividedbythesquareoftheirdistance.
Huangetal.
(2015)HumanmobilityGPSTrajectoriesMarkovchainmodelwithconsiderationofactivitychangesTheproposedmethodimprovestheaccuracyinanalyzingandpredictinghumanmovement.
Zhengetal.
(2013)UrbancomputingAirmonitoringdataandpointsofinterestAsemi-supervisedlearningapproachbasedonthearticialneuralnetworkandconditionalrandomeldTheproposedmethodhasadvantageoverdecisiontree,conditionalrandomeld,andarticialneuralnetwork.
FuandChau(2013)DataqualitySocialmedia(SinaWeibo)RandomsamplingapproachRepresentativeandreliablestatisticsonChinesemicro-bloggersarelimited.
Haklay(2010)DataqualityOpenStreetMapComparisonwithOrdinanceSurveydatasetsOpenStreetMapinformationcanbefairlyaccurateintermsofpositionalaccuracy.
Table2Researchcontributionofdifferentresearchscenarios.
Newphenomenon/problem/topicNewmethodNewdata/contextOldphenomenon/problem/topicNilGood(methodologicalstudy)ProblematicifwithoutnewinsightsOldmethodGood(newareas)NilProblematicifwithoutnewinsightsOlddata/contextGood(newareas)GoodNilJ.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxx3Pleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006everytheoryofhumanbehavior,fromlinguisticstosociology.
For-gettaxonomy,ontology,andpsychology.
WhoknowswhypeopledowhattheydoThepointistheydoit,andwecantrackandmeasureitwithunprecedenteddelity.
Withenoughdata,thenumbersspeakforthemselves.
"However,doesbigdatareallyhave''unprecedenteddelity"andspeakforthemselveswithoutatheoryWedoubttheseideas.
Webelievethatthebignessofbigdatanotonlyreferstoitslargedatavolume,complexdatastructure,andnegranularitybutalsotothesignicanceofdataqualityandusageproblemsinbigdata.
Researchersintheeldofpublichealth,biologyandinformationcommunicationtechnologyhavesimilarviewwithusontheclaimbytheWiredmagazine(BoydandCrawford,2012;KhouryandIoannidis,2014;Pigliucci,2009).
Thisreviewexaminescurrentempiricalstudiesonbigdataandsummarizestheprevailingprob-lemsduringdatacollection,processing,andanalysisofbigdatatoelucidatethe''bigerrors"ofbigdatastudiesinspatialinformationsciencesandrelatedelds.
Weattempttoprovideaconstructiveunderstandingandargumentforthereectionandreexaminationofthedataauthenticityproblemofbigdatastudies,aswellassug-gestconcretemeasurestomitigatetheseproblems.
Givenourlim-itedexperienceonbigdata,thisstudyfocusesonmobilephonedata,socialmedia,volunteeringdata,andsearchingenginedata.
Thus,eachproblemdiscussedinthisreviewmaynotbeapplicabletoalltypesofbigdataandeachtypeofbigdatamaysufferfromoneormultipleproblemsindifferentways.
3.
1.
InauthenticdatacollectionTraditionaldatacollectionisusuallyconductedorsupervisedbyscienticinvestigators,researchinstitutions,orgovernmentalagencies.
Thedatacollectedbytheseauthoritativescienticbodiesgenerallyhavehighdataauthenticityandcredibilitybecauseresearchersintheseorganizationsgenerallyobeyresearchethicsandfollowgoodscienticpractices.
Theseinstitutionsalsohavemoreresourcesandpowertoperformthesetasks.
Furthermore,thesescienticdatacollectiontasksarewhattheseresearchers,researchinstitutions,andgovernmentalagenciesarehiredandpaidtodo.
Itistheirjob.
However,someofthebigdatasufferfromauthenticityandcredibilityproblemsindatacollection.
Forexample,socialmediadataarecollectedfromTwitter,SinaWeibo,andothersocialnet-workingplatforms.
Theseorganizationsarecommercialcompaniesthatarenotestablishedforscienticresearchpurposesbutarebusinessplatformsthatpursueprots.
Atleastthreetypesofdif-ferencesbetweenascienticresearchinstitutionandacommercialcompanyexist.
First,commercialbusinessplatformsneitheradoptscienticdatacollectionproceduressuchasrandomsamplingmethodfordatacollectionnorfollowaseriesofsolidandscienticdataprocessingprocedurestoaddressbiasesandotherdataprob-lems.
Commercialbusinessplatformsperformdatacollectionnotforthesakeofsciencebutforprot.
Thecollecteddataare''repur-posed"datathathavebeenpreviouslyusedforcommercialpur-posesbutisnowusedforscienticpurposes(Loshin,2012).
Thetargetpopulationsofthesecompaniesarepeoplewhoareprof-itabletothemratherthantheoverallpopulation.
Furthermore,thesamplingmethodandprocessingalgorithmsbehindthewebserviceprovidedbythesecompaniesareunknown,justlikeablackbox.
Second,thesecommercialbigdataproviderscanchangethesamplingmethodsandprocessingalgorithmsatanytimewithoutanynotice.
Researchersmaynotknowthesechangesatall.
Researchthatadoptsthesedatageneratedbydifferentalgorithmsfortemporalcomparisoncanproducebiasedorevenwrongcon-clusions.
Third,commercialplatformshavenoobligationormoti-vationtoensuretheauthenticityandvalidityofthedatatheycollected.
AgoodexampleisthatsocialmediasuchasTwitterandSinaWeibohavemanyrobotsor''zombie"accountsthatarerunbymachines.
Thesecompanieshavenodesireto''eliminate"theseunqualiedsamples.
Instead,thesecompaniescountontheseaccountstomakemoney.
Otherbig-data-basedwebservices,suchasFacebook,Fang,Baidu,andGoogle,havesimilarproblems.
ThereputableGoogleFluTrendsIndexisanexampleofinau-thenticdatacollection.
GooglereportsthatthisindexcanusethesearchrecordsofuserswhosearchinuenzaonGoogletopredictthebreakoutofaninuenzapandemic.
Thisindexhasattractedconsiderableattention,butscholarshavereportedthatGoogleoftenchangesthealgorithmsandmakethepredictionunstable(Lazeretal.
,2014).
ThecomparisonbetweentheGoogleestimatesofinuenzaandtheinuenzarecordsoftheUSCentersforDiseaseControlandPrevention(CDC)showsthatGoogleestimatesofdoc-torvisitsforinuenza-likeillnessaremorethandoublethoseofCDCestimates,andGoogleestimatesarealsohigherthanthoseoftheCDCin100outof108weeks(Lazeretal.
,2014).
3.
2.
InformationincompletenessandnoiseofbigdataAsdiscussedinSection3.
1,somebigdataarerepurposedfromcommercialusetoscienticuse,thusresultinginmanydataprob-lems.
Informationincompletenessisoneoftheseproblems.
Somebigdataaregoodindatavolumebutcontainslimitedinformation.
Thissituationconstrainsthefurtherapplicationofthesedata.
Forexample,mobilephonecalldetailrecord(CDR)datarecordsthecallinglogofmanyphoneusersinacityinareal-timemanneratalowcost.
However,theinformationinthedataisincompleteandhasanarrowapplicationscopewithoutthesocio-economicattributedataofusers.
Thisconditioncanbeattributedtothelimitednumberofdataeldsofmobilephonedata,includingonlypseudouserIDs,basestationlocations,andcallingtimestamps.
Althoughthedataisrecordedattheindividuallevel,mobilephonedatahaveextremelylimitedapplicationsbecauseofthelackofsocio-economicattri-butes.
Thistypeofdatacannotreectthedifferencesamongrespondents,ordescriberesidents'behaviorcharacteristicsandotheressentialinformationthatinterestsresearchersandreaders.
Currentliteratureshowsthatmobilephonedatacanonlybeusedtolookintothespatialaspectsofhumanactivitiesandnotthesocio-economicaspects,whicheitherinterrogateshumanmobility(Gonzalezetal.
,2008;Peietal.
,2014)orexaminesthespatialstructureandinteractionbetweenorwithincities(Gaoetal.
,2013;Kringsetal.
,2009;Readesetal.
,2009;SotoandFrías-Martínez,2011;Tooleetal.
,2012).
Forstudiesthatreportthatmobilephonedatacanbeusedtoinferpersonalitiesonthebasisofthecallingpatternsofphoneusers(deMontjoyeetal.
,2013b),therationaleforpersonalityinferenceisproblematic,whichiswhytheaccuracyofpersonalityinferenceisundesirable.
Moreover,thegeo-locationinformationinmobilephonedataisnottheexactlocationofphonecallingactivitiesbutthelocationofmobilephonetowerswhereinthemobilephonepositioningnet-workisbuilt(Gaoetal.
,2013;ZuoandZhang,2012).
Thegeo-locationaccuracyforphoneactivitiesdependsonthedensityofthesetowersandsignalstrengthwhichvaryconsiderablywithinacity(Gaoetal.
,2013).
Furthermore,thedatacanidentifyonlyworkingandresidentialactivitiesduringweekdaysandrecreationactivitiesduringweekends.
Theproblemwithactivityidentica-tionisthatresearchersgenerallyuseanarbitrarypre-denedactivitytimetodifferentiatetheseactivities,thusaddinguncer-taintiestotheconclusionsofstudiesusingthearbitraryparame-ters.
Inaddition,mobilephonedataonlyrecordthemovingpatternsofpeople,whichisonlyasmallpartofthedailylifeofpeople,butignoremostofthetimespentintheofceorathome.
Theseactivitiesaremuchmorecloselyrelatedtothebehavior,health,social,andeconomicactivitiesofurbanresidents,which4J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxxPleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006arethetopconcernofthepublicinsteadofthehumanmovingpatterns.
Mobilephonedata,however,cannotperformthistaskbecauseitonlycontainsincompleteandlessimportantinformation.
Smartcarddatahavesimilarinformationincompletenessprob-lemswithmobilephonedata.
Severalresearchershaveadoptedawindingapproachtoobtainthesocio-economicdataofsmartcardholders.
Theycombinedtraditionalresidenttravelsurveyandlandusedatatocomputetheresidentialaddressesofthesesmartcardholders,andtheninfertheirincomelevelsbasedontheresidentialaddresses(LongandThill,2015).
Thesedataminingprocedurescanattaincertainaccuracy,butintroducehugeerrorsbecauseofmanyuncertaintiesinvolvedinthestudylogic.
Aspecialproblemofinformationnoiseisassociatedwiththebigdataanalysis.
Thisproblemprobablyoccurs,butwehavenosolidevidencesyet.
AspreviouslydiscussedinSection2.
1,bigdatagenerallyhavenegranularity,whichprovidesnedetaileddataforresearch.
However,althoughbigdatacontaininformationinanergrainedanddetailedmanner,theyalsorecordrandomvaria-tions,uctuations,andevennoiseduringthemeasurement.
Whenapplyingtraditionalmethodssuchasthemachinelearningalgo-rithmtoanalyzebigdata,researcherscanprobablyrunintothephenomenonofover-tting,wherethemachinelearningalgo-rithmlearnsfromthenoiseembeddedinthene-grainedbigdataandpredictsbasedonthenoisedinformation.
Takeasimpledatattingtaskforexample,assumethatthetruerelationshipbetweenvariablesxandyisalinearrelationship(Fig.
2a).
However,whenbigdataprovideneandmoredetaileddatapointsasshowninFig.
2b,themachinelearningalgorithmwilllikelycomeoutwithamultiplepolynomialttingcurvewithahigherrsquaredratherthanthetruelinearcurve.
3.
3.
RepresentativenessproblemsofbigdataTherepresentativenessproblemisanotherconsequenceoftherepurposedbigdatadiscussedinSection3.
1.
Commercialbigdataprovidersgenerallydonotadoptascienticsamplingmethodwhencollectingdatabecauseoftheirnatureofpursuingcommer-cialprots,whichlimitsthepopulationrepresentedinthesedatatoonlyasmallgroupofpeoplewithlowsignicanceandimplica-tion.
Studiesbasedonthesebigdatathatignoretherepresenta-tivenessproblemhaveprobablydrawnconclusionsthatmismatchtheclaimedpopulationinthestudies.
Manyofthecurrentbigdataresearchthatusessocialmediadataarebuiltonthefollowingassumption:socialmediadatahaveaparticularlylargenumberofusers,whichindicatesthatthedatahasalargesamplecoverage,sothedatacanrepresenttheentirepopulationorpossiblepopulationsamplingbiascanbedismissed(Mayer-SchnbergerandCukier,2013).
However,thisscenarioisfalse.
Thelargesizeandvolumeofbigdatadonotnecessarilymeanthatthedataisrandomandrepresentative(BoydandCrawford,2012),andincreasingthequantitydoesnotincreasethequality(CheshireandBatty,2012).
Forexample,SinaWeibo,theChinese-localizedTwitter,reportedthatithad61.
4millionactiveusersperdayinthe4thquarterof2013,accountingforonly9.
94%ofallInternetusersand4.
51%ofthetotalpopulation,accordingtothestatisticsreportonInternetusersreleasedbytheChinaInternetNetworkInformationCenter(CINIC)(ChinaInternetNetworkInformationCenter,2014b).
AsurveyconductedbyCINICinearly2014showsthatnearly70%ofsocialmediausersareunder30yearsold(ChinaInternetNetworkInformationCenter,2014a).
Thepopulationthatusessocialmediaisonlyasmallpopulationthatismainlycomprisedofyoungpeople,whichisfarfrombeingrepresentativeoftheentirepopulationoreventheInternetuserpopulation.
Therepresentativenessproblemofsocialmediadataisnotonlyintheagestructure,butalsoinregio-naldivisions.
Morethanonequarterofmicrobloggingusersarelocatedinwell-developedregions,includingGuangdong,Beijing,andShanghaiwheretheInternetusersofthesethreeplacesonlyaccountfor9%ofthetotalpopulationofChina'sInternetusers(FuandChau,2013).
Thisconditionisoccurringamongsocialmediaservicesallovertheworld.
A2012surveythatinterviewed1802AmericanInternetusersshowsthatonly16%ofInternetusershaveaTwitteraccount,andthemajorityofTwitterusersareAfrican-American,urbanres-idents,andyoungpeoplebetween18and29yearsold(DugganandBrenner,2013).
Instagram,whichisapopularonlinephotosharingsocialmediaplatform,isonlyusedby13%ofthesurveyparticipants,andthesurveyshowsthatitisonlypopularamongwomenfromLatinAmerica(DugganandBrenner,2013).
Mobilephonedataalsohaverepresentativenessproblems.
Mostmobilephonedatausedinbigdatastudiesarefrommobilephoneuserswhosignedservicecontractswithamobilephoneoperator(Kringsetal.
,2009).
However,manyphoneusershavenotsignedcontractswithamobilephoneoperator,whereasothershavesignedacontractwithothermobilephoneoperatorsinacity.
Thephoneactivitiesoftheseusershavenotbeenrecordedinthesedata.
Anotherassumptionofthesestudiesbasedonmobilephonerecordsandsocialmediaisthatoneaccountorphonenumberrep-resentsoneperson.
However,thetruthisthat''accounts"andusersareclearlyneverequivalent.
Somepeopleshareonemobilephone,whereasothershavemultiplephonesfordifferentpurposes.
Forinstance,eachpersoninShanghaireportedlyhas1.
32phonesonaverage(Niuetal.
,2015).
Thesamecaseappliesinsocialmediaaccountsbecausetheseuser-generatedcontentsarenotsolelyFig.
2.
Fine-grainedbigdatacouldcauseover-ttingwhenwrongmodelsoranalysismethodsareappliedtopursuehighercoefcientofdeterminationrsquared.
J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxx5Pleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006producedbyhumansbutbyacomplexandmore-than-humanassemblage(Cramptonetal.
,2013).
3.
4.
ConsistencyandreliabilityproblemsofbigdataDatareliabilitydependsontwoaspects.
First,thedata,theirderivedmeasures,andindicatorscangenuinelyrepresentthefactsandinformationonresearchsubjectswithoutbeinginuencedbythedatacollector'sbehavior.
Second,thedata,theirderivedmea-sures,andindicatorsareconsistentandstable,regardlessofhowotherunrelatedfactorschange.
However,somebigdatafaileinbothaspectsofdatareliability.
Takethesocialmediadataforexample,studiesthatusesocialmediadataoftenignorethefactthattheoperatingcompanythatprovidesthesesocialmediadataisasignicantconfoundingvari-able.
Theactionsandbehaviorofthesecompaniesdistortsocialmediadatainmanyways.
Forexample,whenstudyingtheinter-personalcommunicationlinksbetweencities,researchersarelikelytooverlookthefactthatthesocialmediaservicesitselfaltersthetrueinter-personalnetworksbyrecommendingfriends(Fig.
3)onthebasisoftheuser'splaceofbirth,gender,educationback-ground,andotherattributes.
Thispersonalizationfeatureof''friendrecommendation"insocialmediaplatformsissimilartothetargetedadvertisingofGoogleandBaidu,whichdistractusers'attentionanddistorttheoriginalintentionofonlinesurng.
Thishiddeninuenceexertedbythesesocialmediacompaniesunder-minethecredibilityoftheconclusionofinter-personalcommuni-cationorinter-citycyberspaceanalysessuchas(Zhenetal.
,2012).
Inshort,theinformationinsocialmediabigdatadoesnotonlycontaininter-personalorinter-citycommunicationinformationbutalsocontaintheinterferinginuencesoftheoperatingcompa-niesbehindsuchsocialmediabigdata.
Usingsocialmediabigdatatoanalyzetheusers'behaviorandactivitieswithouteliminatingtheinuencesexertedbytheoperatingcompaniesisproblematic.
Searchenginedataareanotherexampleofthedistortionofusers'behaviorandactivities.
SearchenginessuchasGooglegen-erallyprovideuserswithahandyfunctioncalledautocompletetopredicttherestofawordbeingtypedbyauserasshowninFig.
4.
Thisfeatureisconvenientforsearchengineusers,butitdis-tortsusers'behavior(RuthsandPfeffer,2014).
Userscanbedis-tractedbysearchingotherwordsorwordcombinationsinstead.
Thecollectionandsubmissionproceduresofbigdataareunre-liable.
Forexample,thegeo-locationofthegeo-taggedtweetsisprobablyinaccurateandwrongbecauseTwitteruserscanpostgeo-taggedtweetsliterallyat''anyplace"onEarth.
Fig.
5showsaTwitteruserwhopostedageo-taggedtweetinHongKong,butthesameuserpostedanothergeo-taggedtweetinNewYork.
HowisitpossibleforapersontojumpfromHongKongtoNewYorkinonlyoneminuteFreeandwithout-vericationgeo-taggedtweetpostingsareapparentlyproblematic.
Thecollectionandsubmissionmethodofunverieddatalargelydecreasesthecredibilityofstudiesbasedonthesegeo-taggedbigdata.
SeveralstudiesthatadoptedtheTwitteruser-speciedlocationinusers'proleencounterthesamevericationproblembecauseuserscanprovideafuzzylocationsuchas''MiddleEarth"(Cramptonetal.
,2013;Grahametal.
,2014).
Theunreliabilityissuesofbigdataarealsomanifestedintheinstabilityandinconsistencyofbigdata.
Thepopulationrepre-sentedinsocialmediadatavarieswithtime,andisneverstableandconsistent.
Thenumberofsocialmediavaliduserswouldalwayschangebecauseusertasteschangeandothersocialnet-workingplatformsemerge.
Thisvariationdirectlychangesthedemographicstructureofitsuserpopulation.
Researchbasedonthesesocialmediadatacanbetruenow,butcanbecomefalseafewmonthslater(RuthsandPfeffer,2014).
OpenStreetMap,awell-knownvolunteergeographicinforma-tion,isanexcellentfreemapdataformanyresearchers(LiuandLong,2015).
However,oneproblemofOpenStreetMapisthattheaccuracyanddataqualityareworrisome.
GiventhatmostofthedatainOpenStreetMapareprovidedbydifferentamateuruserswhohaveneverreceivedprofessionalscientictrainingonmap-ping,nouniformandstandarddatacollectionmethodsexist,thusmakingthedataaccuracy,quality,andcompletenessvaryconsid-erablyacrossacountryorcity.
TheOpenStreetMapdatacanonlyberegardedasa''generalizeddataset"(Haklay,2010).
Inaddition,OpenStreetMapappearsto''bullythepoorandattertherich"becauseitprovidesdetailedmapdataforwealthyregions,butonlyrareandincompletedataforpoorareas(Haklay,2010).
Areason-ableexplanationisthatthevolunteeringdatacollectorsofOpen-StreetMaptendtocollectthedatainafuentareasratherthanthelessafuentareas.
Theinconsistentandunjustdataabundanceunderminesthereliabilityofthisvolunteeringgeographicalbigdata.
Fig.
3.
Socialmediausefriendrecommendationsandothermethodstoaffecttherealsocialnetwork.
6J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxxPleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
0063.
5.
EthicalissuesofbigdataWeoncehadadiscussionwithProfessorJohnLoganofBrownUniversity,theauthorof''UrbanFortunes:ThePoliticalEconomyofPlace.
"WhenProfessorLoganlearnedthatweusedSinaWeibomicroblogtocollectpublicactivitydata,heaskedus,''isitethical"Wereplied,''Sincetheusersmakepublictheinformation,theinformationisopen.
Sodataminingfromtheinformationiscer-tainlyallowed.
"However,thetruthisthat''justbecauseitisaccessibledoesnotmakeitethical"andignoringtheethicalevaluationofresearchisproblematic(BoydandCrawford,2012).
Thepracticeofcollectingpublicdatawithoutseekingappropriateethicalapprovalstillhassomerisks.
Iftheactivitypatternofanindividualuserisfounddur-ingthedataminingprocess,itcaninfringeontheuser'sprivacyandexceedthe''minimalrisk"inresearchethics.
Theso-calledminimalriskreferstothepainordiscomfortthatpeopleexperi-enceinstudiesthatshouldnotbemoreseverethanwhattheyexperienceindailylife(Bacon-Shone,2014).
Whatwedidnotcon-siderandacknowledgeisthefactthatsocialmediadataisinpublicisdifferentfromthefactthatthepermissiontousethedataisgrantedbyallinvolvedusers(BoydandMarwick,2011).
Severalscholarsidentiednearly10,000peopleengagedinurbanplanningandrelatedcareersbysearchingtheSinaFig.
4.
TheautocompletefunctionofGooglesearchenginedistortstheuser'soriginalsearchtopic.
Fig.
5.
Geo-taggedtweetscanbesentwithoutvericationonthegeo-location.
J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxx7Pleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006microblogandanalyzedtheirpersonalconnections(MaoandLong,2013).
Suchacademicexplorationseemsne,butthesendingscanviolatetheprivacyoftheplanningpractitionersunderinvesti-gation.
Forstudiesthatlookintotheresearchtopicsinvolvedwithuserprivacies,doestheresearchercollectsensitivedataregardingsocialnetworkusers'politicalideasoractionsIstheresearchdatastoredproperlyIsthereanyriskofdataleakageAretheuserIDsandnamesreplacedwithpseudocodesHavemeasuresbeentakentominimizetheminimumriskemphasizedintheinterna-tionalresearchethicsOtherbigdata-basedstudiesalsorevealthepossibleviolationofbigdataagainstuserprivacy.
Forexample,astudyonthecom-municationdataof1.
5millionanonymousmobilephoneusersinaWesternEuropeancountryshowsthatfourspatial–temporalposi-tionrecordscansufcientlyconrmtheidentityof95%ofpeople;researchersalsofoundthatafterdilutingthetimeandspaceofmobilephonedatasets,userprivacyprotectionisunimproved(deMontjoyeetal.
,2013a).
Thekeyinformationinouridentica-tiondocumentsisalsounsafe,suchassocialsecuritynumbers.
Researchersshowthatindividuals'socialsecuritynumberscanbeinferredbycombingpubliclyavailabledata,includingprolesinsocialnetworkingsites(AcquistiandGross,2009).
Thesestudiespointouttheunderlyingdifcultethicalprob-lemsinbigdata.
However,theseproblemsarebeyondthecontrolofcurrentethicalcontrolmechanismsbecauseweneitherknowwhatbigdatatypeisliableforviolatinguserprivacynordoweunderstandwhatmeasurecanbetakentocopewithit.
Ourresearchethicscommitteeisalsounpreparedforthebigdataprob-lem(BoydandCrawford,2012).
4.
SummaryandcopingstrategiesIntheprevioussection,webrieyintroducethe'BigErrors'indataqualityandusageproblemsofbigdata:inauthenticdatacol-lection,informationincompleteness,unrepresentativeness,incon-sistencyandunreliability,aswellasethicalissues.
Obviously,mostoftheseissuesareduetotheunscienticpracticeofdatacol-lection,dataprocessing,andthelackofdataverication.
The'bigerrors'broughtbyBigDataappearscritical,butbigdatastillhaspotential.
Asaforementionedatthebeginningofthisstudy,ouraimisnottodiscouragebigdatastudies.
Instead,ourintentionistoincreasetheawarenessofresearchersofthepotentialbiasesanderrorsinthiseld,aswellastounravelthebigdatapuzzlewithcare.
Bigdatastudies,ascurrentlyobserved,willcontinuetoprosperandproducemoreinterestingstudies.
Thenextquestionishowspatialinformationsciencesandrelateddisciplinescopewiththe''bigerror"broughtbybigdata.
Thispaperarguesthatbigdatastudiesinspatialinformationsciencesandrelateddisciplinesshouldfocusonthefollowingaspects.
Noteherethatthefollowingrecommendationsareprelim-inarythoughtsonhowtomitigatethe''BigError"ratherthanapanaceaforallproblems.
4.
1.
Bigdataerrorsshouldbefurtherunderstoodandevaluated,andnewreliabledataanalysismethodsshouldbedevelopedtodecreaseerrorsObtainingadeepunderstandingoftheissueisessentialtodevelopappropriatemethodstosolvetheproblem.
Thisconditionisalsoapplicableforthe''bigerror"broughtbybigdata.
Speci-cally,scholarswhousebigdatashouldhaveadeepunderstandingofhowthedatasupplier'sbehavioraffectsthequalityofbigdataandbiasestheresults.
Besides,evaluatingthequalityofbigdataandpossibleerrors,suchaspositioningaccuracy,dataintegrity,logicalconsistency,andotherdataaccuracycharacteristics,isneededbeforeusingbigdataindataanalysis.
Second,weshouldactivelydevelopreliabletechnicalmethodsthatcanbewidelyappliedtodecreaseoreliminatebigdataerrors.
Whencounter-measurestosolvetheseproblemsareunrealistic,weshouldcare-fullydeterminethescopeofstudysuchasthetargetedstudypopulationorsubjects,andexplainitinthediscussionsection.
4.
2.
Weshouldcooperatewithdataproviders,adoptrigorousresearchdesignssuchasexperimentalresearchdesigntodecreaseoreliminatetheinuencefromdataprovidersTheimpactsofbigdatasuppliersonthebigdataqualityhavebeendiscussedinthepreviouspart.
Ifresearcherswhoknowthescienticmethodsofdatacollectioncooperatewithbigdatasup-pliersandconductrigorousresearchplanstogether,theyarelikelytoobtainaconclusionswithhighcredibility.
Thismodelisfeasiblebecauseresearchersneedthedataandsupportfrombigdatasup-pliersandthesupplierswanttolearnmoreabouttheirbusinessfromthedata.
Forexample,eBaymaywanttoknowhowitcaninuenceonlinetransactionsandlogisticsiftheyincreasetradingcommissionsforthemerchants.
Thereareprecedentsofscholarsinothereldsthatcooperatewithonlinecommercialplatforms.
Forexample,Kohavietal.
(2009),whoweretheexperimentalplatformteammembersofMicrosoft,cooperatedwithAmazon,Google,andNASAusinganexperimentdesignmethodtostudyusersatisfactionortoleranceonthewaitingtimeofthesearchingweb,certainwebsitedesigns,orcertainservices.
4.
3.
ResearchonbigdatashouldbesupplementedbytraditionalscienticdatacollectionmethodstoobtainmoredetailedandrepresentativedataBigdatasuchasmobilephonedataandsmartcarddatahaveincompletenessproblemsandotherbigdatasuchassocialmediamicroblogshavebiasedsamples.
Traditionaldatacanbeusedtocomplementsuchbigdata.
Usingascientictraditionalsamplingsurvey,wecancollectmoredetailedinformationofthetargetpopulationincludingastheirsocio-economiccharacteristics,andmakethecollecteddatamorerepresentative.
Manyscholarsarealreadyusingthisapproach,suchascombiningsmartcarddatawithtraveldiaries(Longetal.
,2015)andcombiningnewsreportandcrimeincidentdatawithsocialmediadata(Cramptonetal.
,2013).
4.
4.
MultipledatasourcescanbeusedtoexpandthesamplerepresentativenessandenhancethereliabilityofresearchndingsbasedonbigdataDatafrommultiplesourcesresultinrelativelycomplexdatastructuresanddataprocessingtasks,butitincreasesdiversity.
Thisconditionisparticularlyimportantfortheproblemsofrepresenta-tivenessandreliabilityofbigdata.
Therobustnessofconclusionsderivedfromonetypeofbigdatacanbetestedwithbigdatapro-videdbyanotherplatformtomakethemmoreconvincing.
4.
5.
Currentresearchethicsandgoodpracticesshouldbeenhancedatthegovernmental,university,andindividuallevelsallovertheworldAsprovidersofpublicgoods,governmentagencies,universities,researchinstitutions,andindividualresearchersareobligedtopro-tecttheprivacyofindividualsandtherightofthepublictoknow.
EthicsreviewpracticeiswellimplementedindevelopedcountriesbutpoorlyimplementedindevelopingcountriessuchasChina.
Tothebestofourknowledge,clinicalresearchethicsreviewsarewidelyadoptedinthemedicalschoolsofChineseuniversitiesbuttherearenocompulsoryinstitutional-levelmeasureovernon-clinicalresearchthatinvolveshumansubjects.
Webelievethat8J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxxPleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006theresearchethicsreviewisnottohindertheprogressofresearch.
Instead,itattemptstoprotectthepublic,researchinstitutions,andresearchersthemselves.
Asforbigdataresearch,werecommendthataccesscontroltobigdatathatareliableto''disclosing"theprivacyofthepublicshouldberegulatedinbigdataresearch.
Pseudocodesshouldbeusedtosubstituteallidentiableinformation.
However,whatkindofbigdataisvulnerabletoprivacydisclosureandwhatcorre-spondingcountermeasurescanbedevelopedshouldbeintensivelystudiedrst.
AcknowledgementThisresearchwassupportbytheEarlyCareerSchemefromResearchGrantCouncilofHongKong(ProjectNo.
:27200414).
ReferencesAcquisti,A.
,Gross,R.
,2009.
Predictingsocialsecuritynumbersfrompublicdata.
Proc.
Natl.
Acad.
Sci.
106,10975–10980.
Anderson,C.
,2008.
Theendoftheory:thedatadelugemakesthescienticmethodobsolete.
WiredMag.
Bacon-Shone,J.
,2014.
HumanResearchEthicsinHKU.
(accessed16.
12.
15).
Batty,M.
,2012.
Smartcities,bigdata.
Environ.
Plan.
B–Plan.
Des.
39,191–193.
http://dx.
doi.
org/10.
1068/b3902ed.
Boyd,D.
,Crawford,K.
,2012.
Criticalquestionsforbigdataprovocationsforacultural,technological,andscholarlyphenomenon.
Inf.
Commun.
Soc.
15,662–679.
http://dx.
doi.
org/10.
1080/1369118x.
2012.
678878.
Boyd,D.
,Marwick,A.
E.
,2011.
Socialprivacyinnetworkedpublics:Teens'attitudes,practices,andstrategies.
In:ProceedingsoftheaDecadeinInternetTime:SymposiumontheDynamicsoftheInternetandSociety.
OxfordInternetInstitute,pp.
1–29.
Cheshire,J.
,Batty,M.
,2012.
Visualisationtoolsforunderstandingbigdata.
Environ.
Plan.
B–Plan.
Des.
39,413–415.
http://dx.
doi.
org/10.
1068/b3903ed.
ChinaInternetNetworkInformationCenter,2014a.
ResearchReportonSocialMediaUserBehaviors2014.
(accessed16.
12.
15).
ChinaInternetNetworkInformationCenter,2014b.
StatisticalReportonInternetDevelopmentinChina.
(accessed16.
12.
15).
Crampton,J.
W.
,Graham,M.
,Poorthuis,A.
,Shelton,T.
,Stephens,M.
,Wilson,M.
W.
,Zook,M.
,2013.
Beyondthegeotag:situating'bigdata'andleveragingthepotentialofthegeoweb.
Cartogr.
Geogr.
Inf.
Sci.
40,130–139.
http://dx.
doi.
org/10.
1080/15230406.
2013.
777137.
deMontjoye,Y.
-A.
,Hidalgo,C.
A.
,Verleysen,M.
,Blondel,V.
D.
,2013a.
Uniqueinthecrowd:theprivacyboundsofhumanmobility.
Sci.
Rep.
3.
http://dx.
doi.
org/10.
1038/srep01376.
deMontjoye,Y.
-A.
,Quoidbach,J.
,Robic,F.
,Pentland,A.
,2013b.
Predictingpersonalityusingnovelmobilephone-basedmetrics.
In:Greenberg,A.
,Kennedy,W.
,Bos,N.
(Eds.
),SocialComputing,Behavioral–CulturalModelingandPrediction.
SpringerBerlinHeidelberg,pp.
48–55.
Duggan,M.
,Brenner,J.
,2013.
TheDemographicsofSocialMediaUsers–2012.
(accessed16.
12.
15).
Einav,L.
,Levin,J.
,2014.
Economicsintheageofbigdata.
Science346,1243089.
http://dx.
doi.
org/10.
1126/science.
1243089.
Fu,K.
-W.
,Chau,M.
,2013.
RealitycheckfortheChinesemicroblogspace:arandomsamplingapproach.
PLoSONE8,e58356.
Gao,S.
,Li,L.
,Li,W.
,Janowicz,K.
,Zhang,Y.
,2014.
Constructinggazetteersfromvolunteeredbiggeo-databasedonHadoop.
Comput.
Environ.
UrbanSyst.
http://dx.
doi.
org/10.
1016/j.
compenvurbsys.
2014.
02.
004.
Gao,S.
,Liu,Y.
,Wang,Y.
,Ma,X.
,2013.
Discoveringspatialinteractioncommunitiesfrommobilephonedata.
Trans.
GIS17,463–481.
Gonzalez,M.
C.
,Hidalgo,C.
A.
,Barabasi,A.
-L.
,2008.
Understandingindividualhumanmobilitypatterns.
Nature453,779–782.
http://dx.
doi.
org/10.
1038/nature06958.
Graham,M.
,Hale,S.
A.
,Gaffney,D.
,2014.
WhereintheworldareyouGeolocationandlanguageidenticationinTwitter.
Prof.
Geogr.
66,568–578.
Haklay,M.
,2010.
HowgoodisvolunteeredgeographicalinformationAcomparativestudyofOpenStreetMapandOrdnanceSurveydatasets.
Environ.
Plan.
B–Plan.
Des.
37,682–703.
Huang,W.
,Li,S.
,Liu,X.
,Ban,Y.
,2015.
Predictinghumanmobilitywithactivitychanges.
Int.
J.
Geogr.
Inf.
Sci.
29,1–19.
IBM,2013.
TheFourV'sofBigData.
(accessed16.
12.
15).
Khoury,M.
J.
,Ioannidis,J.
P.
A.
,2014.
Bigdatameetspublichealth.
Science346,1054–1055.
http://dx.
doi.
org/10.
1126/science.
aaa2709.
Kohavi,R.
,Longbotham,R.
,Sommereld,D.
,Henne,R.
M.
,2009.
Controlledexperimentsontheweb:surveyandpracticalguide.
DataMin.
Knowl.
Disc.
18,140–181.
Krings,G.
,Calabrese,F.
,Ratti,C.
,Blondel,V.
D.
,2009.
Urbangravity:amodelforinter-citytelecommunicationows.
J.
Stat.
Mech:TheoryExp.
2009,L07003.
Laney,D.
,2001.
3-DDataManagement:ControllingDataVolume,VelocityandVariety.
(accessed16.
12.
15).
Lazer,D.
,Pentland,A.
S.
,Adamic,L.
,Aral,S.
,Barabasi,A.
L.
,Brewer,D.
,Christakis,N.
,Contractor,N.
,Fowler,J.
,Gutmann,M.
,2009.
Lifeinthenetwork:thecomingageofcomputationalsocialscience.
Science323,721.
Lazer,D.
M.
,Kennedy,R.
,King,G.
,Vespignani,A.
,2014.
TheparableofGoogleFlu:trapsinbigdataanalysis.
Science343,1203–1205.
Lee,R.
,Wakamiya,S.
,Sumiya,K.
,2013.
UrbanareacharacterizationbasedoncrowdbehaviorallifelogsoverTwitter.
Pers.
Ubiquit.
Comput.
17,605–620.
http://dx.
doi.
org/10.
0007/s00779-012-0510-9.
Liu,X.
,Long,Y.
,2015.
Automatedidenticationandcharacterizationofparcels(AICP)withOpenStreetMapandPointsofInterest.
Environ.
Plan.
B–Plan.
Des.
,1–20http://dx.
doi.
org/10.
1177/0265813515604767.
Liu,Y.
,Kang,C.
,Gao,S.
,Xiao,Y.
,Tian,Y.
,2012.
Understandingintra-urbantrippatternsfromtaxitrajectorydata.
J.
Geogr.
Syst.
14,463–483.
Long,Y.
,Liu,X.
,Zhou,J.
,Chai,Y.
,2015.
EarlyBirds,NightOwls,andTireless/RecurringItinerants:AnExploratoryAnalysisofExtremeTransitBehaviorsinBeijing,China.
arXiv:1502.
02056[physics.
soc-ph].
Long,Y.
,Thill,J.
,2015.
Combiningsmartcarddata,householdtravelsurveyandlandusepatternforidentifyinghousing-jobsrelationshipsinBeijing.
Comput.
Environ.
UrbanSyst.
53,19–35.
http://dx.
doi.
org/10.
1016/j.
compenvurbsys.
2015.
02.
005.
Loshin,D.
,2012.
DataGovernanceandQuality:DataReusevs.
DataRepurposing.
(accessed06.
05.
15).
Mao,M.
,Long,Y.
,2013.
Jiyuweiboshujudeguihuaquanshibiechutan[anexplorationintothepersonalconnectionofurbanplanningpractitionersusingSinaWeibodata].
In:AnnualNationalPlanningConference2013.
UrbanPlanningSocietyofChina,Qingdao,China.
Mayer-Schnberger,V.
,Cukier,K.
,2013.
BigData:ARevolutionthatWillTransformHowWeLive,Work,andThink.
HoughtonMifinHarcourt.
McNulty,E.
,2014.
UnderstandingBigData:TheSevenV's(accessed15.
12.
15).
Niu,X.
,Ding,L.
,Song,X.
,2015.
Understandingurbanspatialstructureofshanghaicentralcitybasedonmobilephonedata.
ChinaCityPlanningReview3,004.
O'Leary,D.
E.
,2013.
Articialintelligenceandbigdata.
IEEEIntell.
Syst.
28,96–99.
Pei,T.
,Sobolevsky,S.
,Ratti,C.
,Amini,A.
,Zhou,C.
,2014.
Uncoveringthedirectionalheterogeneityofanaggregatedmobilephonenetwork.
Trans.
GIS18,126–142.
Pigliucci,M.
,2009.
TheendoftheoryinscienceEMBORep.
10,534–534.
Pijanowski,B.
C.
,Tayyebi,A.
,Doucette,J.
,Pekin,B.
K.
,Braun,D.
,Plourde,J.
,2014.
Abigdataurbangrowthsimulationatanationalscale:conguringtheGISandneuralnetworkbasedlandtransformationmodeltoruninahighperformancecomputing(HPC)environment.
Environ.
Modell.
Softw.
51,250–268.
http://dx.
doi.
org/10.
1016/j.
envsoft.
2013.
09.
015.
Reades,J.
,Calabrese,F.
,Ratti,C.
,2009.
Eigenplaces:analysingcitiesusingthespace-timestructureofthemobilephonenetwork.
Environ.
Plan.
B–Plan.
Des.
36,824–836.
Richardson,D.
B.
,Volkow,N.
D.
,Kwan,M.
-P.
,Kaplan,R.
M.
,Goodchild,M.
F.
,Croyle,R.
T.
,2013.
Spatialturninhealthresearch.
Science339,1390.
Robinson,W.
S.
,2009.
Ecologicalcorrelationsandthebehaviorofindividuals.
Int.
J.
Epidemiol.
38,337–341.
Roth,C.
,Kang,S.
M.
,Batty,M.
,Barthélemy,M.
,2011.
Structureofurbanmovements:polycentricactivityandentangledhierarchicalows.
PLoSONE6,e15923.
Ruths,D.
,Pfeffer,J.
,2014.
Socialmediaforlargestudiesofbehavior.
Science346,1063–1064.
http://dx.
doi.
org/10.
1126/science.
346.
6213.
1063.
Song,C.
,Qu,Z.
,Blumm,N.
,Barabási,A.
-L.
,2010.
Limitsofpredictabilityinhumanmobility.
Science327,1018–1021.
Soto,V.
,Frías-Martínez,E.
,2011.
Automatedlanduseidenticationusingcell-phonerecords.
In:Proceedingsofthe3rdACMInternationalWorkshoponMobiArch.
ACM,pp.
17–22.
Toole,J.
L.
,Ulm,M.
,González,M.
C.
,Bauer,D.
,2012.
Inferringlandusefrommobilephoneactivity.
In:ProceedingsoftheACMSIGKDDInternationalWorkshoponUrbanComputing.
ACM,pp.
1–8.
Webster,C.
,2014.
Dean'sRoundup(Friday,31October,2014).
(accessed06.
05.
15).
Wu,H.
Y.
,Zhang,T.
,Gong,J.
Y.
,2014.
Geocomputationforgeospatialbigdata.
Trans.
GIS18,1–2.
http://dx.
doi.
org/10.
1111/tgis.
12131.
Yuan,J.
,Zheng,Y.
,Xie,X.
,2012.
DiscoveringregionsofdifferentfunctionsinacityusinghumanmobilityandPOIs.
In:ProceedingsoftheACMKDD.
ACM,pp.
186–194.
Zhen,F.
,Wang,B.
,Chen,Y.
,2012.
China'scitynetworkcharacteristicsbasedonsocialnetworkspace:anempiricalanalysisofsinamicro-blog.
ActaGeogr.
Sin.
67,1031–1043.
Zheng,Y.
,Liu,F.
,Hsieh,H.
-P.
,2013.
U-Air:whenurbanairqualityinferencemeetsbigdata.
In:Proceedingsofthe19thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining.
ACM,pp.
1436–1444.
Zuo,X.
,Zhang,Y.
,2012.
Detectionandanalysisofurbanareahotspotsbasedoncellphonetrafc.
J.
Comput.
7,1753–1760.
J.
Liuetal.
/ISPRSJournalofPhotogrammetryandRemoteSensingxxx(2015)xxx–xxx9Pleasecitethisarticleinpressas:Liu,J.
,etal.
Rethinkingbigdata:Areviewonthedataqualityandusageissues.
ISPRSJ.
Photogram.
RemoteSensing(2015),http://dx.
doi.
org/10.
1016/j.
isprsjprs.
2015.
11.
006

npidc:9元/月,cn2线路(不限流量)云服务器,金盾+天机+傲盾防御CC攻击,美国/香港/韩国

npidc全称No Problem Network Co.,Limited(冇問題(香港)科技有限公司,今年4月注册的)正在搞云服务器和独立服务器促销,数据中心有香港、美国、韩国,走CN2+BGP线路无视高峰堵塞,而且不限制流量,支持自定义内存、CPU、硬盘、带宽等,采用金盾+天机+傲盾防御系统拦截CC攻击,非常适合建站等用途。活动链接:https://www.npidc.com/act.html...

Megalayer优化带宽和VPS主机主机方案策略 15M CN2优化带宽和30M全向带宽

Megalayer 商家主营业务是以独立服务器和站群服务器的,后来也陆续的有新增香港、菲律宾数据中心的VPS主机产品。由于其线路的丰富,还是深受一些用户喜欢的,有CN2优化直连线路,有全向国际线路,以及针对欧美的国际线路。这次有看到商家也有新增美国机房的VPS主机,也有包括15M带宽CN2优化带宽以及30M带宽的全向线路。Megalayer 商家提供的美国机房VPS产品,提供的配置方案也是比较多,...

VoLLcloud(月付低至2.8刀)香港vps大带宽,三网直连

VoLLcloud LLC是一家成立于2020年12月互联网服务提供商企业,于2021年1月份投入云计算应用服务,为广大用户群体提供云服务平台,已经多个数据中心部署云计算中心,其中包括亚洲、美国、欧洲等地区,拥有自己的研发和技术服务团队。现七夕将至,VoLLcloud LLC 推出亚洲地区(香港)所有产品7折优惠,该产品为CMI线路,去程三网163,回程三网CMI线路,默认赠送 2G DDoS/C...

www.cnnic.cn为你推荐
操作http支持ipadiobitfilezillaserverfilezilla server interface怎么填dell服务器bios设置dell R410服务器 bios设置参数如何恢复出厂设置?平阴县教育和体育局下属锦东小学教学设备采购项目竞争性磋商文件开放平台微信的开放平台是干什么用的温州商标注册温州代理注册个商标是怎么收费的?pintang目前世界上最稀有、最珍贵的钱币是什么?瑞东集团道恩集团的集团简介
上海虚拟主机 域名服务器 河南vps 北京vps主机 x3220 128m内存 permitrootlogin dux 创梦 web服务器的架设 100m独享 hkt 登陆空间 中国电信测速器 web应用服务器 东莞服务器托管 宿迁服务器 大化网 国外免费网盘 密钥索引 更多