CopyrightIBMCorporation2013TrademarksDatascienceandopensourcePage1of8DatascienceandopensourceLearnaboutopensourcetoolsforconvertingdataintousefulinformationM.
TimJonesAugust09,2013Datasciencecombinesmathematicsandcomputerscienceforthepurposeofextractingvaluefromdata.
Thisarticleintroducesdatascienceandsurveysprominentopensourcetoolsinthisrapidlygrowingfield.
Thegoalofdatascienceistheextractionofusefulinformationfromadataset.
Companieshaverecognizedthevalueofdataasabusinessassetforalongtime.
Butthehugedatavolumesthatarenowavailablenecessitatenewwaystomakesenseofdataandmanageitefficiently.
Agrowingcadreofengineersandscientistsarebuildingsystemstoapplydatasciencetomassivedatavolumes.
Thisarticleintroducesyoutothefieldofdatascienceandtoopensourcetoolsthatareavailablefortoday'sdatascientist.
DatascienceanddatascientistsDatasciencebeginswiththecollectionofdata.
Candidatesforcollectioncanbeopendataordatathatcomesfrominternalbusinessprocesses(forexample,websitestatistics).
Nextcomesrefinement:theinventiveprocessthatreducesthedatatousefulinformationthatanswersspecificquestions.
Typically,thequestionsdefinetheapproachtotheextractionoftheinformation.
Withinthecollectionandrefinementstepsareotherimportantaspectssuchasdatacleansing(orpreprocessing)anddatavisualization.
OpendataOpendataistheconceptofdemocratizingdatabymakingitfreelyavailabletoeveryonetouseastheywant.
Thegrowingopendatamovementfollowstheideasbehindopensource.
AusefulsourceofopendataisData.
gov(seeRelatedtopics),aUSgovernmentwebsitethatwascreatedtoincreasepublicaccesstodatageneratedbytheexecutivebranchofthefederalgovernment.
Youcanalsoviewdatascienceasabusinessprocess.
MikeLoukidesofO'Reillymakesacompellingcasethatdatascienceistheconversionofdatanotonlyintoinformationbutalsointoproducts(seeRelatedtopics).
Fromthatperspective,thefieldisamodern-daygoldrush—acompetitivesearchforthevaluablenuggetsinmountainsofinformation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage2of8Theprospectorsinthedatagoldrusharecalleddatascientists.
Asbusinessesrecognizethevalueintheirdata,theneedfortalentedmultidisciplinaryengineersandscientistsisgrowing.
Datascientistsmusthaveskillsincomputerscience,math,andstatistics.
Ideally,theyalsohavedomainknowledge—anunderstandingofthesourceofthedata(medical,financial,web,andotherdomains).
Figure1illustratesdatascienceastheintersectionofcomputerscience,mathandstatistics,anddomainknowledge:Figure1.
KeydisciplinesofthedatascientistWiththiscompleteskillset,thedatascientistcantranslatedomainknowledgeandmathintoanapplication(fromthecomputersciencedomain)thatminesdataandrefinesitintoinformation.
Thekeyisamultidisciplinaryfocus(whichcanalsoincludedomainssuchasmachinelearningandinformationretrieval).
Engineersandscientistswithbigdataanalyticsexperienceareinhighdemandthesedays.
McKinsey&Companypredictsthatby2018ashortageofpeoplewhocanfitthedatascientistrolewilloccur(seeRelatedtopics).
Theideasandapproachesindatascienceareusefulinmanyotherdisciplinestoo.
Evenifyoudon'taspiretobecomeadatascientist,datascienceskillscanbeagreatadditiontoyourengineeringtoolbox.
WheredatascienceisusedLikecloudcomputing,datascienceisrapidlygaininginterestandadoption.
Overtheyearbeforethisarticlewaswritten,interestindatascienceroughlydoubled,accordingtoGoogleInsightsforSearch(formerlyGoogleTrends).
GoogleInsightsforSearchisitselfanexampleofdatascienceinaction.
Figure2showsthatthefrequencyofdatascienceasawebsearchtermincreaseddramaticallybetweenthesummerof2011andthespringof2012:ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage3of8Figure2.
GoogleInsightsforSearchdataoninterestindatascienceDatascienceisquicklybecomingastaplewithinorganizationsthatharvestdataonline(beitcrawling-basedcollectionorinternalcollectionthatisbasedonuserbehaviorssuchasclicks).
MajorwebsitessuchasGoogle,Amazon,Facebook,andLinkedInallhavetheirowndatascienceteamstousetheiravailabledata(seeRelatedtopics).
Google'sdevelopmentofthePageRankalgorithmisanearlyexampleofdatascience.
Googlecrawlsthewebandassignsanumericalweighttothehyperlinksoneverypagetomeasuretherelativeimportanceofthoselinks.
(FulldetailsofPageRankareknownonlywithinGoogle.
)Thealgorithmservesasthemeansofrankingwebcontentasafunctionofsearchterms.
LargeonlineretailerssuchaslikeAmazonandWalmartusedatasciencetotrytoincreasesales.
Theygeneraterecommendationstoindividualusersthatarebasedtheuser'sproductsearchesandpastpurchases.
LinkedIn,aprofessionalnetworkingsite,maintainsahugeamountofdatathatisrelatedtopeopleandtheircareers,interests,andconnections.
Thismassivenetworkofdataresultedinvariousrecommendationengines(forindividuals,groups,andcompanies)andprojectsthatusethedataatadeeperleveltoproducenewproductsatLinkedIn.
Onenovelexampleofdatascienceatawebpropertyisthecompanybitly.
Onthesurface,bitlyisaservicethatenablesuserstoshortenanyURLtoa19-charactermaximumURL(whichisstoredpermanentlyinbitly'sdatacenter).
ReferencestotheshortenedURLareredirectedfrombitlytotheoriginalURL.
bitlycanthenseewhichURLspeopleshortenandwhichURLsotherusersclick.
Thistacticprovidesanenormousamountofdatathatbitly(anditschiefscientist,HilaryMason)canusetogenerateawealthofstatisticsaboutbrowsinghabits.
UserswhoareregisteredwithbitlycanseewhentheirshortenedURLswereclicked,throughwhichreferrer(emailclient,Twitter,oranotherURL),andfromwhichcountry.
Businessescanalsousebitlytotrackuserbehaviorforasetofcontent.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage4of8OpensourcetoolsfordatascienceJustascomputerprogrammingisn'tconstrainedtoasinglelanguageordevelopmentenvironment,datascienceisn'tassociatedwithasingletoolortoolsuite.
Arichandbroadarrayoftoolsintheopensourcedomainadvancedatascience.
Theyincludetoolsthatprocesslargedatasetsnumerically,andvisualizationandprototypingtoolsthataidinthedevelopmentofcomplexprocessing.
Table1listsprominentopensourcetoolsfordatascientistsanddefinestheirroles:Table1.
OpensourcetoolsfordatascienceToolDescriptionApacheHadoopFrameworkforprocessingbigdataApacheMahoutScalablemachine-learningalgorithmsforHadoopSparkCluster-computingframeworkfordataanalyticsTheRProjectforStatisticalComputingAccessibledatamanipulationandgraphingPython,Ruby,PerlPrototypingandproductionscriptinglanguagesSciPyPythonpackageforscientificcomputingscikit-learnPythonpackageformachinelearningAxiisInteractivedatavisualizationThelistinTable1isn'texhaustivebutinsteadrepresentssomeofthecoreelementswithinthedatascientist'stoolbox.
Theopensourcedomainisalsofilledwithhighlyspecializedanddomain-specificlibrariesandtools(forexample,utilitiesforinteractivemapvisualizationandfortextanalysis).
Hadoop,Mahout,andSparkTheInternetcreatesopportunitiestocollectmassesofdataaboutusers'behaviorandhabits.
ApacheHadoopisthepremierframeworkforprocessingmassivedatasets.
Hadoopisimportantfordatasciencebecauseitprovidesascalableframeworkfordistributeddataprocessing.
Notalldatascienceproblemsrequirebigdataprocessing,butHadoopisidealwhenyourprobleminvolvesInternet-scaledata.
TheGoogleMapReduceframework'simplementationofthePageRankalgorithmisanearlyexampleofdatascienceonabigdataframework.
(HadoopisanimplementationofMapReduce.
)ApachePigcanmakeHadoopevenmoreaccessible,bringingaquerylanguagethatautomaticallybuildsMapReduceapplications(seeRelatedtopics).
ApacheMahoutisanimplementationofscalablemachine-learningalgorithmsontheHadoopplatform(seeRelatedtopics).
Mahoutincludesscalableimplementationsofclusteringalgorithmsandbatch-basedcollaborativefilteringalgorithms(forimplementingrecommendationsystems).
AnothernoteworthysolutionforlargedatasetsistheSparkframework(seeRelatedtopics).
Sparkincludesoptimizationssuchasin-memoryclustercomputingwithfault-tolerantabstractions.
TheRprojectAtoolthat'softenfoundinthedataminer'stoolkitisaprogramminglanguageanddevelopmentenvironmentcalledR.
Rfocusesonstatisticalcomputingandgraphics.
Risrelativelysimpleibm.
com/developerWorks/developerWorksDatascienceandopensourcePage5of8tolearnandiswidelyusedinthedomainofdataanalysis.
Beingopensourceandfree,Risapopularlanguagewithalargeuserbase.
Risamultiparadigmlanguagethatsupportsobject-oriented,functional,procedural,andimperativeprogrammingstyles.
Thelanguageisinterpretedthroughacommand-lineinterfaceandalsoincludesextensiveproduction-levelgraphicalcapabilities.
Staticgraphicsareavailableoutofthebox.
Withadditionalpackages,bothdynamicandinteractivegraphsarepossible.
Figure3showsanexampleplotthatwasgeneratedwithR:Figure3.
Sample3DsincplotthatusesRTheRprogramminglanguagewasdevelopedinCandFortran.
ManyoftheinternalstandardfunctionsinRwerewritteninRitself.
Rsupportsmixed-languageprogramming,enablingaccesstoRobjectsfromlanguagessuchasCandJava.
YoucaneasilyextendthecapabilitiesofRbyusingpackages,whichcanbedevelopedintheR,C,Java,andFortranprogramminglanguages.
ScriptinglanguagesMultiparadigmscriptinglanguagessuchasPython,Ruby,andPerlprovideaprofessionalplatformforapplicationdevelopmentanddeployment.
Andtheyareidealforprototypingandtestingnewideas.
Theselanguagesalsosupportvariousdatastorageandcommunicationformats,suchasXMLandJavaScriptObjectNotation(JSON),andalargevarietyofopensourcelibrariesforscientificcomputingandmachinelearning.
Pythonistheclearleaderinthisspace,probablybecauseitistheeasiesttolearnforuserswhocomefrombackgroundsotherthancomputerscience.
KnowledgeofPythonisoftenarequirementfordatascientistjobs.
SciPyandscikit-learnTheSciPypackageextendsPythonintothedomainofscientificprogramming.
Itsupportsvariousfunctions,includingparallelprogrammingtools,integration,ordinarydifferentialequationsolvers,andevenanextension(calledWeave)forincludingC/C++codewithinPythoncode.
RelatedtoSciPyisscikit-learn,whichisapackageforPython-basedmachinelearning.
Scikit-learnincludesmanyalgorithmsunderthemachine-learningumbrellaforsupervisedlearning(supportforvectormachines,naiveBayes),unsupervisedlearning(clusteringalgorithms),andotheralgorithmsfordata-setmanipulation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage6of8BothofthesepackagesextendthecapabilitiesofPythonforuseasadatascienceplatform.
AxiisinteractivedatavisualizationManyopensourcesolutionsfocussolelyonvisualization.
OneespeciallyinterestingexampleistheAxiisframework,whichprovidesaconcisemarkuplanguageforrichandcolorfulvisualizations.
Figure4showsanexample:Figure4.
WedgestackgraphvisualizationusingtheAxiisframeworkFigure4isastaticversionofaninteractiveexamplefromTomGonzalez,ManagingDirectoratBrightPointConsulting.
SeeRelatedtopicsforalinktotheinteractiveversion.
GoingfurtherTheroleofdatascientistbuildsonasolidplatformofknowledgeandexperience.
Buttoolsarealsoanimportantaspectofthedatasciencefield.
Inemergingdisciplines,theopensourcecommunityisoftenatthevanguardinestablishingsoftwarewherenoneexistedbefore.
Thefieldofdatascienceisnoexception.
Datascienceisrelativelynew,somorenewtools,dataprotocols,anddataformatsarealmostcertainlyintheworks.
Butindatascience,asinmanyotherdisciplines,opensourcesolutionsalreadyleadinbreadthanddepth.
ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage7of8RelatedtopicsGoogleInsightsforSearch:ThisGooglesiteenablesanyonetoviewsearchtrendsforatopicacrossregionsoftheworld,includingcomparativetrendsoftwoormoretopics.
Opendata:ReadaboutopendataonWikipedia.
"Whatisdatascience"(MikeLoukides,O'ReillyRadar,June2010):Readagreatintroductiontodatascienceandtheideabehindtransformingdataintoproducts.
"GrowingYourOwnDataScientists"(DanWoods,Forbes,March2012):Thearticleseriessurveysdefinitionsofdatascientistfromleadingexpertsinthefield.
HadoopondeveloperWorks:ExploreawealthofarticlesandotherresourcesonApacheHadoopanditsrelatedtechnologies.
"ApacheMahout:Scalablemachinelearningforeveryone"(GrantIngersoll,developerWorks,November2011):MahoutcommitterIngersolldescribesMahout'sfeaturesandwalksthroughanexampleofhowtodeployandscalesomeofMahout'smorepopularalgorithms.
"DatavisualizationtoolsforLinux"(M.
TimJones,developerWorks,November2006):ThisarticlepresentsseveralusefuldatavisualizationtoolsthatbearsomesimilaritytotheRProject.
Bigdata:Thenextfrontierforcompetition:ReadaboutresearchfromMcKinsey&Co.
andontheroleofbigdataanddatascientists.
Data.
gov:BrowsetheData.
govdatasetsavailablethroughtheonlinecatalogandusemultiplecriteriatofilteryoursearch.
Science.
gov:Thisportalprovidesaccesstomorethan55databasesand2,100websitesfrom13federalagenciesforUSgovernmentscienceinformation.
AsonData.
gov,youcanrestrictyoursearchesbysearchcriteriaorbyspecificagencies.
"ProcessyourdatawithApachePig"(M.
TimJones,developerWorks,February2012):LearnmoreaboutPigandhowtoputittoworkinyourapplications.
"Spark,analternativeforfastdataanalytics"(M.
TimJones,developerWorks,November2011):GettoknowtheSparkapproachtoclustercomputinganditsdifferencesfromHadoop.
ApacheHadoop:DownloadHadoop.
ApacheMahout:DownloadMahoutfromanApachemirror.
Spark:GetthelatestSparkrelease.
Rprogramminglanguage:GetR,amultiparadigmlanguageanddevelopmentenvironmentwithbroaduseinstatisticsandvisualizationPython,Ruby,andPerl:Simplifythedevelopmentandprototypingofalgorithmsfordatarefinementwiththesemultiparadigmscriptinglanguages.
SciPyandscikit-learn:UsePython'sdatasciencecapabilitieswiththeSciPypackageforscientificcomputingandthescikit-learnpackageformachinelearning.
Axiis:TheAxiisdatavisualizationframeworkisausefulsolutionforbothbeginnersandexperts.
Checkouttheexamplespagetoseewhat'spossiblewiththeframework,includingtheinteractiveversionofFigure4.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage8of8CopyrightIBMCorporation2013(www.
ibm.
com/legal/copytrade.
shtml)Trademarks(www.
ibm.
com/developerworks/ibm/trademarks/)
zji怎么样?zji最近新上韩国BGP+CN2线路服务器,国内三网访问速度优秀,适用8折优惠码zji,优惠后韩国服务器最低每月440元起。zji主机支持安装Linux或者Windows操作系统,会员中心集成电源管理功能,8折优惠码为终身折扣,续费同价,全场适用。ZJI是原Wordpress圈知名主机商:维翔主机,成立于2011年,2018年9月启用新域名ZJI,提供中国香港、台湾、日本、美国独立服...
感恩一年有你!免费领取2核4G套餐!2核4G轻量应用服务器2核 CPU 4GB内存 60G SSD云硬盘 6Mbps带宽领取地址:https://cloud.tencent.com/act/pro/lighthousethankyou活动规则活动时间2021年9月23日 ~ 2021年10月23日活动对象腾讯云官网已注册且完成实名认证的国内站用户(协作者与子用户账号除外),且符合以下活动条件:账号...
这两天在站长群里看到不少有使用DEDECMS织梦程序的朋友比较着急,因为前两天有看到来自DEDECMS,我们熟悉的织梦程序官方发布的公告,将会在10月25日开始全面商业用途的使用DEDECMS内容管理程序的会采用授权收费模式,如果我们有在个人或者企业商业用途的,需要联系且得到授权才可以使用,否则后面会通过维权的方式。对于这个事情,我们可能有些站长经历过,比如字体、图片的版权。以及有一些国内的CMS...
google统计为你推荐
I:\Sam-research\QEF\Publications\Conference人才ipad支持ipad支持ipad支持ipadnetbios端口26917 8000 4001 netbios-ns 端口 是干什么的重庆宽带测速重庆市电信网速测试是哪个网站或ip勒索病毒win7补丁win7有针对勒索病毒的补丁吗win7telnetwindows7旗舰版中telnet在哪win7如何关闭445端口如何判断445端口是否关闭
中文国际域名 域名主机基地 申请免费域名 locvps 香港服务器99idc 美国主机论坛 鲜果阅读 seovip 网盘申请 cdn联盟 服务器是干什么的 美国免费空间 中国电信宽带测速器 西安服务器托管 我的世界服务器ip 华为k3 石家庄服务器 免 windowsserver2008 时间服务器 更多