usergoogle

google统计  时间:2021-02-11  阅读:()
CopyrightIBMCorporation2013TrademarksDatascienceandopensourcePage1of8DatascienceandopensourceLearnaboutopensourcetoolsforconvertingdataintousefulinformationM.
TimJonesAugust09,2013Datasciencecombinesmathematicsandcomputerscienceforthepurposeofextractingvaluefromdata.
Thisarticleintroducesdatascienceandsurveysprominentopensourcetoolsinthisrapidlygrowingfield.
Thegoalofdatascienceistheextractionofusefulinformationfromadataset.
Companieshaverecognizedthevalueofdataasabusinessassetforalongtime.
Butthehugedatavolumesthatarenowavailablenecessitatenewwaystomakesenseofdataandmanageitefficiently.
Agrowingcadreofengineersandscientistsarebuildingsystemstoapplydatasciencetomassivedatavolumes.
Thisarticleintroducesyoutothefieldofdatascienceandtoopensourcetoolsthatareavailablefortoday'sdatascientist.
DatascienceanddatascientistsDatasciencebeginswiththecollectionofdata.
Candidatesforcollectioncanbeopendataordatathatcomesfrominternalbusinessprocesses(forexample,websitestatistics).
Nextcomesrefinement:theinventiveprocessthatreducesthedatatousefulinformationthatanswersspecificquestions.
Typically,thequestionsdefinetheapproachtotheextractionoftheinformation.
Withinthecollectionandrefinementstepsareotherimportantaspectssuchasdatacleansing(orpreprocessing)anddatavisualization.
OpendataOpendataistheconceptofdemocratizingdatabymakingitfreelyavailabletoeveryonetouseastheywant.
Thegrowingopendatamovementfollowstheideasbehindopensource.
AusefulsourceofopendataisData.
gov(seeRelatedtopics),aUSgovernmentwebsitethatwascreatedtoincreasepublicaccesstodatageneratedbytheexecutivebranchofthefederalgovernment.
Youcanalsoviewdatascienceasabusinessprocess.
MikeLoukidesofO'Reillymakesacompellingcasethatdatascienceistheconversionofdatanotonlyintoinformationbutalsointoproducts(seeRelatedtopics).
Fromthatperspective,thefieldisamodern-daygoldrush—acompetitivesearchforthevaluablenuggetsinmountainsofinformation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage2of8Theprospectorsinthedatagoldrusharecalleddatascientists.
Asbusinessesrecognizethevalueintheirdata,theneedfortalentedmultidisciplinaryengineersandscientistsisgrowing.
Datascientistsmusthaveskillsincomputerscience,math,andstatistics.
Ideally,theyalsohavedomainknowledge—anunderstandingofthesourceofthedata(medical,financial,web,andotherdomains).
Figure1illustratesdatascienceastheintersectionofcomputerscience,mathandstatistics,anddomainknowledge:Figure1.
KeydisciplinesofthedatascientistWiththiscompleteskillset,thedatascientistcantranslatedomainknowledgeandmathintoanapplication(fromthecomputersciencedomain)thatminesdataandrefinesitintoinformation.
Thekeyisamultidisciplinaryfocus(whichcanalsoincludedomainssuchasmachinelearningandinformationretrieval).
Engineersandscientistswithbigdataanalyticsexperienceareinhighdemandthesedays.
McKinsey&Companypredictsthatby2018ashortageofpeoplewhocanfitthedatascientistrolewilloccur(seeRelatedtopics).
Theideasandapproachesindatascienceareusefulinmanyotherdisciplinestoo.
Evenifyoudon'taspiretobecomeadatascientist,datascienceskillscanbeagreatadditiontoyourengineeringtoolbox.
WheredatascienceisusedLikecloudcomputing,datascienceisrapidlygaininginterestandadoption.
Overtheyearbeforethisarticlewaswritten,interestindatascienceroughlydoubled,accordingtoGoogleInsightsforSearch(formerlyGoogleTrends).
GoogleInsightsforSearchisitselfanexampleofdatascienceinaction.
Figure2showsthatthefrequencyofdatascienceasawebsearchtermincreaseddramaticallybetweenthesummerof2011andthespringof2012:ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage3of8Figure2.
GoogleInsightsforSearchdataoninterestindatascienceDatascienceisquicklybecomingastaplewithinorganizationsthatharvestdataonline(beitcrawling-basedcollectionorinternalcollectionthatisbasedonuserbehaviorssuchasclicks).
MajorwebsitessuchasGoogle,Amazon,Facebook,andLinkedInallhavetheirowndatascienceteamstousetheiravailabledata(seeRelatedtopics).
Google'sdevelopmentofthePageRankalgorithmisanearlyexampleofdatascience.
Googlecrawlsthewebandassignsanumericalweighttothehyperlinksoneverypagetomeasuretherelativeimportanceofthoselinks.
(FulldetailsofPageRankareknownonlywithinGoogle.
)Thealgorithmservesasthemeansofrankingwebcontentasafunctionofsearchterms.
LargeonlineretailerssuchaslikeAmazonandWalmartusedatasciencetotrytoincreasesales.
Theygeneraterecommendationstoindividualusersthatarebasedtheuser'sproductsearchesandpastpurchases.
LinkedIn,aprofessionalnetworkingsite,maintainsahugeamountofdatathatisrelatedtopeopleandtheircareers,interests,andconnections.
Thismassivenetworkofdataresultedinvariousrecommendationengines(forindividuals,groups,andcompanies)andprojectsthatusethedataatadeeperleveltoproducenewproductsatLinkedIn.
Onenovelexampleofdatascienceatawebpropertyisthecompanybitly.
Onthesurface,bitlyisaservicethatenablesuserstoshortenanyURLtoa19-charactermaximumURL(whichisstoredpermanentlyinbitly'sdatacenter).
ReferencestotheshortenedURLareredirectedfrombitlytotheoriginalURL.
bitlycanthenseewhichURLspeopleshortenandwhichURLsotherusersclick.
Thistacticprovidesanenormousamountofdatathatbitly(anditschiefscientist,HilaryMason)canusetogenerateawealthofstatisticsaboutbrowsinghabits.
UserswhoareregisteredwithbitlycanseewhentheirshortenedURLswereclicked,throughwhichreferrer(emailclient,Twitter,oranotherURL),andfromwhichcountry.
Businessescanalsousebitlytotrackuserbehaviorforasetofcontent.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage4of8OpensourcetoolsfordatascienceJustascomputerprogrammingisn'tconstrainedtoasinglelanguageordevelopmentenvironment,datascienceisn'tassociatedwithasingletoolortoolsuite.
Arichandbroadarrayoftoolsintheopensourcedomainadvancedatascience.
Theyincludetoolsthatprocesslargedatasetsnumerically,andvisualizationandprototypingtoolsthataidinthedevelopmentofcomplexprocessing.
Table1listsprominentopensourcetoolsfordatascientistsanddefinestheirroles:Table1.
OpensourcetoolsfordatascienceToolDescriptionApacheHadoopFrameworkforprocessingbigdataApacheMahoutScalablemachine-learningalgorithmsforHadoopSparkCluster-computingframeworkfordataanalyticsTheRProjectforStatisticalComputingAccessibledatamanipulationandgraphingPython,Ruby,PerlPrototypingandproductionscriptinglanguagesSciPyPythonpackageforscientificcomputingscikit-learnPythonpackageformachinelearningAxiisInteractivedatavisualizationThelistinTable1isn'texhaustivebutinsteadrepresentssomeofthecoreelementswithinthedatascientist'stoolbox.
Theopensourcedomainisalsofilledwithhighlyspecializedanddomain-specificlibrariesandtools(forexample,utilitiesforinteractivemapvisualizationandfortextanalysis).
Hadoop,Mahout,andSparkTheInternetcreatesopportunitiestocollectmassesofdataaboutusers'behaviorandhabits.
ApacheHadoopisthepremierframeworkforprocessingmassivedatasets.
Hadoopisimportantfordatasciencebecauseitprovidesascalableframeworkfordistributeddataprocessing.
Notalldatascienceproblemsrequirebigdataprocessing,butHadoopisidealwhenyourprobleminvolvesInternet-scaledata.
TheGoogleMapReduceframework'simplementationofthePageRankalgorithmisanearlyexampleofdatascienceonabigdataframework.
(HadoopisanimplementationofMapReduce.
)ApachePigcanmakeHadoopevenmoreaccessible,bringingaquerylanguagethatautomaticallybuildsMapReduceapplications(seeRelatedtopics).
ApacheMahoutisanimplementationofscalablemachine-learningalgorithmsontheHadoopplatform(seeRelatedtopics).
Mahoutincludesscalableimplementationsofclusteringalgorithmsandbatch-basedcollaborativefilteringalgorithms(forimplementingrecommendationsystems).
AnothernoteworthysolutionforlargedatasetsistheSparkframework(seeRelatedtopics).
Sparkincludesoptimizationssuchasin-memoryclustercomputingwithfault-tolerantabstractions.
TheRprojectAtoolthat'softenfoundinthedataminer'stoolkitisaprogramminglanguageanddevelopmentenvironmentcalledR.
Rfocusesonstatisticalcomputingandgraphics.
Risrelativelysimpleibm.
com/developerWorks/developerWorksDatascienceandopensourcePage5of8tolearnandiswidelyusedinthedomainofdataanalysis.
Beingopensourceandfree,Risapopularlanguagewithalargeuserbase.
Risamultiparadigmlanguagethatsupportsobject-oriented,functional,procedural,andimperativeprogrammingstyles.
Thelanguageisinterpretedthroughacommand-lineinterfaceandalsoincludesextensiveproduction-levelgraphicalcapabilities.
Staticgraphicsareavailableoutofthebox.
Withadditionalpackages,bothdynamicandinteractivegraphsarepossible.
Figure3showsanexampleplotthatwasgeneratedwithR:Figure3.
Sample3DsincplotthatusesRTheRprogramminglanguagewasdevelopedinCandFortran.
ManyoftheinternalstandardfunctionsinRwerewritteninRitself.
Rsupportsmixed-languageprogramming,enablingaccesstoRobjectsfromlanguagessuchasCandJava.
YoucaneasilyextendthecapabilitiesofRbyusingpackages,whichcanbedevelopedintheR,C,Java,andFortranprogramminglanguages.
ScriptinglanguagesMultiparadigmscriptinglanguagessuchasPython,Ruby,andPerlprovideaprofessionalplatformforapplicationdevelopmentanddeployment.
Andtheyareidealforprototypingandtestingnewideas.
Theselanguagesalsosupportvariousdatastorageandcommunicationformats,suchasXMLandJavaScriptObjectNotation(JSON),andalargevarietyofopensourcelibrariesforscientificcomputingandmachinelearning.
Pythonistheclearleaderinthisspace,probablybecauseitistheeasiesttolearnforuserswhocomefrombackgroundsotherthancomputerscience.
KnowledgeofPythonisoftenarequirementfordatascientistjobs.
SciPyandscikit-learnTheSciPypackageextendsPythonintothedomainofscientificprogramming.
Itsupportsvariousfunctions,includingparallelprogrammingtools,integration,ordinarydifferentialequationsolvers,andevenanextension(calledWeave)forincludingC/C++codewithinPythoncode.
RelatedtoSciPyisscikit-learn,whichisapackageforPython-basedmachinelearning.
Scikit-learnincludesmanyalgorithmsunderthemachine-learningumbrellaforsupervisedlearning(supportforvectormachines,naiveBayes),unsupervisedlearning(clusteringalgorithms),andotheralgorithmsfordata-setmanipulation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage6of8BothofthesepackagesextendthecapabilitiesofPythonforuseasadatascienceplatform.
AxiisinteractivedatavisualizationManyopensourcesolutionsfocussolelyonvisualization.
OneespeciallyinterestingexampleistheAxiisframework,whichprovidesaconcisemarkuplanguageforrichandcolorfulvisualizations.
Figure4showsanexample:Figure4.
WedgestackgraphvisualizationusingtheAxiisframeworkFigure4isastaticversionofaninteractiveexamplefromTomGonzalez,ManagingDirectoratBrightPointConsulting.
SeeRelatedtopicsforalinktotheinteractiveversion.
GoingfurtherTheroleofdatascientistbuildsonasolidplatformofknowledgeandexperience.
Buttoolsarealsoanimportantaspectofthedatasciencefield.
Inemergingdisciplines,theopensourcecommunityisoftenatthevanguardinestablishingsoftwarewherenoneexistedbefore.
Thefieldofdatascienceisnoexception.
Datascienceisrelativelynew,somorenewtools,dataprotocols,anddataformatsarealmostcertainlyintheworks.
Butindatascience,asinmanyotherdisciplines,opensourcesolutionsalreadyleadinbreadthanddepth.
ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage7of8RelatedtopicsGoogleInsightsforSearch:ThisGooglesiteenablesanyonetoviewsearchtrendsforatopicacrossregionsoftheworld,includingcomparativetrendsoftwoormoretopics.
Opendata:ReadaboutopendataonWikipedia.
"Whatisdatascience"(MikeLoukides,O'ReillyRadar,June2010):Readagreatintroductiontodatascienceandtheideabehindtransformingdataintoproducts.
"GrowingYourOwnDataScientists"(DanWoods,Forbes,March2012):Thearticleseriessurveysdefinitionsofdatascientistfromleadingexpertsinthefield.
HadoopondeveloperWorks:ExploreawealthofarticlesandotherresourcesonApacheHadoopanditsrelatedtechnologies.
"ApacheMahout:Scalablemachinelearningforeveryone"(GrantIngersoll,developerWorks,November2011):MahoutcommitterIngersolldescribesMahout'sfeaturesandwalksthroughanexampleofhowtodeployandscalesomeofMahout'smorepopularalgorithms.
"DatavisualizationtoolsforLinux"(M.
TimJones,developerWorks,November2006):ThisarticlepresentsseveralusefuldatavisualizationtoolsthatbearsomesimilaritytotheRProject.
Bigdata:Thenextfrontierforcompetition:ReadaboutresearchfromMcKinsey&Co.
andontheroleofbigdataanddatascientists.
Data.
gov:BrowsetheData.
govdatasetsavailablethroughtheonlinecatalogandusemultiplecriteriatofilteryoursearch.
Science.
gov:Thisportalprovidesaccesstomorethan55databasesand2,100websitesfrom13federalagenciesforUSgovernmentscienceinformation.
AsonData.
gov,youcanrestrictyoursearchesbysearchcriteriaorbyspecificagencies.
"ProcessyourdatawithApachePig"(M.
TimJones,developerWorks,February2012):LearnmoreaboutPigandhowtoputittoworkinyourapplications.
"Spark,analternativeforfastdataanalytics"(M.
TimJones,developerWorks,November2011):GettoknowtheSparkapproachtoclustercomputinganditsdifferencesfromHadoop.
ApacheHadoop:DownloadHadoop.
ApacheMahout:DownloadMahoutfromanApachemirror.
Spark:GetthelatestSparkrelease.
Rprogramminglanguage:GetR,amultiparadigmlanguageanddevelopmentenvironmentwithbroaduseinstatisticsandvisualizationPython,Ruby,andPerl:Simplifythedevelopmentandprototypingofalgorithmsfordatarefinementwiththesemultiparadigmscriptinglanguages.
SciPyandscikit-learn:UsePython'sdatasciencecapabilitieswiththeSciPypackageforscientificcomputingandthescikit-learnpackageformachinelearning.
Axiis:TheAxiisdatavisualizationframeworkisausefulsolutionforbothbeginnersandexperts.
Checkouttheexamplespagetoseewhat'spossiblewiththeframework,includingtheinteractiveversionofFigure4.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage8of8CopyrightIBMCorporation2013(www.
ibm.
com/legal/copytrade.
shtml)Trademarks(www.
ibm.
com/developerworks/ibm/trademarks/)

云基Yunbase无视CC攻击(最高500G DDoS防御),美国洛杉矶CN2-GIA高防独立服务器,

云基yunbase怎么样?云基成立于2020年,目前主要提供高防海内外独立服务器,欢迎各类追求稳定和高防优质线路的用户。业务可选:洛杉矶CN2-GIA+高防(默认500G高防)、洛杉矶CN2-GIA(默认带50Gbps防御)、香港CN2-GIA高防(双向CN2GIA专线,突发带宽支持,15G-20G DDoS防御,无视CC)。目前,美国洛杉矶CN2-GIA高防独立服务器,8核16G,最高500G ...

热网互联33元/月,香港/日本/洛杉矶/韩国CN2高速线路云主机

热网互联怎么样?热网互联(hotiis)是随客云计算(Suike.Cloud)成立于2009年,增值电信业务经营许可证:B1-20203716)旗下平台。热网互联云主机是CN2高速回国线路,香港/日本/洛杉矶/韩国CN2高速线路云主机,最低33元/月;热网互联国内BGP高防服务器,香港服务器,日本服务器全线活动中,大量七五折来袭!点击进入:热网互联官方网站地址热网互联香港/日本/洛杉矶/韩国cn2...

CloudCone2核KVM美国洛杉矶MC机房机房2.89美元/月,美国洛杉矶MC机房KVM虚拟架构2核1.5G内存1Gbps带宽,国外便宜美国VPS七月特价优惠

近日CloudCone发布了七月的特价便宜优惠VPS云服务器产品,KVM虚拟架构,性价比最高的为2核心1.5G内存1Gbps带宽5TB月流量,2.89美元/月,稳定性还是非常不错的,有需要国外便宜VPS云服务器的朋友可以关注一下。CloudCone怎么样?CloudCone服务器好不好?CloudCone值不值得购买?CloudCone是一家成立于2017年的美国服务器提供商,国外实力大厂,自己开...

google统计为你推荐
股份一卡通系统支持ipad支持ipad支持ipad支持ipadwin10445端口windows server2008怎么开放4443端口css下拉菜单html+css下拉菜单怎么制作google图片搜索谁能教我怎么在手机用google的图片搜索啊!!!chromeframe谷歌浏览器(Chrome) 与(Chromium) 有什么区别?哪个更快?google分析如何添加google analysis
免备案虚拟主机 怎样注册域名 域名查询软件 电信服务器租赁 河南vps 安云加速器 512m 免费网站监控 灵动鬼影 免费全能主机 微软服务器操作系统 优酷黄金会员账号共享 视频服务器是什么 下载速度测试 个人免费邮箱 成都主机托管 宿迁服务器 ssl加速 阿里云邮箱申请 乐视会员免费领取 更多