usergoogle

google统计  时间:2021-02-11  阅读:()
CopyrightIBMCorporation2013TrademarksDatascienceandopensourcePage1of8DatascienceandopensourceLearnaboutopensourcetoolsforconvertingdataintousefulinformationM.
TimJonesAugust09,2013Datasciencecombinesmathematicsandcomputerscienceforthepurposeofextractingvaluefromdata.
Thisarticleintroducesdatascienceandsurveysprominentopensourcetoolsinthisrapidlygrowingfield.
Thegoalofdatascienceistheextractionofusefulinformationfromadataset.
Companieshaverecognizedthevalueofdataasabusinessassetforalongtime.
Butthehugedatavolumesthatarenowavailablenecessitatenewwaystomakesenseofdataandmanageitefficiently.
Agrowingcadreofengineersandscientistsarebuildingsystemstoapplydatasciencetomassivedatavolumes.
Thisarticleintroducesyoutothefieldofdatascienceandtoopensourcetoolsthatareavailablefortoday'sdatascientist.
DatascienceanddatascientistsDatasciencebeginswiththecollectionofdata.
Candidatesforcollectioncanbeopendataordatathatcomesfrominternalbusinessprocesses(forexample,websitestatistics).
Nextcomesrefinement:theinventiveprocessthatreducesthedatatousefulinformationthatanswersspecificquestions.
Typically,thequestionsdefinetheapproachtotheextractionoftheinformation.
Withinthecollectionandrefinementstepsareotherimportantaspectssuchasdatacleansing(orpreprocessing)anddatavisualization.
OpendataOpendataistheconceptofdemocratizingdatabymakingitfreelyavailabletoeveryonetouseastheywant.
Thegrowingopendatamovementfollowstheideasbehindopensource.
AusefulsourceofopendataisData.
gov(seeRelatedtopics),aUSgovernmentwebsitethatwascreatedtoincreasepublicaccesstodatageneratedbytheexecutivebranchofthefederalgovernment.
Youcanalsoviewdatascienceasabusinessprocess.
MikeLoukidesofO'Reillymakesacompellingcasethatdatascienceistheconversionofdatanotonlyintoinformationbutalsointoproducts(seeRelatedtopics).
Fromthatperspective,thefieldisamodern-daygoldrush—acompetitivesearchforthevaluablenuggetsinmountainsofinformation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage2of8Theprospectorsinthedatagoldrusharecalleddatascientists.
Asbusinessesrecognizethevalueintheirdata,theneedfortalentedmultidisciplinaryengineersandscientistsisgrowing.
Datascientistsmusthaveskillsincomputerscience,math,andstatistics.
Ideally,theyalsohavedomainknowledge—anunderstandingofthesourceofthedata(medical,financial,web,andotherdomains).
Figure1illustratesdatascienceastheintersectionofcomputerscience,mathandstatistics,anddomainknowledge:Figure1.
KeydisciplinesofthedatascientistWiththiscompleteskillset,thedatascientistcantranslatedomainknowledgeandmathintoanapplication(fromthecomputersciencedomain)thatminesdataandrefinesitintoinformation.
Thekeyisamultidisciplinaryfocus(whichcanalsoincludedomainssuchasmachinelearningandinformationretrieval).
Engineersandscientistswithbigdataanalyticsexperienceareinhighdemandthesedays.
McKinsey&Companypredictsthatby2018ashortageofpeoplewhocanfitthedatascientistrolewilloccur(seeRelatedtopics).
Theideasandapproachesindatascienceareusefulinmanyotherdisciplinestoo.
Evenifyoudon'taspiretobecomeadatascientist,datascienceskillscanbeagreatadditiontoyourengineeringtoolbox.
WheredatascienceisusedLikecloudcomputing,datascienceisrapidlygaininginterestandadoption.
Overtheyearbeforethisarticlewaswritten,interestindatascienceroughlydoubled,accordingtoGoogleInsightsforSearch(formerlyGoogleTrends).
GoogleInsightsforSearchisitselfanexampleofdatascienceinaction.
Figure2showsthatthefrequencyofdatascienceasawebsearchtermincreaseddramaticallybetweenthesummerof2011andthespringof2012:ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage3of8Figure2.
GoogleInsightsforSearchdataoninterestindatascienceDatascienceisquicklybecomingastaplewithinorganizationsthatharvestdataonline(beitcrawling-basedcollectionorinternalcollectionthatisbasedonuserbehaviorssuchasclicks).
MajorwebsitessuchasGoogle,Amazon,Facebook,andLinkedInallhavetheirowndatascienceteamstousetheiravailabledata(seeRelatedtopics).
Google'sdevelopmentofthePageRankalgorithmisanearlyexampleofdatascience.
Googlecrawlsthewebandassignsanumericalweighttothehyperlinksoneverypagetomeasuretherelativeimportanceofthoselinks.
(FulldetailsofPageRankareknownonlywithinGoogle.
)Thealgorithmservesasthemeansofrankingwebcontentasafunctionofsearchterms.
LargeonlineretailerssuchaslikeAmazonandWalmartusedatasciencetotrytoincreasesales.
Theygeneraterecommendationstoindividualusersthatarebasedtheuser'sproductsearchesandpastpurchases.
LinkedIn,aprofessionalnetworkingsite,maintainsahugeamountofdatathatisrelatedtopeopleandtheircareers,interests,andconnections.
Thismassivenetworkofdataresultedinvariousrecommendationengines(forindividuals,groups,andcompanies)andprojectsthatusethedataatadeeperleveltoproducenewproductsatLinkedIn.
Onenovelexampleofdatascienceatawebpropertyisthecompanybitly.
Onthesurface,bitlyisaservicethatenablesuserstoshortenanyURLtoa19-charactermaximumURL(whichisstoredpermanentlyinbitly'sdatacenter).
ReferencestotheshortenedURLareredirectedfrombitlytotheoriginalURL.
bitlycanthenseewhichURLspeopleshortenandwhichURLsotherusersclick.
Thistacticprovidesanenormousamountofdatathatbitly(anditschiefscientist,HilaryMason)canusetogenerateawealthofstatisticsaboutbrowsinghabits.
UserswhoareregisteredwithbitlycanseewhentheirshortenedURLswereclicked,throughwhichreferrer(emailclient,Twitter,oranotherURL),andfromwhichcountry.
Businessescanalsousebitlytotrackuserbehaviorforasetofcontent.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage4of8OpensourcetoolsfordatascienceJustascomputerprogrammingisn'tconstrainedtoasinglelanguageordevelopmentenvironment,datascienceisn'tassociatedwithasingletoolortoolsuite.
Arichandbroadarrayoftoolsintheopensourcedomainadvancedatascience.
Theyincludetoolsthatprocesslargedatasetsnumerically,andvisualizationandprototypingtoolsthataidinthedevelopmentofcomplexprocessing.
Table1listsprominentopensourcetoolsfordatascientistsanddefinestheirroles:Table1.
OpensourcetoolsfordatascienceToolDescriptionApacheHadoopFrameworkforprocessingbigdataApacheMahoutScalablemachine-learningalgorithmsforHadoopSparkCluster-computingframeworkfordataanalyticsTheRProjectforStatisticalComputingAccessibledatamanipulationandgraphingPython,Ruby,PerlPrototypingandproductionscriptinglanguagesSciPyPythonpackageforscientificcomputingscikit-learnPythonpackageformachinelearningAxiisInteractivedatavisualizationThelistinTable1isn'texhaustivebutinsteadrepresentssomeofthecoreelementswithinthedatascientist'stoolbox.
Theopensourcedomainisalsofilledwithhighlyspecializedanddomain-specificlibrariesandtools(forexample,utilitiesforinteractivemapvisualizationandfortextanalysis).
Hadoop,Mahout,andSparkTheInternetcreatesopportunitiestocollectmassesofdataaboutusers'behaviorandhabits.
ApacheHadoopisthepremierframeworkforprocessingmassivedatasets.
Hadoopisimportantfordatasciencebecauseitprovidesascalableframeworkfordistributeddataprocessing.
Notalldatascienceproblemsrequirebigdataprocessing,butHadoopisidealwhenyourprobleminvolvesInternet-scaledata.
TheGoogleMapReduceframework'simplementationofthePageRankalgorithmisanearlyexampleofdatascienceonabigdataframework.
(HadoopisanimplementationofMapReduce.
)ApachePigcanmakeHadoopevenmoreaccessible,bringingaquerylanguagethatautomaticallybuildsMapReduceapplications(seeRelatedtopics).
ApacheMahoutisanimplementationofscalablemachine-learningalgorithmsontheHadoopplatform(seeRelatedtopics).
Mahoutincludesscalableimplementationsofclusteringalgorithmsandbatch-basedcollaborativefilteringalgorithms(forimplementingrecommendationsystems).
AnothernoteworthysolutionforlargedatasetsistheSparkframework(seeRelatedtopics).
Sparkincludesoptimizationssuchasin-memoryclustercomputingwithfault-tolerantabstractions.
TheRprojectAtoolthat'softenfoundinthedataminer'stoolkitisaprogramminglanguageanddevelopmentenvironmentcalledR.
Rfocusesonstatisticalcomputingandgraphics.
Risrelativelysimpleibm.
com/developerWorks/developerWorksDatascienceandopensourcePage5of8tolearnandiswidelyusedinthedomainofdataanalysis.
Beingopensourceandfree,Risapopularlanguagewithalargeuserbase.
Risamultiparadigmlanguagethatsupportsobject-oriented,functional,procedural,andimperativeprogrammingstyles.
Thelanguageisinterpretedthroughacommand-lineinterfaceandalsoincludesextensiveproduction-levelgraphicalcapabilities.
Staticgraphicsareavailableoutofthebox.
Withadditionalpackages,bothdynamicandinteractivegraphsarepossible.
Figure3showsanexampleplotthatwasgeneratedwithR:Figure3.
Sample3DsincplotthatusesRTheRprogramminglanguagewasdevelopedinCandFortran.
ManyoftheinternalstandardfunctionsinRwerewritteninRitself.
Rsupportsmixed-languageprogramming,enablingaccesstoRobjectsfromlanguagessuchasCandJava.
YoucaneasilyextendthecapabilitiesofRbyusingpackages,whichcanbedevelopedintheR,C,Java,andFortranprogramminglanguages.
ScriptinglanguagesMultiparadigmscriptinglanguagessuchasPython,Ruby,andPerlprovideaprofessionalplatformforapplicationdevelopmentanddeployment.
Andtheyareidealforprototypingandtestingnewideas.
Theselanguagesalsosupportvariousdatastorageandcommunicationformats,suchasXMLandJavaScriptObjectNotation(JSON),andalargevarietyofopensourcelibrariesforscientificcomputingandmachinelearning.
Pythonistheclearleaderinthisspace,probablybecauseitistheeasiesttolearnforuserswhocomefrombackgroundsotherthancomputerscience.
KnowledgeofPythonisoftenarequirementfordatascientistjobs.
SciPyandscikit-learnTheSciPypackageextendsPythonintothedomainofscientificprogramming.
Itsupportsvariousfunctions,includingparallelprogrammingtools,integration,ordinarydifferentialequationsolvers,andevenanextension(calledWeave)forincludingC/C++codewithinPythoncode.
RelatedtoSciPyisscikit-learn,whichisapackageforPython-basedmachinelearning.
Scikit-learnincludesmanyalgorithmsunderthemachine-learningumbrellaforsupervisedlearning(supportforvectormachines,naiveBayes),unsupervisedlearning(clusteringalgorithms),andotheralgorithmsfordata-setmanipulation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage6of8BothofthesepackagesextendthecapabilitiesofPythonforuseasadatascienceplatform.
AxiisinteractivedatavisualizationManyopensourcesolutionsfocussolelyonvisualization.
OneespeciallyinterestingexampleistheAxiisframework,whichprovidesaconcisemarkuplanguageforrichandcolorfulvisualizations.
Figure4showsanexample:Figure4.
WedgestackgraphvisualizationusingtheAxiisframeworkFigure4isastaticversionofaninteractiveexamplefromTomGonzalez,ManagingDirectoratBrightPointConsulting.
SeeRelatedtopicsforalinktotheinteractiveversion.
GoingfurtherTheroleofdatascientistbuildsonasolidplatformofknowledgeandexperience.
Buttoolsarealsoanimportantaspectofthedatasciencefield.
Inemergingdisciplines,theopensourcecommunityisoftenatthevanguardinestablishingsoftwarewherenoneexistedbefore.
Thefieldofdatascienceisnoexception.
Datascienceisrelativelynew,somorenewtools,dataprotocols,anddataformatsarealmostcertainlyintheworks.
Butindatascience,asinmanyotherdisciplines,opensourcesolutionsalreadyleadinbreadthanddepth.
ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage7of8RelatedtopicsGoogleInsightsforSearch:ThisGooglesiteenablesanyonetoviewsearchtrendsforatopicacrossregionsoftheworld,includingcomparativetrendsoftwoormoretopics.
Opendata:ReadaboutopendataonWikipedia.
"Whatisdatascience"(MikeLoukides,O'ReillyRadar,June2010):Readagreatintroductiontodatascienceandtheideabehindtransformingdataintoproducts.
"GrowingYourOwnDataScientists"(DanWoods,Forbes,March2012):Thearticleseriessurveysdefinitionsofdatascientistfromleadingexpertsinthefield.
HadoopondeveloperWorks:ExploreawealthofarticlesandotherresourcesonApacheHadoopanditsrelatedtechnologies.
"ApacheMahout:Scalablemachinelearningforeveryone"(GrantIngersoll,developerWorks,November2011):MahoutcommitterIngersolldescribesMahout'sfeaturesandwalksthroughanexampleofhowtodeployandscalesomeofMahout'smorepopularalgorithms.
"DatavisualizationtoolsforLinux"(M.
TimJones,developerWorks,November2006):ThisarticlepresentsseveralusefuldatavisualizationtoolsthatbearsomesimilaritytotheRProject.
Bigdata:Thenextfrontierforcompetition:ReadaboutresearchfromMcKinsey&Co.
andontheroleofbigdataanddatascientists.
Data.
gov:BrowsetheData.
govdatasetsavailablethroughtheonlinecatalogandusemultiplecriteriatofilteryoursearch.
Science.
gov:Thisportalprovidesaccesstomorethan55databasesand2,100websitesfrom13federalagenciesforUSgovernmentscienceinformation.
AsonData.
gov,youcanrestrictyoursearchesbysearchcriteriaorbyspecificagencies.
"ProcessyourdatawithApachePig"(M.
TimJones,developerWorks,February2012):LearnmoreaboutPigandhowtoputittoworkinyourapplications.
"Spark,analternativeforfastdataanalytics"(M.
TimJones,developerWorks,November2011):GettoknowtheSparkapproachtoclustercomputinganditsdifferencesfromHadoop.
ApacheHadoop:DownloadHadoop.
ApacheMahout:DownloadMahoutfromanApachemirror.
Spark:GetthelatestSparkrelease.
Rprogramminglanguage:GetR,amultiparadigmlanguageanddevelopmentenvironmentwithbroaduseinstatisticsandvisualizationPython,Ruby,andPerl:Simplifythedevelopmentandprototypingofalgorithmsfordatarefinementwiththesemultiparadigmscriptinglanguages.
SciPyandscikit-learn:UsePython'sdatasciencecapabilitieswiththeSciPypackageforscientificcomputingandthescikit-learnpackageformachinelearning.
Axiis:TheAxiisdatavisualizationframeworkisausefulsolutionforbothbeginnersandexperts.
Checkouttheexamplespagetoseewhat'spossiblewiththeframework,includingtheinteractiveversionofFigure4.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage8of8CopyrightIBMCorporation2013(www.
ibm.
com/legal/copytrade.
shtml)Trademarks(www.
ibm.
com/developerworks/ibm/trademarks/)

百纵科技云主机首月9元,站群1-8C同价,美国E52670*1,32G内存 50M 899元一月

百纵科技:美国高防服务器,洛杉矶C3机房 独家接入zenlayer清洗 带金盾硬防,CPU全系列E52670、E52680v3 DDR4内存 三星固态盘阵列!带宽接入了cn2/bgp线路,速度快,无需备案,非常适合国内外用户群体的外贸、搭建网站等用途。官方网站:https://www.baizon.cnC3机房,双程CN2线路,默认200G高防,3+1(高防IP),不限流量,季付送带宽美国洛杉矶C...

妮妮云(43元/月 ) 香港 8核8G 43元/月 美国 8核8G

妮妮云的来历妮妮云是 789 陈总 张总 三方共同投资建立的网站 本着“良心 便宜 稳定”的初衷 为小白用户避免被坑妮妮云的市场定位妮妮云主要代理市场稳定速度的云服务器产品,避免新手购买云服务器的时候众多商家不知道如何选择,妮妮云就帮你选择好了产品,无需承担购买风险,不用担心出现被跑路 被诈骗的情况。妮妮云的售后保证妮妮云退款 通过于合作商的友好协商,云服务器提供2天内全额退款,超过2天不退款 物...

极光KVM(限时16元),洛杉矶三网CN2,cera机房,香港cn2

极光KVM创立于2018年,主要经营美国洛杉矶CN2机房、CeRaNetworks机房、中国香港CeraNetworks机房、香港CMI机房等产品。其中,洛杉矶提供CN2 GIA、CN2 GT以及常规BGP直连线路接入。从名字也可以看到,VPS产品全部是基于KVM架构的。极光KVM也有明确的更换IP政策,下单时选择“IP保险计划”多支付10块钱,可以在服务周期内免费更换一次IP,当然也可以不选择,...

google统计为你推荐
党建搜狗浏览器2变量itunes朝阳分局犯罪嫌疑人标准化信息采集系统itunes备份itunes就是备份不了怎么办啊127.0.0.1传奇服务器非法网关连接: 127.0.0.1google中国地图谷歌卫星地图中文版下载在哪下??电信版iphone4s电信版iphone4s是买16gb的好还是32gb的好?google搜图google的直接搜索图片的功能为什么没了chrome18CHROME现在最新版是多少?chrome18谷歌浏览器,你正在用哪个版本呢??
希网动态域名 网络星期一 外国域名 174.127.195.202 免费smtp服务器 panel1 bgp双线 789电视 赞助 秒杀汇 佛山高防服务器 广州服务器 1美金 免费dns解析 常州联通宽带 江苏双线服务器 空间租赁 云营销系统 杭州电信宽带优惠 深圳域名 更多