usergoogle

google统计  时间:2021-02-11  阅读:()
CopyrightIBMCorporation2013TrademarksDatascienceandopensourcePage1of8DatascienceandopensourceLearnaboutopensourcetoolsforconvertingdataintousefulinformationM.
TimJonesAugust09,2013Datasciencecombinesmathematicsandcomputerscienceforthepurposeofextractingvaluefromdata.
Thisarticleintroducesdatascienceandsurveysprominentopensourcetoolsinthisrapidlygrowingfield.
Thegoalofdatascienceistheextractionofusefulinformationfromadataset.
Companieshaverecognizedthevalueofdataasabusinessassetforalongtime.
Butthehugedatavolumesthatarenowavailablenecessitatenewwaystomakesenseofdataandmanageitefficiently.
Agrowingcadreofengineersandscientistsarebuildingsystemstoapplydatasciencetomassivedatavolumes.
Thisarticleintroducesyoutothefieldofdatascienceandtoopensourcetoolsthatareavailablefortoday'sdatascientist.
DatascienceanddatascientistsDatasciencebeginswiththecollectionofdata.
Candidatesforcollectioncanbeopendataordatathatcomesfrominternalbusinessprocesses(forexample,websitestatistics).
Nextcomesrefinement:theinventiveprocessthatreducesthedatatousefulinformationthatanswersspecificquestions.
Typically,thequestionsdefinetheapproachtotheextractionoftheinformation.
Withinthecollectionandrefinementstepsareotherimportantaspectssuchasdatacleansing(orpreprocessing)anddatavisualization.
OpendataOpendataistheconceptofdemocratizingdatabymakingitfreelyavailabletoeveryonetouseastheywant.
Thegrowingopendatamovementfollowstheideasbehindopensource.
AusefulsourceofopendataisData.
gov(seeRelatedtopics),aUSgovernmentwebsitethatwascreatedtoincreasepublicaccesstodatageneratedbytheexecutivebranchofthefederalgovernment.
Youcanalsoviewdatascienceasabusinessprocess.
MikeLoukidesofO'Reillymakesacompellingcasethatdatascienceistheconversionofdatanotonlyintoinformationbutalsointoproducts(seeRelatedtopics).
Fromthatperspective,thefieldisamodern-daygoldrush—acompetitivesearchforthevaluablenuggetsinmountainsofinformation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage2of8Theprospectorsinthedatagoldrusharecalleddatascientists.
Asbusinessesrecognizethevalueintheirdata,theneedfortalentedmultidisciplinaryengineersandscientistsisgrowing.
Datascientistsmusthaveskillsincomputerscience,math,andstatistics.
Ideally,theyalsohavedomainknowledge—anunderstandingofthesourceofthedata(medical,financial,web,andotherdomains).
Figure1illustratesdatascienceastheintersectionofcomputerscience,mathandstatistics,anddomainknowledge:Figure1.
KeydisciplinesofthedatascientistWiththiscompleteskillset,thedatascientistcantranslatedomainknowledgeandmathintoanapplication(fromthecomputersciencedomain)thatminesdataandrefinesitintoinformation.
Thekeyisamultidisciplinaryfocus(whichcanalsoincludedomainssuchasmachinelearningandinformationretrieval).
Engineersandscientistswithbigdataanalyticsexperienceareinhighdemandthesedays.
McKinsey&Companypredictsthatby2018ashortageofpeoplewhocanfitthedatascientistrolewilloccur(seeRelatedtopics).
Theideasandapproachesindatascienceareusefulinmanyotherdisciplinestoo.
Evenifyoudon'taspiretobecomeadatascientist,datascienceskillscanbeagreatadditiontoyourengineeringtoolbox.
WheredatascienceisusedLikecloudcomputing,datascienceisrapidlygaininginterestandadoption.
Overtheyearbeforethisarticlewaswritten,interestindatascienceroughlydoubled,accordingtoGoogleInsightsforSearch(formerlyGoogleTrends).
GoogleInsightsforSearchisitselfanexampleofdatascienceinaction.
Figure2showsthatthefrequencyofdatascienceasawebsearchtermincreaseddramaticallybetweenthesummerof2011andthespringof2012:ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage3of8Figure2.
GoogleInsightsforSearchdataoninterestindatascienceDatascienceisquicklybecomingastaplewithinorganizationsthatharvestdataonline(beitcrawling-basedcollectionorinternalcollectionthatisbasedonuserbehaviorssuchasclicks).
MajorwebsitessuchasGoogle,Amazon,Facebook,andLinkedInallhavetheirowndatascienceteamstousetheiravailabledata(seeRelatedtopics).
Google'sdevelopmentofthePageRankalgorithmisanearlyexampleofdatascience.
Googlecrawlsthewebandassignsanumericalweighttothehyperlinksoneverypagetomeasuretherelativeimportanceofthoselinks.
(FulldetailsofPageRankareknownonlywithinGoogle.
)Thealgorithmservesasthemeansofrankingwebcontentasafunctionofsearchterms.
LargeonlineretailerssuchaslikeAmazonandWalmartusedatasciencetotrytoincreasesales.
Theygeneraterecommendationstoindividualusersthatarebasedtheuser'sproductsearchesandpastpurchases.
LinkedIn,aprofessionalnetworkingsite,maintainsahugeamountofdatathatisrelatedtopeopleandtheircareers,interests,andconnections.
Thismassivenetworkofdataresultedinvariousrecommendationengines(forindividuals,groups,andcompanies)andprojectsthatusethedataatadeeperleveltoproducenewproductsatLinkedIn.
Onenovelexampleofdatascienceatawebpropertyisthecompanybitly.
Onthesurface,bitlyisaservicethatenablesuserstoshortenanyURLtoa19-charactermaximumURL(whichisstoredpermanentlyinbitly'sdatacenter).
ReferencestotheshortenedURLareredirectedfrombitlytotheoriginalURL.
bitlycanthenseewhichURLspeopleshortenandwhichURLsotherusersclick.
Thistacticprovidesanenormousamountofdatathatbitly(anditschiefscientist,HilaryMason)canusetogenerateawealthofstatisticsaboutbrowsinghabits.
UserswhoareregisteredwithbitlycanseewhentheirshortenedURLswereclicked,throughwhichreferrer(emailclient,Twitter,oranotherURL),andfromwhichcountry.
Businessescanalsousebitlytotrackuserbehaviorforasetofcontent.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage4of8OpensourcetoolsfordatascienceJustascomputerprogrammingisn'tconstrainedtoasinglelanguageordevelopmentenvironment,datascienceisn'tassociatedwithasingletoolortoolsuite.
Arichandbroadarrayoftoolsintheopensourcedomainadvancedatascience.
Theyincludetoolsthatprocesslargedatasetsnumerically,andvisualizationandprototypingtoolsthataidinthedevelopmentofcomplexprocessing.
Table1listsprominentopensourcetoolsfordatascientistsanddefinestheirroles:Table1.
OpensourcetoolsfordatascienceToolDescriptionApacheHadoopFrameworkforprocessingbigdataApacheMahoutScalablemachine-learningalgorithmsforHadoopSparkCluster-computingframeworkfordataanalyticsTheRProjectforStatisticalComputingAccessibledatamanipulationandgraphingPython,Ruby,PerlPrototypingandproductionscriptinglanguagesSciPyPythonpackageforscientificcomputingscikit-learnPythonpackageformachinelearningAxiisInteractivedatavisualizationThelistinTable1isn'texhaustivebutinsteadrepresentssomeofthecoreelementswithinthedatascientist'stoolbox.
Theopensourcedomainisalsofilledwithhighlyspecializedanddomain-specificlibrariesandtools(forexample,utilitiesforinteractivemapvisualizationandfortextanalysis).
Hadoop,Mahout,andSparkTheInternetcreatesopportunitiestocollectmassesofdataaboutusers'behaviorandhabits.
ApacheHadoopisthepremierframeworkforprocessingmassivedatasets.
Hadoopisimportantfordatasciencebecauseitprovidesascalableframeworkfordistributeddataprocessing.
Notalldatascienceproblemsrequirebigdataprocessing,butHadoopisidealwhenyourprobleminvolvesInternet-scaledata.
TheGoogleMapReduceframework'simplementationofthePageRankalgorithmisanearlyexampleofdatascienceonabigdataframework.
(HadoopisanimplementationofMapReduce.
)ApachePigcanmakeHadoopevenmoreaccessible,bringingaquerylanguagethatautomaticallybuildsMapReduceapplications(seeRelatedtopics).
ApacheMahoutisanimplementationofscalablemachine-learningalgorithmsontheHadoopplatform(seeRelatedtopics).
Mahoutincludesscalableimplementationsofclusteringalgorithmsandbatch-basedcollaborativefilteringalgorithms(forimplementingrecommendationsystems).
AnothernoteworthysolutionforlargedatasetsistheSparkframework(seeRelatedtopics).
Sparkincludesoptimizationssuchasin-memoryclustercomputingwithfault-tolerantabstractions.
TheRprojectAtoolthat'softenfoundinthedataminer'stoolkitisaprogramminglanguageanddevelopmentenvironmentcalledR.
Rfocusesonstatisticalcomputingandgraphics.
Risrelativelysimpleibm.
com/developerWorks/developerWorksDatascienceandopensourcePage5of8tolearnandiswidelyusedinthedomainofdataanalysis.
Beingopensourceandfree,Risapopularlanguagewithalargeuserbase.
Risamultiparadigmlanguagethatsupportsobject-oriented,functional,procedural,andimperativeprogrammingstyles.
Thelanguageisinterpretedthroughacommand-lineinterfaceandalsoincludesextensiveproduction-levelgraphicalcapabilities.
Staticgraphicsareavailableoutofthebox.
Withadditionalpackages,bothdynamicandinteractivegraphsarepossible.
Figure3showsanexampleplotthatwasgeneratedwithR:Figure3.
Sample3DsincplotthatusesRTheRprogramminglanguagewasdevelopedinCandFortran.
ManyoftheinternalstandardfunctionsinRwerewritteninRitself.
Rsupportsmixed-languageprogramming,enablingaccesstoRobjectsfromlanguagessuchasCandJava.
YoucaneasilyextendthecapabilitiesofRbyusingpackages,whichcanbedevelopedintheR,C,Java,andFortranprogramminglanguages.
ScriptinglanguagesMultiparadigmscriptinglanguagessuchasPython,Ruby,andPerlprovideaprofessionalplatformforapplicationdevelopmentanddeployment.
Andtheyareidealforprototypingandtestingnewideas.
Theselanguagesalsosupportvariousdatastorageandcommunicationformats,suchasXMLandJavaScriptObjectNotation(JSON),andalargevarietyofopensourcelibrariesforscientificcomputingandmachinelearning.
Pythonistheclearleaderinthisspace,probablybecauseitistheeasiesttolearnforuserswhocomefrombackgroundsotherthancomputerscience.
KnowledgeofPythonisoftenarequirementfordatascientistjobs.
SciPyandscikit-learnTheSciPypackageextendsPythonintothedomainofscientificprogramming.
Itsupportsvariousfunctions,includingparallelprogrammingtools,integration,ordinarydifferentialequationsolvers,andevenanextension(calledWeave)forincludingC/C++codewithinPythoncode.
RelatedtoSciPyisscikit-learn,whichisapackageforPython-basedmachinelearning.
Scikit-learnincludesmanyalgorithmsunderthemachine-learningumbrellaforsupervisedlearning(supportforvectormachines,naiveBayes),unsupervisedlearning(clusteringalgorithms),andotheralgorithmsfordata-setmanipulation.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage6of8BothofthesepackagesextendthecapabilitiesofPythonforuseasadatascienceplatform.
AxiisinteractivedatavisualizationManyopensourcesolutionsfocussolelyonvisualization.
OneespeciallyinterestingexampleistheAxiisframework,whichprovidesaconcisemarkuplanguageforrichandcolorfulvisualizations.
Figure4showsanexample:Figure4.
WedgestackgraphvisualizationusingtheAxiisframeworkFigure4isastaticversionofaninteractiveexamplefromTomGonzalez,ManagingDirectoratBrightPointConsulting.
SeeRelatedtopicsforalinktotheinteractiveversion.
GoingfurtherTheroleofdatascientistbuildsonasolidplatformofknowledgeandexperience.
Buttoolsarealsoanimportantaspectofthedatasciencefield.
Inemergingdisciplines,theopensourcecommunityisoftenatthevanguardinestablishingsoftwarewherenoneexistedbefore.
Thefieldofdatascienceisnoexception.
Datascienceisrelativelynew,somorenewtools,dataprotocols,anddataformatsarealmostcertainlyintheworks.
Butindatascience,asinmanyotherdisciplines,opensourcesolutionsalreadyleadinbreadthanddepth.
ibm.
com/developerWorks/developerWorksDatascienceandopensourcePage7of8RelatedtopicsGoogleInsightsforSearch:ThisGooglesiteenablesanyonetoviewsearchtrendsforatopicacrossregionsoftheworld,includingcomparativetrendsoftwoormoretopics.
Opendata:ReadaboutopendataonWikipedia.
"Whatisdatascience"(MikeLoukides,O'ReillyRadar,June2010):Readagreatintroductiontodatascienceandtheideabehindtransformingdataintoproducts.
"GrowingYourOwnDataScientists"(DanWoods,Forbes,March2012):Thearticleseriessurveysdefinitionsofdatascientistfromleadingexpertsinthefield.
HadoopondeveloperWorks:ExploreawealthofarticlesandotherresourcesonApacheHadoopanditsrelatedtechnologies.
"ApacheMahout:Scalablemachinelearningforeveryone"(GrantIngersoll,developerWorks,November2011):MahoutcommitterIngersolldescribesMahout'sfeaturesandwalksthroughanexampleofhowtodeployandscalesomeofMahout'smorepopularalgorithms.
"DatavisualizationtoolsforLinux"(M.
TimJones,developerWorks,November2006):ThisarticlepresentsseveralusefuldatavisualizationtoolsthatbearsomesimilaritytotheRProject.
Bigdata:Thenextfrontierforcompetition:ReadaboutresearchfromMcKinsey&Co.
andontheroleofbigdataanddatascientists.
Data.
gov:BrowsetheData.
govdatasetsavailablethroughtheonlinecatalogandusemultiplecriteriatofilteryoursearch.
Science.
gov:Thisportalprovidesaccesstomorethan55databasesand2,100websitesfrom13federalagenciesforUSgovernmentscienceinformation.
AsonData.
gov,youcanrestrictyoursearchesbysearchcriteriaorbyspecificagencies.
"ProcessyourdatawithApachePig"(M.
TimJones,developerWorks,February2012):LearnmoreaboutPigandhowtoputittoworkinyourapplications.
"Spark,analternativeforfastdataanalytics"(M.
TimJones,developerWorks,November2011):GettoknowtheSparkapproachtoclustercomputinganditsdifferencesfromHadoop.
ApacheHadoop:DownloadHadoop.
ApacheMahout:DownloadMahoutfromanApachemirror.
Spark:GetthelatestSparkrelease.
Rprogramminglanguage:GetR,amultiparadigmlanguageanddevelopmentenvironmentwithbroaduseinstatisticsandvisualizationPython,Ruby,andPerl:Simplifythedevelopmentandprototypingofalgorithmsfordatarefinementwiththesemultiparadigmscriptinglanguages.
SciPyandscikit-learn:UsePython'sdatasciencecapabilitieswiththeSciPypackageforscientificcomputingandthescikit-learnpackageformachinelearning.
Axiis:TheAxiisdatavisualizationframeworkisausefulsolutionforbothbeginnersandexperts.
Checkouttheexamplespagetoseewhat'spossiblewiththeframework,includingtheinteractiveversionofFigure4.
developerWorksibm.
com/developerWorks/DatascienceandopensourcePage8of8CopyrightIBMCorporation2013(www.
ibm.
com/legal/copytrade.
shtml)Trademarks(www.
ibm.
com/developerworks/ibm/trademarks/)

百纵科技:美国独立服务器租用/高配置;E52670/32G内存/512G SSD/4IP/50M带宽,999元/月

百纵科技怎么样?百纵科技国人商家,ISP ICP 电信增值许可证的正规公司,近期上线美国C3机房洛杉矶独立服务器,大带宽/高配置多ip站群服务器。百纵科技拥有专业技术售后团队,机器支持自动化,自助安装系统 重启,开机交付时间 30分钟内交付!美国洛杉矶高防服务器配置特点: 硬件配置高 线路稳定 洛杉矶C3机房等级T4 平价销售,支持免费测试,美国独服适合做站,满意付款。点击进入:百纵科技官方网站地...

SugarHosts糖果主机圣诞节促销 美国/香港虚拟主机低至6折

SugarHosts 糖果主机商我们算是比较熟悉的,早年学会建站的时候开始就用的糖果虚拟主机,目前他们家还算是为数不多提供虚拟主机的商家,有提供香港、美国、德国等虚拟主机机房。香港机房CN2速度比较快,美国机房有提供优化线路和普通线路适合外贸业务。德国欧洲机房适合欧洲业务的虚拟主机。糖果主机商一般是不会发布黑五活动的,他们在圣圣诞节促销活动是有的,我们看到糖果主机商发布的圣诞节促销虚拟主机低至6折...

RAKsmart推出7.59美元/月,云服务器产品Cloud Server,KVM架构1核1G内存40G硬盘1M带宽基础配置

近期RAKsmart上线云服务器Cloud Server产品,KVM架构1核1G内存40G硬盘1M带宽基础配置7.59美元/月!RAKsmart云服务器Cloud Server位于美国硅谷机房,下单可选DIY各项配置,VPC网络/经典网络,大陆优化/精品网线路,1-1000Mbps带宽,支持Linux或者Windows操作系统,提供Snap和Backup。RAKsmart机房是一家成立于2012年...

google统计为你推荐
请各矿将表填好后于2017年3月1日前发至zhxsh411@163.com邮箱.http://www.paper.edu.cn支持ipad支持ipadeacceleratorW3S是什么意思win7关闭445端口如何快速关闭445端口用itunes备份iphone怎么从itunes备份恢复win7如何关闭445端口如何关闭WIN7自动配置 IPV4 地址 169.254迅雷快鸟迅雷快鸟是做什么用的,,,win7勒索病毒补丁我的电脑是windows7系统,为什么打不了针对勒索病毒的补丁(杀毒软件显
asp虚拟主机 域名城 踢楼 singlehop hawkhost优惠码 外国服务器 私服服务器 国外服务器网站 win8升级win10正式版 一点优惠网 qq数据库 京东商城双十一活动 蜗牛魔方 帽子云 双11秒杀 泉州移动 中国电信宽带测速网 什么是web服务器 国内域名 xuni 更多