launchlocalhost
localhost 时间:2021-05-20 阅读:(
)
CS246:MiningMassiveDatasetsWinter2014ProblemSet0Due9:30amJanuary14,2014GeneralInstructionsThishomeworkistobecompletedindividually(nocollaborationisallowed).
Also,youarenotallowedtouseanylatedaysforthehomework.
Thishomeworkisworth1%ofthetotalcoursegrade.
ThepurposeofthishomeworkistogetyoustartedwithHadoop.
Hereyouwilllearnhowtowrite,compile,debugandexecuteasimpleHadoopprogram.
FirstpartofthehomeworkservesasatutorialandthesecondpartasksyoutowriteyourownHadoopprogram.
Section1describesthevirtualmachineenvironment.
Insteadofthevirtualmachine,youarewelcometosetupyourownpseudo-distributedorfullydistributedclusterifyoupre-fer.
AnyversionofHadoopthatisatleast1.
0willsuce.
(Foraneasywaytosetupacluster,tryClouderaManager:http://archive.
cloudera.
com/cm4/installer/latest/cloudera-manager-installer.
bin.
)Ifyouchoosetosetupyourowncluster,youarere-sponsibleformakingsuretheclusterisworkingproperly.
TheTAswillbeunabletohelpyoudebugcongurationissuesinyourowncluster.
Section2explainshowtousetheEclipseenvironmentinthevirtualmachine,includinghowtocreateaproject,howtorunjobs,andhowtodebugjobs.
Section2.
5givesanend-to-endexampleofcreatingaproject,addingcode,building,running,anddebuggingit.
Section3istheactualhomeworkassignment.
Therearenodeliverablesforsections1and2.
Insection3,youareaskedtowriteandsubmityourownMapReducejob.
Thishomeworkrequiresyoutouploadthecodeandhand-inaprint-outoftheoutputforSection3.
Regular(non-SCPD)studentsshouldsubmithardcopiesoftheanswers(Section3)eitherinclassorinthesubmissionbox(seecoursewebsiteforlocation).
Forpapersubmis-sion,pleasellthecoversheetandsubmititasafrontpagewithyouranswers.
Youshoulduploadyoursourcecodeandanyotherlesyouused.
SCPDstudentsshouldsubmittheiranswersthroughSCPDandalsouploadthecode.
ThesubmissionmustincludetheanswerstoSection3,thecoversheetandtheusualSCPDrout-ingform(http://scpd.
stanford.
edu/generalInformation/pdf/SCPD_HomeworkRouteForm.
pdf).
CoverSheet:http://cs246.
stanford.
edu/cover.
pdfUploadLink:http://snap.
stanford.
edu/submit/CS246:MiningMassiveDatasets-ProblemSet02Questions1SettingupavirtualmachineDownloadandinstallVirtualBoxonyourmachine:http://virtualbox.
org/wiki/DownloadsDownloadtheClouderaQuickstartVMathttp://www.
cloudera.
com/content/dev-center/en/home/developer-admin-resources/quickstart-vm.
htmlUncompresstheVMarchive.
Itiscompressedwith7-Zip.
Ifneeded,youcandownloadatooltouncompressthearchiveathttp://www.
7-zip.
org/.
StartVirtualBoxandclickImportAppliance.
Clickthefoldericonbesidethelocationeld.
Browsetotheuncompressedarchivefolder,selectthe.
ovfle,andclicktheOpenbutton.
ClicktheContinuebutton.
ClicktheImportbutton.
Yourvirtualmachineshouldnowappearintheleftcolumn.
SelectitandclickonStarttolaunchit.
Usernameandpasswordare"cloudera"and"cloudera".
Optional:Openthenetworkpropertiesforthevirtualmachine.
ClickontheAdapter2tab.
EnabletheadapterandselectHost-onlyAdapter.
Ifyoudothisstep,youwillbeabletoconnecttotherunningvirtualmachinefromthehostOSat192.
168.
56.
101.
VirtualmachineincludesthefollowingsoftwareCentOS6.
2JDK6(1.
6.
032)Hadoop2.
0.
0Eclipse4.
2.
6(Juno)Theloginuseriscloudera,andthepasswordforthataccountiscloudera.
2RunningHadoopjobsGenerallyHadoopcanberuninthreemodes.
1.
Standalone(orlocal)mode:Therearenodaemonsusedinthismode.
HadoopusesthelocallesystemasansubstituteforHDFSlesystem.
Thejobswillrunasifthereis1mapperand1reducer.
CS246:MiningMassiveDatasets-ProblemSet032.
Pseudo-distributedmode:Allthedaemonsrunonasinglemachineandthissettingmimicsthebehaviorofacluster.
AllthedaemonsrunonyourmachinelocallyusingtheHDFSprotocol.
Therecanbemultiplemappersandreducers.
3.
Fully-distributedmode:ThisishowHadooprunsonarealcluster.
InthishomeworkwewillshowyouhowtorunHadoopjobsinStandalonemode(veryusefulfordevelopinganddebugging)andalsoinPseudo-distributedmode(tomimicthebehaviorofaclusterenvironment).
2.
1CreatingaHadoopprojectinEclipse(ThereisapluginforEclipsethatmakesitsimpletocreateanewHadoopprojectandexecuteHadoopjobs,butthepluginisonlywellmaintainedforHadoop1.
0.
4,whichisaratheroldversionofHadoop.
Thereisaprojectathttps://github.
com/winghc/hadoop2x-eclipse-pluginthatisworkingtoupdatethepluginforHadoop2.
0.
Youcantryitoutifyoulike,butyourmilagemayvary.
)Tocreateaproject:1.
Openorcreatethe~/.
m2/settings.
xmlleandmakesureithasthefollowingcon-tents:standardextrarepostruecentralhttp://repo.
maven.
apache.
org/maven2/truetrueclouderaCS246:MiningMassiveDatasets-ProblemSet04https://repository.
cloudera.
com/artifactory/clouderarepostruetrue2.
OpenEclipseandselectFile→New→Project.
.
.
.
3.
ExpandtheMavennode,selectMavenProject,andclicktheNext>button.
4.
Onthenextscreen,clicktheNext>button.
5.
Onthenextscreen,whenthearchetypeshaveloaded,selectmaven-archetype-quickstartandclicktheNext>button.
6.
Onthenextscreen,enteragroupnameintheGroupIdeld,andenteraprojectnameintheArtifactId.
ClicktheFinishbutton.
7.
Inthepackageexplorer,expandtheprojectnodeanddouble-clickthepom.
xmlletoopenit.
8.
Replacethecurrent"dependencies"sectionwiththefollowingcontent:jdk.
toolsjdk.
tools1.
6org.
apache.
hadoophadoophdfs2.
0.
0cdh4.
0.
0org.
apache.
hadoophadoopauth2.
0.
0cdh4.
0.
0CS246:MiningMassiveDatasets-ProblemSet05org.
apache.
hadoophadoopcommon2.
0.
0cdh4.
0.
0org.
apache.
hadoophadoopcore2.
0.
0mr1cdh4.
0.
1junitjunitdep4.
8.
2org.
apache.
hadoophadoophdfsorg.
apache.
hadoophadoopauthorg.
apache.
hadoophadoopcommonorg.
apache.
hadoophadoopcorejunitjunit4.
10testCS246:MiningMassiveDatasets-ProblemSet06org.
apache.
maven.
pluginsmavencompilerplugin2.
11.
61.
69.
Savethele.
10.
Right-clickontheprojectnodeandselectMaven→UpdateProject.
Youcannowcreateclassesinthesrcdirectory.
Afterwritingyourcode,buildtheJARlebyright-clickingontheprojectnodeandselectingRunAs→Maveninstall.
2.
2RunningHadoopjobsinstandalonemodeAftercreatingaproject,addingsourcecode,andbuildingtheJARleasoutlinedabove,theJARlewillbelocatedat/workspace//targetdirectory.
Openaterminalandrunthefollowingcommand:hadoopjar~/workspace//target/-0.
0.
1-SNAPSHOT.
jar\-Dmapped.
task.
tracker=local-Dfs.
defaultFS=localYouwillseealloftheoutputfromthemapandreducetasksintheterminal.
2.
3RunningHadoopjobsinpseudo-distributedmodeOpenaterminalandrunthefollowingcommand:hadoopjar~/workspace//target/-0.
0.
1-SNAPSHOT.
jarToseeallrunningjobs,runthefollowingcommand:hadoopjob-listTokillarunningjob,ndthejob'sIDandthenrunthefollowingcommand:hadoopjob-killCS246:MiningMassiveDatasets-ProblemSet072.
4DebuggingHadoopjobsTodebuganissuewithajob,theeasiestapproachistoaddprintstatementsintothesourceleandrunthejobinstandalonemode.
Theprintstatementswillappearintheterminaloutput.
Whenrunningyourjobinpseudo-distributedmode,theoutputfromthejobisloggedinthetasktracker'slogles,whichcanbeaccessedmosteasilybypointingawebbrowsertoport50030oftheserver.
Fromthejobtrackerwebpage,youcandrilldownintothefailingjob,thefailingtask,thefailedattempt,andnallythelogles.
Notethatthelogsforstdoutandstderrareseparated,whichcanbeusefulwhentryingtoisolatespecicdebuggingprintstatements.
IfyouenabledthesecondnetworkadapterintheVMsetup,youcanpointyourlocalbrowsertohttp://192.
168.
56.
101:50030/toaccessthejobtrackerpage.
Note,though,thatwhenyoufollowlinksthatleadtothetasktrackerwebpage,thelinkspointtolocalhost.
locadomain,whichmeansyourbrowserwillreturnapagenotfounderror.
Sim-plyreplacelocalhost.
locadomainwith192.
168.
56.
101intheURLbarandpressentertoloadthecorrectpage.
2.
5ExampleprojectInthissectionyouwillcreateanewEclipseHadoopproject,compile,andexecuteit.
Theprogramwillcountthefrequencyofallthewordsinagivenlargetextle.
Inyourvirtualmachine,Hadoop,JavaenvironmentandEclipsehavealreadybeenpre-installed.
Editthe~/.
m2/settings.
xmlleasoutlinedabove.
SeeFigure1Figure1:CreateaHadoopProject.
OpenEclipseandcreateanewprojectasoutlinedabove.
SeeFigures2-9.
CS246:MiningMassiveDatasets-ProblemSet08Figure2:CreateaHadoopProject.
Figure3:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet09Figure4:CreateaHadoopProject.
Figure5:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet010Figure6:CreateaHadoopProject.
Figure7:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet011Figure8:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet012Figure9:CreateaHadoopProject.
Theprojectwillcontainastubsourceleinthesrc/main/javadirectorythatwewillnotuse.
Instead,createanewclasscalledWordCount.
FromtheFilemenu,selectNew→Class.
SeeFigure10Figure10:Createjavale.
Onthenextscreen,enterthepackagename(e.
g,thegroupIDplustheprojectname)inthePackageeld.
EnterWordCountastheName.
SeeFigure11.
CS246:MiningMassiveDatasets-ProblemSet013Figure11:Createjavale.
IntheSuperclasseld,enterConfiguredandclicktheBrowsebutton.
Fromthepop-upwindowselectCongured—org.
apache.
hadoop.
confandclicktheOKbutton.
SeeFigure12.
CS246:MiningMassiveDatasets-ProblemSet014Figure12:Createjavale.
IntheInterfacessection,clicktheAddbutton.
Fromthepop-upwindowselectTool—org.
apache.
hadoop.
utilandclicktheOKbutton.
SeeFigure13.
CS246:MiningMassiveDatasets-ProblemSet015Figure13:Createjavale.
Checktheboxesforpublicstaticvoidmain(Stringargs[])andInheritedabstractmeth-odsandclicktheFinishbutton.
SeFigure14CS246:MiningMassiveDatasets-ProblemSet016Figure14:CreateWordCount.
java.
YouwillnowhavearoughskeletonofaJavaleasinFigure15.
YoucannowaddcodetothisclasstoimplementyourHadoopjob.
CS246:MiningMassiveDatasets-ProblemSet017Figure15:CreateWordCount.
java.
Ratherthanimplementajobfromscratch,copythecontentsfromhttp://snap.
stanford.
edu/class/cs246-data-2014/WordCount.
javaandpasteitintotheWordCount.
javale.
Becarefultoleavethepackagestatementatthetopintact.
SeeFigure16.
ThecodeinWordCount.
javacalculatesthefrequencyofeachwordinagivendataset.
CS246:MiningMassiveDatasets-ProblemSet018Figure16:CreateWordCount.
java.
Buildtheprojectbyright-clickingtheprojectnodeandselectingRunAs→Maveninstall.
SeeFigure17.
CS246:MiningMassiveDatasets-ProblemSet019Figure17:CreateWordCount.
java.
DownloadtheCompleteWorksofWilliamShakespearefromProjectGutenbergathttp://www.
gutenberg.
org/cache/epub/100/pg100.
txt.
Openaterminalandchangetothedirectorywherethedatasetwasstored.
Runthecommand:hadoopjar~/workspace/wordcount/target/wordcount-0.
0.
1-SNAPSHOT.
jar\edu.
stanford.
cs246.
wordcount.
WordCount-Dmapred.
job.
tracker=local\-Dfs.
defaultFS=localdatasetoutputCS246:MiningMassiveDatasets-ProblemSet020SeeFigure18Figure18:RunWordCountjob.
Ifthejobsucceeds,youwillseeanoutputdirectoryinthecurrentdirectorythatcontainsalecalledpart-00000.
Thepart-00000lecontainstheoutputfromthejob.
SeeFigure19Figure19:RunWordCountjob.
Runthecommand:hadoopfs-lsThecommandwilllistthecontentsofyourhomedirectoryinHDFS,whichshouldbeempty,resultinginnooutput.
Runthecommand:hadoopfs-copyFromLocalpg100.
txttocopythedatasetfolderintoHDFS.
Runthecommand:hadoopfs-lsCS246:MiningMassiveDatasets-ProblemSet021again.
Youshouldseethedatasetdirectorylisted,asinFigure20indicatingthatthedatasetisinHDFS.
Figure20:RunWordCountjob.
Runthecommand:hadoopjar~/workspace/WordCount/target/WordCount-0.
0.
1-SNAPSHOT.
jar\edu.
stanford.
cs246.
wordcount.
WordCountpg100.
txtoutputSeeFigure21.
Ifthejobfails,youwillseeamessageindicatingthatthejobfailed.
Otherwise,youcanassumethejobsucceeded.
Figure21:RunWordCountjob.
Runthecommand:hadoopfs-lsoutputYoushouldseeanoutputleforeachreducer.
Sincetherewasonlyonereducerforthisjob,youshouldonlyseeonepart-*le.
Notethatsometimestheleswillbecalledpart-NNNNN,andsometimesthey'llbecalledpart-r-NNNNN.
SeeFigure22Figure22:RunWordCountjob.
Runthecommand:hadoopfs-catoutput/part\*|headYoushouldseethesameoutputaswhenyouranthejoblocally,asshowninFigure23CS246:MiningMassiveDatasets-ProblemSet022Figure23:RunWordCountjob.
Toviewthejob'slogs,openthebrowserintheVMandpointittohttp://localhost:50030asinFigure24.
Figure24:ViewWordCountjoblogs.
Clickonthelinkforthecompletedjob.
SeeFigure25.
CS246:MiningMassiveDatasets-ProblemSet023Figure25:ViewWordCountjoblogs.
Clickthelinkforthemaptasks.
SeeFigure26.
CS246:MiningMassiveDatasets-ProblemSet024Figure26:ViewWordCountjoblogs.
Clickthelinkfortherstattempt.
SeeFigure27.
CS246:MiningMassiveDatasets-ProblemSet025Figure27:ViewWordCountjoblogs.
Clickthelinkforthefulllogs.
SeeFigure28.
CS246:MiningMassiveDatasets-ProblemSet026Figure28:ViewWordCountjoblogs.
2.
6UsingyourlocalmachinefordevelopmentIfyouenabledthesecondnetworkadapter,youcanuseyourownlocalmachineforde-velopment,includingyourlocalIDE.
Ifordertodothat,you'llneedtoinstallacopyofHadooplocally.
Theeasiestwaytodothatistosimplydownloadthearchivefromhttp://archive.
cloudera.
com/cdh4/cdh/4/hadoop-2.
0.
0-cdh4.
4.
0.
tar.
gzandunpackit.
Intheunpackedarchive,you'llndaetc/hadoop-mapreduce1directory.
Inthatdirectory,openthecore-site.
xmlleandmodifyitasfollows:fs.
default.
namehdfs://192.
168.
56.
101:8020CS246:MiningMassiveDatasets-ProblemSet027Next,openthemapred-site.
xmlleinthesamedirectoryandmodifyitasfollows:mapred.
job.
tracker192.
168.
56.
101:8021Aftermakingthosemodications,updateyourcommandpathtoincludethebin-mapreduce1directoryandsettheHADOOPCONFDIRenvironmentvariabletobethepathtotheetc/hadoop-mapreduce1directory.
YoushouldnowbeabletoexecuteHadoopcommandsfromyourlocalterminaljustasyouwouldfromtheterminalinthevirtualmachine.
YoumayalsowanttosettheHADOOPUSERNAMEenvironmentvariabletoclouderatoletyoumasqueradeastheclouderauser.
WhenyouusetheVMdirectly,you'rerunningastheclouderauser.
FurtherHadooptutorialsYahoo!
HadoopTutorial:http://developer.
yahoo.
com/hadoop/tutorial/ClouderaHadoopTutorial:http://www.
cloudera.
com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.
htmlHowtoDebugMapReducePrograms:http://wiki.
apache.
org/hadoop/HowToDebugMapReduceProgramsFurtherEclipsetutorialsGeneraEclipsetutorial:http://www.
vogella.
com/articles/Eclipse/article.
html.
TutorialonhowtousetheEclipsedebugger:http://www.
vogella.
com/articles/EclipseDebugging/article.
html.
3Task:WriteyourownHadoopJobNowyouwillwriteyourrstMapReducejobtoaccomplishthefollowingtask:CS246:MiningMassiveDatasets-ProblemSet028WriteaHadoopMapReduceprogramwhichoutputsthenumberofwordsthatstartwitheachletter.
Thismeansthatforeveryletterwewanttocountthetotalnumberofwordsthatstartwiththatletter.
Inyourimplementationignorethelettercase,i.
e.
,considerallwordsaslowercase.
Youcanignoreallnon-alphabeticcharacters.
Runyourprogramoverthesameinputdataasabove.
Whattohand-in:Hand-intheprintoutoftheoutputleanduploadthesourcecode.
六一云 成立于2018年,归属于西安六一网络科技有限公司,是一家国内正规持有IDC ISP CDN IRCS电信经营许可证书的老牌商家。大陆持证公司受大陆各部门监管不好用支持退款退现,再也不怕被割韭菜了!主要业务有:国内高防云,美国高防云,美国cera大带宽,香港CTG,香港沙田CN2,海外站群服务,物理机,宿母鸡等,另外也诚招代理欢迎咨询。官网www.61cloud.net最新直销劲爆...
Hostodo商家算是一个比较小众且运营比较久的服务商,而且还是率先硬盘更换成NVMe阵列的,目前有提供拉斯维加斯和迈阿密两个机房。看到商家这两年的促销套餐方案变化还是比较大的,每个月一般有这么两次的促销方案推送,可见商家也在想着提高一些客户量。毕竟即便再老的服务商,你不走出来让大家知道,迟早会落寞。目前,Hostodo有提供两款大流量的VPS主机促销,机房可选拉斯维加斯和迈阿密两个数据中心,且都...
LOCVPS商家我们还是比较熟悉的老牌的国内服务商,包括他们还有其他的产品品牌。这不看到商家的信息,有新增KVM架构轻量/迷你套餐,提供的机房包括香港云地和美国洛杉矶,适用全场8折优惠,月付29.6元起。LOCVPS是一家成立于2011年的稳定老牌国人商家,主要从事XEN、KVM架构的国外VPS销售,主推洛杉矶MC、洛杉矶C3、香港邦联、香港沙田电信、香港大埔、日本东京、日本大阪、新加坡等数据中心...
localhost为你推荐
http://www.tutorialspoint.com/css/css_dimension.htmgetIntjava支持ipad联通版iphone4s苹果4s是联通版,或移动版,或全网通如何知道?icloudiphone怎么利用iCloud使iPhone内存达到扩容目的css选择器请给出三种Css选择器并举例说明google统计怎样将Google分析转换成中文显示Google中文专题交流winrar5.0winrar压缩3种格式分别有什么区别ios6.1.3完美越狱IPAD越狱和不越狱的区别··以及什么是完美越狱搜狗拼音输入法4.3搜狗拼音输入法4.3正式版怎样变换繁体字
免费com域名申请 域名备案收费吗 万网免费域名 sugarhosts 阿云浏览器 星星海 westhost 优惠码 紫田 免费cdn加速 godaddy优惠券 搜狗12306抢票助手 好看qq空间 699美元 新家坡 lol台服官网 免费吧 免费活动 中国网通测速 512mb 更多