launchlocalhost
localhost 时间:2021-05-20 阅读:(
)
CS246:MiningMassiveDatasetsWinter2014ProblemSet0Due9:30amJanuary14,2014GeneralInstructionsThishomeworkistobecompletedindividually(nocollaborationisallowed).
Also,youarenotallowedtouseanylatedaysforthehomework.
Thishomeworkisworth1%ofthetotalcoursegrade.
ThepurposeofthishomeworkistogetyoustartedwithHadoop.
Hereyouwilllearnhowtowrite,compile,debugandexecuteasimpleHadoopprogram.
FirstpartofthehomeworkservesasatutorialandthesecondpartasksyoutowriteyourownHadoopprogram.
Section1describesthevirtualmachineenvironment.
Insteadofthevirtualmachine,youarewelcometosetupyourownpseudo-distributedorfullydistributedclusterifyoupre-fer.
AnyversionofHadoopthatisatleast1.
0willsuce.
(Foraneasywaytosetupacluster,tryClouderaManager:http://archive.
cloudera.
com/cm4/installer/latest/cloudera-manager-installer.
bin.
)Ifyouchoosetosetupyourowncluster,youarere-sponsibleformakingsuretheclusterisworkingproperly.
TheTAswillbeunabletohelpyoudebugcongurationissuesinyourowncluster.
Section2explainshowtousetheEclipseenvironmentinthevirtualmachine,includinghowtocreateaproject,howtorunjobs,andhowtodebugjobs.
Section2.
5givesanend-to-endexampleofcreatingaproject,addingcode,building,running,anddebuggingit.
Section3istheactualhomeworkassignment.
Therearenodeliverablesforsections1and2.
Insection3,youareaskedtowriteandsubmityourownMapReducejob.
Thishomeworkrequiresyoutouploadthecodeandhand-inaprint-outoftheoutputforSection3.
Regular(non-SCPD)studentsshouldsubmithardcopiesoftheanswers(Section3)eitherinclassorinthesubmissionbox(seecoursewebsiteforlocation).
Forpapersubmis-sion,pleasellthecoversheetandsubmititasafrontpagewithyouranswers.
Youshoulduploadyoursourcecodeandanyotherlesyouused.
SCPDstudentsshouldsubmittheiranswersthroughSCPDandalsouploadthecode.
ThesubmissionmustincludetheanswerstoSection3,thecoversheetandtheusualSCPDrout-ingform(http://scpd.
stanford.
edu/generalInformation/pdf/SCPD_HomeworkRouteForm.
pdf).
CoverSheet:http://cs246.
stanford.
edu/cover.
pdfUploadLink:http://snap.
stanford.
edu/submit/CS246:MiningMassiveDatasets-ProblemSet02Questions1SettingupavirtualmachineDownloadandinstallVirtualBoxonyourmachine:http://virtualbox.
org/wiki/DownloadsDownloadtheClouderaQuickstartVMathttp://www.
cloudera.
com/content/dev-center/en/home/developer-admin-resources/quickstart-vm.
htmlUncompresstheVMarchive.
Itiscompressedwith7-Zip.
Ifneeded,youcandownloadatooltouncompressthearchiveathttp://www.
7-zip.
org/.
StartVirtualBoxandclickImportAppliance.
Clickthefoldericonbesidethelocationeld.
Browsetotheuncompressedarchivefolder,selectthe.
ovfle,andclicktheOpenbutton.
ClicktheContinuebutton.
ClicktheImportbutton.
Yourvirtualmachineshouldnowappearintheleftcolumn.
SelectitandclickonStarttolaunchit.
Usernameandpasswordare"cloudera"and"cloudera".
Optional:Openthenetworkpropertiesforthevirtualmachine.
ClickontheAdapter2tab.
EnabletheadapterandselectHost-onlyAdapter.
Ifyoudothisstep,youwillbeabletoconnecttotherunningvirtualmachinefromthehostOSat192.
168.
56.
101.
VirtualmachineincludesthefollowingsoftwareCentOS6.
2JDK6(1.
6.
032)Hadoop2.
0.
0Eclipse4.
2.
6(Juno)Theloginuseriscloudera,andthepasswordforthataccountiscloudera.
2RunningHadoopjobsGenerallyHadoopcanberuninthreemodes.
1.
Standalone(orlocal)mode:Therearenodaemonsusedinthismode.
HadoopusesthelocallesystemasansubstituteforHDFSlesystem.
Thejobswillrunasifthereis1mapperand1reducer.
CS246:MiningMassiveDatasets-ProblemSet032.
Pseudo-distributedmode:Allthedaemonsrunonasinglemachineandthissettingmimicsthebehaviorofacluster.
AllthedaemonsrunonyourmachinelocallyusingtheHDFSprotocol.
Therecanbemultiplemappersandreducers.
3.
Fully-distributedmode:ThisishowHadooprunsonarealcluster.
InthishomeworkwewillshowyouhowtorunHadoopjobsinStandalonemode(veryusefulfordevelopinganddebugging)andalsoinPseudo-distributedmode(tomimicthebehaviorofaclusterenvironment).
2.
1CreatingaHadoopprojectinEclipse(ThereisapluginforEclipsethatmakesitsimpletocreateanewHadoopprojectandexecuteHadoopjobs,butthepluginisonlywellmaintainedforHadoop1.
0.
4,whichisaratheroldversionofHadoop.
Thereisaprojectathttps://github.
com/winghc/hadoop2x-eclipse-pluginthatisworkingtoupdatethepluginforHadoop2.
0.
Youcantryitoutifyoulike,butyourmilagemayvary.
)Tocreateaproject:1.
Openorcreatethe~/.
m2/settings.
xmlleandmakesureithasthefollowingcon-tents:standardextrarepostruecentralhttp://repo.
maven.
apache.
org/maven2/truetrueclouderaCS246:MiningMassiveDatasets-ProblemSet04https://repository.
cloudera.
com/artifactory/clouderarepostruetrue2.
OpenEclipseandselectFile→New→Project.
.
.
.
3.
ExpandtheMavennode,selectMavenProject,andclicktheNext>button.
4.
Onthenextscreen,clicktheNext>button.
5.
Onthenextscreen,whenthearchetypeshaveloaded,selectmaven-archetype-quickstartandclicktheNext>button.
6.
Onthenextscreen,enteragroupnameintheGroupIdeld,andenteraprojectnameintheArtifactId.
ClicktheFinishbutton.
7.
Inthepackageexplorer,expandtheprojectnodeanddouble-clickthepom.
xmlletoopenit.
8.
Replacethecurrent"dependencies"sectionwiththefollowingcontent:jdk.
toolsjdk.
tools1.
6org.
apache.
hadoophadoophdfs2.
0.
0cdh4.
0.
0org.
apache.
hadoophadoopauth2.
0.
0cdh4.
0.
0CS246:MiningMassiveDatasets-ProblemSet05org.
apache.
hadoophadoopcommon2.
0.
0cdh4.
0.
0org.
apache.
hadoophadoopcore2.
0.
0mr1cdh4.
0.
1junitjunitdep4.
8.
2org.
apache.
hadoophadoophdfsorg.
apache.
hadoophadoopauthorg.
apache.
hadoophadoopcommonorg.
apache.
hadoophadoopcorejunitjunit4.
10testCS246:MiningMassiveDatasets-ProblemSet06org.
apache.
maven.
pluginsmavencompilerplugin2.
11.
61.
69.
Savethele.
10.
Right-clickontheprojectnodeandselectMaven→UpdateProject.
Youcannowcreateclassesinthesrcdirectory.
Afterwritingyourcode,buildtheJARlebyright-clickingontheprojectnodeandselectingRunAs→Maveninstall.
2.
2RunningHadoopjobsinstandalonemodeAftercreatingaproject,addingsourcecode,andbuildingtheJARleasoutlinedabove,theJARlewillbelocatedat/workspace//targetdirectory.
Openaterminalandrunthefollowingcommand:hadoopjar~/workspace//target/-0.
0.
1-SNAPSHOT.
jar\-Dmapped.
task.
tracker=local-Dfs.
defaultFS=localYouwillseealloftheoutputfromthemapandreducetasksintheterminal.
2.
3RunningHadoopjobsinpseudo-distributedmodeOpenaterminalandrunthefollowingcommand:hadoopjar~/workspace//target/-0.
0.
1-SNAPSHOT.
jarToseeallrunningjobs,runthefollowingcommand:hadoopjob-listTokillarunningjob,ndthejob'sIDandthenrunthefollowingcommand:hadoopjob-killCS246:MiningMassiveDatasets-ProblemSet072.
4DebuggingHadoopjobsTodebuganissuewithajob,theeasiestapproachistoaddprintstatementsintothesourceleandrunthejobinstandalonemode.
Theprintstatementswillappearintheterminaloutput.
Whenrunningyourjobinpseudo-distributedmode,theoutputfromthejobisloggedinthetasktracker'slogles,whichcanbeaccessedmosteasilybypointingawebbrowsertoport50030oftheserver.
Fromthejobtrackerwebpage,youcandrilldownintothefailingjob,thefailingtask,thefailedattempt,andnallythelogles.
Notethatthelogsforstdoutandstderrareseparated,whichcanbeusefulwhentryingtoisolatespecicdebuggingprintstatements.
IfyouenabledthesecondnetworkadapterintheVMsetup,youcanpointyourlocalbrowsertohttp://192.
168.
56.
101:50030/toaccessthejobtrackerpage.
Note,though,thatwhenyoufollowlinksthatleadtothetasktrackerwebpage,thelinkspointtolocalhost.
locadomain,whichmeansyourbrowserwillreturnapagenotfounderror.
Sim-plyreplacelocalhost.
locadomainwith192.
168.
56.
101intheURLbarandpressentertoloadthecorrectpage.
2.
5ExampleprojectInthissectionyouwillcreateanewEclipseHadoopproject,compile,andexecuteit.
Theprogramwillcountthefrequencyofallthewordsinagivenlargetextle.
Inyourvirtualmachine,Hadoop,JavaenvironmentandEclipsehavealreadybeenpre-installed.
Editthe~/.
m2/settings.
xmlleasoutlinedabove.
SeeFigure1Figure1:CreateaHadoopProject.
OpenEclipseandcreateanewprojectasoutlinedabove.
SeeFigures2-9.
CS246:MiningMassiveDatasets-ProblemSet08Figure2:CreateaHadoopProject.
Figure3:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet09Figure4:CreateaHadoopProject.
Figure5:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet010Figure6:CreateaHadoopProject.
Figure7:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet011Figure8:CreateaHadoopProject.
CS246:MiningMassiveDatasets-ProblemSet012Figure9:CreateaHadoopProject.
Theprojectwillcontainastubsourceleinthesrc/main/javadirectorythatwewillnotuse.
Instead,createanewclasscalledWordCount.
FromtheFilemenu,selectNew→Class.
SeeFigure10Figure10:Createjavale.
Onthenextscreen,enterthepackagename(e.
g,thegroupIDplustheprojectname)inthePackageeld.
EnterWordCountastheName.
SeeFigure11.
CS246:MiningMassiveDatasets-ProblemSet013Figure11:Createjavale.
IntheSuperclasseld,enterConfiguredandclicktheBrowsebutton.
Fromthepop-upwindowselectCongured—org.
apache.
hadoop.
confandclicktheOKbutton.
SeeFigure12.
CS246:MiningMassiveDatasets-ProblemSet014Figure12:Createjavale.
IntheInterfacessection,clicktheAddbutton.
Fromthepop-upwindowselectTool—org.
apache.
hadoop.
utilandclicktheOKbutton.
SeeFigure13.
CS246:MiningMassiveDatasets-ProblemSet015Figure13:Createjavale.
Checktheboxesforpublicstaticvoidmain(Stringargs[])andInheritedabstractmeth-odsandclicktheFinishbutton.
SeFigure14CS246:MiningMassiveDatasets-ProblemSet016Figure14:CreateWordCount.
java.
YouwillnowhavearoughskeletonofaJavaleasinFigure15.
YoucannowaddcodetothisclasstoimplementyourHadoopjob.
CS246:MiningMassiveDatasets-ProblemSet017Figure15:CreateWordCount.
java.
Ratherthanimplementajobfromscratch,copythecontentsfromhttp://snap.
stanford.
edu/class/cs246-data-2014/WordCount.
javaandpasteitintotheWordCount.
javale.
Becarefultoleavethepackagestatementatthetopintact.
SeeFigure16.
ThecodeinWordCount.
javacalculatesthefrequencyofeachwordinagivendataset.
CS246:MiningMassiveDatasets-ProblemSet018Figure16:CreateWordCount.
java.
Buildtheprojectbyright-clickingtheprojectnodeandselectingRunAs→Maveninstall.
SeeFigure17.
CS246:MiningMassiveDatasets-ProblemSet019Figure17:CreateWordCount.
java.
DownloadtheCompleteWorksofWilliamShakespearefromProjectGutenbergathttp://www.
gutenberg.
org/cache/epub/100/pg100.
txt.
Openaterminalandchangetothedirectorywherethedatasetwasstored.
Runthecommand:hadoopjar~/workspace/wordcount/target/wordcount-0.
0.
1-SNAPSHOT.
jar\edu.
stanford.
cs246.
wordcount.
WordCount-Dmapred.
job.
tracker=local\-Dfs.
defaultFS=localdatasetoutputCS246:MiningMassiveDatasets-ProblemSet020SeeFigure18Figure18:RunWordCountjob.
Ifthejobsucceeds,youwillseeanoutputdirectoryinthecurrentdirectorythatcontainsalecalledpart-00000.
Thepart-00000lecontainstheoutputfromthejob.
SeeFigure19Figure19:RunWordCountjob.
Runthecommand:hadoopfs-lsThecommandwilllistthecontentsofyourhomedirectoryinHDFS,whichshouldbeempty,resultinginnooutput.
Runthecommand:hadoopfs-copyFromLocalpg100.
txttocopythedatasetfolderintoHDFS.
Runthecommand:hadoopfs-lsCS246:MiningMassiveDatasets-ProblemSet021again.
Youshouldseethedatasetdirectorylisted,asinFigure20indicatingthatthedatasetisinHDFS.
Figure20:RunWordCountjob.
Runthecommand:hadoopjar~/workspace/WordCount/target/WordCount-0.
0.
1-SNAPSHOT.
jar\edu.
stanford.
cs246.
wordcount.
WordCountpg100.
txtoutputSeeFigure21.
Ifthejobfails,youwillseeamessageindicatingthatthejobfailed.
Otherwise,youcanassumethejobsucceeded.
Figure21:RunWordCountjob.
Runthecommand:hadoopfs-lsoutputYoushouldseeanoutputleforeachreducer.
Sincetherewasonlyonereducerforthisjob,youshouldonlyseeonepart-*le.
Notethatsometimestheleswillbecalledpart-NNNNN,andsometimesthey'llbecalledpart-r-NNNNN.
SeeFigure22Figure22:RunWordCountjob.
Runthecommand:hadoopfs-catoutput/part\*|headYoushouldseethesameoutputaswhenyouranthejoblocally,asshowninFigure23CS246:MiningMassiveDatasets-ProblemSet022Figure23:RunWordCountjob.
Toviewthejob'slogs,openthebrowserintheVMandpointittohttp://localhost:50030asinFigure24.
Figure24:ViewWordCountjoblogs.
Clickonthelinkforthecompletedjob.
SeeFigure25.
CS246:MiningMassiveDatasets-ProblemSet023Figure25:ViewWordCountjoblogs.
Clickthelinkforthemaptasks.
SeeFigure26.
CS246:MiningMassiveDatasets-ProblemSet024Figure26:ViewWordCountjoblogs.
Clickthelinkfortherstattempt.
SeeFigure27.
CS246:MiningMassiveDatasets-ProblemSet025Figure27:ViewWordCountjoblogs.
Clickthelinkforthefulllogs.
SeeFigure28.
CS246:MiningMassiveDatasets-ProblemSet026Figure28:ViewWordCountjoblogs.
2.
6UsingyourlocalmachinefordevelopmentIfyouenabledthesecondnetworkadapter,youcanuseyourownlocalmachineforde-velopment,includingyourlocalIDE.
Ifordertodothat,you'llneedtoinstallacopyofHadooplocally.
Theeasiestwaytodothatistosimplydownloadthearchivefromhttp://archive.
cloudera.
com/cdh4/cdh/4/hadoop-2.
0.
0-cdh4.
4.
0.
tar.
gzandunpackit.
Intheunpackedarchive,you'llndaetc/hadoop-mapreduce1directory.
Inthatdirectory,openthecore-site.
xmlleandmodifyitasfollows:fs.
default.
namehdfs://192.
168.
56.
101:8020CS246:MiningMassiveDatasets-ProblemSet027Next,openthemapred-site.
xmlleinthesamedirectoryandmodifyitasfollows:mapred.
job.
tracker192.
168.
56.
101:8021Aftermakingthosemodications,updateyourcommandpathtoincludethebin-mapreduce1directoryandsettheHADOOPCONFDIRenvironmentvariabletobethepathtotheetc/hadoop-mapreduce1directory.
YoushouldnowbeabletoexecuteHadoopcommandsfromyourlocalterminaljustasyouwouldfromtheterminalinthevirtualmachine.
YoumayalsowanttosettheHADOOPUSERNAMEenvironmentvariabletoclouderatoletyoumasqueradeastheclouderauser.
WhenyouusetheVMdirectly,you'rerunningastheclouderauser.
FurtherHadooptutorialsYahoo!
HadoopTutorial:http://developer.
yahoo.
com/hadoop/tutorial/ClouderaHadoopTutorial:http://www.
cloudera.
com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.
htmlHowtoDebugMapReducePrograms:http://wiki.
apache.
org/hadoop/HowToDebugMapReduceProgramsFurtherEclipsetutorialsGeneraEclipsetutorial:http://www.
vogella.
com/articles/Eclipse/article.
html.
TutorialonhowtousetheEclipsedebugger:http://www.
vogella.
com/articles/EclipseDebugging/article.
html.
3Task:WriteyourownHadoopJobNowyouwillwriteyourrstMapReducejobtoaccomplishthefollowingtask:CS246:MiningMassiveDatasets-ProblemSet028WriteaHadoopMapReduceprogramwhichoutputsthenumberofwordsthatstartwitheachletter.
Thismeansthatforeveryletterwewanttocountthetotalnumberofwordsthatstartwiththatletter.
Inyourimplementationignorethelettercase,i.
e.
,considerallwordsaslowercase.
Youcanignoreallnon-alphabeticcharacters.
Runyourprogramoverthesameinputdataasabove.
Whattohand-in:Hand-intheprintoutoftheoutputleanduploadthesourcecode.
棉花云官网棉花云隶属于江西乐网科技有限公司,前身是2014年就运营的2014IDC,专注海外线路已有7年有余,是国内较早从事海外专线的互联网基础服务提供商。公司专注为用户提供低价高性能云计算产品,致力于云计算应用的易用性开发,并引导云计算在国内普及。目前公司研发以及运营云服务基础设施服务平台(IaaS),面向全球客户提供基于云计算的IT解决方案与客户服务(SaaS),拥有丰富的国内BGP、双线高防...
CloudCone发布了2021年的闪售活动,提供了几款年付VPS套餐,基于KVM架构,采用Intel® Xeon® Silver 4214 or Xeon® E5s CPU及SSD硬盘组RAID10,最低每年14.02美元起,支持PayPal或者支付宝付款。这是一家成立于2017年的国外VPS主机商,提供VPS和独立服务器租用,数据中心为美国洛杉矶MC机房。下面列出几款年付套餐配置信息。CPU:...
RFCHOST,这个服务商我们可能有一些朋友知道的。不要看官网是英文就以为是老外服务商,实际上这个服务商公司在上海。我们实际上看到的很多商家,有的是繁体,有的是英文,实际上很多都是我们国人朋友做的,有的甚至还做好几个品牌域名,实际上都是一个公司。对于RFCHOST商家还是第一次分享他们家的信息,公司成立大约2015年左右。目前RFCHOST洛杉矶机房VPS正进行优惠促销,采用CN2优化线路,电信双...
localhost为你推荐
易安信电脑系统participants37OPENCORE苹果引导配置说明第四版-基于支持ipad支持ipad三星苹果5重庆宽带测速重庆哪一种宽带网速最快勒索病毒win7补丁为了防勒索病毒,装了kb4012212补丁,但出现关机蓝屏的问题了,开机正常iexplore.exe应用程序错误iexplore.exe应用程序错误itunes备份如何用iTunes备份iPhone
个人注册域名 host1plus vmsnap3 免费ftp空间 shopex空间 tk域名 dropbox网盘 免费网络电视 网通代理服务器 北京主机 中国特价网 三拼域名 165邮箱 admit的用法 国外代理服务器软件 华为云盘 yundun smtp虚拟服务器 独享主机 石家庄服务器托管 更多