userpagedefrag

pagedefrag 时间:2021-02-21 阅读:()

FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSBestPracticesforOptimizingVirtualizedBigDataApplicationsonVMwarevSphere6.
5PERFORMANCESTUDY–AUGUST2017FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|2TableofContentsExecutiveSummary.
3Introduction.
3BestPractices4HardwareSelection4SoftwareSelection.
4VirtualizingHadoop4Tuning5OperatingSystem5Hadoop.
6SparkonYARN.
7Workloads.
7MapReduce.
7TeraSortSuite7TestDFSIO.
7Spark.
8K-meansClustering8LogisticRegressionClassification.
8RandomForestDecisionTrees8TestEnvironment9Hardware9HadoopClusterConfiguration.
10Software.
14Hadoop/SparkConfiguration14PerformanceResults–VirtualizedvsBareMetal18TeraSortResults.
18TestDFSIOResults21SparkResults23K-meansClustering23LogisticRegression.
25RandomForest.
27SparkScalingasDatasetExceedsMemorySize.
29SummaryofBestPractices30Conclusion31References32FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|3ExecutiveSummaryBestpracticesaredescribedforoptimizingBigDataapplicationsrunningonVMwarevSphere.
Hardware,software,andvSphereconfigurationparametersaredocumented,aswellastuningparametersfortheoperatingsystem,Hadoop,andSpark.
TheHewlettPackardEnterpriseProLiantDL380Gen9serversusedinthetestfeaturedfastIntelprocessorswithalargenumberofcores,largememory(512GiB),andall-flashdisks.
TestresultsareshownfromtwoMapReduceandthreeSparkapplicationsrunningonthreedifferentconfigurationsofvSphere(with1,2,and4VMsperhost)aswellasdirectlyonthehardware.
Amongthevirtualizedclusters,thefastestconfigurationwas4VMsperhostduetoNUMAlocalityandbestdiskutilization.
The4VMsperhostplatformwasfasterthanbaremetalforalltestswiththeexceptionofalarge(10TB)TeraSorttestwherethethebaremetaladvantageoflargermemoryovercamethedisadvantageofNUMAmisses.
IntroductionServervirtualizationhasbroughtitsadvantagesofrapiddeployment,easeofmanagement,andimprovedresourceutilizationtomanydatacenterapplications,andBigDataisnoexception.
ITdepartmentsarebeingtaskedtoprovideserverclusterstorunHadoop,Spark,andotherBigDataprogramsforavarietyofdifferentusesandsizes.
Theremightbearequirement,forexample,toruntest/development,qualityassurance,andproductionclustersatthesametime,ordifferentversionsofthesamesoftware.
VirtualizingHadooponvSphereavoidstheneedtodedicateasetofhardwaretoeachdifferentrequirement,thusreducingcosts.
Andwiththeneedformanyidenticalworkernodes,theautomationtoolsprovidedbyvSpherecanprovideforrapiddeploymentofnewclusters.
ThecurrentworkisthelatestinaseriesofVMwarestudiesintotheoptimummethodtorunBigDataworkloadsonvSphere,comparingthoseresultstorunningthesameworkloadsonsimilarlyconfiguredbaremetalservers.
A2015paper,VirtualizedHadoopPerformancewithVMwarevSphere6[1]showedhowHadoopMapReduceversion1(MRv1)workloadsrunningonhighlytunedvSphereimplementationscanapproachandsometimesevenexceednativeperformance.
Thiswasfollowedin2016byastudy,BigDataPerformanceonvSphere6[2],thatappliedsomeofthelearningfromthe2015studytoamoretypicalcustomerenvironment,onewhichutilizedtoolssuchasClouderaManagerandYARN[3]todeployandmanageaHadoopclusterinahighlyavailablefashion.
YetAnotherResourceNegotiator(YARN)replacedtheJobTracker/TaskTrackerresourcemanagementofMRv1withamoregeneralresourcemanagementprocessthatcansupportotherworkloadssuchasSpark,inadditiontoMapReduce.
UsingYARN,theperformanceoftheclusterwasmeasuredbothwiththestandardHadoopMapReducebenchmarksuiteofTeraGen,TeraSort,andTeraValidate,andtheTestDFSIOHadoopDistributedFilesystem(HDFS)stresstool,aswellasasetofSparkmachinelearningprogramsthatrepresentsomeoftheleadingedgeusesofBigDatatechnology.
TheconclusionofthatstudywasthataproperlyvirtualizedclusterranboththeHadoopandSparkworkloadsthesameasorfasterthanthesameclusterrunningtheworkerserversonbaremetal.
Thecurrentstudyexpandsuponthepreviousworkusinganewclusterequippedwiththelatesthardware:Intelv4processors,largeservermemory,Non-VolatileMemoryExpress(NVMe)andsolidstatedisk(SSD)storage,runningthecurrentversionofvSphere,6.
5.
Additionally,threedifferentvirtualizationconfigurations(1,2,and4VMsperhost)weretestedalongwithbaremetal.
Finally,anewerSparkmachinelearningbenchmark,spark-perf[4],fromDatabricks,Inc.
,thedeveloperofSpark,wasusedfortheSparktests.
ThispaperwillshowhowtobestdeployandconfiguretheunderlyingvSphereinfrastructure,aswellastheHadoopcluster,insuchanenvironment.
Bestpracticesforalllayersofthestackwillbedocumentedandtheirimplementationinthetestclusterdescribed.
TheperformanceoftheclusterwillbeshownbothwiththeTeraSortsuite,TestDFSIOHDFSstresstool,andthenewSparkmachinelearningbenchmarks.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|4BestPracticesHardwareSelectionTheserverhardwareusedinthisstudyreflectedgeneralrecommendationsforBigDataaslistedinthepreviouspaperinthisseries[2]:largememory,especiallyforSpark(512GiB),fastprocessors(IntelXeonProcessorsE5-2683v4@2.
10GHz)withahighcorecount(16coresperCPU),solidstatestorage(NVMeandSSD),and10GbEnetworking.
Note:Inthisdocument,notationsuchas"GiB"referstobinaryquantitiessuchasgibibytes(2**30or1,073,741,824)while"GB"referstogigabytes(10**9or1,000,000,000).
Thenumberofserversshouldbedeterminedbyworkloadsize,numberofconcurrentusers,andtypesofapplication.
Forworkloadsize,itisnecessarytoincludeinthecalculationthenumberofdatablockreplicascreatedbytheHDFSforavailabilityintheeventofdiskorserverfailure(3bydefault),aswellasthepossibilitythattheapplicationneedstokeepacopyoftheinputandoutputatthesametime.
Oncesufficientcapacityisputinplace,itmightbenecessarytoincreasethecluster'scomputationresourcestoimproveapplicationperformanceorhandlemanysimultaneoususers.
SoftwareSelectionHadoopcanbedeployeddirectlyfromopensourceApacheHadoop,butmanyproductionHadoopusersemployadistributionsuchasCloudera,Hortonworks,orMapR,whichprovidedeploymentandmanagementtools,performancemonitoring,andsupport.
ThesetestsusedtheClouderaDistributionofHadoop5.
10.
0MostrecentLinuxoperatingsystemsaresupportedforHadoopincludingRedHat/CentOS6and7,SUSELinuxEnterpriseServer11and12,andUbuntu12and14.
JavaDevelopmentKitversions1.
7and1.
8aresupported.
Foranyuseotherthanasmalltest/developmentcluster,itisrecommendedtouseastandalonedatabaseformanagementandfortheHiveMetastore.
MySQL,PostgreSQL,andOraclearesupported.
Checkthedistributionfordetails.
VirtualizingHadoopOnavirtualizedserver,somememoryneedstobereservedforESXiforitsowncodeanddatastructures.
WithvSphere6.
5,about56%oftotalservermemoryshouldbereservedforESXi,withtheremainderusedforvirtualmachines.
Contemporarymultiprocessorserversemployadesignwhereeachprocessorisdirectlyconnectedtoaportionoftheoverallsystemmemory,asshowninFigure1.
Thisresultsinnon-uniformmemoryaccess(NUMA),whereaprocessor'saccesstoitslocalmemoryisfasterthantoremotememoryattachedtootherprocessors.
ForbestperformancewithvSphere,itisrecommendedtosizetheVMstoeachfitwithinasingleNUMAnode(theprocessoranditslocalmemory),bycreatingoneormoreVMsperNUMAnode.
Bydoingso,thevSphereschedulerwillkeepeachvirtualmachineontheNUMAnodetowhichitwasoriginallyassigned,henceoptimizingmemorylocality.
So,foratwoprocessorsystem,creating2or4VMsperhostwillgivethefastestmemoryaccess.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|5Figure1.
NUMAlocalityina2-processorserver.
Theredlinesshowthevirtualmachineboundaries.
Tomaximizetheutilizationofeachdatadisk,thenumberofdisksperDataNodeshouldbelimited:4to6isagoodstartingpoint.
(SeeVirtualizedHadoopPerformancewithVMwarevSphere6[1]forafulldiscussion).
Diskareasneedtobeclearedbeforewriting.
Bychoosingtheeager-zeroedthick(ThickProvisionEagerZeroed)formatforthevirtualmachinefilesystem's(VMFS)virtualmachinedisks(VMDKs),thisstepwillbeaccomplishedbeforeuse,therebyincreasingproductionperformance.
Thefilesystemusedinsidetheguestshouldbeext4orxfs.
Ext4providesfastlargeblocksequentialI/O—thekindusedinHDFS—bymappingcontiguousblocks(upto128MB)toasingledescriptoror"extent.
"Xfsisanewerfilesystemthatprovidesquickerdiskformattingtimes.
TheVMwareParavirtualSCSI(pvscsi)driverfordiskcontrollersprovidesthebestdiskperformance.
TherearefourvirtualSCSIcontrollersavailableinvSphere6.
5;thedatadisksshouldbedistributedamongallfour.
Virtualnetworkinterfacecards(NICs)shouldbeconfiguredwithMTU=9000forjumboEthernetframesandusethevmxnet3networkdriver;thevirtualswitchesandphysicalswitchestowhichthehostattachesshouldbeconfiguredtoenablejumboframes.
TuningOperatingSystemForbothvirtualmachinesandbaremetalservers,thefollowingoperatingsystemparametersarerecommendedforHadoop(see,forexample,theClouderadocumentationforPerformanceManagement)[5].
TheLinuxkernelparametervm.
swappinesscontrolstheaggressivenessofmemoryswapping,withhighervaluesincreasingthechancethataninactivememorypagewillbeswappedtodisk.
ThiscancauseproblemsforHadoopclustersduetolengthyJavagarbagecollectionpausesforcriticalsystemprocesses.
Itisrecommendedthisbesetto0or1.
Theuseoftransparenthugepagecompactionshouldbedisabledbysettingthevalueof/sys/kernel/mm/transparent_hugepage/defragtonever.
Forexample,thecommandechonever>/sys/kernel/mm/transparent_hugepage/defragcanbeplacedin/etc/rc.
localwhereitwillbeexecuteduponsystemstartup.
Thevalueof/sys/kernel/mm/transparent_hugepage/enabledshouldbesetsimilarly.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|6JumboEthernetframesshouldbeenabledonNICs(bothinthebaremetalandguestOS)andonphysicalnetworkswitches.
HadoopThedevelopersofHadoophaveexposedalargenumberofconfigurationparameterstotheuser.
Inthiswork,afewkeyparameterswereaddressed.
AYARN-basedHadoopclusterhastwomasterprocesses:theNameNodeandtheResourceManager.
TheNameNodemanagestheHadoopDistributedFilesystem(HDFS),communicatingwithDataNodeprocessesoneachworkernodetoreadandwritedatatoHDFS.
TheYARNResourceManagermanagesNodeManagerprocessesoneachworkernode,whichmanagecontainerstorunthemapandreducetasksofMapReduceortheexecutorsusedbySpark.
ForHDFS,thetwomostimportantparametersinvolvesizeandreplicationnumberoftheblocksmakingupthecontentstoredintheHDFSfilesystem.
Theblocksizedfs.
blocksizecontrolsthenumberofHadoopinputsplits.
Highervaluesresultinfewerblocks(andmaptasks)andthusmoreefficiencyexceptforverysmallworkloads.
Formostworkloads,theblocksizeshouldberaisedfromthedefault128MiBto256MiB.
Thiscanbemadeaglobalparameterthatcanbeoverriddenatthecommandlineifnecessary.
Toaccommodatethelargerblocksize,mapreduce.
task.
io.
sort.
mb(aYARNparameter)shouldberaisedto400MiB.
Thisisthebufferspaceallocatedonthemappertocontaintheinputsplitwhilebeingprocessed.
Raisingitpreventsunnecessaryspillstodisk.
Thenumberofreplicas,dfs.
replication,defaultsto3copiesofeachblockforavailabilityinthecaseofdisk,server,orrackloss.
Thesearegeneralrecommendations;aswillbeshownfortheHadoopapplicationstestedhere,raisingthesevaluessignificantlyhighertookthebestadvantageofthelargememoryservers.
YARNandMapReduce2doawaywiththeconceptofmapandreduceslotsfromMapReducev1andinsteaddynamicallyassignresourcestoapplicationsthroughtheallocationoftaskcontainers,whereataskisamaporreducetaskorSparkexecutor.
TheuserspecifiesavailableCPUandmemoryresourcesaswellasminimumandmaximumallocationsforcontainerCPUandmemory,andYARNwillallocatetheresourcestotasks.
Thisisthebestapproachforgeneralclusteruse,inwhichmultipleapplicationsarerunningsimultaneously.
Forgreatercontrolofapplicationsrunningoneatatime,taskcontainerCPUandmemoryrequirementscanbespecifiedtomanuallyallocateclusterresourcesperapplication.
Thislatterapproachwasusedinthetestsdescribedhere.
ForCPUresources,thekeyparameterisyarn.
nodemanager.
resource.
cpu-vcores.
ThistellsYARNhowmanyvirtualCPUcores(vcores)itcanallocatetocontainers.
Inthesetests,allavailablevirtualCPUs(vSpherevCPUs)orlogicalcores(baremetal)wereallocated.
Thiswillbediscussedinmoredetaillaterafterthehardwareconfigurationsaredescribed.
TheapplicationcontainerCPUrequirementsformapandreducetasksarespecifiedbymapreduce.
map.
cpu.
vcoresandmapreduce.
reduce.
cpu.
vcores,whicharespecifiedintheclusterconfigurationandcanbeoverriddeninthecommandline,perapplication.
SimilartoCPUresources,clustermemoryresourcesarespecifiedbyyarn.
nodemanager.
resource.
memory-mb,whichshouldbedefinedintheclusterconfigurationparameters,whilemapreduce.
map.
memory.
mbandmapreduce.
reduce.
memory.
mbcanbeusedtospecifycontainermemoryrequirementsthatdifferfromthedefaultof1GiB.
Thejavavirtualmachine(JVM)thatrunsthetaskwithintheYARNcontainerneedssomeoverhead;theparametermapreduce.
job.
heap.
memory-mb.
ratiospecifieswhatpercentageofthecontainermemorytousefortheJVMheapsize.
Thedefaultvalueof0.
8wasusedinthesetests.
Detailsofthememoryallocationforthevirtualizedandbaremetalclustersfollow.
Ingeneral,itisnotnecessarytospecifythenumberofmapsandreducesinanapplicationbecauseYARNwillallocatesufficientcontainersbasedonthecalculationsdescribedabove.
However,incertaincases(suchasTeraGen,anall-mapapplicationthatwritesdatatoHDFSforTeraSorttosort),itisnecessarytoinstructYARNhowmanytaskstoruninthecommandline.
Tofullyutilizethecluster,thetotalnumberofavailabletasksminusonefortheApplicationMastertaskthatcontrolstheapplicationcanbespecified.
Theparameterscontrollingthenumberoftasksaremapreduce.
job.
mapsandmapreduce.
job.
reduces.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|7Thecorrectsizingoftaskcontainers—largerversusmorecontainers—andthecorrectbalancebetweenmapandreducetasksarefundamentaltoachievinghighperformanceinaHadoopcluster.
AtypicalMapReducejobstartswithallmaptasksingestingandprocessinginputdatafromfiles(possiblytheoutputofapreviousMapReducejob).
Theresultofthisprocessingisasetofkey-valuepairswhichthemaptasksthenpartitionandsortbykeyandwrite(orspill)todisk.
Thesortedoutputisthencopiedtoothernodeswherereducetasksmergethefilesbykey,continuetheprocessing,and(usually)writeoutputtoHDFS.
Themapandreducephasescanoverlap,soitisnecessarytoreviewtheabovechoicesinlightofthefactthateachDataNode/NodeManagernodewillgenerallyberunningbothkindsoftaskssimultaneously.
SparkonYARNSparkisamemory-basedexecutionenginethatinmanywaysisreplacingMapReduceforBigDataapplications.
ItcanrunstandaloneonBigDataclustersorasanapplicationunderYARN.
Forthesetests,thelatterwasused.
Theparametersspark.
executor.
coresandspark.
executor.
memoryplaythesameroleforSparkexecutorsasmap/reducetaskmemoryandvcoreassignmentdoforMapReduce.
Additionally,theparameterspark.
yarn.
executor.
memoryOverheadcanbesetifthedefault(10%ofspark.
executor.
memory)isinsufficient(ifout-of-memoryissuesareseen).
Ingeneral,Sparkrunsbestwithfewerexecutorscontainingmorememory.
WorkloadsSeveralstandardbenchmarksthatexercisethekeycomponentsofaBigDataclusterwereusedforthistest.
ThesebenchmarksmaybeusedbycustomersasastartingpointforcharacterizingtheirBigDataclusters,buttheirownapplicationswillprovidethebestguidanceforchoosingthecorrectarchitecture.
MapReduceTwoindustry-standardMapReducebenchmarks,theTeraSortsuiteandTestDFSIO,wereusedformeasuringperformanceofthecluster.
TeraSortSuiteTheTeraSortsuite(TeraGen/TeraSort/TeraValidate)isthemostcommonlyusedHadoopbenchmarkandshipswithallHadoopdistributions.
Byfirstcreatingalargedataset,thensortingit,andfinallyvalidatingthatthesortwascorrect,thesuiteexercisesmanyofHadoop'sfunctionsandstressesCPU,memory,disk,andnetwork.
TeraGengeneratesaspecifiednumberof100byterecords,eachwitharandomizedkeyoccupyingthefirst10bytes,creatingthedefaultnumberofreplicasassetbydfs.
replication.
Inthesetests10,30,and100billionrecordswerespecifiedresultingindatasetsof1,3,and10TB.
TeraSortsortstheTeraGenoutput,creatingonereplicaofthesortedoutput.
InthefirstphaseofTeraSort,themaptasksreadthedatasetfromHDFS.
FollowingthatisaCPU-intensivephasewheremaptaskspartitiontherecordstheyhaveprocessedbyacomputedkeyrange,sortthembykey,andspillthemtodisk.
Atthispoint,thereducetaskstakeover,fetchthefilesfromeachmappercorrespondingtothekeysassociatedwiththatreducer,andthenmergethefilesforeachkey(sortingthemintheprocess)withseveralpasses,andfinallywritetodisk.
TeraValidate,whichvalidatesthattheTeraSortoutputisindeedinsortedorder,ismainlyareadoperationwithasinglereducetaskattheend.
TestDFSIOTestDFSIOisawrite-intensiveHDFSstresstoolalsosuppliedwitheverydistribution.
Itgeneratesaspecifiednumberoffilesofaspecifiedsize.
Inthesetests1,000filesofsize1GB,3GB,or10GBfileswerecreatedfortotalsizeof1,3,and10TB.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|8SparkThreestandardanalyticprogramsfromtheSparkmachinelearninglibrary(MLLib),K-meansclustering,LogisticRegressionclassification,andRandomForestdecisiontrees,weredrivenusingspark-perf[4].
K-meansClusteringClusteringisusedforanalytictaskssuchascustomersegmentationforpurposesofadplacementorproductrecommendations.
K-meansgroupsinputintoaspecifiednumber,k,ofclustersinamulti-dimensionalspace.
Thecodetestedtakesalargetrainingsetwithaknowngroupingforeachexampleandusesthistobuildamodeltoquicklyplacearealinputsetintooneofthegroups.
ThreeK-meanstestswererun,eachwith5millionexamples.
Thenumberofgroupswassetto20ineach.
Thenumberoffeatureswasvariedwith11,500,23,000,and34,500featuresgeneratingdatasetsizesof1TB,2TB,and3TB.
Thetrainingtimereportedbythebenchmarkkitwasrecorded.
Sixrunsateachsizewereperformed,withthefirstonebeingdiscardedandtheremainingfiveaveragedtogivethereportedelapsedtime.
LogisticRegressionClassificationLogisticregression(LR)isabinaryclassifierusedintoolssuchascreditcardfrauddetectionandspamfilters.
Givenatrainingsetofcreditcardtransactionexampleswith,say,20features,(date,time,location,creditcardnumber,amount,etc.
)andwhetherthatexampleisvalidornot,LRbuildsanumericalmodelthatisusedtoquicklydetermineifsubsequent(real)transactionsarefraudulent.
ThreeLRtestswererun,eachwith5millionexamples.
Thenumberoffeatureswerevariedwith11,500,23,000,and34,500featuresgeneratingdatasetsizesof1TB,2TB,and3TB.
Thetrainingtimereportedbythebenchmarkkitwasrecorded.
Sixrunsateachsizewereperformed,withthefirstonebeingdiscardedandtheremainingfiveaveragedtogivethereportedelapsedtime.
RandomForestDecisionTreesRandomForestautomatesanykindofdecisionmakingorclassificationalgorithmbyfirstcreatingamodelwithasetoftrainingdata,withtheoutcomesincluded.
RandomForestrunsanensembleofdecisiontreesinordertoreducetheriskofoverfittingthetrainingdata.
ThreeRandomForesttestswererun,eachwith5millionexamples.
Thenumberoftreeswassetto10ineach.
Thenumberoffeatureswasvariedwith15,000,30,000,and45,000featuresgeneratingdatasetsizesof1TB,2TB,and3TB.
Thetrainingtimereportedbythebenchmarkkitwasrecorded.
Sixrunsateachsizewereperformed,withthefirstonediscardedandtheremainingfiveaveragedtogivethereportedelapsedtime.
TheSparkMLLibcodeenablesthespecificationofthenumberofpartitionsthateachSparkresilientdistributeddataset(RDD)employs.
Forthesetests,thenumberofpartitionswasinitiallysetequaltothenumberofSparkexecutorstimesthenumberofcoresineach,butwasincreasedincertainconfigurationsasnecessary.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|9TestEnvironmentHardwareThirteenHewlettPackardEnterprise(HPE)ProLiantDL380Gen9serverswereusedinthetestasshowninFigure2.
Theserverswereconfiguredidentically,withtwoIntelXeonProcessorsE5-2683v4("Broadwell")runningat2.
10GHzwith16coreseachand512GiBofmemory.
Hyperthreadingwasenabledsoeachservershowed64logicalprocessorsorhyperthreads.
Figure2.
BigDataCluster–13HPEProLiantDL380Gen9serversForthevirtualizedservers,all64logicalprocessorswereassignedtovirtualCPUs(vCPUs),withthe1,2,and4VMsperhostplatformhaving64,32,and16vCPUsperVM,respectively.
ToenabletheNUMAcalculationtoincludebothphysicalandhyperthreadedlogicalcores,itisnecessarytosetNuma.
PreferHTtoTrueinvSphereoneachhost.
Then,forexample,theESXischedulerwillcalculatethata32vCPUVMinthe2VMsperhostplatformwillindeedfitwithinasingleNUMAnodesince32vCPUswillfitwithinthatprocessor'scoresifhyperthreadsarecounted.
EachserverwasconfiguredwithanSDcard,two1.
2TBspinningdisks,four800GBNVMeSSDsconnectedtothePCIbus,andtwelve800GBSASSSDsconnectedthroughtheHPESmartArrayP840ar/2GBRAIDcontroller.
ESXi6.
5.
0wasinstalledontheSDcardoneachserver.
Thetwointernal1.
2TBdisksineachserverweremirroredinaRAID1setandusedfortheoperatingsystem(forbaremetal)orasavirtualmachinefilesystem(VMFS)datastoreuponwhichtheoperatingsystemdisksfortheVMsonthathostwerestored.
The16flashdrivesattachedtoeachworkerserverwereusedasdatadrivesfortheNodeManagerandDataNodeprocesses.
WiththeNVMestorageprovidingthehighestrandomread/writeI/Ospersecond,the4NVMesineachserverwereassignedtohandletheNodeManagertemporarydata,whichconsistedofHadoopmapspillstodiskandreduceshuffles.
SASSSDsprovideveryhighspeedsequentialreadsandwritessothe12SSDsineachserverwereassignedtotheDataNodetraffic,consistingofreadsandwritesofpermanentHDFSdata.
AsshownlateroninTable3,thedrivesareevenlydistributedamongVMs:forthe1VMperhostplatform(aswellasbareFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|10metal),all4NVMEsand12SSDsareassignedtotheVM/OS,whileattheotherextreme,the4VMsperhostplatformhad1NVMeand3SSDsperVM.
Forthebaremetalcluster,thedriveswereformattedandmountedasanext4filesystem.
ThevirtualizedclustersweresetupasVMFSdatastoreswithone745GBvirtualmachinedisk(VMDK)createdoneach,whichwasthenformattedandmountedasext4usingthesameparametersasinthebaremetalcase.
Forbestperformance,theseVMDKswereinitializedwitheager-zeroedthickformatting,whichmeansthediskwaspre-formattedsowritesdidnothavetoformatthediskduringproduction.
Eachserverhadone1GbENICandfour10GbENICs(ofwhichonlytwowereusedinthistest).
OnevirtualswitchwascreatedoneachESXihostusingthe1GbENICandservedasthemanagementnetworkforthehostaswellastheexternalvirtualmachinenetworkfortheguests.
Asecondvirtualswitch,createdontwo10GbENICsbondedtogether,servedastheinternalnetworkbetweenVMs/baremetalhostsforallHadooptraffic.
BysettingthebondloadbalancingpolicytoRoutebasedonoriginatingvirtualport,bothfailoverandperformancethroughputenhancementwereachievedbecausetheVMsattachedtothatNICbondhaddifferentoriginatingports.
Themaximumtransmissionunit(MTU)ofthebonded10GbENICwassetto9000bytestohandlejumboEthernetframes.
Forthebaremetalhosts,NICbondingwasachievedwithadaptiveloadbalancing.
AnoptimizedvSpheredriverfortheHPESmartArrayP840ar/2GBController,VMW-ESX-6.
5.
0-nhpsa-2.
0.
10-4593811.
zip,wasdownloadedfromHewlettPackardEnterprise'ssiteandinstalledoneachvSpherehost.
HardwaredetailsaresummarizedinTable1.
COMPONENTQUANTITY/TYPEServerHPEProLiantDL380Gen9CPU2xIntelXeonProcessorsE5-2683v4@2.
10GHzw/16coreseachLogicalProcessors(includinghyperthreads)64Memory512GiB(16x32GiBDIMMs)NICs2x1GbEports+4x10GbEportsHardDrives2x1.
2TB12Gb/sSAS10K2.
5inHDD–RAID1forOSNVMes4x800GBNVMePCIe–NodeManagertrafficSSDs12x800GB12GSASSSD–DataNodetrafficRAIDControllerHPESmartArrayP840ar/2GcontrollerRemoteAccessHPEiLOAdvancedTable1.
Serverconfiguration.
Inthisdocumentnotationsuchas"GiB"referstobinaryquantitiessuchasgibibytes(2**30or1,073,741,824)while"GB"referstogigabytes(10**9or1,000,000,000).
HadoopClusterConfigurationInaHadoopcluster,therearethreekindsofserversornodes(seeTable2).
OneormoregatewayserversactasclientsystemsforHadoopapplicationsandprovidearemoteaccesspointforusersofclusterapplications.
MasterserversruntheHadoopmasterservicessuchasHDFSNameNodeandYARNResourceManagerandtheirassociatedservices(JobHistoryServer,etc.
),aswellasotherHadoopservicessuchasHive,Oozie,andHue.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|11Workernodesrunonlytheresource-intensiveHDFSDataNoderoleandtheYARNNodeManagerrole(whichalsoserveasSparkexecutors).
ForNameNodeandResourceManagerhighavailability,atleastthreeZooKeeperservicesandthreeHDFSJournalNodesarerequired;thesecanrunonthegatewayandtwomasterservers.
Forthesetests,threeoftheserverswerevirtualizedwithVMwarevSphere6.
5andraninfrastructureVMstomanagetheHadoopcluster.
OnthefirstserveraVMhostedthegatewaynode,runningClouderaManagerandseveralotherHadoopfunctionsaswellasthegatewaysfortheHDFS,YARN,Spark,andHiveservices.
ThesecondandthirdserverseachhostedamasterVM,onwhichtheactiveandpassiveNameNodeandResourceManagercomponentsandassociatedservicesran.
TheactiveNameNodeandactiveResourceManagerranondifferentnodesforbestdistributionofCPUload,withthestandbyofeachontheoppositemasternode.
ZooKeeper,runningonallthreeVMs,providedsynchronizationofthedistributedprocesses.
PlacingthegatewayandtwomasterVMsondifferentphysicalserversguaranteedthehighestclusteravailability.
Theothertenservers(eithervirtualizedorbaremetal)ranonlytheworkerservices,HDFSDataNode,andYARNNodeManager.
SparkexecutorsranontheYARNNodeManagers.
InaproductionvSpherecluster,thethreeinfrastructurehostsdedicatedtothegatewayandmasternodesalsohostadditionalVMssupportingservicessuchasVMwarevCenterServerandVMwaremanagementtoolssuchasVMwarevRealizeOperationsManagerapplianceandVMwarevRealizeLogInsight.
Theseservicescanbeprotectedandload-balancedthroughvSpherehighavailability(HA)anddistributedresourcescheduler(DRS),inadditiontotheprotectionofferedbyNameNodeandResourceManagerhighavailability.
ThegatewayVMistheonlyserverthatneedsdirectnetworkaccesstotheoutsideworld,soitcanbeprotectedwithafirewallandaccesslimitedwithatoolsuchasKerberos.
(TheothernodesmayneedInternetaccessduringinstallationdependingonthemethodused;thatcanbeachievedbyIPforwardingthroughthegateway).
ThefullassignmentofrolesisshowninTable2.
KeysoftwarecomponentversionsareshowninTable4.
NODEROLESGatewayClouderaManager,ZooKeeperServer,HDFSJournalNode,HDFSgateway,YARNgateway,Hivegateway,Sparkgateway,SparkHistoryServer,HiveMetastoreServer,HiveServer2,HiveWebHCatServer,HueServer,OozieServerMaster1HDFSNameNode(active),YARNResourceManager(standby),ZooKeeperServer,HDFSJournalNode,HDFSBalancer,HDFSFailoverController,HDFSHttpFS,HDFSNFSgatewayMaster2HDFSNameNode(standby),YARNResourceManager(active),ZooKeeperServer,HDFSJournalNode,HDFSFailoverController,YARNJobHistoryServer,Workers(10)HDFSDataNode,YARNNodeManager,SparkExecutorTable2.
Hadoop/SparkrolesDuringthesetests,thethreevirtualizedserverscontrollingtheclusterwereleftunchanged,whiletheconfigurationoftheworkerservers,whichdetermineapplicationperformance,waschangedforeachplatform.
Itisacommonpracticeinproductionclusterstohaveadedicatedinfrastructure(comprisinglessexpensiveservers)managingseveralclustersofmorerichlyconfiguredworkerservers.
Forthevirtualizedplatforms,vSphere6.
5wasinstalledoneachserver.
The64vCPUsoneachhost(correspondingtothe64logicalprocessorswithhyperthreadingenabled)wereevenlycommittedtotheVMs(seeTable3forper-platformdetails).
Followingbestpractices,32GiBofeachserver's512GiBofmemory(6%)wasprovidedforESXi,andtheremaining480GiBwasdividedequallybetweentheVMs.
OnallVMs,thevirtualNICconnectedtothebonded10GbENICswasassignedanIPaddressinternaltothecluster.
Thevmxnet3driverwasusedforallnetworkconnections.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|12EachvirtualmachinewasinstalledwithitsCentOS7.
3operatingsystemontheRAID1datastoreforthathost,oneormoreNVMesassignedasNodeManagerdisks,andthreeormoreSSDsassignedasDataNodedisks.
AlldiskswereconnectedthroughvirtualSCSIcontrollersusingtheVMwareParavirtualSCSIdriver.
TheOSdiskanddatadiskswerespreadevenlyoverthefourSCSIcontrollers.
ThevirtualizedclusterisshowninFigure3.
Forthebaremetalperformancetests,theworkerserverswerereinstalledwithCentOS7.
3.
Eachofthe10baremetalworkerserversusedthefullcomplementof64logicalprocessors,512GiB,oneRAID1pairof1.
2TBdisksforitsoperatingsystem,four800GBNVMesforNodeManagerdata,andtwelve800GBSSDsforDataNodedata.
Thetwo10GbENICswerebondedtogetherusingLinuxbondingmodeadaptiveloadbalancingtoprovidebothfailoverandenhancedthroughputsimilartothatemployedwithESXi.
InneitherthevirtualizednorthebaremetalcasewasthenetworkswitchtowhichtheNICsattachconfiguredwithanyspecialconfigurationfortheNICbonding.
ThebaremetalclusterisshowninFigure4.
TheworkernodeconfigurationsforvirtualizedandbaremetalarecomparedinTable3.
TheimportantYARNclusterparametersyarn.
nodemanager.
resource.
memory-mb(containermemory)andyarn.
nodemanager.
resource.
cpu-vcores(containerCPUs)arediscussedinmoredetailfollowing.
Figure3.
BigDatacluster–virtualized–4VMsperhostplatformshownFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|13Figure4.
BigDatacluster-baremetalCOMPONENT1VMPERHOST2VMSPERHOST4VMSPERHOSTBAREMETALVirtualCPUs64321664Memory480GiB240GiB120GiB512GiBContainermemory432GiB208GiB104GiB448GiBContainervcores64321664VirtualNICsBonded10GbENICs–ESXbondingBonded10GbENICs–ESXbondingBonded10GbENICs–ESXbondingBonded10GbENICs–BalanceALBOSdrives100GBHDD100GBHDD100GBHDD1.
1TBHDDNodeManagerdrives4x740GBNVMe2x740GBNVMe1x740GBNVMe4x740GBNVMeDataNodedrives12x741GBSSD6x741GBSSD3x741GBSSD12x741GBSSDTable3.
WorkernodeconfigurationperVMorbaremetalserverFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|14SoftwareThesoftwarestackemployedoneachVMisshowninTable4.
ThecombinationofClouderaDistributionofHadoop(CDH)5.
10.
0onCentOS7.
3wasused.
TheOracleJDK1.
8.
0_65wasinstalledoneachnodeandthenemployedbyCDH.
MySQL5.
7CommunityReleasewasinstalledonthegatewaynodefortheHivemetastoreandOoziedatabases.
COMPONENTVERSIONvSphere/ESXi6.
5.
0,4564106GuestoperatingsystemCentOS7.
3ClouderadistributionofHadoop5.
10.
0ClouderaManager5.
10.
0Hadoop2.
6.
0+cdh5.
10.
0+2102HDFS2.
6.
0+cdh5.
10.
0+2102YARN2.
6.
0+cdh5.
10.
0+2102MapReduce22.
6.
0+cdh5.
10.
0+2102Hive1.
1.
0+cdh5.
10.
0+859Spark1.
6.
0+cdh5.
10.
0+457ZooKeeper3.
4.
5+cdh5.
10.
0+104JavaOracle1.
8.
0_111-b14MySQL5.
6.
35communityserverTable4.
KeysoftwarecomponentsHadoop/SparkConfigurationTheClouderaDistributionofHadoop(CDH)wasinstalledusingClouderaManagerandtheparcelsmethod(seetheClouderadocumentation[6]fordetails).
HDFS,YARN,ZooKeeper,Spark,Hive,Hue,andOoziewereinstalledwiththerolesdeployedasshowninTable2.
OncetheClouderaManagerandMasternodeswereinstalledontheinfrastructureservers,theywereleftinplaceforallthreevirtualizedconfigurations,aswellasthebaremetalcluster,withtheworkernodesbeingredeployedbetweeneachsetoftests.
TheClouderaManagerhomescreenisshowninFigure5.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|15Figure5.
ClouderaManagerscreenshotshowing10TerabyteTeraGen/TeraSort/TeraValidateresourceutilization.
WheretheCPU%dropsmarksthebeginningofthenextapplication.
Asdescribedpreviously,intuningforHadoop,thetwokeyclusterparametersthatneedtobesetbytheuserareyarn.
nodemanager.
resource.
cpu-vcoresandyarn.
nodemanager.
resource.
memory-mb,whichtellYARNhowmuchCPUandmemoryresources,respectively,canbeallocatedtotaskcontainersineachworkernode.
Forthevirtualizedcase,thevCPUsineachworkerVMwereexactlycommittedtoYARNcontainers,thatis,yarn.
nodemanager.
resource.
cpu-vcoreswassetequaltothenumberofvCPUsineachVM.
AsseeninFigure6,thisactuallyovercommitstheCPUsslightlysince,inadditiontothe16vcoresforcontainers(4VMsperhostcase),atotalof1vcorewasrequiredfortheDataNodeandNodeManagerprocessesthatrunoneachworkernode.
Forthebaremetalcase,all64logicalCPUswereallocatedtocontainers,againslightlyovercommittingtheCPUresources.
Thememorycalculationwasalittlemoreinvolved.
Takingthe4VMsperhostplatformasanexample(Figure6),eachofthe4VMshave120GiBofmemoryafterthe32GiBofthe512GiBhostmemorywassetasideforESXi.
9.
4GiBofthat120GiBwassetasidefortheguestoperatingsystem.
Oftheremaining110.
6GiB,1.
3GiBeachwererequiredfortheJVMsrunningtheDataNodeandNodeManagerprocesses,plus4GiBforcache,leaving104GiBforcontainers(yarn.
nodemanager.
resource.
memory-mb).
TheamountofmemoryfortheoperatingsystemwasunderCloudera'stargetof20%.
Sincenoswappingwasseenontheworkernodes,thewarningfromClouderaManagerwassuppressed.
(The20%targetcouldhavebeenloweredaswell).
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|16Figure6.
DataNode/NodeManagerresources–4VMs/hostcaseThememorycalculationforthebaremetalserversissimilar.
Settingaside57.
4GiBofthe512GiBservermemoryfortheOSplusthe6.
6GiBfortheDataNodeandNodeManagerprocessesandcacheleaves448GiBforcontainers.
Again,noswappingwasseenwiththisamountofOSmemory.
Totalingtheavailablecontainermemoryandvcoresforeachcluster,allplatformshavethesamenumberofvcoresavailable(640)asshowninTable5.
ButnotethatbetweentheESXioverhead(32GiB)andthefactthateachVMrequiresonesetofJVMsfortheDataNode,NodeManager,andassociatedcache(requiring6.
6GiBtotal)versusjustonesettotalonbaremetal,thetotalcontainermemoryofthevirtualizedplatformsislowerthanthatavailableonthebaremetalserver.
However,aswillbeshownintheresults,thisadvantagedoesnotmakeupforthevirtualizationadvantagesofNUMAlocalityandbetterdiskutilizationonmostapplications.
COMPONENT1VMPERHOST2VMSPERHOST4VMSPERHOSTBAREMETALYARNcontainermemoryperVMorbaremetalserver432GiB208GiB104GiB448GiBYARNcontainervcoresperVMorbaremetalserver64321664NumberofVMsorserverspercluster10204010YARNcontainermemorypercluster4320GiB4160GiB4160GiB4480GiBYARNcontainervcorespercluster640640640640Table5.
AvailableYARNcontainermemoryandvcoresperVMandperclusterbyplatformFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|17Theseclusterparametersdefinethetotalmemoryandnumberofcoresavailabletoapplications.
Toachievethecorrectbalancebetweenthenumberandsizeofcontainers,muchexperimentationwasdone,particularlywithTeraSort,factoringintheavailablememoryvs.
vcores.
Ingeneral,thebestperformanceonHadoopwasfoundbyutilizingallYARNvcoresbutnotallclustermemory.
ForSpark,thebestperformancewasachievedbyleavingbothmemoryandcoresunderutilized.
TheexactsettingsperapplicationperplatformareshowninthePerformanceResultssection.
Afewadditionalparameterswerechangedfromtheirdefaultvaluesonallplatforms.
Thebufferspaceallocatedoneachmappertocontaintheinputsplitwhilebeingprocessed(mapreduce.
task.
io.
sort.
mb)wasraisedtoitsmaximumvalue,2047MiB(about2GiB)toaccommodatetheverylargeblocksizethatwasusedintheTeraSortsuite(seebelow).
TheamountofmemorydedicatedtotheApplicationMasterprocess,yarn.
app.
mapreduce.
am.
resource.
mb,wasraisedfrom1GiBto4GiB.
Theparameteryarn.
scheduler.
increment-allocation-mbwasloweredfrom512GiBto256MiBtoallowfinergrainedspecificationoftasksizes.
TheloglevelsofallkeyprocesseswereturneddownfromtheClouderadefaultofINFOtoWARNforproductionuse,butthemuchlowerlevelsoflogwritesdidnothaveameasurableimpactonapplicationperformance.
TheseglobalparametersaresummarizedinTable6.
PARAMETERDEFAULTCONFIGUREDmapreduce.
task.
io.
sort.
mb256MiB2047MiByarn.
app.
mapreduce.
am.
resource.
mb1GiB4GiByarn.
scheduler.
increment-allocation-mb512MiB256MiBLogLevelonHDFS,YARN,HiveINFOWARNNote:MiB=2**20(1048576)bytes,GiB=2**30(1073741824)bytesTable6.
KeyHadoop/SparkclusterparametersusedintestsFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|18PerformanceResults–VirtualizedvsBareMetalTeraSortResultsThecommandstorunthethreecomponentsoftheTeraSortsuite(TeraGen,TeraSort,andTeraValidate)areshowninTable7.
Thekeyapplication-dependentparameterssuchasblocksize,mapandreducetaskcontainersize(memoryandcores)and,insomecases,numberoftasks,areshowninTable8foreachplatform.
Thedfs.
blocksizewassetat1GiBtotakeadvantageofthelargememoryavailabletoYARN,and,asmentionedpreviously,mapreduce.
task.
io.
sort.
mbwasconsequentlysettothelargestpossiblevalue,2047MiB,tominimizespillstodiskduringthemapprocessingofeachHDFSblock.
ItwasfoundthattheTeraGenmaptasksforallplatformsranfasterwith2vcores.
With640totalcoresavailableforeachplatform,3202-vcoremaptaskscouldrunsimultaneously.
However,avcoremustbesetasidetoruntheApplicationMaster,leaving319maptasks.
Similarly,theTeraSortreducetaskrunsbestwith1vcore,so639areassigned.
Theoptimummapandreducetaskmemoryallocationsweredeterminedexperimentally.
Inallcases,lessthanthetotalclustermemorywasassigned.
Forexample,forthebaremetalcluster,with448GiBofcontainermemoryavailableperserver,64reducetaskscouldbeaslargeas7GiB(7168MiB).
However,itwasfoundthatavalueof6400MiB,foratotalconsumptionof400GiB,wasfoundtogivethebestperformance.
AscanbeseeninTable8thesevaluesvaryforeachplatform.
TERASORTCOMMANDStimehadoopjar$examples_jarteragen-Ddfs.
blocksize=$blocksize-Dmapreduce.
job.
maps=$n_maps_tg\-Dmapreduce.
map.
memory.
mb=$map_mem_tg-Dmapreduce.
map.
cpu.
vcores=$map_vcores_tg\10000000000/30000000000/100000000000terasort_inputtimehadoopjar$examples_jarterasort-Ddfs.
blocksize=$blocksize-Dmapreduce.
job.
reduces=$n_reds_ts\-Dmapreduce.
map.
memory.
mb=$map_mem_ts-Dmapreduce.
reduce.
memory.
mb=$red_mem_ts\-Dmapreduce.
map.
cpu.
vcores=$map_vcores_ts-Dmapreduce.
reduce.
cpu.
vcores=$red_vcores_tsterasort_input\terasort_outputtimehadoopjar$examples_jarteravalidate-Dmapreduce.
map.
memory.
mb=$map_mem_tvterasort_output\terasort1TB_validatewhere$examples_jar=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.
jarTable7.
TeraSortsuitecommandsfor1,3,and10TBinput–seebelowforparametersFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|19PARAMETERALL1VMPERHOST2VMSPERHOST4VMSPERHOSTBAREMETALblocksize1342177280n_maps_tg319map_mem_tg(MiB)13824128001331212800map_vcores_tg2n_reds_ts639map_mem_ts(MiB)6912640066566400red_mem_ts(MiB)6912640066566400map_vcores_ts1red_vcores_ts1map_mem_tv(MiB)6912640066566400Table8.
TeraSortsuiteparametersperplatformTheTeraSortresultsareshowninTable9Table11andplottedinFigure7.
Focusingfirstontheoptimumvirtualizedplatform,4VMsperhost,virtualizedTeraGenwasfasterthanbaremetalduetothesmallernumberofdisksperdatanode.
Thisisconsistentasthedatasetcreatedgrowsfrom1TBto3TBto10TB.
VirtualizedTeraSortstartsat7.
5%fasterthanbaremetallargelyduetothebenefitsofNUMAlocality(theNUMAmissrateasmeasuredbytheLinuxnumastattoolis0%fortheVMs,21%forbaremetal),butdecreasesto1.
1%fasterat3TB,andfinallygoesslowerthanbaremetalastheextramemoryavailableinbaremetaldominates.
VirtualizedTeraValidate,mainlyreads,wasfasterthanbaremetalforthesmallerdatasetsizesbutendedupaboutthesameasbaremetalat10TB.
Examiningtherelativeperformanceofthethreevirtualizedplatforms,4VMsperhostconsistentlyisfasterthan2VMsperhostduetoitsbettermatchtonumberofdisksperDataNode(4vs.
8).
1VMperhost,whichsuffersfrombothNUMAnon-localityandvSphereoverhead,istheworstofthethreevirtualizedconfigurationsandis,infact,worsethanbaremetal.
TheresourceutilizationofthethreeapplicationsisshownintheClouderaManagerscreenshotinFigure5(10TBTeraSortsuite).
DuringTeraGen,about35.
3GB/sofclusterdiskwritesoccuraccompaniedby23.
7GB/sofclusternetworkI/Oasreplicasarecopiedtoothernodes.
TeraSortstartswiththeCPU-intensivemapsort,withdiskwritesoccurringassorteddataisspilledtodisk.
InthereducecopyphaseofTeraSort,networkI/Oticksup,followedbydiskI/OasthesorteddatagetswrittentoHDFSbythereducers.
Finally,TeraValidateisahighdiskread,lowCPUoperation.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|20PLATFORMTERAGENELAPSEDTIME(SEC)TERASORTELAPSEDTIME(SEC)TERAVALIDATEELAPSEDTIME(SEC)1VM/host211.
4472.
552.
22VMs/host127.
4290.
247.
34VMs/host101.
2274.
746.
3Baremetal145.
4295.
252.
6Performanceadvantage,4VMs/hostoverbaremetal43.
7%7.
5%13.
7%Table9.
TeraSortperformanceresultsshowingvSpherevs.
baremetal–1TB(smallerisbetter)PLATFORMTERAGENELAPSEDTIME(SEC)TERASORTELAPSEDTIME(SEC)TERAVALIDATEELAPSEDTIME(SEC)1VM/host586.
11371.
9108.
82VMs/host318.
4949.
9100.
94VMs/host242.
5722.
194.
4Baremetal382.
9729.
6101.
8Performanceadvantage,4VMs/hostoverbaremetal57.
9%1.
0%7.
8%Table10.
TeraSortperformanceresultsshowingvSpherevs.
baremetal–3TB(smallerisbetter)PLATFORMTERAGENELAPSEDTIME(SEC)TERASORTELAPSEDTIME(SEC)TERAVALIDATEELAPSEDTIME(SEC)1VM/host1877.
85216.
4305.
12VMs/host991.
63186.
2283.
04VMs/host748.
32519.
8261.
5Baremetal1221.
12318.
7264.
3Performanceadvantage,4VMs/hostoverbaremetal63.
2%-8.
0%1.
1%Table11.
TeraSortperformanceresultsshowingvSpherevs.
baremetal–10TB(smallerisbetter)FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|21Figure7.
TeraSortsuiteperformanceshowingvSpherevs.
baremetalTestDFSIOResultsTestDFSIOwasrunasshowninTable12togenerateoutputof1,3,and10TBbywriting1,000filesofincreasingsize.
AsinTeraSort,themapmemorysizewasadjustedexperimentallyforbestperformanceforeachplatform,asshowninTable13.
Thereisashortreducephaseattheendofthetestwhichwasfoundtorunbestwith2coresperreducetask.
TESTDFSIOCOMMANDtimehadoopjar$test_jarTestDFSIO-Ddfs.
blocksize=$blocksize-Dmapreduce.
map.
memory.
mb=$map_mem\-Dmapreduce.
reduce.
cpu.
vcores=$red_vcores-write-nrFiles1000-size1GB/3GB/10GBwhere$test_jar=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.
jarTable12.
TestDFSIOcommandfor1,3,and10TBdatasets–seebelowforparameters0100020003000400050006000TeraGenTeraSortTeraValidateTeraGenTeraSortTeraValidateTeraGenTeraSortTeraValidateElapsedTime(sec)TeraSortSuitePerformance-Virtualizedvs.
BareMetal-SmallerIsBetter1VM/Host2VMs/Host4VMs/HostBareMetal1TB3TB10TBFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|22PARAMETERALL1VMPERHOST2VMSPERHOST4VMSPERHOSTBAREMETALblocksize1342177280map_mem(MiB)6912640066566400red_vcores2Table13.
TestDFSIOparametersperplatformTheresultsareshowninTable14andFigure8.
LikeTeraGen,TestDFSIOwasfasteronthe4VMsperhostplatformduetothebestmatchofnumberofdisks(4)perDataNodesthanontheotherthreevirtualizedplatforms(8or16)orbaremetal(16).
Forallthreesizes,the4VMsperhostplatformwasfasterthan2VMsperhost,whichwasfasterthanbaremetal.
1VMperhostwasalwaysslowestforthereasonscitedabove.
The4VMsperhostplatformmaxedoutat47.
5GB/stotalclusterdiskI/Ocomparedto28.
3GB/sforthebaremetalplatform.
PLATFORMTESTDFSIO1TBELAPSEDTIME(SEC)TESTDFSIO3TBELAPSEDTIME(SEC)TESTDFSIO10TBELAPSEDTIME(SEC)1VM/host234.
4623.
62016.
72VMs/host135.
0318.
91008.
34VMs/host112.
7237.
2749.
5Baremetal145.
3400.
41300.
4PerformanceAdvantage,4VMs/hostoverbaremetal28.
9%68.
8%73.
5%Table14.
TestDFSIOperformanceresultsshowingvSpherevs.
baremetal(smallerisbetter)FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|23Figure8.
TestDFSIOperformanceshowingvSpherevs.
baremetalSparkResultsThethreeSparkMLLibbenchmarkswerecontrolledbyconfigurationfilesexposingmanySparkandalgorithmparameters.
Theparametersthatweremodifiedfromtheirdefaultvaluesaredocumentedforeachapplicationfollowing.
ForHadoop,thebestperformancewasseenwithlargenumbersofmaporreducetasks,eachwith1or2vcoresand6to14GiBpertask,whereasthebestperformancewiththeSparkbenchmarkswasseenwithasmallernumberoflargerSparkexecutors.
Asshowninthetablesbelow,theSparkapplicationsranbestwith2or4vcores(spark.
executor.
cores)andupto34GiB(spark.
executor.
memoryplusspark.
yarn.
executor.
memoryOverhead)perSparkexecutor.
ForK-meansandLogisticRegression,itwasnecessarytovaryslightlythememorysettingsbetweenplatformstoachievethebestperformance.
Thenumberofresilientdistributeddataset(RDD)partitionswasinitiallysettothenumberofexecutorstimesthenumberofcoresperexecutorsotherewouldbeonepartitionpercore.
However,forsomeapplicationsthepartitionsizeforthe2and3TBdatasetsexceededthe2GiBSparkpartitionlimit,soadditionalpartitionswerespecified.
Sparkwasruninyarn-clientmode;thismeanstheSparkmasterprocessranontheSparkgatewayonthegatewayVM.
20GiBwasassignedtothisprocessthroughout(spark.
driver.
memory).
AllthreeMLLibapplicationsweretestedwithtrainingdatasetsizesof1,2,and3TB.
Theclustermemorywassufficienttocontainalldatasets.
Foreachtest,firstatrainingsetofthespecifiedsizewascreated.
Thenthemachinelearningcomponentwasexecutedandtimed,withthetrainingsetingestedandusedtobuildthemathematicalmodeltobeusedtoclassifyrealinputdata.
Thetrainingtimesofsixrunswererecorded,withthefirstonediscardedandtheaverageoftheremainingfivevaluesreportedhere.
K-meansClusteringK-meanswastestedwith5millionexamplesand11,500,23,000,and34,500features.
Throughtestingitwasdeterminedthat4coresperexecutorwasoptimal,butratherthanspecifyatotalof160executorsonthe640-coreclusters,120executorswerefoundtobefaster.
Thistranslatesto12executorsoneachofthe10baremetalserversor3executorsoneachofthe40VMsofthe4VMsperhostplatform.
Asbefore,thenumber,120,wasreducedby1toleaveroomfortheApplicationMaster.
05001000150020002500TestDFSIO(1TB)TestDFSIO(3TB)TestDFSIO(10TB)ElapsedTime(sec)TestDFSIOPerformance-Virtualizedvs.
BareMetal-SmallerIsBetter1VM/Host2VMs/Host4VMs/HostBareMetalFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|24Thenumberofpartitionsspecifiedwas476(119x4)forthesmallerdatasetsandwasraisedto952forthe3TBdataset.
Eachexecutorinthe1and2VMsperhostcasewasassigned34GiBofmemory(including10GiBofoverheadmemory).
Thiswasslightlyreducedto32GiBforthe4VMsperhostplatformandbaremetal.
Soforthebaremetalcase384GiB(12executorsat32GiBeach)ofthe448GiBcontainermemorywasfoundoptimal.
TheK-meansparametersareshowninTable15.
PARAMETERALL1VMPERHOST2VMSPERHOST4VMSPERHOSTBAREMETAL#examples5,000,0001TB#features11,5002TB#features23,0003TB#features34,500#executors119Coresperexecutor41TB#partitions4762TB#partitions4763TB#partitions952Sparkdrivermemory20GiBExecutormemory24GiB24GiB22GiB22GiBExecutoroverheadmem10GiB10GiB10GiB10GiBTable15.
SparkK-meansparametersperplatformTheK-meansperformanceresultsareshowninTable16andFigure9.
Thespark-perfapplicationswerecodedasNUMA-aware,sotherewerenoNUMAmissesonanyoftheplatforms.
Sincethedatasetsremaininmemory,thenumberofdisksperDataNodedoesnotmatter.
Thustheperformanceacrossthefourplatformsisverysimilar.
Theslightadvantageseenonthe4VMsperhostplatformisduetothelargeamountofdatathathastobetransmittedfromexecutorsononenodetoexecutorsonothernodesinaSparkdistributedapplication.
FortheVMsonthe4VMsperhostplatform,3ofthe39nodes(8%)thatanexecutorcanwritetoare,infact,onthesamephysicalserversothecommunicationoccursthroughtheserver'smemorybus,muchfasterthanhavingtogothroughthenetwork.
Thesamefactoris1of19nodes(5%)forthe2VMsperhostplatformand0%forthe1VMperhostandbaremetalplatforms.
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|25PLATFORMK-MEANS1TBELAPSEDTIME(SEC)K-MEANS2TBELAPSEDTIME(SEC)K-MEANS3TBELAPSEDTIME(SEC)1VM/host130.
9272.
4549.
72VMs/host130.
3276.
9567.
64VMs/host120.
2258.
8526.
5Baremetal129.
4277.
8560.
3Performanceadvantage,4VMs/hostoverbaremetal7.
6%7.
3%6.
4%Table16.
SparkK-meansperformanceresultsshowingvSpherevs.
baremetal–smallerisbetterFigure9.
SparkK-meansperformanceLogisticRegressionLogisticRegressionwasalsotestedwith5millionexamplesand11,500,23,000,and34,500features.
Itwasfoundthat2-coreexecutorswith279executorspercluster(28perbaremetalclusternode)gavethebestperformance.
Executorsizesof14or15GiBwereused(seeTable17).
So,usingthebaremetalclusterasanexample,56ofthe64containervcoresand420GiBofthe448GiBcontainermemorywasfoundtobeoptimal.
Thenumberofpartitionsspecifiedwassetto558(279x2)inallcasesbutone.
0100200300400500600SparkKM(1TB)SparkKM(2TB)SparkKM(3TB)ElapsedTime(sec)SparkK-meansPerformance-Virtualizedvs.
BareMetal-SmallerIsBetter1VMPerHost2VMsPerHost4VMsPerHostBareMetalFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|26PARAMETERALL1VMPERHOST2VMSPERHOST4VMSPERHOSTBAREMETAL#examples5,000,0001TB#features11,5002TB#features23,0003TB#features34,500#executors279Coresperexecutor21TB#partitions5585585585582TB#partitions5585585585583TB#partitions5581116558558Sparkdrivermemory20GiBExecutormemory12GiB11GiB11GiB12GiBExecutoroverheadmem3GiB3GiB3GiB3GiBTable17.
SparkLogisticRegressionparametersperplatformLogisticRegressionperformanceresultsareshowninTable18andFigure10.
LikeK-means,theyareclosewiththe4VMsperhostplatformshowingaslightleadovertheothersduetothehigherpercentageofin-memorytransfers.
PLATFORMLOGISTICREGRESSION1TBELAPSEDTIME(SEC)LOGISTICREGRESSION2TBELAPSEDTIME(SEC)LOGISTICREGRESSION3TBELAPSEDTIME(SEC)1VM/host36.
061.
399.
92VMs/host35.
158.
985.
04VMs/host25.
838.
452.
7Baremetal34.
954.
183.
8Performanceadvantage,4VMs/hostoverbaremetal35.
4%40.
9%59.
2%Table18.
SparkLogisticRegressionperformanceresultsshowingvSpherevs.
baremetal–smallerisbetterFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|27Figure10.
SparkLogisticRegressionperformanceRandomForestTheRandomForestworkloadwastestedwith5millionexamplesand15,000,30,000,and45,000features.
SimilartoK-means,itranbestwith1194-coreexecutors.
Foreachplatform,32GiBofexecutormemory(including10GiBofoverhead)wasoptimal.
Thenumberofpartitionsrequiredscaledwiththedatasetsize,from476to952to1666(seeTable19).
020406080100120SparkLR(1TB)SparkLR(2TB)SparkLR(3TB)ElapsedTime(sec)SparkLogisticRegressionPerformance-Virtualizedvs.
BareMetalSmallerIsBetter1VMPerHost2VMsPerHost4VMsPerHostBareMetalFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|28PARAMETERALL#examples5,000,0001TB#features15,0002TB#features30,0003TB#features45,000#executors119Coresperexecutor41TB#partitions4762TB#partitions9523TB#partitions1666Sparkdrivermemory20GiBExecutormemory22GiBExecutoroverheadmemory10GiBTable19.
SparkRandomForestparametersforallplatformsRandomForestperformanceresults(showninTable20andFigure11)areonceagainclose,with4VMsperhostbeingthebest.
PLATFORMRANDOMFOREST1TBELAPSEDTIME(SEC)RANDOMFOREST2TBELAPSEDTIME(SEC)RANDOMFOREST3TBELAPSEDTIME(SEC)1VM/host157.
3340.
5652.
72VMs/host140.
7305.
4594.
94VMs/host133.
5291.
7534.
8Baremetal149.
0324.
1609.
0Performanceadvantage,4VMs/hostoverbaremetal11.
6%11.
1%13.
9%Table20.
SparkRandomForestperformanceresultsshowingvSpherevs.
baremetal–smallerisbetterFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|29Figure11.
SparkRandomForestperformanceSparkScalingasDatasetExceedsMemorySizeSparkcachesitsRDDsinmemorywhileprocessing,andthenwritestodiskwhennecessary.
TogaugetheperformanceoftheSparkapplicationswhendatasetsizeexceededclustermemory,theLogisticRegressionworkloadwasrunonthe4VMsperhostplatformwithtrainingdatasetsfrom1TBto6TB.
Thefirstthreefitinmemory,thelastthreerequirethatSparkRDDsbecachedtodisk.
ThedataisinTable21andFigure12.
DATASETSIZELOGISTICREGRESSIONELAPSEDTIME(SEC)1TB24.
52TB37.
63TB52.
24TB157.
45TB203.
96TB274.
4Table21.
SparkLogisticRegressionperformanceforlargedatasetsAsseenclearlyinFigure12,theperformancescaleslinearlywithdatasetsizeasthedatasetremainsinmemory,thenscaleslinearlyagain,butataslowerrateasthedatasetgetscachedtodisk.
0100200300400500600700SparkRF(1TB)SparkRF(2TB)SparkRF(3TB)ElapsedTime(sec)SparkRandomForestPerformance-Virtualizedvs.
BareMetal-SmallerIsBetter1VMPerHost2VMsPerHost4VMsPerHostBareMetalFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|30Figure12.
SparkLogisticRegressionperformanceasdatasetsizeexceedsclustermemorySummaryofBestPracticesTorecap,herearethebestpracticescitedinthispaper:Reserveabout5-6%oftotalservermemoryforESXi;usetheremainderforthevirtualmachines.
DonotovercommitphysicalmemoryonanyhostserverthatishostingBigDataworkloads.
CreateoneormorevirtualmachinesperNUMAnode.
LimitthenumberofdisksperDataNodetomaximizetheutilizationofeachdisk:4to6isagoodstartingpoint.
Useeager-zeroedthickVMDKsalongwiththeext4orxfsfilesysteminsidetheguest.
UsetheVMwareParavirtualSCSI(pvscsi)adapterfordiskcontrollers;useall4virtualSCSIcontrollersavailableinvSphere6.
5.
Usethevmxnet3networkdriver;configurevirtualswitcheswithMTU=9000forjumboframes.
ConfiguretheguestoperatingsystemforHadoopperformanceincludingenablingjumboIPframes,reducingswappiness,anddisablingtransparenthugepagecompaction.
PlaceHadoopmasterroles,ZooKeeper,andjournalnodesonthreevirtualmachinesforoptimumperformanceandtoenablehighavailability.
DedicatetheworkernodestorunonlytheHDFSDataNode,YARNNodeManager,andSparkExecutorroles.
RuntheHiveMetastoreinaseparateMySQLdatabase.
SettheYarnclustercontainermemoryandvcorestoslightlyovercommitbothresources.
Adjustthetaskmemoryandvcorerequirementtooptimizethenumberofmapsandreducesforeachapplication.
0501001502002503001TB2TB3TB4TB5TB6TBElapsedTime(sec)SparkMachineLearningPerformanceScalingasDatasetSizeExceedsMemorySparkLogisticRegressionTImeFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|31ConclusionVirtualizationofBigDataclusterswithvSpherebringsadvantagesthatbaremetalclustersdon'thaveincludingbetterutilizationofclusterresources,enhancedoperationalflexibility,andeaseofdeploymentandmanagement.
Fromaperformancepointofview,enhancedNUMAlocalityandabettermatchofdiskdrivesperHDFSDataNodemorethanoutweightheextramemoryrequiredforvirtualization.
Throughapplicationofbestpracticesthroughoutthehardwareandsoftwarestack,itispossibletoutilizeallthebenefitsofvirtualizationwithoutsacrificingperformanceinaproductioncluster.
Thesebestpracticesweretestedinahighlyavailable13serverclusterrunningHadoopMapReduceversion2andSparkonYARNwiththelatestclustermanagementtools.
Similarlyconfiguredvirtualizedplatformsincluding1,2,and4VMsperhostaswellasabaremetalplatformwereinstalledonthe10clusterworkerservers.
Theoptimalvirtualizationpointwasseentobe4VMsoneach2-processor,16datadiskhost.
ThisconfigurationensuredthattheVMswouldfitwithinNUMAnodeboundariesforfastestmemoryaccess,andthenumberofdatadisksperVM(4)wasintheoptimalrangeof4–6perHDFSDataNodeforutilization.
Nextfastestwas2VMsperhost,whichalsofitwithinNUMAnodeboundaries,butat8datadisksperVMleftthediskssomewhatunderutilized.
The1VMperhostplatformwastheslowestsinceitneitherexhibitsNUMAlocalitynorgoodmatchofdisksperDataNode.
Input/output-dominatedapplicationssuchasTeraGenandTestDFSIOransignificantlyfasteronthe4VMsperhostplatformthanonbaremetal.
TeraSortranfastervirtualizedonsmallerdatasetsbutat10TBthebaremetaladvantageoflargermemoryovercamethedisadvantageofNUMAmisses,sobaremetalbecamefaster.
SparkMachineLearningLibraryapplicationsranupto60%fastervirtualizedthanonbaremetalduetothevirtualizedadvantageofhavingmultiplenodesperphysicalserver.
Thelargeservermemory(512GiB)wasputtousebyrunningverylargeblocksizeHadoopapplicationsforthebestperformance.
TheflashdisksprovidedbothveryfastrandomreadsandwritesforYARNNodeManagertraffic(onNVMestorage)andfastlargesequentialreadsandwritesforHDFSDataNodetraffic(onSSDs).
FASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSPERFORMANCESTUDY|32References[1]JeffBuell.
(2015,February)VirtualizedHadoopPerformancewithVMwarevSphere6.
http://www.
vmware.
com/files/pdf/techpaper/Virtualized-Hadoop-Performance-with-VMware-vSphere6.
pdf[2]DaveJaffe.
(2016,August)BigDataPerformanceonVMwarevSphere6.
https://www.
vmware.
com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/bigdata-perf-vsphere6.
pdf[3]TheApacheSoftwareFoundation.
(2015,June)ApacheHadoopNextGenMapReduce(YARN).
https://hadoop.
apache.
org/docs/r2.
7.
1/hadoop-yarn/hadoop-yarn-site/YARN.
html[4]Databricks.
(2015,December)spark-perf:SparkPerformanceTests.
https://github.
com/databricks/spark-perf[5]Cloudera.
(2017)PerformanceManagement.
http://www.
cloudera.
com/documentation/enterprise/latest/topics/admin_performance.
html[6]Cloudera.
(2017)ClouderaEnterprise5.
10.
xDocumentation.
https://www.
cloudera.
com/documentation/enterprise/5-10-x.
html[7]VMware,Inc.
(2013)VMwareStoragePolicyAPI6.
0.
https://code.
vmware.
com/apis/32/storage-policy[8]GitHub,Inc.
(2016,June)spark-bench.
https://github.
com/SparkTC/spark-benchFASTVIRTUALIZEDHADOOPANDSPARKONALL-FLASHDISKSVMware,Inc.
3401HillviewAvenuePaloAltoCA94304USATel877-486-9273Fax650-427-5001www.
vmware.
comCopyright2017VMware,Inc.
Allrightsreserved.
ThisproductisprotectedbyU.
S.
andinternationalcopyrightandintellectualpropertylaws.
VMwareproductsarecoveredbyoneormorepatentslistedathttp://www.
vmware.
com/go/patents.
VMwareisaregisteredtrademarkortrademarkofVMware,Inc.
intheUnitedStatesand/orotherjurisdictions.
Allothermarksandnamesmentionedhereinmaybetrademarksoftheirrespectivecompanies.
Commentsonthisdocument:https://communities.
vmware.
com/docs/DOC-32739AbouttheAuthorDaveJaffeisastaffengineeratVMware,specializinginBigDataPerformance.
AcknowledgementsTheauthorwouldliketothankJustinMurray,TechnicalMarketingManagerforBigDataatVMwareformanyvaluablediscussions,andPaulCaoandHai-FangYunofHewlettPackardEnterpriseformuchtechnicaladviceonHPEservers.

展开全文

userpagedefrag相关文档

powerpagedefrag

hitpagedefrag

查询pagedefrag

虚拟内存用PageDefrag快速清理虚拟内存碎片

pagedefrag虚拟内存是什么东东？

pagedefragPageDefrag 该如何使用！

google地球打不开google地球无法打开怎么办郭吉军新媒体营销的咨询行业有哪些好的老师？96155北京住房公积金电话96155经常没人接？伪装微信地理位置微信和微信伪装地理位置打不开怎么办？一点就一闪就完了安装程序配置服务器失败win10安装程序配置服务器失败怎么办 yy频道中心yy频道怎么进频道中心，求图~！站长故事爱迪生发明东西的故事 vista系统重装怎样重装vista系统中国电信互联星空中国电信宽带于互联星空的区别自助建站什么情况下采用自助建站方式建站好？网站空间域名汉邦高科域名申请 hostgator 主机点评 dropbox网盘 java虚拟主机福建天翼加速 bgp双线有奖调查秒杀预告最好的qq空间怎么建立邮箱移动服务器托管电信网络测速器登陆qq空间腾讯云平台湖南铁通碳云空间排行榜 ncp 更多

userpagedefrag

选择Vultr VPS主机不支持支付宝付款的解决方案

乌云数据(10/月),香港cera 1核1G 10M带宽/美国cera 8核8G10M

Raksmart VPS主机如何设置取消自动续费