powerpagedefrag

pagedefrag  时间:2021-02-21  阅读:()
DocumentNumber:340608-001USUtilizingLinuxSwapwithIntelOptaneDCSSDsasaMemoryOvercommitTechniqueSolutionsBlueprintJune2019Version1TeamContacts:AndrzejJakowskiandrzej.
jakowski@intel.
comKernelDevelopmentTimC.
Chentim.
c.
chen@intel.
comKernelDevelopmentYingHuangying.
huang@intel.
comKernelDevelopmentFrankOberfrank.
ober@intel.
comTestingandOutreachDavidJ.
Leonedavid.
j.
leone@intel.
comTestingandOutreachAndrewRuffinandrew.
ruffin@intel.
comMarketAnalysisandOutreachPragathiNarendrapragathi.
narendra@intel.
comPerformanceTestandTestDevelopmentMariuszBarczakmariusz.
barczak@intel.
comKernelDevelopmentGertPauwelsgert.
pauwels@intel.
comFieldTechnicalSupportEMEARegionStevenBriscoesteven.
briscoe@intel.
comFieldTechnicalSupportEMEARegionFaribKhondokerfarib.
khondoker@intel.
comTestingandSupportUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20192340608-001USRevisionHistoryRevisionNumberDescriptionRevisionDate001Initialrelease.
June2019Inteltechnologies'featuresandbenefitsdependonsystemconfigurationandmayrequireenabledhardware,softwareorserviceactivation.
Performancevariesdependingonsystemconfiguration.
Noproductorcomponentcanbeabsolutelysecure.
Checkwithyoursystemmanufacturerorretailerorlearnmoreatintel.
com.
Noproductorcomponentcanbeabsolutelysecure.
Intel,theIntellogo,Optane,andXeonaretrademarksofIntelCorporationoritssubsidiariesintheU.
S.
and/orothercountries.
*Othernamesandbrandsmaybeclaimedasthepropertyofothers.
IntelCorporationUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US3ContentsIntroduction4Scope.
4MemoryOvercommitUseCases.
6ExampleServerCostModel7TheKernelBuildProcess.
8DevelopmentToolsRequiredformenuconfig(PossiblePre-requisites)8AppendixAAutomationScriptsandHow-toGuide16AppendixBMemoryManagementFundamentals18B.
1MemoryManagementSystemOverview18AppendixCLinuxKernelInnovationstoLeverageFastSSDsasMemoryExtension20C.
1SwapImprovementsCompletedinv4.
14ofLinuxKernel21AppendixDSwapImprovementsPatchLists23D.
1References.
24UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20194340608-001USIntroductionThissolutionsblueprintexplainshowtouseIntelOptaneDCSSDsinmemoryextensionconfigurations,orasmemoryreplacement.
We'lldescriberecentperformanceimprovementsthatwerefirstintroducedinversion4.
11andcompletedinversion4.
14oftheLinux*kernel.
Forsimplicity,wewillrefertoversion4.
14ornewer,asthekernelversionneededtoevaluatehighperformanceswapusage.
VeryhighenduranceandlowlatencydeviceslikeIntelOptaneDCSSDscanbeefficientlyusedasswapdevices,therebyenablingthesystemtoexceeditsminimumrequiredsystemlevelperformanceinvariousmemoryovercommitusecases.
IntelOptaneSSDsusedasswapdevicesareexpectedtohavealonglifespanoffiveormoreyearsinthisusage.
Forthosewhointendtoimmediatelyimplementandtesttheusecasesoutlinedinthisdocument,pleasejumptotheAppendixsections,andvisitthefollowingGitHublinkfortools,instructions,andtestcode.
http://github.
com/fxober/LinuxSwapScopeWewillfocusonhowtheLinuxoperatingsystem(OS)canutilizeIntelOptaneDCSSDsasswapdevices,therebyallowingstoragedevicecapacitytobeusedinconjunctionwithDRAMtostorememorypagesonbothDRAMandnon-volatilememorytypemedia.
Theprocessofmovingmemorypagesbetweenthestoragedeviceandmainmemoryiscalledpaging.
Pagingallowssystemadministratorstoperformefficientmanagementofsystemresources(memory,CPU,storage)atdesiredcostandservicelevels.
WithrecentadvancementsinstoragemediaandLinuxkernelimprovements,IntelOptaneDCSSDsprovideanewopportunitytooffsetDRAMcostsandallowformoreflexibleprocessmemoryoversubscription,athigherperformancelevelsthanbefore.
Thissolutionsblueprintwillexplorethoseusages.
TargetAudienceTargetedforsystemadministrators,systemoperators,DevOpsteams,andapplicationdeveloperswantingtoconfiguretheirunderlyingsoftwareandhardwareresourcestomaximizesystemperformanceatabettercost.
ThisdocumentassumesfamiliaritywithbasiccomputerarchitectureterminologyandtechniquesinOSusagestomanagephysicalresourcessuchasCPU,memoryandstorage.
ItalsoexplainsfundamentalconceptsofmemorymanagementtechniquesutilizedinmodernOSs,focusingontheLinuxenvironment.
TheimprovedimplementationsofLinuxSwap*andbetterhigherendurancememorymedia,suchasIntelOptanememory,isessentiallywhatenablessuchasolutiontobeeffectiveinamoderndatacenterenvironment.
DocumentOrganizationFirst,thisdocumentintroducesusecasesinwhichtheIntelOptaneDCSSDisusedasmemoryaugmentation.
Later,aservercostmodelispresented,whichcanbeadoptedoradjustedtocalculatepotentialcostsavingswhenleveraginganIntelOptaneDCSSDasDRAMreplacement.
Next,wedescribetheOSupgradesnecessarytomaximizesystemperformancewhenusinganIntelOptaneDCSSDasaswapdevice.
SpecificallyweprovideguidanceonminimumrequiredversionsofcommonLinuxdistributionsthatutilizeswapandmemorymanagementsubsystemimprovements,alongwithdetailsonbuildingtheLinuxkernelmanuallytomaximizeswapperformance.
TheAdditionalConsiderationsforSoftwareConfigurationsectionexploressystemconfigurationdetailsformaximizingswapperformance.
Thenwecomparetheperformanceofthedifferentswapdevices.
FinallyintheAppendixsectionsthedetailsofthememorymanagementsubsystemanddetailsofLinuxkernelinnovationsthatimproveswapperformanceareexplained.
Finally,akernelpatchlistisprovidedforadvanceduserswillingtobackportthechangesintotheirownkernelfork.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US5GlossaryTermDefinitionPhysicalmemoryFastmemory,byteaddressable(asopposedtodiskstoragewhichissectororblockaddressable).
Thisfast,dynamicsystemmemoryistypicallyprovidedbyDRAMtechnology.
SwapdeviceDedicatedspaceonastoragedeviceforstoringmemorypagesofprocessdataorprocesscode.
Itcanbewholeblockstoragedeviceoritspartitionorafileinfilesystem(swapfile).
VirtualmemoryMemorymanagementtechniqueimplementedinmodernOSs.
Itprovidesanillusiontotherunningprocessthatitoperatesonacontiguousblockofmemory,whileinrealityhardwareandtheOSmanagetranslationsbetweenvirtualaddressestophysicaladdresses,andtransfersofmemorypagesfromstoragedevicetophysicalmemory.
OSvirtualmemoryhidesthosecomplexitiesfromtheapplicationprogrammer.
TotalCostofOwnership(TCO)Adefined,butoftennotstandardizedapproachtoanalyzingthefinancialimpactofapurchase,andperhapsongoingexpensesofhardwareandsoftwareinfrastructureoveritslifecycle.
TCOmodelstypicallyincludesvariousfactorsimpactingcost,e.
g.
costtopurchaseHW(capitalspending),operationalcostrelatedtoelectricityusedtopowerandcoolabuilding,andDataCenterequipment.
Thispaperfocusesonasimplifiedservercostmodel.
YoucanconsideritBillofMaterialoptimization,sincethetargetisnotfullanalysisofallserveroperationoracquisitioncosts.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20196340608-001USMemoryOvercommitUseCasesThischapterintroducesexampleusecasesinwhichanIntelOptaneDCSSDcanbeusedasmemoryextension,orasmemoryreplacementbyusingtheLinuxswapmechanism.
ThischapteralsoprovidesanexampleservercostmodelthathasbeendevelopedtoillustratepotentialcostsavingswhenconsideringthepurchaseofanewHWinfrastructure.
Usethisservercostmodelasaframeworktocalculatepotentialcostsavingsattheservercapitalexpenditurelevel.
MemoryOvercommitforVirtualizesEnvironmentsOnecommontechniquewidelyusedamongcloudserviceproviders(CSPs)istoperformphysicalresourcesover-commitmentincludingphysicalCPU,storage,andmemory.
Thefollowingfigureillustratesvirtualmachinedifferentiationbasedonratio,andhowmuchoftheguestphysicalmemoryisactuallybackedupbyphysicalDRAM.
Forexample"Gold"VMs'guestphysicalmemoryisfullybackedupbyDRAM,whilefor"Silver"VMshalfofitsguestphysicalmemoryisbackedupbyDRAM,andtheremainingportionisbackedupbytheswapdevice.
Finally,for"Bronze"VMs,aquarteroftheguestphysicalmemoryisbackedupbyDRAM,theremainingportioncanbepagedouttotheswapdevice.
WithLinuxbasedhypervisor(KVM)thistypeofdifferentiationcanbeachievedusingthemechanismcalledcontrolgroups(cgroup)whichcontrolsresourceusage(e.
g.
systemmemory)toagroupofprocess–inthiscaseaclassofVMs.
Figure1:ExampleofVirtualMachineDifferentiationBasedonMemoryOvercommitRatio§UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US7ExampleServerCostModelThischapterfocusesonderivinganexampleservercostmodel,fromasystemmemoryhardwarecostsperspective,fortwoexampleconfigurationsofservers:server"A"andserver"B.
"Theservercostmodeldoesnottakeintoaccountthevariedanduniqueoperationalexpensesorothercapitalexpendituresrelatedtothelargerscopeofrunningadatacenter.
Forsimplicityofourcomparison,differencesinspace,power,operatingcosts,andothervariablefactorsareignored.
Server"A"andserver"B"configurationsarealmostidenticalwithregardstoCPU,networking,andstorage(bothbootdisksanddatavolumes).
Thereareonly2differencesbetweenthem:Server"A"totalphysicalDRAMis384GiB(24x16GBRDIMMs),whileserver"B"ispopulatedwithonly192GiB(12x16GBRDIMMs)ofphysicalDRAMServer"A"doesnotuseIntelOptaneDCSSDasaswapdevice;insteadserver"B"usesIntelOptaneDCSSD(2x100GiBdevices)asswapdevicesOneofthedatapointsmostinterestingtoasystemadministratoristherelativecostofserver"B"toserver"A"whichillustratesthepotentialhardwarecomponentcostsavingsonthepurchaseorleaseofnewserversforthedatacenter.
Additionalservercostcalculationsfocusontherelativecostsofserver"B"configurationcomparedtoserver"A".
Forsimplicity,thiscostingmodeltakesintoaccountonlythememorycomponents(DRAM+IntelOptaneDCSSDcapacities),becauseallothercomponentsofthoseserverconfigurationsareidentical.
Relativecostcomparisonofserver"B"configurationtoserver"A"configurationcanbedefinedasfollows:==_+__NowsimplydividingnumeratoranddenominatorofaboveequationbycostOptaneleadstothefollowingformula:=_+__SubstitutionofwithnormalizedperGiBDRAMtoOptanepriceratio(DRAM_to_Optane)willleadtothisfinalformula:=___+____Note:Pleasedoyourownpricecalculationsusingtheformulaabovetocalculateyourservercostsavings.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20198340608-001USTheKernelBuildProcessRecommendedSoftwareUpgradesInordertomaximizeIntelOptaneDCSSDperformanceinamemoryextensionconfiguration(asaswapdevice)IntelrecommendsupgradingyourLinuxdistributiontoarecentversioncontainingthebackportedseriesofpatchesthatwereaddedtotheupstreamLinuxkernelinversions4.
11andlater.
ThefollowingtablecontainsinformationonthecommonLinuxdistributionversionsthatadoptedperformanceimprovementspertainingtoswapperformance.
Table1:LinuxDistributionContainingSwapPerformanceImprovementsLinuxDistributionOSVersionRHEL/CentOSStartingversion7.
5andforwardStartingversion8.
0andforwardUbuntuStartingversion18.
10andforwardSLESStartingversionSLES15,SLES12SP4andforwardOracle*LinuxStartingversionOracleLinux7.
5andlaterwithUEKR5andRHCKHowtoBuildyourKernelBasedonUpstreamLinuxKernelThissectionprovidesinstructionsonbuildingaLinuxkernelimagebasedontheupstreamLinuxkernelproject.
ThismaybeespeciallyusefulforthoseinterestedinfurtherexplorationofLinuxkernelimprovementsrelatingtoswapdeviceperformance,andwhoarewillingtoupgradetheirinfrastructure'sLinuxkernel.
PleasenotethattheseinstructionsarebasedonUbuntu*server18.
04.
2systembuild,theexactstepsmaydifferbetweendifferentLinuxdistributions,e.
g.
usageofdistributionpackagemanager.
Approximatetimeneeded:1hourDevelopmentToolsRequiredformenuconfig(PossiblePre-requisites)Inordertoclone,compile,andbuildanewkernel/driver,thefollowingpackagesmustbeinstalled.
Youmustbeloggedinasroottoinstallthesepackages.
##Dependenciesneededtorunkernelmenuconfig#apt-getinstallflexbison#apt-getinstalllibncurses5-devlibncursesw5-dev##Dependenciesneededtoperformkernelbuild#apt-getinstalllibssl-devlibelf-dev#dpkg-ilinux-*.
debUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US9BuildNewLinuxKernelwithRCUSettingforSwapDownloadLinuxkernel4.
14or5.
xornewerfromthisrepository:https://www.
kernel.
org/pub/linux/kernel/intoyourLinuxdistribution.
Itisthebesttochoosethelateststablekernel.
Fromaworkingdirectory:##Usewgettodownloadthekernelandunpackit(heretheexampleis4.
18.
20)#wgethttps://mirrors.
edge.
kernel.
org/pub/linux/kernel/v4.
x/linux-4.
18.
20.
tar.
xz#tar-xvflinux-4.
18.
20.
tar.
xz##AlternativelyclonewholeLinuxkernelgitrepositoryandcheckoutspecificbranch#gitclonehttps://git.
kernel.
org/pub/scm/linux/kernel/git/stable/linux.
git#gitcheckout–bv4.
18.
20_localv4.
18.
20BuildandinstallTocreatethekernelconfigurationfile(.
config)basedontherunningkernel,andusethedefaultsettingforallnewoptions,runthefollowingcommand:#yes""|makeoldconfigToobtainmaximumperformance,avoidread-copy-update(RCU)callbackprocessingasthismayintroducedelays.
ToavoidRCU,edit"CONFIG_RCU_NOCB_CPU=y"settinginyourlocalkernel.
configfile.
SeeOffloadingRCUProcessingtoDedicatedKernelThreadsfordetailsoneditingRCUsettings.
Alternatively,youcanmakechangesbyrunningmenuconfigtoselectthatoptionusingtheuserinterfaceasshownintheimagebelow.
#makemenuconfigUnder"GeneralSetupandFeatures>RCUSubsystem"setthe"OffloadRCUcallback…"flagasshownintheimagebelow:SaveandExitmenuconfig.
Buildthekernelandkernelmodules,andinstallthenewkernelonthesystem.
##Tobuildkernelimageandloadablekernelmodulesinvoke#make#makemodules_install##Installnewlybuiltkernelintooperatingsystem#makeinstallAftersuccessfulinstall,rebootthesystemtoloadthenewkernelimageandkernelmodules.
Usuallythenewkernelbecomesthedefaultbootselection.
AfterbootingtheOS,use"uname-a"toverifythattherunningkernelversionmatchesthenewlyinstalledkernelversion.
Ifadifferentkernelversionisloaded,youcanmodifythisbyreconfiguringthesystemloader,usuallygrub2.
Refertothesystemloaderdocumentationforyourspecificdistribution.
UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201910340608-001USAdditionalConsiderationsforOSConfigurationThissectionexploresOSconfigurationconsiderationsformaximizingperformanceoftheswapdevice(s).
OffloadingRCUProcessingtoDedicatedKernelThreadsTooffloadRCUprocessingtodedicatedkernelthreads,editthekernelcommandlineoptioninthesystemloader.
WhenusingGrub2assystemloader,navigateto/etc/default/grubfileandadd"rcu_nocb="totheGRUB_CMDLINE_LINUX_DEFAULTentry.
Seebelow/etc/default/grubfilelistingforexample:.
.
.
GRUB_DISTRIBUTOR=`lsb_release-i-s2>/dev/null||echoDebian`GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-nmaybe-ubiquity"GRUB_CMDLINE_LINUX="".
.
.
Note:nisthenumberofcpus(orhwthreads)inyoursystemAftersavingedits,runeitherthe"update-grub"or"grub2-mkconfig"commandtoupdateyourgrub2settingsinthebootpartition.
Rebootthesystemandverifythatthenewsettingshavebeenappliedtothekernel.
#dmesg|grep-ioffload[0.
000000]OffloadRCUcallbacksfromCPUs:0-63.
ThereasonforthisstepistoavoidRCUprocessinginanIOcompletionpath,asRCUprocessingwilllikelyincreasepaginglatency.
TurningOffTransparenthugepagesTominimizetheoverheadofcoalescingmemorypagesintohugepagesandlaterbreakingthemupontheswapdevice,performthefollowingcommands:#echo'never'>/sys/kernel/mm/transparent_hugepage/enabled#echo'never'>/sys/kernel/mm/transparent_hugepage/defragWatermarkScaleFactorItisimportanttoincreasethewatermarkscalefactorin/proc/sys/vmasthisisthelevelwhereavailablememoryischeckedbykswapd.
Werecommendsettingitto400or4%ofavailablememory,doingsowillsetkswapdtoautomaticallykickoffswappingat4%ofavailablesystemmemory.
#echo'400'>/proc/sys/vm/watermark_scale_factorNUMAConsiderationsWhendealingwithmultipleswapdevicesonamulti-socketsystemwerecommenddistributingswapdevicesevenlyamongdifferentCPUsocketstoavoidQPI/UPItransfers.
MoreovertoavoidsoftwareoverheadwerecommendcreatingmanyswapdevicesonapartitionedNVMedevice.
Eachswappartitionmusthavethesamepriority.
Inmostcasestherecanbeatleast28partitions,dependingonthekernelconfiguration.
Whensettingupyoursystem,werecommendadheringtotheNUMAlocalityrulesformaximumperformance.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US11PerformanceDataof4.
18.
20LinuxSwapWeusedthepmbenchutilitytotesttheallocationandaccessof4KiBmemorypagesonaLinuxsystem.
OurtestsystemutilizedanUbuntu18.
4.
2distributionofLinuxwhichweinitiallyupgradedtothe4.
18.
20versionofthekernel,astheUbuntureleasecomeswith4.
15.
xkernelversion.
WeupgradedusingthemethodsnotedinAppendixA-AutomationScriptsandHow-toGuide.
Thereshouldbenoissuerunningkernel4.
14ornewerasthekernelpatchestoLinuxswapareupstreamed(publiconkernel.
org)in4.
14.
Youcannotgainthislevelofperformanceonkernelspriorto4.
14.
Wetestedthein-boxkernelofUbuntu18.
04.
2(kernel4.
15.
0-46-generic)andsawminimaldifference(Hereisanexamplevariablesettingfrom/etc/default/grub,CPUcountspecific:GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-[n]maybe-ubiquity"Where[n]isthenumberoftotalCPUcoresorvirtualCPUthreadsinyoursystem.
Configurethekernelwiththese.
configsettingsifyouareabletocompileyourownkernel.
4.
EXPERIMENTAL:Generallyspeaking,itisbesttosettheNVMeschedulerto[none]ontheNVMeSSDswhichyouaretestingthemqblockorkyberscheduler.
Inmostcasesyourbuildshows[none],whichisfine.
#more/sys/block/nvme1n1/queue/scheduler[none]UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US175.
NewerkernelsallowanNVMequeuesizeof1,023,whichissufficientandrecommended.
6.
IfyouareseeingNVMeblockmerges,changeyourNVMeblocksizeto4Kib(not512b)sectors.
Ifblockmergesarestilloccurringaftermakingthischange,trythefollowing.
First,checkthenomergesvalue:#cat/sys/block/queue/nomergesThenomergesvalueshouldbesetto2.
Verifyandchangeifnecessary:echo2>/sys/block/queue/nomerges§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201918340608-001USAppendixBMemoryManagementFundamentalsThischapterintroducesthebasicmemorymanagementconceptsusedintheLinuxkernel.
ItexplainssystemlevelbottlenecksobservedwhenIntelOptaneDCSSDsareusedasswapdeviceswithLinuxversionspriortov4.
14oftheupstreamLinuxkernel.
Finally,itexplainstechniquestoovercomethosebottlenecksinversion4.
14,souserscanexperienceimprovedperformanceandutilizeIntelOptaneDCSSDsasswapdevices.
B.
1MemoryManagementSystemOverviewModernoperatingsystemsimplementavirtualmemorymodelwhichprovidesmanyadvantagestoapplicationdevelopers.
Virtualmemorymodelsimplifiessoftwaredevelopment,itleavesphysicalmemoryallocationanddataplacementcomplexitytotheunderlyingoperatingsystem.
Theoperatingsystemkerneldealswiththatcomplexitybyprovidinganimpressiontoanyrunningprocessthathasabigchunkofmemoryavailable(usually4GiB)foritsexclusiveuse.
InrealityOSkernelmapsprocessvirtualmemorytophysicalDRAM,andpotentiallyoverflowstoaswapdevice,whichextendsavailablephysicalmemory.
Theprocessoftransferringdatabetweentheswapdeviceandphysicalmemoryiscalledpagingandconsistsofpage-inswhenthedataisreadfromtheswapdeviceintophysicalmemory,andpage-outswhendataismovedoutofmemory.
Itshouldbenoted,page-outsmayrequiredatatobewrittenouttotheswapdevice,basedonthestateofthepage.
Figure2belowprovidesaconceptualdiagramofvirtualmemoryandpagingFigure2:VirtualMemoryConceptthroughPagingUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US19ThepagingprocessismanagedbytheOSandisheavilysupportedbyCPUhardwarethroughthememorymanagementunit(MMU).
Forexample,MMUcontainstranslationlookasidebuffer(TLB)cachewhichcontainsrecentinformationonvirtual-to-physicalmemorytranslations.
Thisenablesasignificantreductionintimeneededtoaccessdatainmemory.
AnotherCPUfeaturethatassiststheOSwithmemorymanagementisamechanismcalledpagefault.
PagefaultisanexceptionraisedbyCPUhardwarewhenaprocesstriestoaccessavirtualmemorylocationthatisnotmappedtoaphysicaladdress.
Therearedifferenttypesofpagefaults:Minor–isrisenwhenapageexistsinmainmemorybutthereisnoentryindicatingvirtual-to-physicaladdressmapping.
ThepagefaulthandlerisimplementedintheOScreatesanewmappingentry.
Major–isrisenwhenapagedoesnotexistinmainmemory.
Thepagefaulthandlerneedstobringrequireddatafromtheswapdeviceintomemoryandcreatecorrespondingmappingentry.
Forexample,thishappensinafreshlyloadedprocesswhichcausestheOSkerneltodelayloadingthewholeprogramintomemory.
Thistechnique,calledon-demandpaging,acceleratesprocessstartup.
AmajorpagefaultisaperformancedrainingprocedurethatrequirestheOSpagefaulthandlertofindanavailablelocationinphysicalmemory,whichcanpotentiallyinvolvepaging-outandloadingcontentoftheprogramfromtheswapdeviceintomemory,beforetheprocesscancontinueitsexecution.
Therearetwodifferenttypesofpages:Filesystempages,orpagesbackedupbythefiles.
Thesearememorypagesthatcontainfiledata;forexample,databasefilesdirectlymappedintotoprocessaddressspace,orlibraryfilescontainingexecutableprogramcode.
Thesepagescanbepaged-intophysicalmemory;forexample,whentheprogramstartsexecutinginstructionsstoredonthedisk(i.
e.
programusageofasharedlibrary).
TheLinuxpagecacheisacacheofthesepagesdestinedforfiles–bothresidentto-be-read,andchanged(dirty)thatneedtobesynchronizedtosomestoragedevice.
DirectaccessIOroutinesforwhichthereisnopagecacheusagearealsoavailableonLinux.
Sincethepagecacheisanopportunisticandgeneralusagecache,itisnotappropriateforallusages.
Anonymouspages.
Thesearememorypagesthatcontainprivateprocessinformation,thatisheaporstack,andhavenodeviceorfilesystembackingthem.
Whenthesystemisrunningintolowmemoryconditions(highmemorypressure)anonymouspagescanbepaged-out(swappedout)totheswappingfileorswapdevicebyOSprocesskswapdanditsrelatedkernelthreads.
Thisprocesscanbemoreorlessaggressivebasedontheconfigurationoftheswappinessparameter,asthisparametersetsthetargetofwhenswappingshouldbecomemoreactive.
Theparametercanbesetfrom0to200;thehigherthevalue,themoreswapisutilizedoverpagecachememoryreclamation.
InourperformancestudytheOSisconfiguredtoitsdefaultvalueof60,whichisthetypicalproductionrecommendedsetting.
Valueof100meansthatOSwillreclaimmemorypagesusingpagecacheandswapequally.
Youcanprintoutprocvariable/proc/sys/vm/swappinesstoviewitscurrentvalue.
Anotherimportantparameterusedtocontrolwhenkswapdkernelthreadsareactivatediswatermark_scale_factor.
Theusercansetalowerlimitofavailablememorythatspecifieswhenkswapdactivitywillbestarted.
MoredetailsareavailableinWatermarkscalefactorsection.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201920340608-001USAppendixCLinuxKernelInnovationstoLeverageFastSSDsasMemoryExtensionUntilrecentlytheLinuxkernelhadbeenprimarilyoptimizedforrotationaldisksbecausetheywerethepredominantstoragedevices.
Oneofthetechniquesusedtomaximizeswapperformanceforrotationalharddiskdrives(HDDs)wastomaintainswapdatainthecontiguouslocationonthedisktominimizediskseektime.
Theperformanceyieldsofthistechniquewerefineforrotationalharddiskdrives(HDDs)butinadequateforsolidstatedrives(SSDs).
Withrecentadvancementsinnon-volatilememory(NVM)technologieslikeIntelOptanetechnology,newtechniquesandmethodsareneededtotakeadvantageoftheincreasedperformanceofthemediaanddevices.
WhiletestingLinuxswapagainstthesenewdevices,manysystem-levelbottleneckswerediscoveredinLinuxswap.
KerneldevelopershaveaddressedsomeoftheperformancebottlenecksinthereleaseofLinuxkernel4.
14.
Inthissectionweexploresomeofthoseenhancements.
SwapdeviceintheLinuxkernelisrepresentedbyadedicateddatastructure(swap_info_struct)thatcontainsinformationonhowmemorypagesarestoredontheswapdevice,seeFigure3below.
Thisinformationisstoredinanarray,calledswap_mapwhichispartofswap_info_struct.
Swap_mapstoresinformationonusagecountforapagestoredontheswapdevice.
Swap_mapentriesareaggregatedintoclusters,theseclusterseffectivelyassignspecificportionsoftheswapdevicetothespecificCPUcore.
Updatestotheusagecountofindividualswap_mapentriesrequireperclusterlockstobetakeninsteadofholdingasinglelockprotectingthewholeswap_map.
Figure3:PrimarySwapDeviceDataStructuresEventhoughtherearededicatedswapentriesperCPUcluster,accessestotheswap_mapareprotectedbyasinglelockwhichisascalabilityandperformancelimiterwhenconcurrentattemptstotheswapdevicearemade.
Thenegativeimpactofthissinglelockisespeciallyvisibleinhighmemorypressureconditions.
Whenthesinglelockisusedtoprotectcriticalinformationintheswap_info_structdatastructure,latenciesforhandlingpagefaultsfromtheswapdevicearesignificantlyincreased.
ThisheavilyimpactsenduserperformanceandrendersthelatestHWlatencyimprovementsineffectiveduetosystemlevelbottlenecks.
Thenextsectionexplainstechniquestominimizelockcontentiononthesinglelockthatprotectsswap_info_structdatastructure,andtoimprovesystemlevellatencies.
AspreviouslydiscussedinthePerformanceDatasection,accesslatenciesonswapaveragebelow20microsecondswhenutilizingahigherperformancedrive.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US21C.
1SwapImprovementsCompletedinv4.
14ofLinuxKernelTherearemanysoftwaretechniquestoaddressperformanceproblemsrelatedtolockcontention.
Theseapproachestypicallyrelyonthefollowingprinciples:Replacementofsinglecoarse-grainedlockonswappartitionwithmultiplefiner-grainedlocksontheswapcluster–whenmanypiecesofdataareprotectedfromconcurrentaccessesbyasingle,biglock,theconcurrentthreadsthatareattemptingtoreadorwritedataareserializedinaqueuewhileawaitingtheirturn.
Insuchcases,toimproveparallelism,abiglockcanbesplitintomanysmallerlockstoprotectindependentsub-piecesofdata.
Thisapproachmayyieldsignificantperformanceimprovementsespeciallywhenmultiplethreadsaccessindependentpiecesofdata,howeverwhenmorethanonethreadattemptstoaccessthesamepieceofdata,thoseattemptswillbeserializedinaqueue.
Reductionoftimespentwhenholdinglock(ortimespentincriticalsection)–whentherearemultiplethreadsattemptingtoaccessacriticalsectionthatisprotectedbyanexclusivelockheldbyanotherthreadtheyareallpauseduntillockisreleased.
Thelongerthecriticalsectionis,thelongertheotherthreadswillwaitbeforetheycancontinue.
Reductionoftimethatgiventhreadspendsinthecriticalsectionisanotherusefultechniqueincreasingparallelismandreducinglatency.
KernelDevelopersdeterminedthattheoccurrenceofincreasedsystemlevellatencieswhileswappingtoIntelIntelOptaneDCSSDwerecausedbyasinglelockprotectingswap_info_structdatastructure.
TheyhaveappliedtheprinciplesdiscussedaboveintotheseriesofswapimprovementsthatareavailableinLinuxkernelversion4.
14andlater.
Thefollowingtechniqueshavebeendevelopedtoreducelockcontentionontheswap_info_structlock.
1.
BulkoperationsandperCPUlockclusterimprovements–multipleswap_mapentriesthatrepresentfreespaceontheswapdevicehavebeenaggregatedinlargerunitsandstoredinswapslotcache.
SwapslotcacheismanagedbyaspecificCPUcore,becauseofthatitiscalled"percpuswapslotcache".
WhenaSWthreadrequestsnewswapspaceitfirsttriestoallocateitfromswapslotcacheonthegivenCPU.
Thisoperationdoesnotrequirelocking.
Becausesingleswapslotcachecontainsmultipleswap_mapentriesitislikelythatswap_mapentrywillsuccessfullybeallocatedfromit.
Whenallocationfromswapslotcacheisnotpossible,swapsoftwareneedstoperformbulkallocationofmultipleswap_mapentriesfromswap_map,andassignthoseentriestoswapslotcache.
Swap_info_lockisacquiredwhendoingbulkoperationsontheswap_mapdatastructure.
PleaserefertoFigure4belowfordetailsofthechanges.
Figure4:SwapBulkOperationsImprovementsUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201922340608-001US2.
Radixtreesplit–anothersourceoflockcontentionthatexistedinLinuxkernelpriortoversion4.
14wasradixtreeusedforswapcache.
Swapcacheisanoptimizationinaswappingbehaviorthatreducesthenumberofwritestoswapdeviceorswapfileandmaintainsmappingbetweenmemorypageandswapmapentrywhenmemorypageisswappedinorswappedout.
Swapwriteisconsideredunnecessarywhenapageexistsinaswapdeviceorswapfile,aswellasinmainmemory,becausebothofthoselocationscontainthesamedata.
WhenLinuxconsiderspageforreclamationitcansimplycheckifitexistsinbothswapdeviceorswapfile,andinmainmemoryanddatainthosetwolocationsmatch.
Insuchcasepageinmainmemorycanbesimplymarkedasinvalidandreclaimed.
Toperformcheckifswapentryhascorrespondingpagestoredinmainmemoryradixtreedatastructureisused.
Swapcacheradixtreepriortoversion4.
14ofLinuxusedtobeprotectedbysingleswapcachelockwhichreducedparallelism.
Inversion4.
14singleswapcacheradixtreehasbeensplitintomultiplesmallertrees.
Thismodificationintroducedseparatelockspereachsmallerradixtreeandincreasedparallelism.
Thecurrentdesignmethodisbestimplementedwithmanyswappartitionsonthephysicalswapdevice.
SeeAppendixAandtheautomationscriptsongithubtoimplementthemaximumnumberofLinuxswappartitions,typically28.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US23AppendixDSwapImprovementsPatchListsThissectionprovidesalistofkernelpatchespertainingtoswapimprovementsthatwereintroducedintheLinuxkernel4.
11andin4.
14.
Thislistofpatchesmaybeusefulwhenconsideringcreatingauniquekernelimagebasedonkernelversionsolderthan4.
11,andbackportingswapimprovementsintoit.
commit322b8afe4a65906c133102532e63a278775cc5f0Author:HuangYingDate:WedMay314:52:492017-0700mm,swap:Fixaraceinfree_swap_and_cache()commit0ccfece6ed507738c0e7e4414c3688b78d4e3756Author:HuangYingDate:WedMay314:56:162017-0700mm/swapfile.
c:fixswapspaceleakinerrorpathofswap_free_entries()commit322b8afe4a65906c133102532e63a278775cc5f0Author:HuangYingDate:WedMay314:52:492017-0700mm,swap:Fixaraceinfree_swap_and_cache()commitba81f83842549871cbd7226fc11530dc464500bbAuthor:HuangYingDate:WedFeb2215:45:462017-0800mm/swap:skipreadaheadonlywhenswapslotcacheisenabledcommit039939a65059852242c823ece685579370bc574fAuthor:TimChenDate:WedFeb2215:45:432017-0800mm/swap:enableswapslotscacheusagecommit67afa38e012e9581b9b42f2a41dfc56b1280794dAuthor:TimChenDate:WedFeb2215:45:392017-0800mm/swap:addcacheforswapslotsallocationcommit7c00bafee87c7bac7ed9eced7c161f8e5332cb4eAuthor:TimChenDate:WedFeb2215:45:362017-0800mm/swap:freeswapslotsinbatchUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201924340608-001UScommit36005bae205da3eef0016a5c96a34f10a68afa1eAuthor:TimChenDate:WedFeb2215:45:332017-0800mm/swap:allocateswapslotsinbatchescommite8c26ab60598558ec3a626e7925b06e7417d7710Author:TimChenDate:WedFeb2215:45:292017-0800mm/swap:skipreadaheadforunreferencedswapslotscommit4b3ef9daa4fc0bba742a79faecb17fdaaead083bAuthor:Huang,YingDate:WedFeb2215:45:262017-0800mm/swap:splitswapcacheinto64MBtrunkscommit235b62176712b970c815923e36b9a9cc05d4d901Author:Huang,YingDate:WedFeb2215:45:222017-0800mm/swap:addclusterlockcommit6a991fc72d1243b8da0c644d3147d3ec41a0b281Author:Huang,YingDate:WedFeb2215:45:192017-0800mm/swap:fixkernelmessageinswap_info_get()commitf6498b3f33123a6ee1c81a1b29b9c07964cb95c1Author:HuangYingDate:FriOct816:59:302016-0700mm:don'tuseradixtreewritebacktagsforpagesinswapcacheD.
1ReferencesSeethefollowinglinksforimportantreferenceinformation.
Mostoftheoriginalpatches:https://kernelnewbies.
org/Linux_4.
11#Memory_managementSecondstepswapoptimizationnotes:https://kernelnewbies.
org/Linux_4.
14#Memory_managementWhitepaperonPMBench(2018):https://www.
semanticscholar.
org/paper/Pmbench%3A-A-Micro-Benchmark-for-Profiling-Paging-on-Yang-Seymour/dd0adcde7d074a414a9df76fb20d52a0d8aa8c71#paper-headerWhitepaperwithdeeperanalysisofpersistentmemory'sapplicabilitytomemorypageaccessperformance:https://web.
cs.
unlv.
edu/jisooy/paper/yang_pmbench.
pdf§

WebHorizon($10.56/年)256MB/5G SSD/200GB/日本VPS

WebHorizon是一家去年成立的国外VPS主机商,印度注册,提供虚拟主机和VPS产品,其中VPS包括OpenVZ和KVM架构,有独立IP也有共享IP,数据中心包括美国、波兰、日本、新加坡等(共享IP主机可选机房更多)。目前商家对日本VPS提供一个8折优惠码,优惠后最低款OpenVZ套餐年付10.56美元起。OpenVZCPU:1core内存:256MB硬盘:5G NVMe流量:200GB/1G...

Hostodo美国独立日优惠套餐年付13.99美元起,拉斯维加斯/迈阿密机房

Hostodo又发布了几款针对7月4日美国独立日的优惠套餐(Independence Day Super Sale),均为年付,基于KVM架构,采用NVMe硬盘,最低13.99美元起,可选拉斯维加斯或者迈阿密机房。这是一家成立于2014年的国外VPS主机商,主打低价VPS套餐且年付为主,基于OpenVZ和KVM架构,产品性能一般,支持使用PayPal或者支付宝等付款方式。商家客服响应也比较一般,推...

[6.18]IMIDC:香港/台湾服务器月付30美元起,日本/俄罗斯服务器月付49美元起

IMIDC发布了6.18大促销活动,针对香港、台湾、日本和莫斯科独立服务器提供特别优惠价格最低月付30美元起。IMIDC名为彩虹数据(Rainbow Cloud),是一家香港本土运营商,全线产品自营,自有IP网络资源等,提供的产品包括VPS主机、独立服务器、站群独立服务器等,数据中心区域包括香港、日本、台湾、美国和南非等地机房,CN2网络直连到中国大陆。香港服务器   $39/...

pagedefrag为你推荐
万维读者网读者投稿邮箱绵阳电信绵阳电信宽带套餐资费推荐深圳公交车路线深圳公交车路线渗透测试软件测试与渗透测试那个工作有前途办公协同软件求一款国内知名的OA办公软件,谁知道有哪些呢?xp系统停止服务XP系统停止服务后电脑怎么办?bt封杀为什么现在网上许多BT下载都被封了?分词技术中文分词的应用聚美优品红包聚美优品301活动红包的使用规则是什么?购买流量怎么购买流量啊
台湾vps lamp安装 荷兰服务器 樊云 新站长网 河南移动邮件系统 合租空间 股票老左 老左正传 双11秒杀 静态空间 阿里云官方网站 阿里云免费邮箱 中国联通宽带测速 测速电信 万网注册 宿迁服务器 镇江高防服务器 windows2008 restart 更多