DocumentNumber:340608-001USUtilizingLinuxSwapwithIntelOptaneDCSSDsasaMemoryOvercommitTechniqueSolutionsBlueprintJune2019Version1TeamContacts:AndrzejJakowskiandrzej.
jakowski@intel.
comKernelDevelopmentTimC.
Chentim.
c.
chen@intel.
comKernelDevelopmentYingHuangying.
huang@intel.
comKernelDevelopmentFrankOberfrank.
ober@intel.
comTestingandOutreachDavidJ.
Leonedavid.
j.
leone@intel.
comTestingandOutreachAndrewRuffinandrew.
ruffin@intel.
comMarketAnalysisandOutreachPragathiNarendrapragathi.
narendra@intel.
comPerformanceTestandTestDevelopmentMariuszBarczakmariusz.
barczak@intel.
comKernelDevelopmentGertPauwelsgert.
pauwels@intel.
comFieldTechnicalSupportEMEARegionStevenBriscoesteven.
briscoe@intel.
comFieldTechnicalSupportEMEARegionFaribKhondokerfarib.
khondoker@intel.
comTestingandSupportUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20192340608-001USRevisionHistoryRevisionNumberDescriptionRevisionDate001Initialrelease.
June2019Inteltechnologies'featuresandbenefitsdependonsystemconfigurationandmayrequireenabledhardware,softwareorserviceactivation.
Performancevariesdependingonsystemconfiguration.
Noproductorcomponentcanbeabsolutelysecure.
Checkwithyoursystemmanufacturerorretailerorlearnmoreatintel.
com.
Noproductorcomponentcanbeabsolutelysecure.
Intel,theIntellogo,Optane,andXeonaretrademarksofIntelCorporationoritssubsidiariesintheU.
S.
and/orothercountries.
*Othernamesandbrandsmaybeclaimedasthepropertyofothers.
IntelCorporationUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US3ContentsIntroduction4Scope.
4MemoryOvercommitUseCases.
6ExampleServerCostModel7TheKernelBuildProcess.
8DevelopmentToolsRequiredformenuconfig(PossiblePre-requisites)8AppendixAAutomationScriptsandHow-toGuide16AppendixBMemoryManagementFundamentals18B.
1MemoryManagementSystemOverview18AppendixCLinuxKernelInnovationstoLeverageFastSSDsasMemoryExtension20C.
1SwapImprovementsCompletedinv4.
14ofLinuxKernel21AppendixDSwapImprovementsPatchLists23D.
1References.
24UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20194340608-001USIntroductionThissolutionsblueprintexplainshowtouseIntelOptaneDCSSDsinmemoryextensionconfigurations,orasmemoryreplacement.
We'lldescriberecentperformanceimprovementsthatwerefirstintroducedinversion4.
11andcompletedinversion4.
14oftheLinux*kernel.
Forsimplicity,wewillrefertoversion4.
14ornewer,asthekernelversionneededtoevaluatehighperformanceswapusage.
VeryhighenduranceandlowlatencydeviceslikeIntelOptaneDCSSDscanbeefficientlyusedasswapdevices,therebyenablingthesystemtoexceeditsminimumrequiredsystemlevelperformanceinvariousmemoryovercommitusecases.
IntelOptaneSSDsusedasswapdevicesareexpectedtohavealonglifespanoffiveormoreyearsinthisusage.
Forthosewhointendtoimmediatelyimplementandtesttheusecasesoutlinedinthisdocument,pleasejumptotheAppendixsections,andvisitthefollowingGitHublinkfortools,instructions,andtestcode.
http://github.
com/fxober/LinuxSwapScopeWewillfocusonhowtheLinuxoperatingsystem(OS)canutilizeIntelOptaneDCSSDsasswapdevices,therebyallowingstoragedevicecapacitytobeusedinconjunctionwithDRAMtostorememorypagesonbothDRAMandnon-volatilememorytypemedia.
Theprocessofmovingmemorypagesbetweenthestoragedeviceandmainmemoryiscalledpaging.
Pagingallowssystemadministratorstoperformefficientmanagementofsystemresources(memory,CPU,storage)atdesiredcostandservicelevels.
WithrecentadvancementsinstoragemediaandLinuxkernelimprovements,IntelOptaneDCSSDsprovideanewopportunitytooffsetDRAMcostsandallowformoreflexibleprocessmemoryoversubscription,athigherperformancelevelsthanbefore.
Thissolutionsblueprintwillexplorethoseusages.
TargetAudienceTargetedforsystemadministrators,systemoperators,DevOpsteams,andapplicationdeveloperswantingtoconfiguretheirunderlyingsoftwareandhardwareresourcestomaximizesystemperformanceatabettercost.
ThisdocumentassumesfamiliaritywithbasiccomputerarchitectureterminologyandtechniquesinOSusagestomanagephysicalresourcessuchasCPU,memoryandstorage.
ItalsoexplainsfundamentalconceptsofmemorymanagementtechniquesutilizedinmodernOSs,focusingontheLinuxenvironment.
TheimprovedimplementationsofLinuxSwap*andbetterhigherendurancememorymedia,suchasIntelOptanememory,isessentiallywhatenablessuchasolutiontobeeffectiveinamoderndatacenterenvironment.
DocumentOrganizationFirst,thisdocumentintroducesusecasesinwhichtheIntelOptaneDCSSDisusedasmemoryaugmentation.
Later,aservercostmodelispresented,whichcanbeadoptedoradjustedtocalculatepotentialcostsavingswhenleveraginganIntelOptaneDCSSDasDRAMreplacement.
Next,wedescribetheOSupgradesnecessarytomaximizesystemperformancewhenusinganIntelOptaneDCSSDasaswapdevice.
SpecificallyweprovideguidanceonminimumrequiredversionsofcommonLinuxdistributionsthatutilizeswapandmemorymanagementsubsystemimprovements,alongwithdetailsonbuildingtheLinuxkernelmanuallytomaximizeswapperformance.
TheAdditionalConsiderationsforSoftwareConfigurationsectionexploressystemconfigurationdetailsformaximizingswapperformance.
Thenwecomparetheperformanceofthedifferentswapdevices.
FinallyintheAppendixsectionsthedetailsofthememorymanagementsubsystemanddetailsofLinuxkernelinnovationsthatimproveswapperformanceareexplained.
Finally,akernelpatchlistisprovidedforadvanceduserswillingtobackportthechangesintotheirownkernelfork.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US5GlossaryTermDefinitionPhysicalmemoryFastmemory,byteaddressable(asopposedtodiskstoragewhichissectororblockaddressable).
Thisfast,dynamicsystemmemoryistypicallyprovidedbyDRAMtechnology.
SwapdeviceDedicatedspaceonastoragedeviceforstoringmemorypagesofprocessdataorprocesscode.
Itcanbewholeblockstoragedeviceoritspartitionorafileinfilesystem(swapfile).
VirtualmemoryMemorymanagementtechniqueimplementedinmodernOSs.
Itprovidesanillusiontotherunningprocessthatitoperatesonacontiguousblockofmemory,whileinrealityhardwareandtheOSmanagetranslationsbetweenvirtualaddressestophysicaladdresses,andtransfersofmemorypagesfromstoragedevicetophysicalmemory.
OSvirtualmemoryhidesthosecomplexitiesfromtheapplicationprogrammer.
TotalCostofOwnership(TCO)Adefined,butoftennotstandardizedapproachtoanalyzingthefinancialimpactofapurchase,andperhapsongoingexpensesofhardwareandsoftwareinfrastructureoveritslifecycle.
TCOmodelstypicallyincludesvariousfactorsimpactingcost,e.
g.
costtopurchaseHW(capitalspending),operationalcostrelatedtoelectricityusedtopowerandcoolabuilding,andDataCenterequipment.
Thispaperfocusesonasimplifiedservercostmodel.
YoucanconsideritBillofMaterialoptimization,sincethetargetisnotfullanalysisofallserveroperationoracquisitioncosts.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20196340608-001USMemoryOvercommitUseCasesThischapterintroducesexampleusecasesinwhichanIntelOptaneDCSSDcanbeusedasmemoryextension,orasmemoryreplacementbyusingtheLinuxswapmechanism.
ThischapteralsoprovidesanexampleservercostmodelthathasbeendevelopedtoillustratepotentialcostsavingswhenconsideringthepurchaseofanewHWinfrastructure.
Usethisservercostmodelasaframeworktocalculatepotentialcostsavingsattheservercapitalexpenditurelevel.
MemoryOvercommitforVirtualizesEnvironmentsOnecommontechniquewidelyusedamongcloudserviceproviders(CSPs)istoperformphysicalresourcesover-commitmentincludingphysicalCPU,storage,andmemory.
Thefollowingfigureillustratesvirtualmachinedifferentiationbasedonratio,andhowmuchoftheguestphysicalmemoryisactuallybackedupbyphysicalDRAM.
Forexample"Gold"VMs'guestphysicalmemoryisfullybackedupbyDRAM,whilefor"Silver"VMshalfofitsguestphysicalmemoryisbackedupbyDRAM,andtheremainingportionisbackedupbytheswapdevice.
Finally,for"Bronze"VMs,aquarteroftheguestphysicalmemoryisbackedupbyDRAM,theremainingportioncanbepagedouttotheswapdevice.
WithLinuxbasedhypervisor(KVM)thistypeofdifferentiationcanbeachievedusingthemechanismcalledcontrolgroups(cgroup)whichcontrolsresourceusage(e.
g.
systemmemory)toagroupofprocess–inthiscaseaclassofVMs.
Figure1:ExampleofVirtualMachineDifferentiationBasedonMemoryOvercommitRatio§UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US7ExampleServerCostModelThischapterfocusesonderivinganexampleservercostmodel,fromasystemmemoryhardwarecostsperspective,fortwoexampleconfigurationsofservers:server"A"andserver"B.
"Theservercostmodeldoesnottakeintoaccountthevariedanduniqueoperationalexpensesorothercapitalexpendituresrelatedtothelargerscopeofrunningadatacenter.
Forsimplicityofourcomparison,differencesinspace,power,operatingcosts,andothervariablefactorsareignored.
Server"A"andserver"B"configurationsarealmostidenticalwithregardstoCPU,networking,andstorage(bothbootdisksanddatavolumes).
Thereareonly2differencesbetweenthem:Server"A"totalphysicalDRAMis384GiB(24x16GBRDIMMs),whileserver"B"ispopulatedwithonly192GiB(12x16GBRDIMMs)ofphysicalDRAMServer"A"doesnotuseIntelOptaneDCSSDasaswapdevice;insteadserver"B"usesIntelOptaneDCSSD(2x100GiBdevices)asswapdevicesOneofthedatapointsmostinterestingtoasystemadministratoristherelativecostofserver"B"toserver"A"whichillustratesthepotentialhardwarecomponentcostsavingsonthepurchaseorleaseofnewserversforthedatacenter.
Additionalservercostcalculationsfocusontherelativecostsofserver"B"configurationcomparedtoserver"A".
Forsimplicity,thiscostingmodeltakesintoaccountonlythememorycomponents(DRAM+IntelOptaneDCSSDcapacities),becauseallothercomponentsofthoseserverconfigurationsareidentical.
Relativecostcomparisonofserver"B"configurationtoserver"A"configurationcanbedefinedasfollows:==_+__NowsimplydividingnumeratoranddenominatorofaboveequationbycostOptaneleadstothefollowingformula:=_+__SubstitutionofwithnormalizedperGiBDRAMtoOptanepriceratio(DRAM_to_Optane)willleadtothisfinalformula:=___+____Note:Pleasedoyourownpricecalculationsusingtheformulaabovetocalculateyourservercostsavings.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20198340608-001USTheKernelBuildProcessRecommendedSoftwareUpgradesInordertomaximizeIntelOptaneDCSSDperformanceinamemoryextensionconfiguration(asaswapdevice)IntelrecommendsupgradingyourLinuxdistributiontoarecentversioncontainingthebackportedseriesofpatchesthatwereaddedtotheupstreamLinuxkernelinversions4.
11andlater.
ThefollowingtablecontainsinformationonthecommonLinuxdistributionversionsthatadoptedperformanceimprovementspertainingtoswapperformance.
Table1:LinuxDistributionContainingSwapPerformanceImprovementsLinuxDistributionOSVersionRHEL/CentOSStartingversion7.
5andforwardStartingversion8.
0andforwardUbuntuStartingversion18.
10andforwardSLESStartingversionSLES15,SLES12SP4andforwardOracle*LinuxStartingversionOracleLinux7.
5andlaterwithUEKR5andRHCKHowtoBuildyourKernelBasedonUpstreamLinuxKernelThissectionprovidesinstructionsonbuildingaLinuxkernelimagebasedontheupstreamLinuxkernelproject.
ThismaybeespeciallyusefulforthoseinterestedinfurtherexplorationofLinuxkernelimprovementsrelatingtoswapdeviceperformance,andwhoarewillingtoupgradetheirinfrastructure'sLinuxkernel.
PleasenotethattheseinstructionsarebasedonUbuntu*server18.
04.
2systembuild,theexactstepsmaydifferbetweendifferentLinuxdistributions,e.
g.
usageofdistributionpackagemanager.
Approximatetimeneeded:1hourDevelopmentToolsRequiredformenuconfig(PossiblePre-requisites)Inordertoclone,compile,andbuildanewkernel/driver,thefollowingpackagesmustbeinstalled.
Youmustbeloggedinasroottoinstallthesepackages.
##Dependenciesneededtorunkernelmenuconfig#apt-getinstallflexbison#apt-getinstalllibncurses5-devlibncursesw5-dev##Dependenciesneededtoperformkernelbuild#apt-getinstalllibssl-devlibelf-dev#dpkg-ilinux-*.
debUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US9BuildNewLinuxKernelwithRCUSettingforSwapDownloadLinuxkernel4.
14or5.
xornewerfromthisrepository:https://www.
kernel.
org/pub/linux/kernel/intoyourLinuxdistribution.
Itisthebesttochoosethelateststablekernel.
Fromaworkingdirectory:##Usewgettodownloadthekernelandunpackit(heretheexampleis4.
18.
20)#wgethttps://mirrors.
edge.
kernel.
org/pub/linux/kernel/v4.
x/linux-4.
18.
20.
tar.
xz#tar-xvflinux-4.
18.
20.
tar.
xz##AlternativelyclonewholeLinuxkernelgitrepositoryandcheckoutspecificbranch#gitclonehttps://git.
kernel.
org/pub/scm/linux/kernel/git/stable/linux.
git#gitcheckout–bv4.
18.
20_localv4.
18.
20BuildandinstallTocreatethekernelconfigurationfile(.
config)basedontherunningkernel,andusethedefaultsettingforallnewoptions,runthefollowingcommand:#yes""|makeoldconfigToobtainmaximumperformance,avoidread-copy-update(RCU)callbackprocessingasthismayintroducedelays.
ToavoidRCU,edit"CONFIG_RCU_NOCB_CPU=y"settinginyourlocalkernel.
configfile.
SeeOffloadingRCUProcessingtoDedicatedKernelThreadsfordetailsoneditingRCUsettings.
Alternatively,youcanmakechangesbyrunningmenuconfigtoselectthatoptionusingtheuserinterfaceasshownintheimagebelow.
#makemenuconfigUnder"GeneralSetupandFeatures>RCUSubsystem"setthe"OffloadRCUcallback…"flagasshownintheimagebelow:SaveandExitmenuconfig.
Buildthekernelandkernelmodules,andinstallthenewkernelonthesystem.
##Tobuildkernelimageandloadablekernelmodulesinvoke#make#makemodules_install##Installnewlybuiltkernelintooperatingsystem#makeinstallAftersuccessfulinstall,rebootthesystemtoloadthenewkernelimageandkernelmodules.
Usuallythenewkernelbecomesthedefaultbootselection.
AfterbootingtheOS,use"uname-a"toverifythattherunningkernelversionmatchesthenewlyinstalledkernelversion.
Ifadifferentkernelversionisloaded,youcanmodifythisbyreconfiguringthesystemloader,usuallygrub2.
Refertothesystemloaderdocumentationforyourspecificdistribution.
UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201910340608-001USAdditionalConsiderationsforOSConfigurationThissectionexploresOSconfigurationconsiderationsformaximizingperformanceoftheswapdevice(s).
OffloadingRCUProcessingtoDedicatedKernelThreadsTooffloadRCUprocessingtodedicatedkernelthreads,editthekernelcommandlineoptioninthesystemloader.
WhenusingGrub2assystemloader,navigateto/etc/default/grubfileandadd"rcu_nocb="totheGRUB_CMDLINE_LINUX_DEFAULTentry.
Seebelow/etc/default/grubfilelistingforexample:.
.
.
GRUB_DISTRIBUTOR=`lsb_release-i-s2>/dev/null||echoDebian`GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-nmaybe-ubiquity"GRUB_CMDLINE_LINUX="".
.
.
Note:nisthenumberofcpus(orhwthreads)inyoursystemAftersavingedits,runeitherthe"update-grub"or"grub2-mkconfig"commandtoupdateyourgrub2settingsinthebootpartition.
Rebootthesystemandverifythatthenewsettingshavebeenappliedtothekernel.
#dmesg|grep-ioffload[0.
000000]OffloadRCUcallbacksfromCPUs:0-63.
ThereasonforthisstepistoavoidRCUprocessinginanIOcompletionpath,asRCUprocessingwilllikelyincreasepaginglatency.
TurningOffTransparenthugepagesTominimizetheoverheadofcoalescingmemorypagesintohugepagesandlaterbreakingthemupontheswapdevice,performthefollowingcommands:#echo'never'>/sys/kernel/mm/transparent_hugepage/enabled#echo'never'>/sys/kernel/mm/transparent_hugepage/defragWatermarkScaleFactorItisimportanttoincreasethewatermarkscalefactorin/proc/sys/vmasthisisthelevelwhereavailablememoryischeckedbykswapd.
Werecommendsettingitto400or4%ofavailablememory,doingsowillsetkswapdtoautomaticallykickoffswappingat4%ofavailablesystemmemory.
#echo'400'>/proc/sys/vm/watermark_scale_factorNUMAConsiderationsWhendealingwithmultipleswapdevicesonamulti-socketsystemwerecommenddistributingswapdevicesevenlyamongdifferentCPUsocketstoavoidQPI/UPItransfers.
MoreovertoavoidsoftwareoverheadwerecommendcreatingmanyswapdevicesonapartitionedNVMedevice.
Eachswappartitionmusthavethesamepriority.
Inmostcasestherecanbeatleast28partitions,dependingonthekernelconfiguration.
Whensettingupyoursystem,werecommendadheringtotheNUMAlocalityrulesformaximumperformance.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US11PerformanceDataof4.
18.
20LinuxSwapWeusedthepmbenchutilitytotesttheallocationandaccessof4KiBmemorypagesonaLinuxsystem.
OurtestsystemutilizedanUbuntu18.
4.
2distributionofLinuxwhichweinitiallyupgradedtothe4.
18.
20versionofthekernel,astheUbuntureleasecomeswith4.
15.
xkernelversion.
WeupgradedusingthemethodsnotedinAppendixA-AutomationScriptsandHow-toGuide.
Thereshouldbenoissuerunningkernel4.
14ornewerasthekernelpatchestoLinuxswapareupstreamed(publiconkernel.
org)in4.
14.
Youcannotgainthislevelofperformanceonkernelspriorto4.
14.
Wetestedthein-boxkernelofUbuntu18.
04.
2(kernel4.
15.
0-46-generic)andsawminimaldifference(Hereisanexamplevariablesettingfrom/etc/default/grub,CPUcountspecific:GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-[n]maybe-ubiquity"Where[n]isthenumberoftotalCPUcoresorvirtualCPUthreadsinyoursystem.
Configurethekernelwiththese.
configsettingsifyouareabletocompileyourownkernel.
4.
EXPERIMENTAL:Generallyspeaking,itisbesttosettheNVMeschedulerto[none]ontheNVMeSSDswhichyouaretestingthemqblockorkyberscheduler.
Inmostcasesyourbuildshows[none],whichisfine.
#more/sys/block/nvme1n1/queue/scheduler[none]UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US175.
NewerkernelsallowanNVMequeuesizeof1,023,whichissufficientandrecommended.
6.
IfyouareseeingNVMeblockmerges,changeyourNVMeblocksizeto4Kib(not512b)sectors.
Ifblockmergesarestilloccurringaftermakingthischange,trythefollowing.
First,checkthenomergesvalue:#cat/sys/block/queue/nomergesThenomergesvalueshouldbesetto2.
Verifyandchangeifnecessary:echo2>/sys/block/queue/nomerges§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201918340608-001USAppendixBMemoryManagementFundamentalsThischapterintroducesthebasicmemorymanagementconceptsusedintheLinuxkernel.
ItexplainssystemlevelbottlenecksobservedwhenIntelOptaneDCSSDsareusedasswapdeviceswithLinuxversionspriortov4.
14oftheupstreamLinuxkernel.
Finally,itexplainstechniquestoovercomethosebottlenecksinversion4.
14,souserscanexperienceimprovedperformanceandutilizeIntelOptaneDCSSDsasswapdevices.
B.
1MemoryManagementSystemOverviewModernoperatingsystemsimplementavirtualmemorymodelwhichprovidesmanyadvantagestoapplicationdevelopers.
Virtualmemorymodelsimplifiessoftwaredevelopment,itleavesphysicalmemoryallocationanddataplacementcomplexitytotheunderlyingoperatingsystem.
Theoperatingsystemkerneldealswiththatcomplexitybyprovidinganimpressiontoanyrunningprocessthathasabigchunkofmemoryavailable(usually4GiB)foritsexclusiveuse.
InrealityOSkernelmapsprocessvirtualmemorytophysicalDRAM,andpotentiallyoverflowstoaswapdevice,whichextendsavailablephysicalmemory.
Theprocessoftransferringdatabetweentheswapdeviceandphysicalmemoryiscalledpagingandconsistsofpage-inswhenthedataisreadfromtheswapdeviceintophysicalmemory,andpage-outswhendataismovedoutofmemory.
Itshouldbenoted,page-outsmayrequiredatatobewrittenouttotheswapdevice,basedonthestateofthepage.
Figure2belowprovidesaconceptualdiagramofvirtualmemoryandpagingFigure2:VirtualMemoryConceptthroughPagingUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US19ThepagingprocessismanagedbytheOSandisheavilysupportedbyCPUhardwarethroughthememorymanagementunit(MMU).
Forexample,MMUcontainstranslationlookasidebuffer(TLB)cachewhichcontainsrecentinformationonvirtual-to-physicalmemorytranslations.
Thisenablesasignificantreductionintimeneededtoaccessdatainmemory.
AnotherCPUfeaturethatassiststheOSwithmemorymanagementisamechanismcalledpagefault.
PagefaultisanexceptionraisedbyCPUhardwarewhenaprocesstriestoaccessavirtualmemorylocationthatisnotmappedtoaphysicaladdress.
Therearedifferenttypesofpagefaults:Minor–isrisenwhenapageexistsinmainmemorybutthereisnoentryindicatingvirtual-to-physicaladdressmapping.
ThepagefaulthandlerisimplementedintheOScreatesanewmappingentry.
Major–isrisenwhenapagedoesnotexistinmainmemory.
Thepagefaulthandlerneedstobringrequireddatafromtheswapdeviceintomemoryandcreatecorrespondingmappingentry.
Forexample,thishappensinafreshlyloadedprocesswhichcausestheOSkerneltodelayloadingthewholeprogramintomemory.
Thistechnique,calledon-demandpaging,acceleratesprocessstartup.
AmajorpagefaultisaperformancedrainingprocedurethatrequirestheOSpagefaulthandlertofindanavailablelocationinphysicalmemory,whichcanpotentiallyinvolvepaging-outandloadingcontentoftheprogramfromtheswapdeviceintomemory,beforetheprocesscancontinueitsexecution.
Therearetwodifferenttypesofpages:Filesystempages,orpagesbackedupbythefiles.
Thesearememorypagesthatcontainfiledata;forexample,databasefilesdirectlymappedintotoprocessaddressspace,orlibraryfilescontainingexecutableprogramcode.
Thesepagescanbepaged-intophysicalmemory;forexample,whentheprogramstartsexecutinginstructionsstoredonthedisk(i.
e.
programusageofasharedlibrary).
TheLinuxpagecacheisacacheofthesepagesdestinedforfiles–bothresidentto-be-read,andchanged(dirty)thatneedtobesynchronizedtosomestoragedevice.
DirectaccessIOroutinesforwhichthereisnopagecacheusagearealsoavailableonLinux.
Sincethepagecacheisanopportunisticandgeneralusagecache,itisnotappropriateforallusages.
Anonymouspages.
Thesearememorypagesthatcontainprivateprocessinformation,thatisheaporstack,andhavenodeviceorfilesystembackingthem.
Whenthesystemisrunningintolowmemoryconditions(highmemorypressure)anonymouspagescanbepaged-out(swappedout)totheswappingfileorswapdevicebyOSprocesskswapdanditsrelatedkernelthreads.
Thisprocesscanbemoreorlessaggressivebasedontheconfigurationoftheswappinessparameter,asthisparametersetsthetargetofwhenswappingshouldbecomemoreactive.
Theparametercanbesetfrom0to200;thehigherthevalue,themoreswapisutilizedoverpagecachememoryreclamation.
InourperformancestudytheOSisconfiguredtoitsdefaultvalueof60,whichisthetypicalproductionrecommendedsetting.
Valueof100meansthatOSwillreclaimmemorypagesusingpagecacheandswapequally.
Youcanprintoutprocvariable/proc/sys/vm/swappinesstoviewitscurrentvalue.
Anotherimportantparameterusedtocontrolwhenkswapdkernelthreadsareactivatediswatermark_scale_factor.
Theusercansetalowerlimitofavailablememorythatspecifieswhenkswapdactivitywillbestarted.
MoredetailsareavailableinWatermarkscalefactorsection.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201920340608-001USAppendixCLinuxKernelInnovationstoLeverageFastSSDsasMemoryExtensionUntilrecentlytheLinuxkernelhadbeenprimarilyoptimizedforrotationaldisksbecausetheywerethepredominantstoragedevices.
Oneofthetechniquesusedtomaximizeswapperformanceforrotationalharddiskdrives(HDDs)wastomaintainswapdatainthecontiguouslocationonthedisktominimizediskseektime.
Theperformanceyieldsofthistechniquewerefineforrotationalharddiskdrives(HDDs)butinadequateforsolidstatedrives(SSDs).
Withrecentadvancementsinnon-volatilememory(NVM)technologieslikeIntelOptanetechnology,newtechniquesandmethodsareneededtotakeadvantageoftheincreasedperformanceofthemediaanddevices.
WhiletestingLinuxswapagainstthesenewdevices,manysystem-levelbottleneckswerediscoveredinLinuxswap.
KerneldevelopershaveaddressedsomeoftheperformancebottlenecksinthereleaseofLinuxkernel4.
14.
Inthissectionweexploresomeofthoseenhancements.
SwapdeviceintheLinuxkernelisrepresentedbyadedicateddatastructure(swap_info_struct)thatcontainsinformationonhowmemorypagesarestoredontheswapdevice,seeFigure3below.
Thisinformationisstoredinanarray,calledswap_mapwhichispartofswap_info_struct.
Swap_mapstoresinformationonusagecountforapagestoredontheswapdevice.
Swap_mapentriesareaggregatedintoclusters,theseclusterseffectivelyassignspecificportionsoftheswapdevicetothespecificCPUcore.
Updatestotheusagecountofindividualswap_mapentriesrequireperclusterlockstobetakeninsteadofholdingasinglelockprotectingthewholeswap_map.
Figure3:PrimarySwapDeviceDataStructuresEventhoughtherearededicatedswapentriesperCPUcluster,accessestotheswap_mapareprotectedbyasinglelockwhichisascalabilityandperformancelimiterwhenconcurrentattemptstotheswapdevicearemade.
Thenegativeimpactofthissinglelockisespeciallyvisibleinhighmemorypressureconditions.
Whenthesinglelockisusedtoprotectcriticalinformationintheswap_info_structdatastructure,latenciesforhandlingpagefaultsfromtheswapdevicearesignificantlyincreased.
ThisheavilyimpactsenduserperformanceandrendersthelatestHWlatencyimprovementsineffectiveduetosystemlevelbottlenecks.
Thenextsectionexplainstechniquestominimizelockcontentiononthesinglelockthatprotectsswap_info_structdatastructure,andtoimprovesystemlevellatencies.
AspreviouslydiscussedinthePerformanceDatasection,accesslatenciesonswapaveragebelow20microsecondswhenutilizingahigherperformancedrive.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US21C.
1SwapImprovementsCompletedinv4.
14ofLinuxKernelTherearemanysoftwaretechniquestoaddressperformanceproblemsrelatedtolockcontention.
Theseapproachestypicallyrelyonthefollowingprinciples:Replacementofsinglecoarse-grainedlockonswappartitionwithmultiplefiner-grainedlocksontheswapcluster–whenmanypiecesofdataareprotectedfromconcurrentaccessesbyasingle,biglock,theconcurrentthreadsthatareattemptingtoreadorwritedataareserializedinaqueuewhileawaitingtheirturn.
Insuchcases,toimproveparallelism,abiglockcanbesplitintomanysmallerlockstoprotectindependentsub-piecesofdata.
Thisapproachmayyieldsignificantperformanceimprovementsespeciallywhenmultiplethreadsaccessindependentpiecesofdata,howeverwhenmorethanonethreadattemptstoaccessthesamepieceofdata,thoseattemptswillbeserializedinaqueue.
Reductionoftimespentwhenholdinglock(ortimespentincriticalsection)–whentherearemultiplethreadsattemptingtoaccessacriticalsectionthatisprotectedbyanexclusivelockheldbyanotherthreadtheyareallpauseduntillockisreleased.
Thelongerthecriticalsectionis,thelongertheotherthreadswillwaitbeforetheycancontinue.
Reductionoftimethatgiventhreadspendsinthecriticalsectionisanotherusefultechniqueincreasingparallelismandreducinglatency.
KernelDevelopersdeterminedthattheoccurrenceofincreasedsystemlevellatencieswhileswappingtoIntelIntelOptaneDCSSDwerecausedbyasinglelockprotectingswap_info_structdatastructure.
TheyhaveappliedtheprinciplesdiscussedaboveintotheseriesofswapimprovementsthatareavailableinLinuxkernelversion4.
14andlater.
Thefollowingtechniqueshavebeendevelopedtoreducelockcontentionontheswap_info_structlock.
1.
BulkoperationsandperCPUlockclusterimprovements–multipleswap_mapentriesthatrepresentfreespaceontheswapdevicehavebeenaggregatedinlargerunitsandstoredinswapslotcache.
SwapslotcacheismanagedbyaspecificCPUcore,becauseofthatitiscalled"percpuswapslotcache".
WhenaSWthreadrequestsnewswapspaceitfirsttriestoallocateitfromswapslotcacheonthegivenCPU.
Thisoperationdoesnotrequirelocking.
Becausesingleswapslotcachecontainsmultipleswap_mapentriesitislikelythatswap_mapentrywillsuccessfullybeallocatedfromit.
Whenallocationfromswapslotcacheisnotpossible,swapsoftwareneedstoperformbulkallocationofmultipleswap_mapentriesfromswap_map,andassignthoseentriestoswapslotcache.
Swap_info_lockisacquiredwhendoingbulkoperationsontheswap_mapdatastructure.
PleaserefertoFigure4belowfordetailsofthechanges.
Figure4:SwapBulkOperationsImprovementsUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201922340608-001US2.
Radixtreesplit–anothersourceoflockcontentionthatexistedinLinuxkernelpriortoversion4.
14wasradixtreeusedforswapcache.
Swapcacheisanoptimizationinaswappingbehaviorthatreducesthenumberofwritestoswapdeviceorswapfileandmaintainsmappingbetweenmemorypageandswapmapentrywhenmemorypageisswappedinorswappedout.
Swapwriteisconsideredunnecessarywhenapageexistsinaswapdeviceorswapfile,aswellasinmainmemory,becausebothofthoselocationscontainthesamedata.
WhenLinuxconsiderspageforreclamationitcansimplycheckifitexistsinbothswapdeviceorswapfile,andinmainmemoryanddatainthosetwolocationsmatch.
Insuchcasepageinmainmemorycanbesimplymarkedasinvalidandreclaimed.
Toperformcheckifswapentryhascorrespondingpagestoredinmainmemoryradixtreedatastructureisused.
Swapcacheradixtreepriortoversion4.
14ofLinuxusedtobeprotectedbysingleswapcachelockwhichreducedparallelism.
Inversion4.
14singleswapcacheradixtreehasbeensplitintomultiplesmallertrees.
Thismodificationintroducedseparatelockspereachsmallerradixtreeandincreasedparallelism.
Thecurrentdesignmethodisbestimplementedwithmanyswappartitionsonthephysicalswapdevice.
SeeAppendixAandtheautomationscriptsongithubtoimplementthemaximumnumberofLinuxswappartitions,typically28.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US23AppendixDSwapImprovementsPatchListsThissectionprovidesalistofkernelpatchespertainingtoswapimprovementsthatwereintroducedintheLinuxkernel4.
11andin4.
14.
Thislistofpatchesmaybeusefulwhenconsideringcreatingauniquekernelimagebasedonkernelversionsolderthan4.
11,andbackportingswapimprovementsintoit.
commit322b8afe4a65906c133102532e63a278775cc5f0Author:HuangYingDate:WedMay314:52:492017-0700mm,swap:Fixaraceinfree_swap_and_cache()commit0ccfece6ed507738c0e7e4414c3688b78d4e3756Author:HuangYingDate:WedMay314:56:162017-0700mm/swapfile.
c:fixswapspaceleakinerrorpathofswap_free_entries()commit322b8afe4a65906c133102532e63a278775cc5f0Author:HuangYingDate:WedMay314:52:492017-0700mm,swap:Fixaraceinfree_swap_and_cache()commitba81f83842549871cbd7226fc11530dc464500bbAuthor:HuangYingDate:WedFeb2215:45:462017-0800mm/swap:skipreadaheadonlywhenswapslotcacheisenabledcommit039939a65059852242c823ece685579370bc574fAuthor:TimChenDate:WedFeb2215:45:432017-0800mm/swap:enableswapslotscacheusagecommit67afa38e012e9581b9b42f2a41dfc56b1280794dAuthor:TimChenDate:WedFeb2215:45:392017-0800mm/swap:addcacheforswapslotsallocationcommit7c00bafee87c7bac7ed9eced7c161f8e5332cb4eAuthor:TimChenDate:WedFeb2215:45:362017-0800mm/swap:freeswapslotsinbatchUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201924340608-001UScommit36005bae205da3eef0016a5c96a34f10a68afa1eAuthor:TimChenDate:WedFeb2215:45:332017-0800mm/swap:allocateswapslotsinbatchescommite8c26ab60598558ec3a626e7925b06e7417d7710Author:TimChenDate:WedFeb2215:45:292017-0800mm/swap:skipreadaheadforunreferencedswapslotscommit4b3ef9daa4fc0bba742a79faecb17fdaaead083bAuthor:Huang,YingDate:WedFeb2215:45:262017-0800mm/swap:splitswapcacheinto64MBtrunkscommit235b62176712b970c815923e36b9a9cc05d4d901Author:Huang,YingDate:WedFeb2215:45:222017-0800mm/swap:addclusterlockcommit6a991fc72d1243b8da0c644d3147d3ec41a0b281Author:Huang,YingDate:WedFeb2215:45:192017-0800mm/swap:fixkernelmessageinswap_info_get()commitf6498b3f33123a6ee1c81a1b29b9c07964cb95c1Author:HuangYingDate:FriOct816:59:302016-0700mm:don'tuseradixtreewritebacktagsforpagesinswapcacheD.
1ReferencesSeethefollowinglinksforimportantreferenceinformation.
Mostoftheoriginalpatches:https://kernelnewbies.
org/Linux_4.
11#Memory_managementSecondstepswapoptimizationnotes:https://kernelnewbies.
org/Linux_4.
14#Memory_managementWhitepaperonPMBench(2018):https://www.
semanticscholar.
org/paper/Pmbench%3A-A-Micro-Benchmark-for-Profiling-Paging-on-Yang-Seymour/dd0adcde7d074a414a9df76fb20d52a0d8aa8c71#paper-headerWhitepaperwithdeeperanalysisofpersistentmemory'sapplicabilitytomemorypageaccessperformance:https://web.
cs.
unlv.
edu/jisooy/paper/yang_pmbench.
pdf§
官方网站:点击访问亚洲云官网618活动方案:618特价活动(6.18-6.30)全站首月活动月底结束!地区:浙江高防BGPCPU:至强铂金8270主频7 默频3.61 睿频4.0核心:8核(最高支持64核)内存:8G(最高支持128G)DDR4 3200硬盘:40G系统盘+80G数据盘带宽:上行:20Mbps/下行:1000Mbps防御:100G(可加至300G)防火墙:提供自助 天机盾+金盾 管...
快快CDN主营业务为海外服务器无须备案,高防CDN,防劫持CDN,香港服务器,美国服务器,加速CDN,是一家综合性的主机服务商。美国高防服务器,1800DDOS防御,单机1800G DDOS防御,大陆直链 cn2线路,线路友好。快快CDN全球安全防护平台是一款集 DDOS 清洗、CC 指纹识别、WAF 防护为一体的外加全球加速的超强安全加速网络,为您的各类型业务保驾护航加速前进!价格都非常给力,需...
官方网站:点击访问创梦网络宿迁BGP高防活动方案:机房CPU内存硬盘带宽IP防护流量原价活动价开通方式宿迁BGP4vCPU4G40G+50G20Mbps1个100G不限流量299元/月 209.3元/月点击自助购买成都电信优化线路8vCPU8G40G+50G20Mbps1个100G不限流量399元/月 279.3元/月点击自助购买成都电信优化线路8vCPU16G40G+50G2...
pagedefrag为你推荐
邮箱群发如何在电子邮箱中实现群发邮件?绵阳电信绵阳电信宽带套餐资费推荐金山杀毒怎么样金山杀毒软件咋样?显卡温度多少正常显卡温度多少算正常?唱吧电脑版官方下载电脑怎么安装唱吧,要能用的,请教教程,谢谢qq怎么发邮件怎么发送QQ邮件宕机人们说的宕机是什么意思iphone6上市时间苹果6什么时候出来电子商务网站模板我想开发一个电子商务网站,但是想加入自己设计的模板,可以吗?微信电话本怎么用微信电话本如何使用?
便宜虚拟主机 广州服务器租用 最新代理服务器地址 什么是域名地址 westhost 优key permitrootlogin 牛人与腾讯客服对话 酷番云 能外链的相册 国外ip加速器 流媒体加速 贵阳电信测速 华为云建站 免费蓝钻 阿里云邮箱登陆 空间申请 阵亡将士纪念日 云服务是什么意思 umax 更多