DocumentNumber:340608-001USUtilizingLinuxSwapwithIntelOptaneDCSSDsasaMemoryOvercommitTechniqueSolutionsBlueprintJune2019Version1TeamContacts:AndrzejJakowskiandrzej.
jakowski@intel.
comKernelDevelopmentTimC.
Chentim.
c.
chen@intel.
comKernelDevelopmentYingHuangying.
huang@intel.
comKernelDevelopmentFrankOberfrank.
ober@intel.
comTestingandOutreachDavidJ.
Leonedavid.
j.
leone@intel.
comTestingandOutreachAndrewRuffinandrew.
ruffin@intel.
comMarketAnalysisandOutreachPragathiNarendrapragathi.
narendra@intel.
comPerformanceTestandTestDevelopmentMariuszBarczakmariusz.
barczak@intel.
comKernelDevelopmentGertPauwelsgert.
pauwels@intel.
comFieldTechnicalSupportEMEARegionStevenBriscoesteven.
briscoe@intel.
comFieldTechnicalSupportEMEARegionFaribKhondokerfarib.
khondoker@intel.
comTestingandSupportUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20192340608-001USRevisionHistoryRevisionNumberDescriptionRevisionDate001Initialrelease.
June2019Inteltechnologies'featuresandbenefitsdependonsystemconfigurationandmayrequireenabledhardware,softwareorserviceactivation.
Performancevariesdependingonsystemconfiguration.
Noproductorcomponentcanbeabsolutelysecure.
Checkwithyoursystemmanufacturerorretailerorlearnmoreatintel.
com.
Noproductorcomponentcanbeabsolutelysecure.
Intel,theIntellogo,Optane,andXeonaretrademarksofIntelCorporationoritssubsidiariesintheU.
S.
and/orothercountries.
*Othernamesandbrandsmaybeclaimedasthepropertyofothers.
IntelCorporationUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US3ContentsIntroduction4Scope.
4MemoryOvercommitUseCases.
6ExampleServerCostModel7TheKernelBuildProcess.
8DevelopmentToolsRequiredformenuconfig(PossiblePre-requisites)8AppendixAAutomationScriptsandHow-toGuide16AppendixBMemoryManagementFundamentals18B.
1MemoryManagementSystemOverview18AppendixCLinuxKernelInnovationstoLeverageFastSSDsasMemoryExtension20C.
1SwapImprovementsCompletedinv4.
14ofLinuxKernel21AppendixDSwapImprovementsPatchLists23D.
1References.
24UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20194340608-001USIntroductionThissolutionsblueprintexplainshowtouseIntelOptaneDCSSDsinmemoryextensionconfigurations,orasmemoryreplacement.
We'lldescriberecentperformanceimprovementsthatwerefirstintroducedinversion4.
11andcompletedinversion4.
14oftheLinux*kernel.
Forsimplicity,wewillrefertoversion4.
14ornewer,asthekernelversionneededtoevaluatehighperformanceswapusage.
VeryhighenduranceandlowlatencydeviceslikeIntelOptaneDCSSDscanbeefficientlyusedasswapdevices,therebyenablingthesystemtoexceeditsminimumrequiredsystemlevelperformanceinvariousmemoryovercommitusecases.
IntelOptaneSSDsusedasswapdevicesareexpectedtohavealonglifespanoffiveormoreyearsinthisusage.
Forthosewhointendtoimmediatelyimplementandtesttheusecasesoutlinedinthisdocument,pleasejumptotheAppendixsections,andvisitthefollowingGitHublinkfortools,instructions,andtestcode.
http://github.
com/fxober/LinuxSwapScopeWewillfocusonhowtheLinuxoperatingsystem(OS)canutilizeIntelOptaneDCSSDsasswapdevices,therebyallowingstoragedevicecapacitytobeusedinconjunctionwithDRAMtostorememorypagesonbothDRAMandnon-volatilememorytypemedia.
Theprocessofmovingmemorypagesbetweenthestoragedeviceandmainmemoryiscalledpaging.
Pagingallowssystemadministratorstoperformefficientmanagementofsystemresources(memory,CPU,storage)atdesiredcostandservicelevels.
WithrecentadvancementsinstoragemediaandLinuxkernelimprovements,IntelOptaneDCSSDsprovideanewopportunitytooffsetDRAMcostsandallowformoreflexibleprocessmemoryoversubscription,athigherperformancelevelsthanbefore.
Thissolutionsblueprintwillexplorethoseusages.
TargetAudienceTargetedforsystemadministrators,systemoperators,DevOpsteams,andapplicationdeveloperswantingtoconfiguretheirunderlyingsoftwareandhardwareresourcestomaximizesystemperformanceatabettercost.
ThisdocumentassumesfamiliaritywithbasiccomputerarchitectureterminologyandtechniquesinOSusagestomanagephysicalresourcessuchasCPU,memoryandstorage.
ItalsoexplainsfundamentalconceptsofmemorymanagementtechniquesutilizedinmodernOSs,focusingontheLinuxenvironment.
TheimprovedimplementationsofLinuxSwap*andbetterhigherendurancememorymedia,suchasIntelOptanememory,isessentiallywhatenablessuchasolutiontobeeffectiveinamoderndatacenterenvironment.
DocumentOrganizationFirst,thisdocumentintroducesusecasesinwhichtheIntelOptaneDCSSDisusedasmemoryaugmentation.
Later,aservercostmodelispresented,whichcanbeadoptedoradjustedtocalculatepotentialcostsavingswhenleveraginganIntelOptaneDCSSDasDRAMreplacement.
Next,wedescribetheOSupgradesnecessarytomaximizesystemperformancewhenusinganIntelOptaneDCSSDasaswapdevice.
SpecificallyweprovideguidanceonminimumrequiredversionsofcommonLinuxdistributionsthatutilizeswapandmemorymanagementsubsystemimprovements,alongwithdetailsonbuildingtheLinuxkernelmanuallytomaximizeswapperformance.
TheAdditionalConsiderationsforSoftwareConfigurationsectionexploressystemconfigurationdetailsformaximizingswapperformance.
Thenwecomparetheperformanceofthedifferentswapdevices.
FinallyintheAppendixsectionsthedetailsofthememorymanagementsubsystemanddetailsofLinuxkernelinnovationsthatimproveswapperformanceareexplained.
Finally,akernelpatchlistisprovidedforadvanceduserswillingtobackportthechangesintotheirownkernelfork.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US5GlossaryTermDefinitionPhysicalmemoryFastmemory,byteaddressable(asopposedtodiskstoragewhichissectororblockaddressable).
Thisfast,dynamicsystemmemoryistypicallyprovidedbyDRAMtechnology.
SwapdeviceDedicatedspaceonastoragedeviceforstoringmemorypagesofprocessdataorprocesscode.
Itcanbewholeblockstoragedeviceoritspartitionorafileinfilesystem(swapfile).
VirtualmemoryMemorymanagementtechniqueimplementedinmodernOSs.
Itprovidesanillusiontotherunningprocessthatitoperatesonacontiguousblockofmemory,whileinrealityhardwareandtheOSmanagetranslationsbetweenvirtualaddressestophysicaladdresses,andtransfersofmemorypagesfromstoragedevicetophysicalmemory.
OSvirtualmemoryhidesthosecomplexitiesfromtheapplicationprogrammer.
TotalCostofOwnership(TCO)Adefined,butoftennotstandardizedapproachtoanalyzingthefinancialimpactofapurchase,andperhapsongoingexpensesofhardwareandsoftwareinfrastructureoveritslifecycle.
TCOmodelstypicallyincludesvariousfactorsimpactingcost,e.
g.
costtopurchaseHW(capitalspending),operationalcostrelatedtoelectricityusedtopowerandcoolabuilding,andDataCenterequipment.
Thispaperfocusesonasimplifiedservercostmodel.
YoucanconsideritBillofMaterialoptimization,sincethetargetisnotfullanalysisofallserveroperationoracquisitioncosts.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20196340608-001USMemoryOvercommitUseCasesThischapterintroducesexampleusecasesinwhichanIntelOptaneDCSSDcanbeusedasmemoryextension,orasmemoryreplacementbyusingtheLinuxswapmechanism.
ThischapteralsoprovidesanexampleservercostmodelthathasbeendevelopedtoillustratepotentialcostsavingswhenconsideringthepurchaseofanewHWinfrastructure.
Usethisservercostmodelasaframeworktocalculatepotentialcostsavingsattheservercapitalexpenditurelevel.
MemoryOvercommitforVirtualizesEnvironmentsOnecommontechniquewidelyusedamongcloudserviceproviders(CSPs)istoperformphysicalresourcesover-commitmentincludingphysicalCPU,storage,andmemory.
Thefollowingfigureillustratesvirtualmachinedifferentiationbasedonratio,andhowmuchoftheguestphysicalmemoryisactuallybackedupbyphysicalDRAM.
Forexample"Gold"VMs'guestphysicalmemoryisfullybackedupbyDRAM,whilefor"Silver"VMshalfofitsguestphysicalmemoryisbackedupbyDRAM,andtheremainingportionisbackedupbytheswapdevice.
Finally,for"Bronze"VMs,aquarteroftheguestphysicalmemoryisbackedupbyDRAM,theremainingportioncanbepagedouttotheswapdevice.
WithLinuxbasedhypervisor(KVM)thistypeofdifferentiationcanbeachievedusingthemechanismcalledcontrolgroups(cgroup)whichcontrolsresourceusage(e.
g.
systemmemory)toagroupofprocess–inthiscaseaclassofVMs.
Figure1:ExampleofVirtualMachineDifferentiationBasedonMemoryOvercommitRatio§UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US7ExampleServerCostModelThischapterfocusesonderivinganexampleservercostmodel,fromasystemmemoryhardwarecostsperspective,fortwoexampleconfigurationsofservers:server"A"andserver"B.
"Theservercostmodeldoesnottakeintoaccountthevariedanduniqueoperationalexpensesorothercapitalexpendituresrelatedtothelargerscopeofrunningadatacenter.
Forsimplicityofourcomparison,differencesinspace,power,operatingcosts,andothervariablefactorsareignored.
Server"A"andserver"B"configurationsarealmostidenticalwithregardstoCPU,networking,andstorage(bothbootdisksanddatavolumes).
Thereareonly2differencesbetweenthem:Server"A"totalphysicalDRAMis384GiB(24x16GBRDIMMs),whileserver"B"ispopulatedwithonly192GiB(12x16GBRDIMMs)ofphysicalDRAMServer"A"doesnotuseIntelOptaneDCSSDasaswapdevice;insteadserver"B"usesIntelOptaneDCSSD(2x100GiBdevices)asswapdevicesOneofthedatapointsmostinterestingtoasystemadministratoristherelativecostofserver"B"toserver"A"whichillustratesthepotentialhardwarecomponentcostsavingsonthepurchaseorleaseofnewserversforthedatacenter.
Additionalservercostcalculationsfocusontherelativecostsofserver"B"configurationcomparedtoserver"A".
Forsimplicity,thiscostingmodeltakesintoaccountonlythememorycomponents(DRAM+IntelOptaneDCSSDcapacities),becauseallothercomponentsofthoseserverconfigurationsareidentical.
Relativecostcomparisonofserver"B"configurationtoserver"A"configurationcanbedefinedasfollows:==_+__NowsimplydividingnumeratoranddenominatorofaboveequationbycostOptaneleadstothefollowingformula:=_+__SubstitutionofwithnormalizedperGiBDRAMtoOptanepriceratio(DRAM_to_Optane)willleadtothisfinalformula:=___+____Note:Pleasedoyourownpricecalculationsusingtheformulaabovetocalculateyourservercostsavings.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune20198340608-001USTheKernelBuildProcessRecommendedSoftwareUpgradesInordertomaximizeIntelOptaneDCSSDperformanceinamemoryextensionconfiguration(asaswapdevice)IntelrecommendsupgradingyourLinuxdistributiontoarecentversioncontainingthebackportedseriesofpatchesthatwereaddedtotheupstreamLinuxkernelinversions4.
11andlater.
ThefollowingtablecontainsinformationonthecommonLinuxdistributionversionsthatadoptedperformanceimprovementspertainingtoswapperformance.
Table1:LinuxDistributionContainingSwapPerformanceImprovementsLinuxDistributionOSVersionRHEL/CentOSStartingversion7.
5andforwardStartingversion8.
0andforwardUbuntuStartingversion18.
10andforwardSLESStartingversionSLES15,SLES12SP4andforwardOracle*LinuxStartingversionOracleLinux7.
5andlaterwithUEKR5andRHCKHowtoBuildyourKernelBasedonUpstreamLinuxKernelThissectionprovidesinstructionsonbuildingaLinuxkernelimagebasedontheupstreamLinuxkernelproject.
ThismaybeespeciallyusefulforthoseinterestedinfurtherexplorationofLinuxkernelimprovementsrelatingtoswapdeviceperformance,andwhoarewillingtoupgradetheirinfrastructure'sLinuxkernel.
PleasenotethattheseinstructionsarebasedonUbuntu*server18.
04.
2systembuild,theexactstepsmaydifferbetweendifferentLinuxdistributions,e.
g.
usageofdistributionpackagemanager.
Approximatetimeneeded:1hourDevelopmentToolsRequiredformenuconfig(PossiblePre-requisites)Inordertoclone,compile,andbuildanewkernel/driver,thefollowingpackagesmustbeinstalled.
Youmustbeloggedinasroottoinstallthesepackages.
##Dependenciesneededtorunkernelmenuconfig#apt-getinstallflexbison#apt-getinstalllibncurses5-devlibncursesw5-dev##Dependenciesneededtoperformkernelbuild#apt-getinstalllibssl-devlibelf-dev#dpkg-ilinux-*.
debUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US9BuildNewLinuxKernelwithRCUSettingforSwapDownloadLinuxkernel4.
14or5.
xornewerfromthisrepository:https://www.
kernel.
org/pub/linux/kernel/intoyourLinuxdistribution.
Itisthebesttochoosethelateststablekernel.
Fromaworkingdirectory:##Usewgettodownloadthekernelandunpackit(heretheexampleis4.
18.
20)#wgethttps://mirrors.
edge.
kernel.
org/pub/linux/kernel/v4.
x/linux-4.
18.
20.
tar.
xz#tar-xvflinux-4.
18.
20.
tar.
xz##AlternativelyclonewholeLinuxkernelgitrepositoryandcheckoutspecificbranch#gitclonehttps://git.
kernel.
org/pub/scm/linux/kernel/git/stable/linux.
git#gitcheckout–bv4.
18.
20_localv4.
18.
20BuildandinstallTocreatethekernelconfigurationfile(.
config)basedontherunningkernel,andusethedefaultsettingforallnewoptions,runthefollowingcommand:#yes""|makeoldconfigToobtainmaximumperformance,avoidread-copy-update(RCU)callbackprocessingasthismayintroducedelays.
ToavoidRCU,edit"CONFIG_RCU_NOCB_CPU=y"settinginyourlocalkernel.
configfile.
SeeOffloadingRCUProcessingtoDedicatedKernelThreadsfordetailsoneditingRCUsettings.
Alternatively,youcanmakechangesbyrunningmenuconfigtoselectthatoptionusingtheuserinterfaceasshownintheimagebelow.
#makemenuconfigUnder"GeneralSetupandFeatures>RCUSubsystem"setthe"OffloadRCUcallback…"flagasshownintheimagebelow:SaveandExitmenuconfig.
Buildthekernelandkernelmodules,andinstallthenewkernelonthesystem.
##Tobuildkernelimageandloadablekernelmodulesinvoke#make#makemodules_install##Installnewlybuiltkernelintooperatingsystem#makeinstallAftersuccessfulinstall,rebootthesystemtoloadthenewkernelimageandkernelmodules.
Usuallythenewkernelbecomesthedefaultbootselection.
AfterbootingtheOS,use"uname-a"toverifythattherunningkernelversionmatchesthenewlyinstalledkernelversion.
Ifadifferentkernelversionisloaded,youcanmodifythisbyreconfiguringthesystemloader,usuallygrub2.
Refertothesystemloaderdocumentationforyourspecificdistribution.
UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201910340608-001USAdditionalConsiderationsforOSConfigurationThissectionexploresOSconfigurationconsiderationsformaximizingperformanceoftheswapdevice(s).
OffloadingRCUProcessingtoDedicatedKernelThreadsTooffloadRCUprocessingtodedicatedkernelthreads,editthekernelcommandlineoptioninthesystemloader.
WhenusingGrub2assystemloader,navigateto/etc/default/grubfileandadd"rcu_nocb="totheGRUB_CMDLINE_LINUX_DEFAULTentry.
Seebelow/etc/default/grubfilelistingforexample:.
.
.
GRUB_DISTRIBUTOR=`lsb_release-i-s2>/dev/null||echoDebian`GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-nmaybe-ubiquity"GRUB_CMDLINE_LINUX="".
.
.
Note:nisthenumberofcpus(orhwthreads)inyoursystemAftersavingedits,runeitherthe"update-grub"or"grub2-mkconfig"commandtoupdateyourgrub2settingsinthebootpartition.
Rebootthesystemandverifythatthenewsettingshavebeenappliedtothekernel.
#dmesg|grep-ioffload[0.
000000]OffloadRCUcallbacksfromCPUs:0-63.
ThereasonforthisstepistoavoidRCUprocessinginanIOcompletionpath,asRCUprocessingwilllikelyincreasepaginglatency.
TurningOffTransparenthugepagesTominimizetheoverheadofcoalescingmemorypagesintohugepagesandlaterbreakingthemupontheswapdevice,performthefollowingcommands:#echo'never'>/sys/kernel/mm/transparent_hugepage/enabled#echo'never'>/sys/kernel/mm/transparent_hugepage/defragWatermarkScaleFactorItisimportanttoincreasethewatermarkscalefactorin/proc/sys/vmasthisisthelevelwhereavailablememoryischeckedbykswapd.
Werecommendsettingitto400or4%ofavailablememory,doingsowillsetkswapdtoautomaticallykickoffswappingat4%ofavailablesystemmemory.
#echo'400'>/proc/sys/vm/watermark_scale_factorNUMAConsiderationsWhendealingwithmultipleswapdevicesonamulti-socketsystemwerecommenddistributingswapdevicesevenlyamongdifferentCPUsocketstoavoidQPI/UPItransfers.
MoreovertoavoidsoftwareoverheadwerecommendcreatingmanyswapdevicesonapartitionedNVMedevice.
Eachswappartitionmusthavethesamepriority.
Inmostcasestherecanbeatleast28partitions,dependingonthekernelconfiguration.
Whensettingupyoursystem,werecommendadheringtotheNUMAlocalityrulesformaximumperformance.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US11PerformanceDataof4.
18.
20LinuxSwapWeusedthepmbenchutilitytotesttheallocationandaccessof4KiBmemorypagesonaLinuxsystem.
OurtestsystemutilizedanUbuntu18.
4.
2distributionofLinuxwhichweinitiallyupgradedtothe4.
18.
20versionofthekernel,astheUbuntureleasecomeswith4.
15.
xkernelversion.
WeupgradedusingthemethodsnotedinAppendixA-AutomationScriptsandHow-toGuide.
Thereshouldbenoissuerunningkernel4.
14ornewerasthekernelpatchestoLinuxswapareupstreamed(publiconkernel.
org)in4.
14.
Youcannotgainthislevelofperformanceonkernelspriorto4.
14.
Wetestedthein-boxkernelofUbuntu18.
04.
2(kernel4.
15.
0-46-generic)andsawminimaldifference(Hereisanexamplevariablesettingfrom/etc/default/grub,CPUcountspecific:GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-[n]maybe-ubiquity"Where[n]isthenumberoftotalCPUcoresorvirtualCPUthreadsinyoursystem.
Configurethekernelwiththese.
configsettingsifyouareabletocompileyourownkernel.
4.
EXPERIMENTAL:Generallyspeaking,itisbesttosettheNVMeschedulerto[none]ontheNVMeSSDswhichyouaretestingthemqblockorkyberscheduler.
Inmostcasesyourbuildshows[none],whichisfine.
#more/sys/block/nvme1n1/queue/scheduler[none]UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US175.
NewerkernelsallowanNVMequeuesizeof1,023,whichissufficientandrecommended.
6.
IfyouareseeingNVMeblockmerges,changeyourNVMeblocksizeto4Kib(not512b)sectors.
Ifblockmergesarestilloccurringaftermakingthischange,trythefollowing.
First,checkthenomergesvalue:#cat/sys/block/queue/nomergesThenomergesvalueshouldbesetto2.
Verifyandchangeifnecessary:echo2>/sys/block/queue/nomerges§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201918340608-001USAppendixBMemoryManagementFundamentalsThischapterintroducesthebasicmemorymanagementconceptsusedintheLinuxkernel.
ItexplainssystemlevelbottlenecksobservedwhenIntelOptaneDCSSDsareusedasswapdeviceswithLinuxversionspriortov4.
14oftheupstreamLinuxkernel.
Finally,itexplainstechniquestoovercomethosebottlenecksinversion4.
14,souserscanexperienceimprovedperformanceandutilizeIntelOptaneDCSSDsasswapdevices.
B.
1MemoryManagementSystemOverviewModernoperatingsystemsimplementavirtualmemorymodelwhichprovidesmanyadvantagestoapplicationdevelopers.
Virtualmemorymodelsimplifiessoftwaredevelopment,itleavesphysicalmemoryallocationanddataplacementcomplexitytotheunderlyingoperatingsystem.
Theoperatingsystemkerneldealswiththatcomplexitybyprovidinganimpressiontoanyrunningprocessthathasabigchunkofmemoryavailable(usually4GiB)foritsexclusiveuse.
InrealityOSkernelmapsprocessvirtualmemorytophysicalDRAM,andpotentiallyoverflowstoaswapdevice,whichextendsavailablephysicalmemory.
Theprocessoftransferringdatabetweentheswapdeviceandphysicalmemoryiscalledpagingandconsistsofpage-inswhenthedataisreadfromtheswapdeviceintophysicalmemory,andpage-outswhendataismovedoutofmemory.
Itshouldbenoted,page-outsmayrequiredatatobewrittenouttotheswapdevice,basedonthestateofthepage.
Figure2belowprovidesaconceptualdiagramofvirtualmemoryandpagingFigure2:VirtualMemoryConceptthroughPagingUtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US19ThepagingprocessismanagedbytheOSandisheavilysupportedbyCPUhardwarethroughthememorymanagementunit(MMU).
Forexample,MMUcontainstranslationlookasidebuffer(TLB)cachewhichcontainsrecentinformationonvirtual-to-physicalmemorytranslations.
Thisenablesasignificantreductionintimeneededtoaccessdatainmemory.
AnotherCPUfeaturethatassiststheOSwithmemorymanagementisamechanismcalledpagefault.
PagefaultisanexceptionraisedbyCPUhardwarewhenaprocesstriestoaccessavirtualmemorylocationthatisnotmappedtoaphysicaladdress.
Therearedifferenttypesofpagefaults:Minor–isrisenwhenapageexistsinmainmemorybutthereisnoentryindicatingvirtual-to-physicaladdressmapping.
ThepagefaulthandlerisimplementedintheOScreatesanewmappingentry.
Major–isrisenwhenapagedoesnotexistinmainmemory.
Thepagefaulthandlerneedstobringrequireddatafromtheswapdeviceintomemoryandcreatecorrespondingmappingentry.
Forexample,thishappensinafreshlyloadedprocesswhichcausestheOSkerneltodelayloadingthewholeprogramintomemory.
Thistechnique,calledon-demandpaging,acceleratesprocessstartup.
AmajorpagefaultisaperformancedrainingprocedurethatrequirestheOSpagefaulthandlertofindanavailablelocationinphysicalmemory,whichcanpotentiallyinvolvepaging-outandloadingcontentoftheprogramfromtheswapdeviceintomemory,beforetheprocesscancontinueitsexecution.
Therearetwodifferenttypesofpages:Filesystempages,orpagesbackedupbythefiles.
Thesearememorypagesthatcontainfiledata;forexample,databasefilesdirectlymappedintotoprocessaddressspace,orlibraryfilescontainingexecutableprogramcode.
Thesepagescanbepaged-intophysicalmemory;forexample,whentheprogramstartsexecutinginstructionsstoredonthedisk(i.
e.
programusageofasharedlibrary).
TheLinuxpagecacheisacacheofthesepagesdestinedforfiles–bothresidentto-be-read,andchanged(dirty)thatneedtobesynchronizedtosomestoragedevice.
DirectaccessIOroutinesforwhichthereisnopagecacheusagearealsoavailableonLinux.
Sincethepagecacheisanopportunisticandgeneralusagecache,itisnotappropriateforallusages.
Anonymouspages.
Thesearememorypagesthatcontainprivateprocessinformation,thatisheaporstack,andhavenodeviceorfilesystembackingthem.
Whenthesystemisrunningintolowmemoryconditions(highmemorypressure)anonymouspagescanbepaged-out(swappedout)totheswappingfileorswapdevicebyOSprocesskswapdanditsrelatedkernelthreads.
Thisprocesscanbemoreorlessaggressivebasedontheconfigurationoftheswappinessparameter,asthisparametersetsthetargetofwhenswappingshouldbecomemoreactive.
Theparametercanbesetfrom0to200;thehigherthevalue,themoreswapisutilizedoverpagecachememoryreclamation.
InourperformancestudytheOSisconfiguredtoitsdefaultvalueof60,whichisthetypicalproductionrecommendedsetting.
Valueof100meansthatOSwillreclaimmemorypagesusingpagecacheandswapequally.
Youcanprintoutprocvariable/proc/sys/vm/swappinesstoviewitscurrentvalue.
Anotherimportantparameterusedtocontrolwhenkswapdkernelthreadsareactivatediswatermark_scale_factor.
Theusercansetalowerlimitofavailablememorythatspecifieswhenkswapdactivitywillbestarted.
MoredetailsareavailableinWatermarkscalefactorsection.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201920340608-001USAppendixCLinuxKernelInnovationstoLeverageFastSSDsasMemoryExtensionUntilrecentlytheLinuxkernelhadbeenprimarilyoptimizedforrotationaldisksbecausetheywerethepredominantstoragedevices.
Oneofthetechniquesusedtomaximizeswapperformanceforrotationalharddiskdrives(HDDs)wastomaintainswapdatainthecontiguouslocationonthedisktominimizediskseektime.
Theperformanceyieldsofthistechniquewerefineforrotationalharddiskdrives(HDDs)butinadequateforsolidstatedrives(SSDs).
Withrecentadvancementsinnon-volatilememory(NVM)technologieslikeIntelOptanetechnology,newtechniquesandmethodsareneededtotakeadvantageoftheincreasedperformanceofthemediaanddevices.
WhiletestingLinuxswapagainstthesenewdevices,manysystem-levelbottleneckswerediscoveredinLinuxswap.
KerneldevelopershaveaddressedsomeoftheperformancebottlenecksinthereleaseofLinuxkernel4.
14.
Inthissectionweexploresomeofthoseenhancements.
SwapdeviceintheLinuxkernelisrepresentedbyadedicateddatastructure(swap_info_struct)thatcontainsinformationonhowmemorypagesarestoredontheswapdevice,seeFigure3below.
Thisinformationisstoredinanarray,calledswap_mapwhichispartofswap_info_struct.
Swap_mapstoresinformationonusagecountforapagestoredontheswapdevice.
Swap_mapentriesareaggregatedintoclusters,theseclusterseffectivelyassignspecificportionsoftheswapdevicetothespecificCPUcore.
Updatestotheusagecountofindividualswap_mapentriesrequireperclusterlockstobetakeninsteadofholdingasinglelockprotectingthewholeswap_map.
Figure3:PrimarySwapDeviceDataStructuresEventhoughtherearededicatedswapentriesperCPUcluster,accessestotheswap_mapareprotectedbyasinglelockwhichisascalabilityandperformancelimiterwhenconcurrentattemptstotheswapdevicearemade.
Thenegativeimpactofthissinglelockisespeciallyvisibleinhighmemorypressureconditions.
Whenthesinglelockisusedtoprotectcriticalinformationintheswap_info_structdatastructure,latenciesforhandlingpagefaultsfromtheswapdevicearesignificantlyincreased.
ThisheavilyimpactsenduserperformanceandrendersthelatestHWlatencyimprovementsineffectiveduetosystemlevelbottlenecks.
Thenextsectionexplainstechniquestominimizelockcontentiononthesinglelockthatprotectsswap_info_structdatastructure,andtoimprovesystemlevellatencies.
AspreviouslydiscussedinthePerformanceDatasection,accesslatenciesonswapaveragebelow20microsecondswhenutilizingahigherperformancedrive.
UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US21C.
1SwapImprovementsCompletedinv4.
14ofLinuxKernelTherearemanysoftwaretechniquestoaddressperformanceproblemsrelatedtolockcontention.
Theseapproachestypicallyrelyonthefollowingprinciples:Replacementofsinglecoarse-grainedlockonswappartitionwithmultiplefiner-grainedlocksontheswapcluster–whenmanypiecesofdataareprotectedfromconcurrentaccessesbyasingle,biglock,theconcurrentthreadsthatareattemptingtoreadorwritedataareserializedinaqueuewhileawaitingtheirturn.
Insuchcases,toimproveparallelism,abiglockcanbesplitintomanysmallerlockstoprotectindependentsub-piecesofdata.
Thisapproachmayyieldsignificantperformanceimprovementsespeciallywhenmultiplethreadsaccessindependentpiecesofdata,howeverwhenmorethanonethreadattemptstoaccessthesamepieceofdata,thoseattemptswillbeserializedinaqueue.
Reductionoftimespentwhenholdinglock(ortimespentincriticalsection)–whentherearemultiplethreadsattemptingtoaccessacriticalsectionthatisprotectedbyanexclusivelockheldbyanotherthreadtheyareallpauseduntillockisreleased.
Thelongerthecriticalsectionis,thelongertheotherthreadswillwaitbeforetheycancontinue.
Reductionoftimethatgiventhreadspendsinthecriticalsectionisanotherusefultechniqueincreasingparallelismandreducinglatency.
KernelDevelopersdeterminedthattheoccurrenceofincreasedsystemlevellatencieswhileswappingtoIntelIntelOptaneDCSSDwerecausedbyasinglelockprotectingswap_info_structdatastructure.
TheyhaveappliedtheprinciplesdiscussedaboveintotheseriesofswapimprovementsthatareavailableinLinuxkernelversion4.
14andlater.
Thefollowingtechniqueshavebeendevelopedtoreducelockcontentionontheswap_info_structlock.
1.
BulkoperationsandperCPUlockclusterimprovements–multipleswap_mapentriesthatrepresentfreespaceontheswapdevicehavebeenaggregatedinlargerunitsandstoredinswapslotcache.
SwapslotcacheismanagedbyaspecificCPUcore,becauseofthatitiscalled"percpuswapslotcache".
WhenaSWthreadrequestsnewswapspaceitfirsttriestoallocateitfromswapslotcacheonthegivenCPU.
Thisoperationdoesnotrequirelocking.
Becausesingleswapslotcachecontainsmultipleswap_mapentriesitislikelythatswap_mapentrywillsuccessfullybeallocatedfromit.
Whenallocationfromswapslotcacheisnotpossible,swapsoftwareneedstoperformbulkallocationofmultipleswap_mapentriesfromswap_map,andassignthoseentriestoswapslotcache.
Swap_info_lockisacquiredwhendoingbulkoperationsontheswap_mapdatastructure.
PleaserefertoFigure4belowfordetailsofthechanges.
Figure4:SwapBulkOperationsImprovementsUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201922340608-001US2.
Radixtreesplit–anothersourceoflockcontentionthatexistedinLinuxkernelpriortoversion4.
14wasradixtreeusedforswapcache.
Swapcacheisanoptimizationinaswappingbehaviorthatreducesthenumberofwritestoswapdeviceorswapfileandmaintainsmappingbetweenmemorypageandswapmapentrywhenmemorypageisswappedinorswappedout.
Swapwriteisconsideredunnecessarywhenapageexistsinaswapdeviceorswapfile,aswellasinmainmemory,becausebothofthoselocationscontainthesamedata.
WhenLinuxconsiderspageforreclamationitcansimplycheckifitexistsinbothswapdeviceorswapfile,andinmainmemoryanddatainthosetwolocationsmatch.
Insuchcasepageinmainmemorycanbesimplymarkedasinvalidandreclaimed.
Toperformcheckifswapentryhascorrespondingpagestoredinmainmemoryradixtreedatastructureisused.
Swapcacheradixtreepriortoversion4.
14ofLinuxusedtobeprotectedbysingleswapcachelockwhichreducedparallelism.
Inversion4.
14singleswapcacheradixtreehasbeensplitintomultiplesmallertrees.
Thismodificationintroducedseparatelockspereachsmallerradixtreeandincreasedparallelism.
Thecurrentdesignmethodisbestimplementedwithmanyswappartitionsonthephysicalswapdevice.
SeeAppendixAandtheautomationscriptsongithubtoimplementthemaximumnumberofLinuxswappartitions,typically28.
§UtilizingLinuxSwapwithIntelOptaneDCSSDsJune2019SolutionsBlueprint340608-001US23AppendixDSwapImprovementsPatchListsThissectionprovidesalistofkernelpatchespertainingtoswapimprovementsthatwereintroducedintheLinuxkernel4.
11andin4.
14.
Thislistofpatchesmaybeusefulwhenconsideringcreatingauniquekernelimagebasedonkernelversionsolderthan4.
11,andbackportingswapimprovementsintoit.
commit322b8afe4a65906c133102532e63a278775cc5f0Author:HuangYingDate:WedMay314:52:492017-0700mm,swap:Fixaraceinfree_swap_and_cache()commit0ccfece6ed507738c0e7e4414c3688b78d4e3756Author:HuangYingDate:WedMay314:56:162017-0700mm/swapfile.
c:fixswapspaceleakinerrorpathofswap_free_entries()commit322b8afe4a65906c133102532e63a278775cc5f0Author:HuangYingDate:WedMay314:52:492017-0700mm,swap:Fixaraceinfree_swap_and_cache()commitba81f83842549871cbd7226fc11530dc464500bbAuthor:HuangYingDate:WedFeb2215:45:462017-0800mm/swap:skipreadaheadonlywhenswapslotcacheisenabledcommit039939a65059852242c823ece685579370bc574fAuthor:TimChenDate:WedFeb2215:45:432017-0800mm/swap:enableswapslotscacheusagecommit67afa38e012e9581b9b42f2a41dfc56b1280794dAuthor:TimChenDate:WedFeb2215:45:392017-0800mm/swap:addcacheforswapslotsallocationcommit7c00bafee87c7bac7ed9eced7c161f8e5332cb4eAuthor:TimChenDate:WedFeb2215:45:362017-0800mm/swap:freeswapslotsinbatchUtilizingLinuxSwapwithIntelOptaneDCSSDsSolutionsBlueprintJune201924340608-001UScommit36005bae205da3eef0016a5c96a34f10a68afa1eAuthor:TimChenDate:WedFeb2215:45:332017-0800mm/swap:allocateswapslotsinbatchescommite8c26ab60598558ec3a626e7925b06e7417d7710Author:TimChenDate:WedFeb2215:45:292017-0800mm/swap:skipreadaheadforunreferencedswapslotscommit4b3ef9daa4fc0bba742a79faecb17fdaaead083bAuthor:Huang,YingDate:WedFeb2215:45:262017-0800mm/swap:splitswapcacheinto64MBtrunkscommit235b62176712b970c815923e36b9a9cc05d4d901Author:Huang,YingDate:WedFeb2215:45:222017-0800mm/swap:addclusterlockcommit6a991fc72d1243b8da0c644d3147d3ec41a0b281Author:Huang,YingDate:WedFeb2215:45:192017-0800mm/swap:fixkernelmessageinswap_info_get()commitf6498b3f33123a6ee1c81a1b29b9c07964cb95c1Author:HuangYingDate:FriOct816:59:302016-0700mm:don'tuseradixtreewritebacktagsforpagesinswapcacheD.
1ReferencesSeethefollowinglinksforimportantreferenceinformation.
Mostoftheoriginalpatches:https://kernelnewbies.
org/Linux_4.
11#Memory_managementSecondstepswapoptimizationnotes:https://kernelnewbies.
org/Linux_4.
14#Memory_managementWhitepaperonPMBench(2018):https://www.
semanticscholar.
org/paper/Pmbench%3A-A-Micro-Benchmark-for-Profiling-Paging-on-Yang-Seymour/dd0adcde7d074a414a9df76fb20d52a0d8aa8c71#paper-headerWhitepaperwithdeeperanalysisofpersistentmemory'sapplicabilitytomemorypageaccessperformance:https://web.
cs.
unlv.
edu/jisooy/paper/yang_pmbench.
pdf§
易探云服务器怎么过户/转让?易探云支持云服务器PUSH功能,该功能可将云服务器过户给指定用户。可带价PUSH,收到PUSH请求的用户在接收云服务器的同时,系统会扣除接收方的款项,同时扣除相关手续费,然后将款项打到发送方的账户下。易探云“PUSH服务器”的这一功能,可以让用户将闲置云服务器转让给更多需要购买的用户!易探云服务器怎么过户/PUSH?1.PUSH双方必须为认证用户:2.买家未接收前,卖家...
LightNode是一家位于香港的VPS服务商.提供基于KVM虚拟化技术的VPS.在提供全球常见节点的同时,还具备东南亚地区、中国香港等边缘节点.满足开发者建站,游戏应用,外贸电商等应用场景的需求。新用户注册充值就送,最高可获得20美元的奖励金!成为LightNode的注册用户后,还可以获得属于自己的邀请链接。通过你的邀请链接带来的注册用户,你将直接获得该用户的消费的10%返佣,永久有效!平台目前...
乐凝网络怎么样?乐凝网络是一家新兴的云服务器商家,目前主要提供香港CN2 GIA、美国CUVIP、美国CERA、日本东京CN2等云服务器及云挂机宝等服务。乐凝网络提供比同行更多的售后服务,让您在使用过程中更加省心,使用零云服务器,可免费享受超过50项运维服务,1分钟内极速响应,平均20分钟内解决运维问题,助您无忧上云。目前,香港HKBN/美国cera云服务器,低至9.88元/月起,支持24小时无理...
pagedefrag为你推荐
行业关键词为什么有些行业关键词竟价出价很低有些行业很高外网和内网什么是外网和内网?快速美白好方法快速美白的好点子!?(不是晒黑的)中国论坛大全天涯论坛的网址?ps抠图技巧photoshop最基本的抠图方法和技巧!硬盘人硬盘是指什么人人人逛街过节了,这儿可真热闹写一段话lockdowndios8.1怎么激活内置卡贴bt封杀BT下载可以封杀迅雷吗?什么原理?能破吗?商标注册查询官网全国商标注册查询在哪里查呀?
域名抢注工具 主机点评 香港机房 php主机 免费ftp空间 qq数据库 hostker 东莞数据中心 佛山高防服务器 福建铁通 免费phpmysql空间 超级服务器 yundun dnspod lick 日本代理ip 免费网络 网页加速 移动王卡 聚惠网 更多