HighPerformanceVMM-BypassI/OinVirtualMachinesJiuxingLiuWeiHuangBulentAbaliDhabaleswarK.
PandaIBMT.
J.
WatsonResearchCenter19SkylineDriveHawthorne,NY10532{jl,abali}@us.
ibm.
comComputerScienceandEngineeringTheOhioStateUniversityColumbus,OH43210{huanwei,panda}@cse.
ohio-state.
eduAbstractCurrently,I/Odevicevirtualizationmodelsinvirtualmachine(VM)environmentsrequireinvolvementofavirtualmachinemonitor(VMM)and/oraprivilegedVMforeachI/Ooperation,whichmayturnouttobeaper-formancebottleneckforsystemswithhighI/Odemands,especiallythoseequippedwithmodernhighspeedinter-connectssuchasInniBand.
Inthispaper,weproposeanewdevicevirtualizationmodelcalledVMM-bypassI/O,whichextendstheideaofOS-bypassoriginatedfromuser-levelcommunication.
Essentially,VMM-bypassallowstime-criticalI/Ooper-ationstobecarriedoutdirectlyinguestVMswithoutinvolvementoftheVMMand/oraprivilegedVM.
Byex-ploitingtheintelligencefoundinmodernhighspeednet-workinterfaces,VMM-bypasscansignicantlyimproveI/OandcommunicationperformanceforVMswithoutsacricingsafetyorisolation.
TodemonstratetheideaofVMM-bypass,wehavede-velopedaprototypecalledXen-IB,whichoffersInni-BandvirtualizationsupportintheXen3.
0VMenviron-ment.
Xen-IBrunswithcurrentInniBandhardwareanddoesnotrequiremodicationstoexistinguser-levelap-plicationsorkernel-leveldriversthatuseInniBand.
OurperformancemeasurementsshowthatXen-IBisabletoachievenearlythesamerawperformanceastheoriginalInniBanddriverrunninginanon-virtualizedenviron-ment.
1IntroductionVirtualmachine(VM)technologieswererstintroducedinthe1960s[14],butareexperiencingaresurgenceinrecentyearsandbecomingmoreandmoreattractivetoboththeindustryandtheresearchcommunities[35].
AkeycomponentinaVMenvironmentisthevirtualma-chinemonitor(VMM)(alsocalledhypervisor),whichisimplementeddirectlyontopofhardwareandprovidesvirtualizedhardwareinterfacestoVMs.
WiththehelpofVMMs,VMtechnologiesallowrunningmanydifferentvirtualmachinesinasinglephysicalbox,witheachvir-tualmachinepossiblyhostingadifferentoperatingsys-tem.
VMscanalsoprovidesecureandportableenviron-mentstomeetthedemandingresourcerequirementsofmoderncomputingsystems[9].
InVMenvironments,deviceI/Oaccessinguestop-eratingsystemscanbehandledindifferentways.
Forinstance,inVMwareWorkstation,deviceI/Oreliesonswitchingbacktothehostoperatingsystemanduser-levelemulation[37].
InVMwareESXServer,guestVMI/OoperationstrapintotheVMM,whichmakesdirectaccesstoI/Odevices[42].
InXen[11],deviceI/Ofol-lowsasplit-drivermodel.
Onlyanisolateddevicedo-main(IDD)hasaccesstothehardwareusingnativede-vicedrivers.
Allothervirtualmachines(guestVMs,ordomains)needtopasstheI/OrequeststotheIDDtoac-cessthedevices.
ThiscontroltransferbetweendomainsneedsinvolvementoftheVMM.
Inrecentyears,networkinterconnectsthatprovideverylowlatency(lessthan5s)andveryhighband-width(multipleGbps)areemerging.
ExamplesofthesehighspeedinterconnectsincludeVirtualInterfaceArchi-tecture(VIA)[12],InniBand[19],Quadrics[34],andMyrinet[25].
Duetotheirexcellentperformance,theseinterconnectshavebecomestrongplayersinareassuchashighperformancecomputing(HPC).
Toachievehighperformance,theseinterconnectsusuallyhaveintelligentnetworkinterfacecards(NICs)whichcanbeusedtoof-oadalargepartofthehostcommunicationprotocolprocessing.
TheintelligenceintheNICsalsosupportsuser-levelcommunication,whichenablessafedirectI/Oaccessfromuser-levelprocesses(OS-bypassI/O)andcontributestoreducedlatencyandCPUoverhead.
VMtechnologiescangreatlybenetcomputingsys-temsbuiltfromtheaforementionedhighspeedintercon-nectsbynotonlysimplifyingclustermanagementforthesesystems,butalsoofferingmuchcleanersolutionstotaskssuchascheck-pointingandfail-over.
Recently,asthesehighspeedinterconnectsbecomemoreandmoreAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation29commoditizedwiththeircostgoingdown,theyarealsousedforprovidingremoteI/Oaccessinhigh-endenter-prisesystems,whichincreasinglyruninvirtualizedenvi-ronments.
Therefore,itisveryimportanttoprovideVMsupporttohigh-endsystemsequippedwiththesehighspeedinterconnects.
However,performanceandscal-abilityrequirementsofthesesystemsposesomechal-lenges.
InalltheVMI/Oaccessapproachesmentionedpreviously,VMMshavetobeinvolvedtomakesurethatI/Oaccessesaresafeanddonotcompromiseintegrityofthesystem.
Therefore,currentdeviceI/OaccessinvirtualmachinesrequirescontextswitchesbetweentheVMMandguestVMs.
Thus,I/OaccesscansufferfromlongerlatencyandhigherCPUoverheadcomparedtona-tiveI/Oaccessinnon-virtualizedenvironments.
Insomecases,theVMMmayalsobecomeaperformancebot-tleneckwhichlimitsI/OperformanceinguestVMs.
Insomeoftheaforementionedapproaches(VMWorksta-tionandXen),ahostoperatingsystemoranothervirtualmachineisalsoinvolvedintheI/Oaccesspath.
AlthoughtheseapproachescangreatlysimplifyVMMdesignbymovingdevicedriversoutoftheVMM,theymayleadtoevenhigherI/Oaccessoverheadwhenrequiringcon-textswitchesbetweenthehostoperatingsystemandtheguestVMortwodifferentVMs.
Inthispaper,wepresentaVMM-bypassapproachforI/OaccessinVMenvironments.
Ourapproachtakesad-vantagesoffeaturesfoundinmodernhighspeedintelli-gentnetworkinterfacestoallowtime-criticaloperationstobecarriedoutdirectlyinguestVMswhilestillmain-tainingsystemintegrityandisolation.
Withthismethod,wecanremovethebottleneckofgoingthroughtheVMMoraseparateVMformanyI/Ooperationsandsignif-icantlyimprovecommunicationandI/Operformance.
ThekeyideaofourVMM-bypassapproachisbasedontheOS-bypassdesignofmodernhighspeednetworkin-terfaces,whichallowsuserprocessestoaccessI/Ode-vicesdirectlyinasafewaywithoutgoingthroughoper-atingsystems.
OS-bypasswasoriginallyproposedbyre-searchcommunities[41,40,29,6,33]andlateradoptedbysomecommercialinterconnectssuchasInniBand.
OurideacanberegardedasanextensionofOS-bypassdesignsinthecontextofVMenvironments.
TodemonstratetheideaofVMM-bypass,wehavede-signedandimplementedaprototypecalledXen-IBtoprovidevirtualizationsupportforInniBandinXen.
Ba-sically,ourimplementationpresentstoeachguestVMapara-virtualizedInniBanddevice.
Ourdesignrequiresnomodicationtoexistinghardware.
Also,throughatechniquecalledhigh-levelvirtualization,weallowcur-rentuser-levelapplicationsandkernel-levelmodulesthatutilizeInniBandtorunwithoutchanges.
Ourperfor-manceresults,whichincludesbenchmarksatthebasicInniBandlevelaswellasevaluationofupper-layerIn-niBandprotocolssuchasIPoverInniBand(IPoIB)[1]andMPI[36],demonstratethatperformanceofourVMM-bypassapproachcomesclosetothatinanative,non-virtualizedenvironment.
Althoughourcurrentim-plementationisforInniBandandXen,thebasicVMM-bypassideaandmanyofourimplementationtechniquescanbereadilyappliedtootherhigh-speedinterconnectsandotherVMMs.
Insummary,themaincontributionsofourworkare:WeproposedtheVMM-bypassapproachforI/Oac-cessesinVMenvironmentsformodernhighspeedinterconnects.
Usingthisapproach,manyI/Oop-erationscanbeperformeddirectlywithoutinvolve-mentofaVMMoranotherVM.
Thus,I/Operfor-mancecanbegreatlyimproved.
BasedontheideaofVMM-bypass,weimple-mentedaprototype,Xen-IB,tovirtualizeInni-BanddevicesinXenguestVMs.
OurprototypesupportsrunningexistingInniBandapplicationsandkernelmodulesinguestVMswithoutanymod-ication.
Wecarriedoutextensiveperformanceevaluationofourprototype.
OurresultsshowthatperformanceofourvirtualizedInniBanddeviceisveryclosetona-tiveInniBanddevicesrunninginanon-virtualizedenvironment.
Therestofthepaperisorganizedasfollows:InSec-tion2,wepresentbackgroundinformation,includingtheXenVMenvironmentandtheInniBandarchitecture.
InSection3,wepresentthebasicideaofVMM-bypassI/O.
InSection4,wediscussthedetaileddesignandimple-mentationofourXen-IBprototype.
InSection5,wedis-cussseveralrelatedissuesandlimitationsofourcurrentimplementationandhowtheycanbeaddressedinfuture.
PerformanceevaluationresultsaregiveninSection6.
WediscussrelatedworkinSection7andconcludethepaperinSection8.
2BackgroundInthissection,weprovidebackgroundinformationforourwork.
InSection2.
1,wedescribehowI/Odeviceac-cessishandledinseveralpopularVMenvironments.
InSection2.
3,wedescribetheOS-bypassfeatureinmod-ernhighspeednetworkinterfaces.
SinceourprototypeisbasedonXenandInniBand,weintroducetheminSections2.
2and2.
4,respectively.
2.
1I/ODeviceAccessinVirtualMachinesInaVMenvironment,theVMMplaysthecentralroleofvirtualizinghardwareresourcessuchasCPUs,memory,andI/Odevices.
Tomaximizeperformance,theVMMAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation30canletguestVMsaccesstheseresourcesdirectlywhen-everpossible.
TakingCPUvirtualizationasanexample,aguestVMcanexecuteallnon-privilegedinstructionsnativelyinhardwarewithoutinterventionoftheVMM.
However,privilegedinstructionsexecutedinguestVMswillgenerateatrapintotheVMM.
TheVMMwillthentakenecessarystepstomakesurethattheexecutioncancontinuewithoutcompromisingsystemintegrity.
SincemanyCPUintensiveworkloadsseldomuseprivilegedinstructions(ThisisespeciallytrueforapplicationsinHPCarea.
),theycanachieveexcellentperformanceevenwhenexecutedinaVM.
I/OdeviceaccessinVMs,however,isacompletelydifferentstory.
SinceI/OdevicesareusuallysharedamongallVMsinaphysicalmachine,theVMMhastomakesurethataccessestothemarelegalandcon-sistent.
Currently,thisrequiresVMMinterventiononeveryI/OaccessfromguestVMs.
Forexample,inVMwareESXServer[42],allphysicalI/OaccessesarecarriedoutwithintheVMM,whichincludesdevicedriversforpopularserverhardware.
SystemintegrityisachievedwitheveryI/OaccessgoingthroughtheVMM.
Furthermore,theVMMcanserveasanarbitra-tor/multiplexer/demultiplexertoimplementusefulfea-turessuchasQoScontrolamongVMs.
However,VMMinterventionalsoleadstolongerI/OlatencyandhigherCPUoverheadduetothecontextswitchesbetweenguestVMsandtheVMM.
SincetheVMMservesasacentralcontrolpointforallI/Oaccesses,itmayalsobecomeaperformancebottleneckforI/Ointensiveworkloads.
HavingdeviceI/OaccessintheVMMalsocompli-catesthedesignoftheVMMitself.
ItsignicantlylimitstherangeofsupportedphysicaldevicesbecausenewdevicedrivershavetobedevelopedtoworkwithintheVMM.
Toaddressthisproblem,VMwareworksta-tion[37]andXen[13]carryoutI/OoperationsinahostoperatingsystemoraspecialprivilegedVMcallediso-lateddevicedomain(IDD),whichcanrunpopularop-eratingsystemssuchasWindowsandLinuxthathavealargenumberofexistingdevicedrivers.
AlthoughthisapproachcangreatlysimplifytheVMMdesignandin-creasetherangeofsupportedhardware,itdoesnotdi-rectlyaddressperformanceissueswiththeapproachusedinVMwareESXServer.
Infact,I/Oaccessesnowmayresultinexpensiveoperationscalledaworldswitch(aswitchbetweenthehostOSandaguestVM)oradomainswitch(aswitchbetweentwodifferentVMs),whichcanleadtoevenworseI/Operformance.
2.
2OverviewoftheXenVirtualMachineMonitorXenisapopularhighperformanceVMM.
Itusespara-virtualization[43],inwhichhostoperatingsystemsneedtobeexplicitlyportedtotheXenarchitecture.
Thisar-chitectureissimilartonativehardwaresuchasthex86architecture,withonlyslightmodicationstosupportef-cientvirtualization.
SinceXendoesnotrequirechangestotheapplicationbinaryinterface(ABI),existinguserapplicationscanrunwithoutanymodication.
DeviceManagerandControlSoftware(Domain0)VM0UnmodifiedUserSoftwareUnmodifiedUserSoftwareVM1VM2(GuestDomain)(GuestDomain)SafeHWIFControlIFEventChannelVirtualCPUVirtualMMUHardware(SMP,MMU,PhysicalMemory,Ehternet,SCSI/IDE)GuestOS(XenoLinux)GuestOS(XenoLinux)GuestOS(XenoLinux)BackenddrivernativeDeviceDriverXenHypervisorfrontenddriverfrontenddriverFigure1:ThestructureoftheXenhypervisor,hostingthreexenoLinuxoperatingsystems(courtesy[32])Figure1illustratesthestructureofaphysicalmachinerunningXen.
TheXenhypervisorisatthelowestlevelandhasdirectaccesstothehardware.
Thehypervi-sor,insteadoftheguestoperatingsystems,isrunninginthemostprivilegedprocessor-level.
Xenprovidesba-siccontrolinterfacesneededtoperformcomplexpolicydecisions.
AbovethehypervisoraretheXendomains(VMs).
Therecanbemanydomainsrunningsimultane-ously.
GuestVMsarepreventedfromdirectlyexecut-ingprivilegedprocessorinstructions.
Aspecialdomaincalleddomain0,whichiscreatedatboottime,isallowedtoaccessthecontrolinterfaceprovidedbythehypervi-sor.
TheguestOSindomain0hostsapplication-levelmanagementsoftwareandperformthetaskstocreate,terminateormigrateotherdomainsthroughthecontrolinterface.
Thereisnoguaranteethatadomainwillgetacon-tinuousstretchofphysicalmemorytorunaguestOS.
Xenmakesadistinctionbetweenmachinememoryandpseudo-physicalmemory.
Machinememoryreferstothephysicalmemoryinstalledinamachine,whilepseudo-physicalmemoryisaper-domainabstraction,allowingaguestOStotreatitsmemoryasacontiguousrangeofphysicalpages.
Xenmaintainsthemappingbetweenthemachineandthepseudo-physicalmemory.
Onlyacer-tainpartsoftheoperatingsystemneedstounderstandthedifferencebetweenthesetwoabstractions.
GuestOSesallocateandmanagetheirownhardwarepageta-bles,withminimalinvolvementoftheXenhypervisortoensuresafetyandisolation.
InXen,domainscancommunicatewitheachotherthroughsharedpagesandeventchannels.
Eventchan-nelsprovideanasynchronousnoticationmechanismbetweendomains.
Eachdomainhasasetofend-pointsAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation31(orports)whichmaybeboundedtoaneventsource.
Whenapairofend-pointsintwodomainsareboundtogether,a"send"operationononesidewillcauseaneventtobereceivedbythedestinationdomain,whichmayinturncauseaninterrupt.
Eventchannelsareonlyintendedforsendingnoticationsbetweendomains.
Soifadomainwantstosenddatatoanother,thetypicalschemeisforasourcedomaintograntaccesstolocalmemorypagestothedestinationdomain.
Then,thesesharedpagesareusedtotransferdata.
VirtualmachinesinXenusuallydonothavedirectaccesstohardware.
Sincemostexistingdevicedriversassumetheyhavecompletecontrolofthedevice,therecannotbemultipleinstantiationsofsuchdriversindif-ferentdomainsforasingledevice.
Toensuremanage-abilityandsafeaccess,devicevirtualizationinXenfol-lowsasplitdevicedrivermodel[13].
Eachdevicedriverisexpectedtoruninanisolateddevicedomain(IDD),whichhostsabackenddrivertoserveaccessrequestsfromguestdomains.
EachguestOSusesafrontenddrivertocommunicatewiththebackend.
Thesplitdriverorganizationprovidessecurity:misbehavingcodeinaguestdomainwillnotresultinfailureofotherguestdo-mains.
Thesplitdevicedrivermodelrequiresthedevel-opmentoffrontendandbackenddriversforeachdeviceclass.
Anumberofpopulardeviceclassessuchasvirtualdiskandvirtualnetworkarecurrentlysupportedinguestdomains.
2.
3OS-bypassI/OTraditionally,deviceI/OaccessesarecarriedoutinsidetheOSkernelonbehalfofapplicationprocesses.
How-ever,thisapproachimposesseveralproblemssuchasoverheadcausedbycontextswitchesbetweenuserpro-cessesandOSkernelsandextradatacopieswhichde-gradeI/Operformance[5].
ItcanalsoresultinQoScrosstalk[17]duetolackingofproperaccountingforcostsofI/Oaccessescarriedoutbythekernelonbehalfofapplications.
Toaddresstheseproblems,aconceptcalleduser-levelcommunicationwasintroducedbytheresearchcommu-nity.
Oneofthenotablefeaturesofuser-levelcommu-nicationisOS-bypass,withwhichI/O(communication)operationscanbeachieveddirectlybyuserprocesseswithoutinvolvementofOSkernels.
OS-bypasswaslateradoptedbycommercialproducts,manyofwhichhavebecomepopularinareassuchashighperformancecom-putingwherelowlatencyisvitaltoapplications.
ItshouldbenotedthatOS-bypassdoesnotmeanallI/OoperationsbypasstheOSkernel.
Usually,devicesal-lowOS-bypassforfrequentandtime-criticaloperationswhileotheroperations,suchassetupandmanagementoperations,cangothroughOSkernelsandarehandledbyaprivilegedmodule,asillustratedinFigure2.
PrivilegedAccessOSBypassAccessDevicePrivilegdedModuleOSApplicationApplicationFigure2:OS-BypassCommunicationandI/OThekeychallengetoimplementOS-bypassI/Oistoenablesafeaccesstoadevicesharedbymanydiffer-entapplications.
Toachievethis,OS-bypasscapablede-vicesusuallyrequiremoreintelligenceinthehardwarethantraditionalI/Odevices.
Typically,anOS-bypassca-pabledeviceisabletopresentvirtualaccesspointstodifferentuserapplications.
HardwaredatastructuresforvirtualaccesspointscanbeencapsulatedintodifferentI/Opages.
WiththehelpofanOSkernel,theI/Opagescanbemappedintothevirtualaddressspacesofdifferentuserprocesses.
Thus,differentprocessescanaccesstheirownvirtualaccesspointssafely,thankstotheprotectionprovidedbythevirtualmemorymechanism.
Althoughtheideaofuser-levelcommunicationandOS-bypasswasdevelopedfortraditional,non-virtualizedsystems,theintelligenceandself-virtualizingcharacteristicofOS-bypassdeviceslendthemselvesnicelytoavirtualizedenvironment,aswewillseelater.
2.
4InniBandArchitectureInniBand[19]isahighspeedinterconnectofferinghighperformanceaswellasfeaturessuchasOS-bypass.
In-niBandhostchanneladapters(HCAs)aretheequiva-lentofnetworkinterfacecards(NICs)intraditionalnet-works.
InniBandusesaqueue-basedmodelforcom-munication.
AQueuePair(QP)consistsofasendqueueandareceivequeue.
Thesendqueueholdsinstructionstotransmitdataandthereceivequeueholdsinstructionsthatdescribewherereceiveddataistobeplaced.
Com-municationinstructionsaredescribedinWorkQueueRequests(WQR),ordescriptors,andsubmittedtothequeuepairs.
ThecompletionofthecommunicationisreportedthroughCompletionQueues(CQs)usingCom-pletionQueueEntries(CQEs).
CQEscanbeaccessedbyusingpollingoreventhandlers.
Initiatingdatatransfers(postingdescriptors)andnoti-cationoftheircompletion(pollingforcompletion)aretime-criticaltaskswhichuseOS-bypass.
IntheMel-lanox[21]approach,whichrepresentsatypicalimple-mentationoftheInniBandspecication,postingde-scriptorsisdonebyringingadoorbell.
DoorbellsarerungbywritingtotheregistersthatformtheUserAc-AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation32cessRegion(UAR).
EachUARisa4kI/Opagemappedintoaprocess'svirtualaddressspace.
PostingaworkrequestincludesputtingthedescriptorstoaQPbufferandwritingthedoorbelltotheUAR,whichiscompletedwithouttheinvolvementoftheoperatingsystem.
CQbuffers,wheretheCQEsarelocated,canalsobedirectlyaccessedfromtheprocessvirtualaddressspace.
TheseOS-bypassfeaturesmakeitpossibleforInniBandtoprovideverylowcommunicationlatency.
InniBandalsoprovidesacomprehensivemanage-mentscheme.
Managementcommunicationisachievedbysendingmanagementdatagrams(MADs)towell-knownQPs(QP0andQP1).
InniBandrequiresallbuffersinvolvedincommu-nicationberegisteredbeforetheycanbeusedindatatransfers.
InMellanoxHCAs,thepurposeofregistra-tionistwo-fold.
First,anHCAneedstokeepanen-tryintheTranslationandProtectionTable(TPT)sothatitcanperformvirtual-to-physicaltranslationandprotec-tionchecksduringdatatransfer.
Second,thememorybufferneedstobepinnedinmemorysothatHCAcanDMAdirectlyintothetargetbuffer.
Uponthesuccessofregistration,alocalkeyandaremotekeyarereturned,whichcanbeusedlaterforlocalandremote(RDMA)accesses.
QPandCQbuffersdescribedabovearejustnormalbuffersthataredirectlyallocatedfromthepro-cessvirtualmemoryspaceandregisteredwithHCA.
UserlevelHCADriverUserlevelInfinibandServiceUserlevelApplicationCoreInfinibandModulesHCADriverInfiniBandHCAUserspaceKernelOSbypassFigure3:ArchitecturaloverviewofOpenIBGen2stackTherearetwopopularstacksforInniBanddrivers.
VAPI[23]istheMellanoximplementationandOpenIBGen2[28]recentlyhavecomeoutasanewgenerationofIBstackprovidedbytheOpenIBcommunity.
Inthispa-per,ourprototypeimplementisbasedonOpenIBGen2,whosearchitectureisillustratedinFigure3.
3VMM-BypassI/OVMM-bypassI/OcanbeviewedasanextensiontotheideaofOS-bypassI/OinthecontextofVMenviron-ments.
Inthissection,wedescribethebasicdesignofVMM-bypassI/O.
Twokeyideasinourdesignarepara-virtualizationandhigh-levelvirtualization.
InsomeVMenvironments,I/Odevicesarevirtualizedatthehardwarelevel[37].
EachI/Oinstructiontoac-cessadeviceisvirtualizedbytheVMM.
Withthisap-proach,existingdevicedriverscanbeusedintheguestVMswithoutanymodication.
However,itsignicantlyincreasesthecomplexityofvirtualizingdevices.
Forexample,onepopularInniBandcard(MT23108fromMellanox[24])presentsitselfasaPCI-Xdevicetothesystem.
Afterinitialization,itcanbeaccessedbytheOSusingmemorymappedI/O.
Virtualizingthisdeviceatthehardwarelevelwouldrequireustonotonlyunder-standallthehardwarecommandsissuedthroughmem-orymappedI/O,butalsoimplementavirtualPCI-XbusintheguestVM.
Anotherproblemwiththisapproachisperformance.
Sinceexistingphysicaldevicesaretypi-callynotdesignedtoruninavirtualizedenvironment,theinterfacespresentedatthehardwarelevelmayex-hibitsignicantperformancedegradationwhentheyarevirtualized.
PrivilegedAccessVMMBypassAccessVMVMApplicationApplicationModuleBackendModulePrivilegdedVMMOSGuestModuleOSDeviceFigure4:VM-BypassI/O(I/OHandledbyVMMDi-rectly)OurVMM-bypassI/Ovirtualizationdesignisbasedontheideaofpara-virtualization,similarto[11]and[44].
Wedonotpreservehardwareinterfacesofexistingdevices.
TovirtualizeadeviceinaguestVM,weimplementadevicedrivercalledguestmoduleintheOSoftheguestVM.
Theguestmoduleisresponsibleforhandlingalltheprivilegedaccessestothedevice.
InordertoachieveVMM-bypassdeviceaccess,theguestmodulealsoneedstosetthingsupproperlysothatI/OoperationscanbecarriedoutdirectlyintheguestVM.
Thismeansthattheguestmodulemustbeabletocre-atevirtualaccesspointsonbehalfoftheguestOSandmapthemintotheaddressesofuserprocesses.
Sincetheguestmoduledoesnothavedirectaccesstothedevicehardware,weneedtointroduceanothersoftwarecom-ponentcalledbackendmodule,whichprovidesdevicehardwareaccessfordifferentguestmodules.
IfdevicesareaccessedinsidetheVMM,thebackendmodulecanbeimplementedaspartoftheVMM.
Itispossibletoletthebackendmoduletalktothedevicedirectly.
However,wecangreatlysimplifyitsdesignbyreusingtheoriginalprivilegemoduleoftheOS-bypassdevicedriver.
Inaddi-AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation33tiontoservingasaproxyfordevicehardwareaccess,thebackendmodulealsocoordinatesaccessesamongdiffer-entVMssothatsystemintegritycanbemaintained.
TheVMM-bypassI/OdesignisillustratedinFigure4.
IfdeviceaccessesareprovidedbyanotherVM(devicedriverVM),thebackendmodulecanbeimplementedwithinthedevicedriverVM.
Thecommunicationbe-tweenguestmodulesandthebackendmodulecanbeachievedthroughtheinter-VMcommunicationmecha-nismprovidedbytheVMenvironment.
ThisapproachisshowninFigure5.
PrivilegedAccessVMMBypassAccess.
.
.
.
.
.
VMBackendModuleModulePrivilegdedVMMDeviceOSDeviceDriverVMApplicationGuestModuleFigure5:VM-BypassI/O(I/OHandledbyAnotherVM)Para-virtualizationcanleadtocompatibilityproblemsbecauseapara-virtualizeddevicedoesnotconformtoanyexistinghardwareinterfaces.
However,inourde-sign,theseproblemscanbeaddressedbymaintainingex-istinginterfaceswhichareatahigherlevelthanthehard-wareinterface(atechniquewedubbedhigh-levelvirtual-ization).
ModerninterconnectssuchasInniBandhavetheirownstandardizedaccessinterfaces.
Forexample,InniBandspecicationdenesaVERBSinterfaceforahosttotalktoanInniBanddevice.
TheVERBSinter-faceisusuallyimplementedintheformofanAPIsetthroughacombinationofsoftwareandhardware.
Ourhigh-levelvirtualizationapproachmaintainsthesameVERBSinterfacewithinaguestVM.
Therefore,existingkerneldriversandapplicationsthatuseInniBandwillbeabletorunwithoutanymodication.
AlthoughintheoryadriveroranapplicationcanbypasstheVERBSinterfaceandtalktoInniBanddevicesdirectly,thissel-domhappensbecauseitleadstopoorportabilityduetothefactthatdifferentInniBanddevicesmayhavedif-ferenthardwareinterfaces.
4PrototypeDesignandImplementationInthissection,wepresentthedesignandimplementa-tionofXen-IB,ourInniBandvirtualizationdriverforXen.
Wedescribedetailsofthedesignandhowween-ableaccessingtheHCAfromguestdomainsdirectlyfortime-criticaltasks.
4.
1OverviewLikemanyotherdevicedrivers,InniBanddriverscan-nothavemultipleinstantiationsforasingleHCA.
Thus,asplitdrivermodelapproachisrequiredtoshareasingleHCAamongmultipleXendomains.
Figure6illustratesabasicdesignofourXen-IBdriver.
ThebackendrunsasakerneldaemonontopofthenativeInniBanddriverintheisolateddevicedomain(IDD),whichisdomain0isourcurrentimplementation.
Itwaitsforincomingrequestsfromthefrontenddriversintheguestdomains.
Thefrontenddriver,whichcorrespondstotheguestmodulementionedinSection3,replacesthekernelHCAdriverinOpenIBGen2stack.
Oncethefron-tendisloaded,itestablishestwoeventchannelswiththebackenddaemon.
Therstchannel,togetherwithsharedmemorypages,formsadevicechannel[13]whichisusedtoprocessrequestsinitiatedfromtheguestdomain.
ThesecondchannelisusedforsendingInniBandCQandQPeventstotheguestdomainandwillbediscussedindetaillater.
DeviceChannelEventChannelUserlevelApplicationUserlevelInfinibandServiceUserlevelHCADriverUserlevelApplicationHCADriverNativeXenHypervisorMellanoxHCAUserspaceKernelInfinibandServiceUserlevelInfinibandModulesCoreDriverendBackUserlevelHCADriverCoreInfinibandModulesFrontendDriverIDDGuestDomainFigure6:TheXen-IBdriverstructurewiththesplitdrivermodelTheXen-IBfrontenddriverprovidesthesamesetofinterfacesasanormalGen2stackforkernelmodules.
Itisarelativelythinlayerwhosetasksincludepackingarequesttogetherwithnecessaryparametersandsend-ingittothebackendthroughthedevicechannel.
Thebackenddriverreconstructsthecommands,performstheoperationusingthenativekernelHCAdriveronbehalfoftheguestdomain,andreturnstheresulttothefrontenddriver.
ThesplitdevicedrivermodelinXenposesdifcultiesforuser-leveldirectHCAaccessinXenguestdomains.
ToenableVMM-bypass,weneedtoletguestdomainshavedirectaccesstocertainHCAresourcessuchastheUARsandtheQP/CQbuffers.
4.
2InniBandPrivilegedAccessesInthefollowing,wediscussingeneralhowwesupportallprivilegedInniBandoperations,includinginitializa-tion,InniBandresourcemanagement,memoryregistra-AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation34tionandeventhandling.
Initializationandresourcemanagement:BeforeapplicationscancommunicateusingInniBand,itmustnishseveralpreparationstepsincludingopeningHCA,creatingCQ,creatingQP,andmodifyingQPstatus,etc.
Thoseoperationsareusuallynotinthetimecriticalpathandcanbeimplementedinastraightforwardway.
Basi-cally,theguestdomainsforwardthesecommandstothedevicedriverdomain(IDD)andwaitfortheacknowl-edgmentsaftertheoperationsarecompleted.
Allthere-sourcesaremanagedinthebackendandthefrontendsrefertotheseresourcesbyhandles.
ValidationchecksmustbeconductedinIDDtoensurethatallreferencesarelegal.
MemoryRegistration:TheInniBandspecicationrequiresallthememoryregionsinvolvedindatatrans-ferstoberegisteredwiththeHCA.
WithXen'spara-virtualizationapproach,realmachineaddressesaredi-rectlyvisibletouserdomains.
(NotethataccesscontrolisstillachievedbecauseXenmakessureauserdomaincannotarbitrarilymapamachinepage.
)Thus,adomaincaneasilygureouttheDMAaddressesofbuffersandthereisnoextraneedforaddresstranslation(assumingthatnoIOMMUisused).
TheinformationneededbymemoryregistrationisalistofDMAaddressesthatde-scribesthephysicallocationsofthebuffers,accessagsandthevirtualaddressthattheapplicationwillusewhenaccessingthebuffers.
Again,theregistrationhappensinthedevicedomain.
Thefrontenddriversendsaboveinformationtothebackenddriverandgetbackthelo-calandremotekeys.
NotethatsincetheTranslationandProtectionTable(TPT)onHCAisindexedbykeys,mul-tipleguestdomainsareallowedtoregisterwiththesamevirtualaddress.
Forsecurityreasons,thebackenddrivercanverifyifthefrontenddriveroffersvalidDMAaddressesbelong-ingtothespecicdomaininwhichitisrunning.
Thischeckmakessurethatalllatercommunicationactivitiesofguestdomainsarewithinthevalidaddressspaces.
EventHandling:InniBandsupportsseveralkindsofCQandQPevents.
Themostcommonlyusedisthecom-pletionevent.
EventhandlersareassociatedwithCQsorQPswhentheyarecreated.
AnapplicationcansubscribeforeventnoticationbywritingacommandtotheUARpage.
Whenthosesubscribedeventshappen,theHCAdriverwillrstbenotiedbytheHCAandthendispatchtheeventtodifferentCQsorQPsaccordingtotheeventtype.
Thentheapplication/driverthatownstheCQ/QPwillgetacallbackontheeventhandler.
ForXen-IB,eventsaregeneratedforthedevicedo-main,whereallQPsandCQsareactuallycreated.
Butthedevicedomaincannotdirectlygiveacallbackontheeventhandlersintheguestdomains.
Toaddressthisis-sue,wecreateadedicatedeventchannelbetweenafron-tendandthebackenddriver.
Thebackenddriverasso-ciatesaspecialeventhandlertoeachCQ/QPcreatedduetorequestsfromguestdomains.
EachtimetheHCAgen-eratesaneventtotheseCQs/QPs,thisspecialeventhan-dlergetsexecutedandforwardsinformationsuchastheeventtypeandtheCQ/QPidentiertotheguestdomainthroughtheeventchannel.
Thefrontenddriverbindsaneventdispatcherasacallbackhandlertooneendoftheeventchannelafterthechanneliscreated.
TheeventhandlersgivenbytheapplicationsareassociatedtotheCQsorQPsaftertheyaresuccessfullycreated.
Fron-tenddriveralsomaintainsatranslationtablebetweentheCQ/QPidentiersandtheactualCQ/QPs.
Oncetheeventdispatchergetsaneventnoticationfromtheback-enddriver,itcheckstheidentierandgivesthecorre-spondingCQ/QPacallbackontheassociatedhandler.
4.
3VMM-BypassAccessesInInniBand,QPaccesses(postingdescriptors)includewritingWQEstotheQPbuffersandringingdoorbells(writingtoUARpages)tonotifytheHCA.
ThentheHCAcanuseDMAtotransfertheWQEstointernalHCAmemoryandperformthesend/receiveorRDMAoperations.
Onceaworkrequestiscompleted,HCAwillputacompletionentry(CQE)intheCQbuffer.
InInni-Band,QPaccessfunctionsareusedforinitiatingcom-munication.
Todetectcompletionofcommunication,CQpollingcanbeused.
QPaccessandCQpollingfunc-tionsaretypicallyusedinthecriticalpathofcommuni-cation.
Therefore,itisveryimportanttooptimizetheirperformancebyusingVMM-bypass.
Thebasicarchitec-tureoftheVMM-bypassdesignisshowninFigure7.
EventChannelDeviceChannelBackendDriverResourceManagementValidataionCheckingUARAllocationEventHandlingInfinibandCoreModulesUserlevelApplicationUserlevelInfinibandServiceUserlevelHCADriverNativeHCADriverUARMappingEventDispatchingResourceManagementFrontendDriverCoreInfinibandModulesUserlevelApplicationUserlevelInfinibandServiceUserlevelHCADriverMellanoxHCAXenHypervisorspaceUserKernelVMMBypassIDDGuestDomainFigure7:VMM-BypassdesignofXen-IBdriverSupportingVMM-bypassforQPaccessandCQpollingimposestworequirementsonourdesignofXen-IB:rst,UARpagesmustbeaccessiblefromaguestdo-main;second,bothQPandCQbuffersshouldbedirectlyvisibleintheguestdomain.
Whenafrontenddriverisloaded,thebackenddriverAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation35allocatesaUARpageandreturnsitspageframenumber(machineaddress)tothefrontend.
ThefrontenddriverthenremapsthispagetoitsownaddressspacesothatitcandirectlyaccesstheUARintheguestdomaintoserverequestsfromthekerneldrivers.
(WehaveappliedasmallpatchtoXentoenableaccesstoI/Opagesinguestdomains.
)Inthesameway,whenauserapplica-tionstarts,thefrontenddriverappliesforaUARpagefromthebackendandremapsthepagetotheapplica-tion'svirtualmemoryaddressspace,whichcanbelateraccesseddirectlyfromtheuserspace.
SinceallUARsaremanagedinacentralizedmannerintheIDD,therewillbenoconictsbetweenUARsindifferentguestdo-mains.
TomakeQPandCQbuffersaccessibletoguestdo-mains,creatingCQs/QPshastogothroughtwostages.
Intherststage,QPorCQbuffersareallocatedintheguestdomainsandregisteredthroughtheIDD.
Duringthesecondstage,thefrontendsendstheCQ/QPcreationcommandstotheIDDalongwiththekeysreturnedfromtheregistrationstagetocompletethecreationprocess.
Addresstranslationsareindexedbykeys,soinlaterop-erationstheHCAcandirectlyreadWQRsfromandwritetheCQEsbacktothebuffers(usingDMA)locatedintheguestdomains.
SincewealsoallocateUARstouserspaceapplica-tionsinguestdomains,theuserlevelInniBandlibrarynowkeepsitsOS-bypassfeature.
TheVMM-bypassIB-XenworkowisillustratedinFigure8.
ItshouldbenotedthatsinceVMM-bypassaccessesdi-rectlyinteractwiththeHCA,theyareusuallyhardwaredependentandthefrontendsneedtoknowhowtodealwithdifferenttypesofInniBandHCAs.
However,ex-istingInniBanddriversanduser-levellibrariesalreadyincludecodefordirectaccessanditcanbereusedwith-outspendingnewdevelopmentefforts.
HCAkernelkerneluserspacePostrequestCQEdoorbellCreateQPAckMeasuredLatencyPollCQCQEAckChannelDeviceBackend/FrontendUserlevelHCAdriverNativeHCAdriverApplicationOperationsVMMBypassPriviledgedFigure8:WorkingowoftheVMM-bypassXen-IBdriver4.
4VirtualizingInniBandManagementOperationsInanInniBandnetwork,managementandadministra-tivetasksareachievedthroughtheuseofManagementDatagrams(MADs).
MADsaresentandreceivedjustlikenormalInniBandcommunication,exceptthattheymustusetwowell-knownqueue-pairs:QP0andQP1.
Sincethereisonlyonesetofsuchqueuepairsinev-eryHCA,theiraccessmustbevirtualizedforaccess-ingfrommanydifferentVMs,whichmeanswemusttreatthemdifferentlythannormalqueue-pairs.
However,sincequeue-pairaccessescanbedonedirectlyinguestVMsinourVMM-bypassapproach,itwouldbeverydif-culttotrackeachqueue-pairaccessandtakedifferentactionsbasedonwhetheritisamanagementqueue-pairoranormalone.
Toaddressthisdifculty,weusetheideaofhigh-levelvirtualization.
ThisisbasedonthefactthatalthoughMADisthebasicmechanismforInniBandmanage-ment,applicationsandkerneldriversseldomuseitdi-rectly.
Instead,differentmanagementtasksareachievedthroughmoreuser-friendlyandstandardAPIsetswhichareimplementedontopofMADs.
Forexample,thekernelIPoIBprotocolmakesuseofthesubnetadminis-tration(SA)services,whichareofferedthroughahigh-level,standardizedSAAPI.
Therefore,insteadoftrack-ingeachqueue-pairaccess,wevirtualizemanagementfunctionsattheAPIlevelbyprovidingourownimple-mentationforguestVMs.
Mostfunctionscanbeim-plementedinasimilarmannerasprivilegedInniBandoperations,whichtypicallyincludessendingarequesttothebackenddriver,executingtherequest(backend),andgettingareply.
Sincemanagementfunctionsarerarelyintime-criticalpaths,theimplementationwillnotbringanysignicantperformancedegradation.
However,itdoesrequireustoimplementeveryfunctionprovidedbyallthedifferentmanagementinterfaces.
Fortunately,thereareonlyacoupleofsuchinterfacesandtheimplementa-tioneffortisnotsignicant.
5DiscussionsInthissection,wediscussissuesrelatedtoourprototypeimplementationsuchashowsafedeviceaccessisen-sured,howperformanceisolationbetweendifferentVMscanbeachieved,andchallengesinimplementingVMcheck-pointingandmigrationwithVMM-bypass.
Wealsopointoutseverallimitationsofourcurrentprototypeandhowwecanaddresstheminfuture.
5.
1SafeDeviceAccessToensurethataccessestovirtualInniBanddevicesbydifferentVMswillnotcompromisesystemintegrity,weneedtomakesurethatbothprivilegedaccessesandVMM-bypassaccessesaresafe.
SinceallprivilegedAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation36accessesneedtogothroughthebackendmodule,ac-cesschecksareimplementedtheretoguaranteesafety.
VMM-bypassoperationsareachievedthroughaccessingthememory-mappedUARpageswhichcontainvirtualaccesspoints.
Setting-upthesemappingsisprivilegedandcanbechecked.
InniBandallowsusingbothvirtualandphysicaladdressesforsendingandreceivingmes-sagesorcarryingoutRDMAoperations,aslongasavalidmemorykeyispresented.
SincethekeyisobtainedthroughInniBandmemoryregistration,whichisalsoaprivilegedoperation,weimplementnecessarysafetychecksinthebackendmoduletoensurethataVMcanonlycarryoutvalidmemoryregistrationoperations.
Itshouldbenotedthatonceamemorybufferisregistered,itsphysicalmemorypagescannotbereclaimedbytheVMM.
Therefore,weshouldlimitthetotalsizeofbuffersthatcanberegisteredbyasingleVM.
Thislimitcheckcanalsobeimplementedinthebackendmodule.
MemoryregistrationisanexpensiveoperationinIn-niBand.
InourvirtualInniBandimplementation,memoryregistrationcostisevenhigherduetointer-domaincommunication.
Thismayleadtoperformancedegradationincaseswherebufferscannotberegisteredinadvance.
Techniquessuchaspin-downcachecanbeappliedwhenbuffersarereusedfrequently,butitisnotalwayseffective.
Toaddressthisissue,someexist-ingInniBandkerneldriverscreatesandusesanDMAkeythroughwhichallphysicalpagescanbeaccessed.
Currently,ourprototypesupportsDMAkeys.
However,thisleavesasecurityholebecauseallphysicalmemorypages(includingthosebelongingtootherVMs)canbeaccessed.
Infuture,weplantoaddressthisproblembylettingtheDMAkeysonlyauthorizeaccesstophysicalpagesinthecurrentVM.
However,thisalsomeansthatweneedtoupdatethekeyswhenevertheVMMchangesthephysicalpagesallocatedtoaVM.
5.
2PerformanceIsolationAlthoughourcurrentprototypedoesnotyetimplementperformanceisolationorQoSamongdifferentVMs,thisissuecanbeaddressedbytakingadvantageofQoSmechanismswhicharepresentinthecurrenthardware.
Forexample,MellanoxInniBandHCAssupportaQoSschemeinwhichaweightedround-robinalgorithmisusedtoscheduledifferentqueue-pairs.
Inthisscheme,QoSpolicyparametersareassignedwhenqueue-pairsarecreatedandinitialized.
Afterthat,theHCAhard-wareisresponsiblefortakingnecessarystepstoensureQoSpolicies.
Sincequeue-paircreationsareprivileged,wecancreatedesiredQoSpoliciesinthebackendwhenqueue-pairsarecreated.
TheseQoSpolicieswilllaterbeenforcedbydevicehardware.
Weplantoexploremorealongthisdirectioninfuture.
5.
3VMCheck-pointingandMigrationVMM-bypassI/Oposesnewchallengesforimplement-ingVMcheck-pointingandmigration.
Thisisduetotworeasons.
First,theVMMdoesnothavecompleteknowl-edgeofVMswithrespecttodeviceaccesses.
ThisisincontrasttotraditionaldevicevirtualizationapproachesinwhichtheVMMisinvolvedineveryI/Ooperationanditcaneasilysuspendandbuffertheseoperationswhencheck-pointingormigrationstarts.
Thesecondprob-lemisthatVMM-bypassI/OexploitsintelligentdeviceswhichcanstorealargepartoftheVMsystemstates.
Forexample,anInniBandHCAhasonboardmem-orywhichstoresinformationsuchasregisteredbuffers,queue-pairdatastructures,andsoon.
SomeofthestateinformationonanHCAcanonlybechangedassideef-fectsofVERBSfunctionscalls.
Itdoesnotallowchang-ingitinanarbitraryway.
Thismakesitdifcultforcheck-pointingandmigrationsbecausewhenaVMisre-storedfromapreviouscheckpointormigratedtoanothernode,thecorrespondingstateinformationontheHCAneedstoberestoredalso.
Therearetwodirectionstoaddresstheaboveprob-lems.
TherstoneistoinvolveVMsintheprocessofcheck-pointingandmigration.
Forexample,theVMscanbringthemselvestosomedeterminedstateswhichsimplifycheck-pointingandmigration.
Anotherwayistointroducesomehardware/rmwarechanges.
Wearecurrentlyworkingonbothdirections.
6PerformanceEvaluationInthissection,werstevaluatetheperformanceofourXen-IBprototypeusingasetofInniBandlayermicro-benchmarks.
Then,wepresentperformanceresultsfortheIPoIBprotocolbasedonXen-IB.
WealsoprovideperformancenumbersofMPIonXen-IBatbothmicro-benchmarkandapplicationlevels.
6.
1ExperimentalSetupOurexperimentaltestbedisanInniBandcluster.
EachsystemintheclusterisequippedwithdualIntelXeon3.
0GHzCPUs,2GBmemoryandaMellanoxMT23108PCI-XInniBandHCA.
ThePCI-Xbusesonthesys-temsare64bitandrunat133MHz.
ThesystemsareconnectedwithanInniScaleInniBandswitch.
TheoperatingsystemsareRedHatAS4with2.
6.
12kernel.
Xen3.
0isusedforallourexperiments,witheachguestdomainranwithsinglevirtualCPUand512MBmem-ory.
6.
2InniBandLatencyandBandwidthInthissubsection,wecompareduser-levellatencyandbandwidthperformancebetweenXen-IBandnativeIn-niBand.
Xen-IBresultswereobtainedfromtwoguestdomainsontwodifferentphysicalmachines.
Pollingwasusedfordetectingcompletionofcommunication.
AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation37Thelatencytestswerecarriedoutinaping-pongfash-ion.
Theywererepeatedmanytimesandtheaveragehalfround-triptimewasreportedasone-waylatency.
Fig-ures9and10showthelatencyforInniBandRDMAwriteandsend/receiveoperations,respectively.
ThereisverylittleperformancedifferencebetweenXen-IBandnativeInniBand.
Thisisbecauseinthetests,InniBandcommunicationwascarriedoutbydirectlyaccessingtheHCAfromtheguestdomainswithVMM-bypass.
Thelowestlatencyachievedbybothwasaround4.
2sforRDMAwriteand6.
6sforsend/receive.
Inthebandwidthtests,asendersentanumberofmes-sagestoareceiverandthenwaitedforanacknowledg-ment.
Thebandwidthwasobtainedbydividingthenum-berofbytestransferredfromthesenderbytheelapsedtimeofthetest.
FromFigures11and12,weagainseevirtuallynodifferencebetweenXen-IBandnativeInni-Band.
Bothofthemwereabletoachievebandwidthupto880MByte/s,whichwaslimitedbythebandwidthofthePCI-Xbus.
051015204k1k256641641Latency(us)MessageSize(Bytes)XenIBNativeFigure9:InniBandRDMAWriteLatency051015204k1k256641641Latency(us)MessageSize(Bytes)XenIBNativeFigure10:InniBandSend/ReceiveLatency6.
3Event/InterruptHandlingOverheadThelatencynumbersweshowedintheprevioussub-sectionwerebasedonpollingschemes.
Inthissection,wecharacterizetheoverheadofevent/interrupthandling01002003004005006007008009001M256k64K16k4k1k256641641Throughput(MillionBytes/s)MessageSize(Bytes)XenIBNativeFigure11:InniBandRDMAWriteBandwidth01002003004005006007008009001M256k64K16k4k1k256641641Throughput(MillionBytes/s)MessageSize(Bytes)XenIBNativeFigure12:InniBandSend/ReceiveBandwidth0510152025304k1k256641641Latency(us)MessageSize(Bytes)Inter-domainlatencyFigure13:Inter-domainCommunicationOneWayLa-tency02040608010016k4k1k256641641Latency(us)MessageSize(Bytes)XenIBNativeFigure14:Send/ReceiveLatencyUsingBlockingVERBSFunctionsAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation38inXen-IBbyshowingsend/receivelatencyresultswithblockingInniBanduser-levelVERBSfunctions.
ComparedwithnativeInniBandevent/interruptpro-cessing,Xen-IBintroducesextraoverheadbecauseitre-quiresforwardinganeventfromdomain0toaguestdo-main,whichinvolvesXeninter-domaincommunication.
InFigure13,weshowperformanceofXeninter-domaincommunication.
Wecanseethattheoverheadincreaseswiththeamountofdatatransferred.
However,evenwithverysmallmessages,thereisanoverheadofabout10s.
Figure14showsthesend/receiveone-waylatencyus-ingblockingVERBS.
Thetestisalmostthesameasthesend/receivelatencytestusingpolling.
Thediffer-enceisthataprocesswillblockandwaitforacom-pletioneventinsteadofbusypollingonthecompletionqueue.
Fromthegure,weseethatXen-IBhashigherlatencyduetooverheadcausedbyinter-domaincommu-nication.
Foreachmessage,Xen-IBneedstouseinter-domaincommunicationtwice,oneforsendcompletionandoneforreceivecompletion.
Forlargemessages,weobservethatthedifferencebetweenXen-IBandnativeInniBandisaround18–20s,whichisroughlytwicetheinter-domaincommunicationlatency.
However,forsmallmessages,thedifferenceismuchless.
Forexam-ple,nativeInniBandlatencyisonly3sbetterfor1bytemessages.
Thisdifferencegraduallyincreaseswithmessagesizesuntilitreachesaround20s.
Ourpro-lingrevealsthatthisisdueto"eventbatching".
Forsmallmessages,theinter-domainlatencyismuchhigherthanInniBandlatency.
Thus,whenasendcompletioneventisdeliveredtoaguestdomain,areplymayhavealreadycomebackfromtheotherside.
Therefore,theguestdomaincanprocesstwocompletionswithasingleinter-domaincommunicationoperation,whichresultsinreducedlatency.
Forsmallmessages,eventbatchinghap-pensveryoften.
Asmessagesizeincreases,itbecomeslessandlessfrequentandthedifferencebetweenXen-IBandnativeIBincreases.
6.
4MemoryRegistrationMemoryregistrationisgenerallyacostlyoperationinIn-niBand.
Figure15showstheregistrationtimeofXen-IBandnativeInniBand.
Thebenchmarkregistersandunregistersatrunkofuserbuffersmultipletimesandmeasurestheaveragetimeforeachregistration.
Aswecanseefromthegraph,Xen-IBaddsconsis-tentlyaround25%-35%overheadtotheregistrationcost.
Theoverheadincreaseswiththenumberofpagesin-volvedinregistration.
ThisisbecauseXen-IBneedstouseinter-domaincommunicationtosendamessagewhichcontainsmachineaddressesofallthepages.
Themorepagesweregister,thebiggerthesizeofmessageweneedtosendtothedevicedomainthroughtheinter-domaindevicechannel.
Thisobservationindicatesthat010020030040050060010008006004002000RegistrationTime(us)NumberofpagesXen-IBgen2-nativeFigure15:MemoryRegistrationTime0200400600800100012001400160064k16k4k1k256641641Throughput(Mbits/sec)MessageSize(Bytes)XenIB-NativeNativeFigure16:IPoIBNetperfThroughputiftheregistrationisatimecriticaloperationofanappli-cation,weneedtousetechniquessuchasanefcientim-plementationofregistrationcache[38]toreducecosts.
6.
5IPoIBPerformanceIPoIBallowsonetorunTCP/IPprotocolsuitesoverIn-niBand.
Inthissubsection,wecomparedIPoIBper-formancebetweenXen-IBandnativeInniBandusingNetperf[2].
ForXen-IBperformance,thenetperfserverishostedinaguestdomainwithXen-IBwhiletheclientprocessisrunningwithnativeInniBand.
Figure16illustratesthebulkdatatransferratesoverTCPstreamusingthefollowingcommands:netperf-H$host-l60---s$size-S$sizeDuetotheincreasedcostofinterrupt/eventprocess-ing,wecannotachievethesamethroughputwhiletheserverishostedwithXen-IBcomparedwithnativeIn-niBand.
However,Xen-IBisstillabletoreachmorethan90%ofthenativeInniBandperformanceforlargemessages.
WenoticethatIPoIBachievedmuchlessbandwidthcomparedwithrawInniBand.
Thisisbecauseoftworeasons.
First,IPoIBusesInniBandunreliabledata-gramservice,whichhassignicantlylowerbandwidththanthemorefrequentlyusedreliableconnectionserviceduetothecurrentimplementationofMellanoxHCAs.
Second,inIPoIB,duetothelimitofMTU,largemes-AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation39sagesaredividedintosmallpackets,whichcancausealargenumberofinterruptsanddegradeperformance.
Figure17showstherequest/responseperformancemeasuredbyNetperf(transactions/second)using:netperf-l60-H$host-tTCPRR---r$size,$sizeAgain,Xen-IBperformsworsethannativeInniBand,especiallyforsmallmessageswhereinterrupt/eventcostplaysadominantroleforperformance.
Xen-IBper-formsmorecomparabletonativeInniBandforlargemessages.
0500010000150002000064k16k4k1k256641641Transactions/secMessageSize(Bytes)XenIB-NativeNativeFigure17:NetperfTransactionTest051015204k1k256641641Latency(us)XenIBNativeFigure18:MPILatency6.
6MPIPerformanceMPIisacommunicationprotocolusedinhighperfor-mancecomputing.
Fortestsinthissubsection,wehaveusedMVAPICH[27,20],whichisapopularMPIimple-mentationoverInniBand.
Figures18and19compareXen-IBandnativeInni-BandintermsofMPIone-waylatencyandbandwidth.
Thetestswererunbetweentwophysicalmachinesinthecluster.
SinceMVAPICHusespollingforallunderlyingInniBandcommunication,Xen-IBwasableachievethesameperformanceasnativeInniBandbyusingVMM-bypass.
ThesmallestlatencyachievedbyMPIwithXen-IBwas5.
4s.
Thepeakbandwidthwas870MBytes/s.
Figure20showsperformanceofIS,FT,SPandBTap-plicationsfromtheNASParallelBenchmarkssuite[26]01002003004005006007008009001M256k64K16k4k1k256641641Throughput(MillionBytes/s)MessageSize(Bytes)XenIBNativeFigure19:MPIBandwidthFigure20:MPINASBenchmarks(classA),whichisfrequentlyusedbyresearchersintheareaofhighperformancecomputing.
Weshownormal-izedexecutiontimebasedonnativeInniBand.
Inthesetests,twophysicalnodeswereusedwithtwoguestdo-mainspernodeforXen-IB.
FornativeInniBand,twoMPIprocesseswerelaunchedforeachnode.
WecanseethatXen-IBperformscomparablywithnativeIn-niBand,evenforcommunicationintensiveapplicationssuchasIS.
IB-Xenperformsabout4%worseforFTandaround2–3%betterforSPandBT.
Webelievethediffer-enceisduetothefactthatMVAPICHusessharedmem-orycommunicationforprocessesinasinglenode.
Al-thoughMVAPICHwithXen-IBcurrentlydoesnothavethisfeature,itcanbeaddedbytakingadvantageofthepagesharingmechanismprovidedbyXen.
7RelatedWorkInSection2.
1,wehavediscussedcurrentI/Odevicevir-tualizationapproachessuchasthoseinVMwareWork-station[37],VMwareESXServer[42],andXen[13].
AllofthemrequiretheinvolvementoftheVMMoraprivilegedVMtohandleeveryI/Ooperation.
InourVMM-bypassapproach,manytime-criticalI/Oopera-tionscanbeexecuteddirectlybyguestVMs.
Sincethismethodmakesuseofintelligenceinmodernhighspeednetworkinterfaces,itislimitedtoarelativelysmallrangeAnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation40ofdeviceswhichareusedmostlyinhigh-endsystems.
Thetraditionalapproachescanbeappliedtoamuchwiderrangesofdevices.
OS-bypassisafeaturefoundinuser-levelcommunica-tionprotocolssuchasactivemessages[41],U-Net[40],FM[29],VMMC[6],andArsenic[33].
Later,itwasadoptedbytheindustry[12,19]andfounditswayintocommercialproducts[25,34].
OurworkextendstheideaofOS-bypasstoVMenvironments.
WithVMM-bypass,I/Oandcommunicationoperationscanbeinitiateddi-rectlybyuserspaceapplications,bypassingtheguestOS,theVMM,andthedevicedriverVM.
VMM-bypassalsoallowsanOSinaguestVMtocarryoutmanyI/Ooperationsdirectly,althoughvirtualizinginterruptsstillneedstheinvolvementoftheVMM.
TheideaofdirectdeviceaccessfromaVMhasbeenproposedearlier.
Forexample,[7]describesamethodtoimplementdirectI/OaccessfromaVMforIBMmain-frames.
However,itrequiresanI/Odevicetobededi-catedtoaspecicVM.
TheVMM-bypassapproachnotonlyenablesdirectdeviceaccess,butallowsforsafedevicesharingamongmanydifferentVMs.
Recently,theindustryhasstartedworkingonstandardizationofI/OvirtualizationbyextendingthePCIExpressstandard[30]toallowaphysicaldevicetopresentitselfasmulti-plevirtualdevicestothesystem[31].
ThisapproachcanpotentiallyallowaVMtodirectlyinteractwithavirtualdevice.
However,itrequiresbuildingnewhardwaresup-portintoPCIdeviceswhileourVMM-bypassapproachisbasedonexistinghardware.
AtaboutthesametimewhenwewereworkingonourvirtualizationsupportforInniBandinXen,othersintheInniBandcommunityproposedsimilarideas[39,22].
However,detailsregard-ingtheirimplementationsarecurrentlynotavailable.
OurInniBandvirtualizationsupportforXenusesapara-virtualizationapproach.
Asatechniquetoim-proveVMperformancebyintroducingsmallchangesinguestOSes,para-virtualizationhasbeenusedinmanyVMenvironments[8,16,44,11].
Essentially,para-virtualizationpresentsadifferentabstractiontotheguestOSesthannativehardware,whichlendsitselftoeas-ierandfastervirtualization.
Thesameideacanbeap-pliedtothevirtualizationofbothCPUandI/Odevices.
Para-virtualizationusuallytradescompatibilityforen-hancedperformance.
However,ourInniBandvirtual-izationsupportachievesbothhighperformanceandgoodcompatibilitybymaintainingthesameinterfaceasnativeInniBanddriversatahigherlevelthanhardware.
Asaresult,ourimplementationisabletosupportexistingker-neldriversanduserapplications.
Virtualizationathigherlevelsthannativehardwareisusedinanumberofothersystems.
Forexample,noveloperatingsystemssuchasMach[15],K42[4],andL4[18]useOSlevelAPIorABIemulationtosupporttraditionalOSessuchasUnixandLinux.
SeveralpopularVMprojectsalsousethisapproach[10,3].
8ConclusionsandFutureWorkInthispaper,wepresentedtheideaofVMM-bypass,whichallowstime-criticalI/OcommandstobeprocesseddirectlyinguestVMswithoutinvolvementofaVMMoraprivilegedVM.
VMM-bypasscansignicantlyim-proveI/OperformanceinVMsbyeliminatingcontextswitchingoverheadbetweenaVMandtheVMMortwodifferentVMscausedbycurrentI/Ovirtualizationapproaches.
TodemonstratetheideaofVMM-bypass,wedescribedthedesignandimplementationofXen-IB,anVMM-bypasscapableInniBanddriverfortheXenVMenvironment.
Xen-IBrunswithcurrentInniBandhardwareanddoesnotrequiremodicationtoapplica-tionsorkerneldriverswhichuseInniBand.
Ourperfor-manceevaluationsshowedthatXen-IBcanprovideper-formanceclosetonativehardwareundermostcircum-stances,withexpecteddegradationonevent/interrupthandlingandmemoryregistration.
Currently,weareworkingonprovidingcheck-pointingandmigrationsupportforourXen-IBproto-type.
Wearealsoinvestigatinghowtoprovideperfor-manceisolationbyimplementingQoSsupportinXen-IB.
Infuture,weplantostudythepossibilitytointroduceVMsintohighperformancecomputingarea.
Wewillex-plorehowtotakeadvantagesofXentoprovidebettersupportofcheck-pointing,QoSandclustermanagementwithminimumlossofcomputingpower.
AcknowledgmentsWewouldliketothankCharlesSchulz,OrranKrieger,MuliBen-Yehuda,DanPoff,MohammadBanikazemi,andScottGuthridgeofIBMResearchforvaluablediscussionsandtheirsupportforthisproject.
WethankRyanHarperandNiveditaSinghviofIBMLinuxTechnologyCenterforextendinghelptoimprovetheXen-IBimplementation.
Wealsothanktheanony-mousreviewersfortheirinsightfulcomments.
ThisresearchissupportedinpartbythefollowinggrantsandequipmentdonationstotheOhioStateUniversity:Depart-mentofEnergy'sGrant#DE-FC02-01ER25506;NationalSci-enceFoundationgrants#CNS-0403342and#CCR-0509452;grantsfromIntel,Mellanox,Sun,Cisco,andLinuxNetworx;andequipmentdonationsfromApple,AMD,IBM,Intel,Mi-croway,Pathscale,SilverstormandSun.
References[1]IPoverInniBandWorkingGroup.
http://www.
ietf.
org/-html.
charters/ipoib-charter.
html.
[2]Netperf.
http://www.
netperf.
org.
[3]D.
Aloni.
CooperativeLinux.
http://www.
colinux.
org.
AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation41[4]J.
Appavoo,M.
Auslander,M.
Burtico,D.
D.
Silva,O.
Krieger,M.
Mergen,M.
Ostrowski,B.
Rosenburg,R.
W.
Wisniewski,andJ.
Xenidis.
K42:anOpen-SourceLinux-CompatibleScalableOperatingSystemKernel.
IBMSysmtemsJournal,44(2):427–440,2005.
[5]R.
A.
F.
Bhoedjang,T.
Ruhl,andH.
E.
Bal.
User-LevelNetworkInterfaceProtocols.
IEEEComputer,pages53–60,November1998.
[6]M.
Blumrich,C.
Dubnicki,E.
W.
Felten,K.
Li,andM.
R.
Mesa-rina.
Virtual-Memory-MappedNetworkInterfaces.
InIEEEMi-cro,pages21–28,Feb.
1995.
[7]T.
L.
Borden,J.
P.
Hennessy,andJ.
W.
Rymarczyk.
MultipleOp-eratingSystemsonOneProcessorComplex.
IBMSystemJour-nal,28(1):104–123,1989.
[8]E.
Bugnion,S.
Devine,K.
Govil,andM.
Rosenblum.
Disco:Runningcommodityoperatingsystemsonscalablemultiproces-sors.
ACMTransactionsonComputerSystems,15(4):412–447,1997.
[9]P.
M.
ChenandB.
D.
Noble.
Whenvirtualisbetterthanreal.
HotTopicsinOperatingSystems,pages133–138,2001.
[10]J.
Dike.
UserModeLinux.
http://user-mode-linux.
sourceforge.
net.
[11]B.
Dragovic,K.
Fraser,S.
Hand,T.
Harris,A.
Ho,I.
Pratt,A.
Wareld,P.
Barham,andR.
Neugebauer.
XenandtheArtofVirtualization.
InProceedingsoftheACMSymposiumonOp-eratingSystemsPrinciples,pages164–177,October2003.
[12]D.
Dunning,G.
Regnier,G.
McAlpine,D.
Cameron,B.
Shubert,F.
Berry,A.
Merritt,E.
Gronke,andC.
Dodd.
TheVirtualInter-faceArchitecture.
IEEEMicro,pages66–76,March/April1998.
[13]K.
Fraser,S.
Hand,R.
Neugebauer,I.
Pratt,A.
Wareld,andM.
Williamson.
Safehardwareaccesswiththexenvirtualma-chinemonitor.
InProceedingsofOASISASPLOSWorkshop,2004.
[14]R.
P.
Goldberg.
SurveyofVirtualMachineResearch.
Computer,pages34–45,June1974.
[15]D.
B.
Golub,R.
W.
Dean,A.
Forin,andR.
F.
Rashid.
UNIXasanapplicationprogram.
InUSENIXSummer,pages87–95,1990.
[16]K.
Govil,D.
Teodosiu,Y.
Huang,andM.
Rosenblum.
Cellu-lardisco:resourcemanagementusingvirtualclustersonshared-memorymultiprocessors.
ACMTransactionsonComputerSys-tems,18(3):229–262,2000.
[17]S.
M.
Hand.
Self-paginginthenemesisoperatingsystem.
InOperatingSystemsDesignandImplementation,USENIX,pages73–86,1999.
[18]H.
Hartig,M.
Hohmuth,J.
Liedtke,andS.
Schonberg.
Theper-formanceofmicro-kernel-basedsystems.
InProceedingsoftheACMSymposiumonOperatingSystemsPrinciples,pages66–77,December1997.
[19]InniBandTradeAssociation.
InniBandArchitectureSpeci-cation,Release1.
2.
[20]J.
Liu,J.
Wu,S.
P.
Kini,P.
Wyckoff,andD.
K.
Panda.
HighPer-formanceRDMA-BasedMPIImplementationoverInniBand.
InProceedingsof17thAnnualACMInternationalConferenceonSupercomputing(ICS'03),June2003.
[21]MellanoxTechnologies.
http://www.
mellanox.
com.
[22]MellanoxTechnologies.
I/OVirtualizationwithInni-Band.
http://www.
xensource.
com/company/xensummit.
html-/XenVirtualizationInniBandMellanoxMKagan.
pdf.
[23]MellanoxTechnologies.
MellanoxIB-VerbsAPI(VAPI),Rev.
1.
00.
[24]MellanoxTechnologies.
MellanoxInniBandInniHostMT23108Adapters.
http://www.
mellanox.
com,July2002.
[25]Myricom,Inc.
Myrinet.
http://www.
myri.
com.
[26]NASA.
NASParallelBenchmarks.
http://www.
nas.
nasa.
gov-/Software/NPB/.
[27]Network-BasedComputingLaboratory.
MVAPICH:MPIforInniBandonVAPILayer.
http://nowlab.
cse.
ohio-state.
edu/projects/mpi-iba/index.
html.
[28]OpenInniBandAlliance.
http://www.
openib.
org.
[29]S.
Pakin,M.
Lauria,andA.
Chien.
HighPerformanceMessagingonWorkstations:IllinoisFastMessages(FM).
InProceedingsoftheSupercomputing,1995.
[30]PCI-SIG.
PCIExpressArchitecture.
http://www.
pcisig.
com.
[31]PCI-SIG.
PCII/OVirtualization.
http://www.
pcisig.
com-/newsroom/news/pressreleases/20050606.
[32]I.
Pratt.
XenVirtualization.
LinuxWorld2005VirtualizationBOFPresentation.
[33]I.
PrattandK.
Fraser.
Arsenic:AUser-AccessibleGigabitEther-netInterface.
InINFOCOM,pages67–76,2001.
[34]Quadrics,Ltd.
QsNet.
http://www.
quadrics.
com.
[35]M.
RosenblumandT.
Garnkel.
VirtualMachineMonitors:Cur-rentTechnologyandFutureTrends.
IEEEComputer,pages39–47,May2005.
[36]M.
Snir,S.
Otto,S.
Huss-Lederman,D.
Walker,andJ.
Dongarra.
MPI–TheCompleteReference.
Volume1-TheMPI-1Core,2ndedition.
TheMITPress,1998.
[37]J.
Sugerman,G.
Venkitachalam,andB.
H.
Lim.
VirtualizingI/ODevicesonVMwareWorkstation'sHostedVirtualMachineMon-itor.
InProceedingsofUSENIX,2001.
[38]H.
Tezuka,F.
O'Carroll,A.
Hori,andY.
Ishikawa.
Pin-downcache:Avirtualmemorymanagementtechniqueforzero-copycommunication.
InProceedingsofthe12thInternationalParallelProcessingSymposium,1998.
[39]Voltaire.
FastI/OforXenusingRDMATechnolo-gies.
http://www.
xensource.
com/company/xensummit.
html-/XenRDMAVoltaireYHaviv.
pdf.
[40]T.
vonEicken,A.
Basu,V.
Buch,andW.
Vogels.
U-Net:AUser-levelNetworkInterfaceforParallelandDistributedComputing.
InACMSymposiumonOperatingSystemsPrinciples,1995.
[41]T.
vonEicken,D.
E.
Culler,S.
C.
Goldstein,andK.
E.
Schauser.
ActiveMessages:AMechanismforIntegratedCommunicationandComputation.
InInternationalSymposiumonComputerAr-chitecture,pages256–266,1992.
[42]C.
Waldspurger.
Memoryresourcemanagementinvmwareesxserver.
InProceedingsoftheFifthSymposiumonOperatingSys-temsDesignandImplementation,2002.
[43]A.
Whitaker,M.
Shaw,andS.
Gribble.
Denali:Lightweightvir-tualmachinesfordistributedandnetworkedapplications.
InPro-ceedingsoftheUSENIXAnnualTechnicalConference,Monterey,CA,June2002.
[44]A.
Whitaker,M.
Shaw,andS.
D.
Gribble.
ScaleandPerformanceintheDenaliIsolationKernel.
InProceedingsof5thUSENIXOSDI,Boston,MA,Dec2002.
AnnualTech'06:2006USENIXAnnualTechnicalConferenceUSENIXAssociation42 
		  
		  
		      
			  
		  
			  			   
			      
			        
			          
			          我们很多老用户对于BuyVM商家还是相当熟悉的,也有翻看BuyVM相关的文章可以追溯到2014年的时候有介绍过,不过那时候介绍这个商家并不是很多,主要是因为这个商家很是刁钻。比如我们注册账户的信息是否完整,以及我们使用是否规范,甚至有其他各种问题导致我们是不能购买他们家机器的。以前你嚣张是很多人没有办法购买到其他商家的机器,那时候其他商家的机器不多。而如今,我们可选的商家比较多,你再也嚣张不起来。...
			         
			       
				  
			     
							   
			      
			        
			          
			          柚子互联官网商家介绍柚子互联(www.19vps.cn)本次给大家带来了盛夏促销活动,本次推出的活动是湖北十堰高防产品,这次老板也人狠话不多丢了一个6.5折优惠券而且还是续费同价,稳撸。喜欢的朋友可以看看下面的活动详情介绍,自从站长这么久以来柚子互联从19年开始算是老商家了。六五折优惠码:6kfUGl07活动截止时间:2021年9月30日客服QQ:207781983本次仅推荐部分套餐,更多套餐可进...
			         
			       
				  
			     
							   
			      
			        
			          
			          青云互联怎么样?青云互联是一家成立于2020年6月的主机服务商,致力于为用户提供高性价比稳定快速的主机托管服务,目前提供有美国免费主机、香港主机、香港服务器、美国云服务器,让您的网站高速、稳定运行。美国cn2弹性云主机限时8折起,可选1-20个IP,仅15元/月起,附8折优惠码使用!点击进入:青云互联官方网站地址青云互联优惠码:八折优惠码:ltY8sHMh (续费同价)青云互联活动方案:美国洛杉矶...
			         
			       
				  
			     
							
			   
			   
physicalmemory为你推荐
	functionscsspresent37设备ipad支持ipad支持ipad支持ipadwin7关闭445端口win7系统怎么关闭445和135这两个端口photoshop技术什么是ps技术ipad如何上网苹果ipad无线上网卡怎么设置?google分析谷歌的Search Console 和 Google Analytics有何区别
免费cn域名注册 俄罗斯vps edis mediafire 轻博客 美国php主机 seovip 空间服务商 hinet cdn加速原理 免费申请个人网站 搜索引擎提交入口 申请免费空间和域名 万网主机管理 免费个人主页 广东主机托管 移动王卡 建站行业 阿里云宕机故障 主机箱 更多