lifelinuxredhat

linuxredhat 时间:2021-04-05 阅读:()

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.
researchgate.
net/publication/221617238MultipleFlowsofControlinMigratableParallelPrograms.
ConferencePaper·January2006DOI:10.
1109/ICPPW.
2006.
58·Source:DBLPCITATIONS29READS473authors:Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:NAMDmoleculardynamicssimulatorfromTBG@UIUCViewprojectParallelProgrammingLab(PPL)@UIUCViewprojectGengbinZhengUniversityofIllinois,Urbana-Champaign80PUBLICATIONS1,877CITATIONSSEEPROFILELaxmikantV.
KaléUniversityofIllinois,Urbana-Champaign375PUBLICATIONS3,866CITATIONSSEEPROFILEOrionSkyLawlorUniversityofAlaskaFairbanks35PUBLICATIONS482CITATIONSSEEPROFILEAllcontentfollowingthispagewasuploadedbyOrionSkyLawloron22May2014.
Theuserhasrequestedenhancementofthedownloadedfile.
MultipleFlowsofControlinMigratableParallelProgramsGengbinZheng,OrionSkyLawlor,LaxmikantV.
KaleDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaigngzheng@uiuc.
edu,olawlor@acm.
org,kale@uiuc.
eduAbstractManyimportantparallelapplicationsrequiremultipleowsofcontroltorunonasingleprocessor.
Inthispaper,wepresentastudyoffourow-of-controlmechanisms:pro-cesses,kernelthreads,user-levelthreadsandevent-drivenobjects.
Throughexperiments,wedemonstratethepracti-calperformanceandlimitationsofthesetechniquesonavarietyofplatforms.
Wealsoexaminemigrationoftheseows-of-controlwithfocusonthreadmigration,whichiscriticalforapplication-independentdynamicloadbalanc-inginparallelcomputingapplications.
Threadmigration,however,ischallengingduetothecomplexityofbothuserandsystemstateinvolved.
Inthispaper,wepresentseveraltechniquestosupportmigratablethreadsandcomparetheperformanceofthesetechniques.
1IntroductionWhenaprogramconsistsofmanyconcurrenttasks,itisoftensimplesttorepresentthesetaskswithindependentowsofcontrol.
Concurrencyisrequiredbymanyapplica-tions.
Largescalescienticandengineeringapplicationssuchasmoleculardynamicssimulationcode[33],wherethecomputationoneachpartitionedmolecularsystemcubespacecanbetreatedasaseparateowofcontrol.
Processorvirtualizationinparallelprogramming[20,21],whichsimpliesdevelopmentandimprovesper-formancebydividinguptheworkononephysicalpro-cessorintomany"virtualprocessors",eachaninde-pendentowofcontrol.
Largeparallelmachinesimulation,whereseparateowsofcontrolcanbeusedtorepresenteachsimu-latedphysicalprocessor[40].
DepartmentofComputerScience,UniversityofAlaskaFairbanksParalleldiscreteeventsimulations,whereeachsimu-lationobjectcanbetreatedasaseparateowofcon-trol[39].
Webandothernetworkservers,wherecommunicationwitheachclientcanbehandledbyaseparateowofcontrol.
Inthispaper,wewillfocusonhigh-concurrencyparallelprograms,andusetheterm"ows"insteadof"threads-of-control"toavoidconfusionwiththreads,aparticularkindofcontrolow.
Oneprocessorcanonlyexecuteoneowofcontrolatatime.
Aow-of-controlmaysuspenditsexecutionandgivecontroltoanotheroweitherbecauseofexternalreasons,suchasaninterruptindicatingtheendofatimeslice;orbecauseofinternalreasons,suchasanexplicityieldorawaitonaneventorresource.
Asuspendedow-of-controlcanberesumedatalatertime.
Thereareseveralmethodsforsupportingmultipleowsofcontrol.
Theoldestexampleofthissortofconcurrentprogrammingiscoroutines[27].
Somemethodsaresup-portedbytheOSkernelitself,includingseparateprocessesasdescribedinSection2.
1,andkernelthreadsinSec-tion2.
2.
Othermethodsarelikecoroutinesinthattheyaredenedandsupportedentirelybyusercode,suchasuser-levelthreadsofSection2.
3andevent-drivenobjectsofSec-tion2.
4.
WeanalyzetheperformanceofthesemethodsinSection4.
Onmultiprocessorsystems,distributingexecutionofows-of-controlovermultipleprocessorscanexploitpar-allelismandthusachieveimprovedperformance.
How-ever,theloadinparallelapplicationsmaychangeovertime,andtokeepoverloadedprocessorsfrombottleneck-ingtheoverallcomputation,wemustperformloadbalanc-ing[35].
Althoughmigratingows-of-controlbetweenpro-cessorsischallenging,itpresentsausefulwaytoimplementapplication-independentloadbalancing[41].
Section3de-scribesobstaclestoandtechniquesformigratingtheowsofcontrolacrossprocessors.
2MechanismsforSupportingMultipleFlowsofControlAowofcontrolisasinglesequentialstreamofexe-cutedinstructions,includingsubroutinecallsandreturns.
Normalmachinesdirectlyrepresentaowofcontrolasthesetofmachineregistersandthemachine'sexecutionstack.
TheschedulingofmultipleowsofcontrolonaprocessorcanbesupportedineitherOSkernelorusercode.
Thebasicmethodsforowsofcontrolexaminedinthispaperincludeprocesses,kernelthreads,user-levelthreadsandevent-drivenobjects.
2.
1ProcessesThesimplestandoldestowofcontrolistheprocess,whichwrapsacompleteaddressspacearoundaowofcontrol.
Allmodernmachinessupportmultipleprocesses,althoughsomeparallelmachinesorschedulersdonotal-lowthemorplacerestrictionsonthem.
Forexample,Myri-com'sGMinterfacedoesnotallowforkorsystemcallswhileMyrinetnetworkportsareopen.
OtherexceptionsincludetheBlueGene/L[2]andASCIRed[36]microker-nels,whichdonothaveconventionalvirtualmemorysys-temsandhencedonotsupportUNIXsystemcallssuchasfork,systemandexec.
Theadvantageoftheprocessmodelisitscompletesepa-rationofstate.
Becauseaprocesses'owofcontroliscom-pletelywalledoffwithinitsownseparateaddressspaceandoperatingsystemcontext,processesfullysupportseparateglobalvariables,memorymanagement,blockingI/Ocalls,andsignals.
However,forparallelprogramming,thetotalseparationoftheprocessmodelmakesitmoredifculttocommuni-catebetweenthevariousowsofcontrolthatmakeupaprogram.
Processescanonlyinteractusingrelativelylim-ited,cumbersomemethodssuchaspipesandsockets,orwithmoredifcultexplicitlysharedmemoryregionssuchasSYSVInterprocessCommunication(IPC).
Furthermore,processesareconsidered"heavy-weight".
Thesubstantialamountofper-processkernelstateincreasestheamountofmemoryusedbyeachprocess,andincreasestheoverheadofprocesscreationandswitching.
Worse,someoperatingsystemshaveaxedandfairlylowlimitonthenumberofprocessesthatcanbecreated.
Table2inSection4summa-rizesthesepracticallimitationsonseveralstocksystems.
2.
2KernelThreadsKernelthreadsconsistofasetofregisters,astack,andafewcorrespondingkerneldatastructures.
Whenkernelthreadsareused,theoperatingsystemwillhaveadescrip-torforeachthreadbelongingtoaprocessanditwillsched-uleallthethreads.
Unlikeprocesses,allthreadswithinaprocesssharethesameaddressspace.
Similartoprocesses,whenakernelthreadmakesablockingcall,onlythatthreadblocks.
Allmodernmachinessupportkernelthreads,mostoftenviathePOSIXthreadsinterface"pthreads".
Somededicatedparallelmachinessupportkernelthreadspoorlyornotatall.
Forexample,theBlueGene/Lmicrokerneldoesnotsupportpthreads.
Thepurportedadvantageofkernelthreadsoverpro-cessesisfastercreationandcontextswitchingcomparedwithprocesses.
Forshared-memorymultiprocessorarchi-tectures,thekernelisabletodispatchthreadsofoneprocessonseveralprocessors,whichleadstoautomaticloadbalanc-ingwithinthenodes.
Forparallelprogramming,threadsal-lowdifferentpartsoftheparallelprogramtocommunicatebydirectlyaccessingeachothers'memory,whichallowsveryefcient,ne-grainedcommunication.
Kernelthreadsshareasinglecopyoftheentiread-dressspace,includingregionssuchasglobaldatathatmaycauseconictsifusedbymultiplethreadssimultaneously.
Threadscanalsocauseunintentionaldatasharing,whichleadstocorruptionandraceconditions.
Toavoidthisunin-tentionalsharing,programsmustoftenbemodiedtoeitherlockoraccessseparatecopiesofcommondatastructures.
Severalverywidelyusedlanguagefeaturesareunsafewhenusedwiththreads,suchastheuseofglobalandstaticvari-ables,ortheidiomofreturningareferencetoastaticbuffer.
Especiallywithlargeexistingcodebaseswithmanyglobalvariables,thismakeskernelthreadsverydifculttousebe-causeinmostimplementationsofkernelthreads,itisnotpossibletoassigneachthreadaprivatesetofglobalvari-ables.
Kernelthreadsareconsidered"lightweight,"andonewouldexpectthenumberofthreadstoonlybelimitedbyaddressspaceandprocessortime.
Sinceeverythreadneedsonlyastackandasmalldatastructuredescribingthethread,inprinciplethislimitshouldnotbeaproblem.
Butinprac-tice,wefoundthatmanyplatformsimposehardlimitsonthemaximumnumberofpthreadsthatcanbecreatedinaprocess.
Table2inSection4showsthepracticallimitationsonpthreadsonseveralstocksystems.
Inparticular,operatingsystemkernelstendtoseekernelthreadsasaspecialkindofprocessratherthanauniqueen-tity.
Forexample,intheSolariskernelthreadsarecalled"lightweightprocesses"(LWP's).
Linuxactuallycre-ateskernelthreadsusingaspecialvariationofforkcalled"clone,"anduntilrecentlygaveeachthreadaseparatepro-cessID.
Becauseofthisheritage,inpracticekernelthreadstendtobecloserinmemoryandtimecosttoprocessesthanuser-levelthreads,althoughrecentworkhasmadesomeprogressinclosingthegap,includingK42[5]andtheNativePOSIXThreadingLibrary(NPTL)andLinuxO(1)scheduler.
22.
3User-LevelThreadsLikeakernelthread,auser-levelthreadincludesasetofregistersandastack,andsharestheentireaddressspacewiththeotherthreadsintheenclosingprocess.
Unlikeakernelthread,however,auser-levelthreadishandleden-tirelyinusercode,usuallybyaspeciallibrarythatprovidesatleaststart,swapandsuspendcalls.
BecausetheOSisun-awareofauser-levelthread'sexistence,auser-levelthreadcannotseparatelyreceivesignalsoruseoperatingsystemschedulingcallssuchassleep().
Manyimplementationsofuser-levelthreadsexist,including:GNUPortableThreads(Pth)[1],FreeBSD'suserlandthreads,QuickThreads[26]andthosedevelopedbyusfortheCharm++system[25].
Theprimaryadvantagesofuser-levelthreadsareef-ciencyandexibility.
Becausetheoperatingsystemisnotinvolved,user-levelthreadscanbemadetouseverylittlememory,andcanbecreatedandscheduledveryquickly.
User-levelthreadsarealsomoreexiblebecausethethreadschedulerisinusercode,whichmakesitmucheasiertoschedulethreadsinanintelligentfashion—forexample,theapplication'sprioritystructurecanbedirectlyusedbythethreadscheduler[9].
Theprimarydisadvantageofuser-levelthreadscom-paredtokernelthreadsisthelackofoperatingsystemsup-port.
Forexample,whenauser-levelthreadmakesablock-ingcall,thekerneldoesnotstartrunninganotheruser-levelthread.
Instead,thekernelsuspendstheentirecall-ingkernelthreadorprocess,eventhoughanotheruser-levelthreadmightbereadytorun.
Toavoidthisblockingprob-lem,somesystemssuchasAIXandSolarissupport"N:M"threadscheduling,whichmapssomenumberNofappli-cationthreadsontoa(usuallysmaller)numberMofker-nelentities.
Therearetwoparties,thekernelandtheuserpartsofthethreadsystem,involvedineachthreadoperationforN:Mthreading,whichiscomplex.
Theblockingprob-lemcanalsobeavoidedbybuildingasmarterruntimelayerwhichinterceptsblockingcalls,replacesthemwithanon-blockingcall,andstartsanotheruser-levelthreadwhilethecallproceeds[1].
Yetanotherapproachistoprovidesup-portinthekerneltonotifytheuser-levelschedulerwhenathreadblocks,oftencalled"scheduleractivations"[3,38].
Sinceuser-levelthreadsarecontrolledinusercode,thereisvirtuallynolimitonthemaximumnumberofthreadsaslongastheresourceallows.
Inpractice,onecancreate50,000user-levelthreadsonaLinuxboxveryeasily(seeTable2).
2.
4Event-DrivenObjectsRatherthanstoringandrestoringmachineregisterstopauseandresumeexecution,theprogramcanbedividedintopieceseachofwhichmanuallystoresandrestorestheirstate.
Asingle"scheduler"routinethenprovidesthegluetoexecutethesepiecesintheappropriatesequence.
Forex-ample,ratherthanmakingablockingcall,theevent-drivenstylewouldpostanasynchronousrequest,thenexplicitlyreturncontrolbacktothescheduler.
Oncetherequestcom-pletes,theschedulerwillcalltheobjectagaintoresumeexecution.
WehaveexploredthisideainCharm++runtimesystem[24]extensively.
This"event-driven"styleavoidsoperatingsystemandotherlow-levelproblemscompletely.
Becausesuspendingandresumingexecutionissimplyafunctioncall,theevent-drivenstylecanalsobeveryefcient.
However,theevent-drivenstylecanbemoredifcultforprogrammers.
Inparticular,ifitbecomesnecessarytoblockfrominsideadeeplynestedsubroutinecall,allthecallingroutinesmustbechangedtopasscontrolbackuptothescheduler.
Also,areactiveandevent-drivenstyleofpro-grammingobscuresthedescriptionofthelife-cycleofanobject.
Forexample,aprogramwouldspecifythatwhenthismessageAarrives,executemethodF;whenBarrives,executeG,butcannotspecifythattheobjectisgoingtore-peatedlyprocessAandBmessagesinalternatingsequencektimes[32].
2.
4.
1Return-SwitchFunctionsACorC++subroutinecanbewritteninareturn-switchstyletomimicthreadsuspend/resume.
Whenthesubrou-tineis"suspended",itreturnsinsteadofblockingwithaagindicatingthepointitleftoff.
Whenthesubroutineis"re-sumed",thesamesubroutineiscalledwiththeagwhichcanthenbeusedina"goto"or"switch"statementtore-sumeexecutionatthepointitleftoff.
ItispossibletowrapatechniquesimilartoDuff'sDeviceinsideasetofmacrostomakethis"save,return,andresumefromlabel"processmostlytransparenttotheprogrammer[37].
Howeverthistechniquecanstillbeconfusing,error-proneandtoughtodebug.
2.
4.
2StructuredDaggerSubroutinescanalsobesuspendedwithoutusingthreadsormacrosbyapplyingasimplepre-processor.
WedevelopedStructuredDagger(SDAG)[22]asacoordinationlanguageforexpressingthecontrolowwithinaCharm++parallelobjectnaturallyusingcertainClanguage-likeconstructs.
Inparallelprogramming,itleadstoastyleofprogrammingwhichcleanlyseparatesparallelfromsequentialfunctional-ity.
Figure1showsanexampleofaparallel5-pointstencilprogramwith1-DdecompositionandghostcellexchangewritteninSDAG.
Intheprogram,theforloopimplementsanouteriterationloop.
EachiterationbeginsbycallingsendStripToLeftAndRightinanatomicconstructtosend3entryvoidstencilLifeCycle(){for(i=0;iSamplecodeinStructuredDaggeroutmessagestotheneighbors.
1Theoverlapimmediatelyfollowingassertsthatthetwoeventscorrespondingtoget-StripFromLeftandgetStripFromRightcanoccurandbepro-cessedinanyorder.
Thewhenconstructsimplysaysthatwhenamessage(e.
g.
getStripFromLeft)arrivesforthecon-struct,itinvokestheatomicactionwhichcallsaplainC++functioncopyStripFromLefttoprocessthemessage.
Afterbothwhenconstructsareexecuted,functiondoWorkinthelastatomicconstructwillbeinvokedandtheprogramen-tersthenextiterationoftheforloop.
TheStructuredDaggerpreprocessortransformsallthissyntaxintocodeforanef-cientnite-statemachine,whichreceivesandprocessesthenetworkmessagesatruntime.
Overall,theevent-drivenstylecanbemadequiteef-cientandreasonablyeasytoprogram.
However,theevent-drivenstyleisdifculttoapplytoexistingcodes—inpartic-ular,mostparallelprogramminginterfacesliketheMessagePassingInterface(MPI)arewrittenintermsofblockingsubroutinecallslikeMPIRecv.
AtraditionalC-compatiblecompiledlibrarycanonlyhopetoswitchtasksattheseblockingcallsbysavingandrestoringthemachine'sstackandregistersinathreadlikefashion.
Hencefortheremain-derofthispaper,wefocusonsupportingthreads.
3MigrationMigration,theprocessofmovingworkfromoneproces-sortoanother,isaveryexibletoolthatcanbeusedtosolveavarietyofproblemsinparallelcomputing.
Forexample,migrationcanbeusedtoimproveloadbalance,bymigrat-ingworkawayfromoverloadedprocessors[11,4,41,6].
Migrationcanimprovecommunicationperformance,bymovingpiecesofworkthatcommunicatewitheachotherclosertogether[33].
Migrationcanallowalltheworktobemovedoffaprocessor[17]toallowanotherjobtorunthere[10],ortovacateanodethatisexpectedtofailorbeshutdown[34].
Migrationtechniquescanalsobeusedtoimple-mentcheckpoint/restartforfaulttolerance[12,42]—under1TheatomicconstructencapsulatessequentialC/C++languagecode.
thismodel,checkpointingissimplymigrationtodiskorthelocalmemoryofaremoteprocessor.
Inthissection,wedescribeissuesrelatedtothemigrationoftheowsofcon-trol.
However,wewillmainlyfocusonthethreadmigra-tiontechniquesthatwedevelopedintheCharm++/AMPIruntimesystems.
3.
1MigrationStateTomigrateapieceofworktoanewprocessor,wemustensurethatallneededstateinformationwillbeavailableonthenewprocessor.
Inafullyshared-memoryparallelsys-tem,thisistrivial—becauseallprocessorsimplicitlyshareallstate,migrationsimplymeanstransferringcontroltothenewprocessor,whichwillbeabletoaccessthework'sstatedirectly.
Inadistributed-memoryparallelsystem,wemusteitherproactivelyshipallneededstatealongwiththepieceofwork;orelselazilyshipstateasitisneeded.
Lazymem-orystateshippingiscommonlydonebysoftwaredistributedsharedmemorysystems.
Statethatissharedbetweensev-eralthreadsclearlycannotsimplybeshippedtoanewloca-tion,andsomustbeeithershufedbackandforthatruntime(costingperformance)ortightlyregulatedorbanned(cost-ingexibility).
Wecurrentlyallowsharedstateonlyviawell-denedinterfacesinourownworkrelatedtoCharm++andAMPIruntime.
Thestaterequiredbyamigratingcontrol-of-owinvari-ablyfallsintooneofthreecategories:Memorystate,whichincludesdynamicallyallocateddatastoredintheheap,localvariablesstoredinthestack,andglobalvariables.
Communicationstate,whichincludespendingoutgo-ingandincomingnetworkmessages.
Kernelstate,whichincludesopenles,mappedandsharedregionsofmemory,andpendingsignals.
Mostmigrationsystemsonlysupportshippingasubsetofthepossiblestate,forcinguserstomanuallyreconstructotherstate.
Inthecaseofmigratablethreads,whenseveralthreadsinaprocesscansharememory,kernel,andcom-municationstate,itcanbeverydifculttoextractthesetofresourcesneededbyasinglethread.
InourCharm++/AMPIruntimesystem,weonlyallowdifferentthreadstosharestateinasmallnumberofwell-denedways,whichmakesitpossibletoefcientlymigrateasinglethreadtoadifferentaddressspace.
3.
1.
1MemoryStateMigrationMigratingmemorystateisconceptuallyquitestraightfor-ward—allneededmemoryiscollectedandshippedtothe4newprocessor.
Thisiscomplicatedinpracticebyvariouslow-leveldifculties.
Forheapdata,wemustcollectalltheallocateddataforaparticularpieceofwork.
Forobject-orientedapplications,wehavebuiltaconvenientandgeneral-purposeframeworkfordescribingandshippingcomplicateduser-denedob-jectscalledthePUP(Pack/UnPack)Framework[19].
Formoregeneralapplications,itispossibletoprovidehookstothememoryallocationroutines(malloc)tocaptureallthememoryallocatedbyeachthread.
Forglobalvariables,processmigrationisstraightfor-wardbecauseeachprocessimagecontainsaseparatecopyoftheglobalvariables.
Forthreadmigration,however,ifseveralthreadsinaprocessshareglobalvariables,itmaybeimpossibletoextractthevalueneededbyeachthread.
Ourcurrentsolutionistoprivatizeglobalvariablessothateachuser-levelthreadhasitsowncopyoftheglobalvari-ables.
Thusmigrationofglobalvariablesisconceptuallysimplied.
Inordertosupportthetransparentprivatiza-tionofglobalvariablesforexistingcodebases,weimple-mentedaswap-globalschemebyanalyzingtheExecutableandLinkingFormat(ELF)objectlesinawaysimilartotheWeavesruntimeframework[31].
AdynamicallylinkedELFleformatexecutablealwaysaccessesglobalvariablesviatheGlobalOffsetTable(GOT),whichcontainsonepointertoeachglobalvariable.
Tomakeseparatecopiesoftheglobalvariables,wethensimplymakeseparatecopiesoftheGOT—oneforeachuser-levelthread.
Thethreadsched-ulerthenswapstheGOTwhenswitchingthreads.
3.
1.
2CommunicationStateMigrationInourruntimesystem,migrationentitiesonlycommunicateviathecommunicationsub-system,whichprovidesaloca-tionindependentcommunicationthatsupportsmigrationatanytime.
Wehaveconstructedanefcientcommunicationsystemthatallowsobjectorthreadmigrationwithongo-ingpoint-to-pointandcollectivecommunication.
Asthissystemisdescribedinourearlierwork[28],wewillnotdescribeitindetailhere.
3.
1.
3KernelStateMigrationKernelstateincludesdatamanagedbytheOSfortheap-plication—forexample,thestateofopenles,signals.
Kernelstateismostlyinvisibletousersandisplatform-dependent,soveryfewsystemssupportkernelstatemi-gration.
Instead,kernelstateisoftentreatedasanon-migratableportionofamigratingprocess.
Forexample,Mosix[6]dividesmigratingprocessintomigratableusercontext,andnon-migratablesystemcontextinformation.
Ourruntimesystemcurrentlydoesnotsupportmigrationofkernelstate.
3.
2Event-drivenObjectsThesimplestkindofmigrationisforevent-drivenob-jects[41,8,7,10,11].
InCharm++runtimesystem,event-drivenobjectsarenormallylocation-independent,requiringlittlepersistentruntimeorsystemstate.
Becausetheentireexecutionstatenormallyconsistsofafewapplicationdatastructuresandthenameofthenexteventtorun,tomigratetoanewprocessorweneedonlycopythesedatastructurestoanewprocessorandbeginexecutingthenextevent.
3.
3ProcessMigrationBecauseprocessesprovideawell-denedmemory,ker-nel,andcommunicationinterface,processmigrationisanoldandwidelyimplementedtechnique.
Sincetheentireaddressspaceismigrated,allthepointersintheuserap-plicationarestillvalidonthenewprocessor.
SystemsthatsupportprocessmigrationincludeSprite[13],Condor[29],MOSIX[6],andmanyothers,includingtherecentVMAD-UMPinterfaceforLinux[14].
Anextensivesurveyisavail-ablefromtheACM[30];sowewillnotgointofurtherdetailhere.
3.
4ThreadMigrationThreadmigrationisverychallengingduetothefactthatthesystemanduserstateofathreadisonlyonepartofanenclosingprocess.
Itisoftenverydifculttoextractthestateofonethreadfromtheothersinaprocess.
Forex-ample,heapdataallocatedbyathreadmaybesharedbymultiplethreads.
Inourwork,weassumethattheuserdataallocatedbyonethreadisusedonlybythatthread.
Datasharingamongthreadscanstillbeachievedviaread-onlyglobalvaraiblesorfastlocalmessagepassingviathethreadscheduler.
Anotherdifcultaspectofthreadmigrationisthefactthatathreadstackcontainsalargenumberofpointers(includingfunctionreturnaddresses,framepointers,andpointervariables),andmanyofthesepointintothestackit-self.
Thus,ifwecopythethreadstacktoanotherprocessor,allthesepointersmustbeupdatedtopointtothenewcopyofthestackinsteadoftheoldcopy.
However,becausethestacklayoutisdeterminedbythevagariesofthemachinearchitectureandcompiler,thereisnosimpleandportablemethodbywhichallthesepointerscanevenbeidentied,muchlesschanged.
Afeasibleapproachforthreadmigration,then,istoguaranteethatthestackwillhaveexactlythesameaddressonthenewprocessorasitdidontheoldprocessor.
Becausethestackaddresseshaven'tchanged,nopointersneedtobeupdatedbecauseallreferencestotheoriginalstack'sdata5remainvalidonthenewprocessor.
Thisideawasorigi-nallydevelopedforthreadmigrationinthePM2runtimesystem[4].
Inthefollowingsubsections,wepresentthreemechanismstoensurethatthestack'saddressremainsthesameaftermigration.
Thesemechanismsformigrationap-plytobothkernelanduser-levelthreads.
However,inthispaper,wewillmainlyfocusonthemigrationofteuser-levelthreadssupportedbytheCharm++andAMPIruntimesys-tems.
3.
4.
1StackCopyingThreadsAnaiveimplementationofthisapproachcalled"stack-copyingthreads"issimplytopickoneaddressforthestacksystem-wide.
Thethreadschedulerthencopieseachthread'sstackdataintothisaddressbeforeexecutingthethread,andcopiesitoutagainwhenthethreadissuspended.
Becausethereisonlyoneaddressusedbyallactivethreadstacksevenondifferentprocessors,migratingathreadissimple.
Onecomplicatingfactoristhatmodernmachinesmaynotalwaysallocatethesystemstackatthesamevirtualmemoryaddressoneachprocessor.
Axedstackaddressmakesiteasiertomountabufferoverow("stacksmash-ing")attack,sosomesystemschangethestack'sstartingaddressfromruntorunasasecuritymeasure.
Onsuchsys-tems,itisthusimpossibletousethesystemstackforstackcopyingthreadsbecausethestackaddressisdifferentoneachmachine.
Stackcopyingthreadssufferfromthehighcostincurredbystackcopying.
Becausetheentirestackmustbycopiedforeachthreadswitch,changingthreadsisquiteslow,espe-ciallywhenthestackcontainsalargeamountofdata.
Fur-ther,becausethereisonlyonestacklocation,therecanonlybeonethreadactiveineachaddressspace,whichmeansamachinewithtwophysicalprocessorscannotruntwostack-copyingtheadsfromthesameaddressspacesimul-taneously.
3.
4.
2IsomallocThreadMigrationThePM2implementatonof"isomalloc"[4]overcomesthesedisadvantagesbyallocatingagloballyuniqueaddressforeachthreadstack.
Toavoidconictsbetweenthreads,stackaddressesmustbeuniqueacrosstheentireparallelmachine,whichmakesallocatingstackssomewhatcompli-cated.
AsillustratedinFigure2,thePM2approachistodivideuptheentireunusedvirtualmemoryaddressspace,or"isomallocregion,"intoper-processorslotswhichcansubsequentlybeallocatedinparallel.
Thisisomallocregionmustbeagreeduponbyallprocessorsatstartup—normallythelargestspaceavailableliesbetweentheprocessstackandtheheap.
Aprocessorcanthengrantanylocalthreadanewgloballyreservedrangeofvirtualaddressesfromwithinthatprocessor'sregionofthesharedaddressspace.
Thethreadscanthenbecondentthattheycanmigratetoanyotherprocessor,andtheiraddresseswillbefreeforuseonthatprocessor.
Thisapproachcanbeseenasasortofdistributedsharedmemorysystem,inthateachthreadisusinggloballyuniquevirtualmemoryaddresses.
However,becausethreadsneverdirectlysharedata,wecaneliminatethepossibilityofDSMpagefaultsbysendingallathread'sdataalongwiththethreadatmigrationtime.
Clearly,withnthreadsperprocessor,sbytesperthread,andpprocessors,theisomallocapproachusesatleastnspbytesofaddressspace.
For10threadsperprocessor,1MBofdataperthread,and1000processors,thisamountsto10gigabytesofaddressspace!
Butluckilywecanusethevir-tualmemoryhardwaretoavoidusingsuchalargeamountofphysicalmemory.
Thesystemcallmmapallowsindivid-ualpagesofprogramvirtualaddressspacetobeassignedphysicalmemory,sooneachprocessorweassignphysicalmemoryonlytotheaddressesinusebylocalthreads.
Ad-dressesusedbyallremotethreadsareclaimedonlyinprin-ciple,butneveractuallyallocatedphysicalmemoryunlessthatremotethreadmigratesin.
TheoriginalPM2requiredapplicationstobemodiedtocallthespecialmemoryallocationroutinesisomallocandisofree,whichwasaburdenondevelopersandwasnotfea-siblewhenlinkingwithathird-partylibrary.
Inourruntimesystem,weextendedthisapproachbyoverridingthesystemmalloc/freeroutinestousethenewisomalloc/freewhenitiscalledwithinathread.
Ofcourse,malloc/freecalledfromoutsidethethreadingcontext(e.
g.
bythecommunicationlayeroftheruntimesystem)isstilldirectedtothenormalsystemversionofmalloc/free.
Thisapproachthusallowsunmodiedapplicationstousemigratablethreadmemoryfortheirheapdata.
Figure2.
Migratingathreadstackallocatedwithisomalloc.
6ThreadX86IA64OpteronMacOSXIBMSPSUNAlphaBG/LWindowsStackCopyYesMaybeYesMaybeYesYesYesMaybeYesIsomallocYesYesYesYesYesYesYesNoMaybeMemoryAliasYesYesYesYesYesYesYesMaybeMaybeTable1.
Portabilityofcurrentimplementationsofmigratablethreads.
"Yes"meanswehaveim-plementedthistechnique.
"Maybe"indicatestherearenotheoreticalproblems,butnoexistingimplementation.
"No"indicatesthetechniqueisimpossibleonthismachine.
Sinceonethread'sdataisalwaysallocatedatthesameaddressesinsidetheisomallocregion,athreadcanbemi-gratedsimplybycopyingallitsdatatothenewprocessor—pointerswithinandbetweenthethread'sstackandheapneednotbemodied.
Becauseweonlyallocatephysi-calmemorytolocalthreads,thephysicalmemoryusageoneachprocessorismodest.
Unlikestack-copythreads,nodataneedstobemovedwhenswitchingthreads,andmultiplethreadscanrunsimultaneously,whichallowsthestraightforwardexploitationofSMPmachines.
However,isomallocstackshavethedisadvantagethattheyconsumevirtualaddressspaceoneachprocessorpro-portionaltothetotalnumberofthreadsonallprocessors.
32-bitmachinesonlyhave4GBofvirtualaddressspace,ofwhichasubstantialfractionisalreadyoccupiedbytheop-eratingsystem,sharedlibraries,andothermachinery.
Eveniftheentir32-bitaddressspacewereavailableforthreadstacks,ifeachthreaduses1megabyte,therewouldonlyberoomfor4,096threads.
64-bitmachines,bycontrast,nor-mallyhaveterabytesofvirtualmemoryspaceavailable,andsoneversufferfromthisproblem.
Unfortunately,machinestodaycontinuetobecon-structedwithhighprocessorcountsusingthesimplest,cheapestpartsavailable,whichtodayare32-bitmicropro-cessors.
Large32-bitx86Linuxclustersarecommonplace;andtheBlueGene/LmachinebuiltbyIBMforLLNLcon-sistsof128Kprocessors,eachofwhichisa32-bitPowerPCderivative.
Onthiskindofmachine,virtualaddressspaceusagequicklybecomesasignicantimpedimenttotheuseofisomalloc-basedmigratablethreads.
Thismotivatedustodesignanewapproachtoreducetheusageofvirtualaddressspacewhichisdescribednext.
3.
4.
3MemoryAliasingStacksAsacomplementtotheisomallocapproach,wehavede-signedandimplementedanalternativeimplementationthatusesmuchlessvirtualaddressspace,andsocanscaleonlarge32-bitmachines.
Theideaissimple:acceleratestack-copyingthreadsbysimulatingthecopyusingthevirtualmemoryhardware.
Likestack-copyingthreads,inthisap-proachallstacksareexecutedfromthesameaddress.
How-ever,toswitchinanewthread,wesimplymapthenewthread'sstackontothepagesatthestackaddressbycall-ingmmap,ratherthanactuallycopyingthestackdata.
Thatis,eachthread'sdataisstoredinseparatepagesofphysicalmemory.
Torunathread,werstmapthethread'sdataintothecommonvirtualaddressspacewithamemorymappingcallasshowninFigure3,thenexecutethethread.
Figure3.
Memory-aliasingstacksTheadvantageofthisapproachisthatonlyonestack-sizeofvirtualaddressspaceisused,whichallowsthistech-niquetobeusedevenonmachineswithverylimitedaddressspace.
However,sinceeachcontextswitchinvolvesanex-trammapsystemcall,theperformanceofmemoryaliasingstacksisnotasgoodasthatoftheisomalloc-basedthreads.
However,becausenodataisactuallycopied,theperfor-manceofmemoryaliasingstacksismuchbetterthanthestack-copyingthread.
Likethestackcopyingthreads,thistechniquehasthesamedisadvantagethatonlyonethreadisallowedtobeactiveineachaddressspace.
Section4.
2comparestheperformanceofthesethreethreadmigrationtechniques.
3.
4.
4ThreadMigrationPortabilityTable1illustratestheportabilityofourimplementationofeachthreadmigrationtechniqueonvariousplatforms.
7Stackcopyingthreadsareportabletomanyplatformsin-cludingevenWindows.
Theonlyrestrictiononstackcopythreadsistherequirementthatthesystemstackframesonallprocessorsbeginatthesamebaseaddress.
Inpractice,however,ourimplementationofstackcopyingthreadsisbasedonQuickThreadswhichhasnotbeenportedtotheIA64.
Bycontrast,isomallocandmemoryaliasingstackthreadmigrationmechanismsworkonallmachinesex-ceptforthosewherethemmapsystemcallisunavailable,forexampleBlueGene/LandWindows.
Windowssup-portsvirtualmemorywiththeMapViewOfFileExcall,andsocouldbesupportedwithonlyasmallamountofeffort.
BlueGene/Ldoesnothavemmap,butwehaveshownourschemeformemoryaliasingcanbesupportedbyaddingasmallextensiontotheBlueGene/Lmicrokerneltoallowuserprocessestoremaptheirheapdataoverthestackloca-tion.
4PerformanceWehaveexaminedthepracticalperformanceofthesemethodsforimplementingmigratableowsofcontrolonavarietyofrealmachines.
ThetestswerecarriedoutwithbothbenchmarktestssuchasNASParallelBenchmarkMulti-Zoneversion[18]andreal-worldapplicationlikeaBigSimparallelsimulator[43].
4.
1NumberofFlowsWemeasuredthecontext-switchingperformanceoffourdifferentimplementationsofowsofcontrol.
Processes,createdusingfork()andyieldingusingschedyield().
Thisisanimperfectbenchmark,becausesomeoperatingsystemsseemtoignoreschedyield()whencalledrepeatedly,resultinginanarticiallylowmeasurementofthecontextswitchingtime.
Pthreads,createdusingpthreadcreate()andyieldingusingschedyield().
Cth(ConverseThreads)[23],ourimplementaionsofuser-levelthreadscreatedusingCthCreate()andscheduledusingCthYield().
Weusedthenon-migratableversionofthesethreads.
AMPI(AdaptiveMPI)[16,15]2user-levelthreadscreatedbytheAMPIruntimeandscheduledusingtheAMPIroutineMPIYield().
Thesearemigratablethreads,implementedusingtheisomallocstackallo-cationapproachbasedontheCththreads,althoughnomigrationsactuallyoccur.
2AnimplementationofMPIthatrunseachMPIprocessinanAMPIthread.
Weranourexperimentsonavarietyofmachines.
Wereportcontextswitchtimesasthetimeperowofcontrolpercontextswitch.
Linux,onatypicalx86laptop,witha1.
6GHzPen-tiumMrunningLinux2.
4.
25/glibc2.
3.
3(RedHat9).
ContextswitchtimeisshowninFigure4.
MacOSX,onTuringclusteratUniversityofIllinois,eachnodehas2GHzG5processorsand4GBofRAM.
ContextswitchtimeisshowninFigure5.
SunSolaris,witha700MHzSunBlade1000worksta-tionrunningSolaris9.
ContextswitchtimeisshowninFigure6.
IBMSP,ontheproductionmachinecu.
ncsa.
uiuc.
edu,withone1.
3GHzPower4"Regatta"noderunningA/IX5.
1.
ContextswitchtimeisshowninFigure7.
HP/CompaqAlpha,ontheproductionmachinelemieux.
psc.
edu,withone1GHzES45AlphaServernoderunningTru64Unix.
ContextswitchtimeisshowninFigure8.
Ourexperimentshaveshownthereisawidevariationinthelimitationsandperformanceofthesemethodsondif-ferentmachines.
Ingeneral,theuser-levelthreads(Cth)onmostofthesemachineshavethefastestcontextswitchtimeexceptonIBMSPandAlphamachines.
OnthesemachinesexceptIBMSP,thecontextswitchtimeoftheuser-levelthreadstendstoincreaseslowlyasthenumberofowsin-creases.
Figure4.
Contextswitchingtimevs.
numberofowsonax86Linuxmachine.
Table2illustratesapproximatepracticallimitations(onstocksystems).
Itshowstheapproximatemaximumnum-berofprocessesausercancreateinaprocessorandthe8FlowofcontrolLimitingFactorLinuxSunIBMSPAlphaMacOSIA-64Processulimit/kernel800025000100100050050000+KernelThreadskernel2503000200090000+700030000+User-levelThreadsmemory90000+90000+1500090000+90000+50000+Table2.
Approximatepracticallimitations(onstocksystems)forvariousmethodstoimplementowofcontrol.
Figure5.
Contextswitchingtimevs.
numberofowsonaMacAppleG5machine.
Figure6.
Contextswitchingtimevs.
numberofowsonaSunSolarismachine.
maximumnumberofthreadsausercancreateinaprocess.
Aswecansee,anunmodiedLinuxRedHat9machinecanspawnlessthan256pthreadsinoneprocess;whiletheper-userprocesslimitonourIBMSPwasonly100pro-cesses.
Bothoftheselimitationscanbeextendedwithonlyasmallamountofsystemadministratoreffort,butthisef-Figure7.
Contextswitchingtimevs.
numberofowsonanIBMSPmachine.
Thisisa16-waySMPnode.
WebelievethelowtimesforprocessesandthreadsareduetotheOSignoringourrepeatedschedyield()calls.
Figure8.
Contextswitchingtimevs.
numberofowsonAlphamachine.
Thisisa4-waySMPnode.
Again,processandthreadswitch-ingnumbersmaybeunrealisticallylow.
fortislikelybeyondthereachofatypicalparalleluser.
Ingeneral,processesandkernelthreadswerelimitedtoafew9thousand,withonlytheAlphaallowingmorethan5000threadsatatimeandIA-64withoutsuchlimitation.
Bycon-trast,wecouldcreatetensofthousandsofuser-levelthreadseasilyonallplatforms.
4.
2StackSizeWealsoexaminedtheeffectthestacksizehasonthemi-gratablethreadcontextswitchingtime.
Weexaminedthreemigratablethreadmethods,asdescribedindetailinSec-tion3.
4:stackcopyingthreads,isomallocthreads,andournewmemoryaliasingthreads.
AsourresultsinFigure9show,stackcopyingbecomesunusablyslowifmorethan20KBofstackdataisused.
Isomallocthreadsarethefastestoverall,andshownodependenceontheamountofstackdatainuse.
Memoryaliasingstackstookabout4uspercontextswitch,whichissubstantiallyslowerthanisomal-locthreads,andthetimegrowswithlargerstacksalthoughveryslowly.
Neverthless,memoryaliasingstackscanpro-videthesmallvirtualmemoryfootprintofstackcopying,but(especiallyforlargerstacks)withmuchfastercontextswitching.
Itissuitableforapplicationswithverylargenumberofthreadsrunningonparallelmachineswithonly32-bitaddressspace(e.
g.
BlueGene/L),whereisomallocthreadsmayquicklyrunoutofvirtualaddressspace.
Figure9.
Contextswitchingtimevs.
stacksizeonax86Linuxmachine.
Variousamountsofstackspacefrom8KBto8MBwereconsumedusingalloca().
4.
3MinimalContextSwitchingWehavedeterminedalowerboundonthenumberofinstructionsrequiredtoexplicitlycontextswitchtwouser-levelthreads,asinitiatedbyasubroutinecalltoalibrary'sswitchroutine.
Becausetheswitchroutineisasubroutine,thismeansonlythesavedregistersdenedinthearchitec-ture'ssubroutinecallingconventionneedactuallybesavedandrestored—scratchregisterswillautomaticallybesavedandrestoredbythecompiler,justlikeatanysubroutinecall.
Thisobservationmakesitpossibletowriteextremelyef-cientuser-levelthreadswitchroutines,particularlyforto-day'spopularx86andx86-64CPUs.
3Figure10showstheminimumcorrectthreadswaproutinesfor32and64-bitx86CPUs.
Notethatonx86,theoatingpointregistersmustbeemptybeforemakingasubroutinecall,sothecompilerwillalreadysaveandrestoreoatingpointregisterswhenneeded.
ThesubroutinesinFigure10canswapuser-levelthreadsin16ns(32-bitmode)and18ns(64-bitmode)ona2.
2GHzAthlon64.
Ofcourse,arealthreadlibraryalsorequiresaschedulingcomponenttodecidewhichthreadtoswapinwhenanotherthreadsuspends,butformanyapplicationsthreadschedulingcanbeverysimple—forexample,acir-cularlinkedlistofrunnablethreads.
Figure10.
Minimaluser-levelthreadcontextswitchingroutinesfor32-bit(a)and64-bit(b)x86CPUs.
AT&T/GNUassemblysyntaxshown.
Mostuser-levelthreadpackagesprovidefarworseper-formancethanthisfortworeasons.
First,realsystemsof-tenincludemultiplelayersofschedulingandprioritizationwhichcostsfunction-calloverhead.
Secondly,manyuser-levelthreadsimplementationssaveandrestorefarmore3Together32and64-bitx86architecturesprovided68.
2%oftheFLOPSinthe2005Top500list.
10statethanisnecessary,eitherthroughfearorignorance.
Inparticular,popularimplementationsofswapcontextandsetjmp/longjmp(oftenusedtoimplementuser-levelthread-ing)saveandrestoreallregisters,includingscratchregis-ters.
Worse,theyoftenincludesystemcallstosaveandre-storethesignalmask,eventhoughveryfewscienticappli-cationsmanipulatesignalsatruntime.
Ifauser-levelthreadcontextswitchinvolvesevenonesystemcall,mostofthespeedadvantageofuser-levelthreadsislost.
Thisisbecauseasystemcallinvolvessavingapplicationregisterswhenen-teringkernelspaceandrestoringapplicationregisterswhenleavingkernelspace,sothekernelcouldjustasquicklyper-formaprocessswitchbysimplysavingtheregistersofoneprocessandrestoringtheregistersofadifferentprocess!
4.
4Application—ParallelSimulatorBigSim[43,44]isaparallelsimulatordevelopedontopofourCharm++runtimesystem.
Itiscapableofpredictingperformanceofparallelapplicationsonamassivelyparallelmachinewithpetascaleperformanceusinganexistingpar-allelmachinewithonlyhundredsofprocessorsevenbeforethetargetmachineisbuilt.
Suchsimulationrequiresthatonephysicalprocessortosimulatehundredsoreventhou-sandsofprocessorsofthesimulatedmachine,hencecreat-ingthescenariosofrunningmultipleows-of-control,oneforsimulatingeachtargetprocessor,onasimulatingpro-cessor.
Inatypicalsimulation,wesimulatedaBlueGenelikemachinewith200,000processorsrunningamoleculardy-namics(MD)simulationcode.
Runningtheteston4pro-cessorsrequiresthateachprocessorsimulate50,000sepa-ratetargetprocessors,whichclearlyisnotfeasibleonmostmachinesusingeitherprocessesorkernelthreads.
Butbyusinguser-levelthreads,wewereabletosimulate50,000targetprocessorsusing50,000user-levelthreadsonjustonerealprocessor.
Figure11illustratestheperformanceofBigSimusingCth,Converseuser-levelthreads.
ThetestwasrunonLeMieuxwhichis750QuadAlphaServerES45nodema-chineatPittsburghSupercomputingCenter(PSC).
Eachnodeisa4processorSMP,with4Gbytesofmemory.
WemeasuredthetimetakentosimulateonetimestepoftheMDsimulationusing4to64LeMieuxprocessors.
Theguredemonstratesexcellentscalabilityofthesimulatorinthistest.
4.
5ThreadMigrationandLoadBalancingThe"Multi-Zone"NASParallelBenchmark[18]isanextensiontothewell-knownNPBsuite.
Itinvolvessolv-ingtheapplicationbenchmarksLU,BTandSPonvariouscollectionsoflooselycoupleddiscretemeshes.
Itischarac-Figure11.
Simulationtimeperstepusingatotalof200,000user-levelthreadsterizedbypartitioningtheproblemsonacoarse-grainleveltoexposemoreparallelismandtostresstheneedforbal-ancingthecomputationload.
Amongthesetests,BT-MZcreatesthemostdramaticloadimbalance,whichisusedinourtestruns.
WerantheBT-MZbenchmarkwithAdaptiveMPIandusedthreadmigrationfortheloadbalancing.
Themigrat-ablethreadsusetheisomallocandswap-globalmechanisms(Section3.
4.
2)toallowtransparentthreadmigrationwith-outhavingtochangeanyofthebenchmarkcode.
Inor-derforloadbalancingtobeeffective,AMPIrequiresthenumberofAMPImigratablethreadstobemuchlargerthantheactualnumberofprocessors,sothatAMPIthreadscanmigratefromoverloadedprocessorstounderloadedonestoimproveloadbalance.
ThesetestswererunontheTungstenXeonLinuxclus-teratNCSA.
ThisclusterisbasedonDellPowerEdge1750servers,eachwithtwoIntelXeon3.
2GHzprocessors,run-ningRedHatLinuxandMyrinetinterconnectnetwork.
Fig-ure12showsthetotalexecutiontimewithvariouscong-urationsofBT-MZwithvs.
withoutloadbalancing.
Thex-axisrepresentseachtestcase.
Forexample,"A.
8,4PE"indicatesthattheBT-MZiscompiledwithCLASS=AandNPROCS=8runningwith8AMPIthreadsbuton4actualprocessors.
Notethatsameclass(A,B,etc)problemshavesameproblemsize.
Indeed,forallthreeclassBtestson8processors(B.
16,8BE,B.
32,8PEandB.
64,8PE),theexe-cutiontimesafterloadbalancingareaboutthesame,whilethereisadramaticvariationinexecutiontimesbeforeloadbalancing.
Thisbenchmarkdemonstratestheeffectofloadbalancingthatismadefeasibleviathreadmigration.
11Figure12.
TheNASBT-MZbenchmarkwithandwithoutthreadmigrationforautomaticparallelloadbalancing.
5ConclusionWepresentedastudyoffourow-of-controlmech-anismsthatarewidelyusedinparallelprogramming.
Thesemechanismsareprocesses,kernelthreads,user-levelthreads,andevent-drivenobjects.
Throughexperiments,weillustratedthepracticallimita-tionsofusingthesetechniquesonavarietyofplatforms.
Wefurtheranalyzedtheperformanceofthesetechniqueswhilevaryingparameterssuchasthenumberofthreadsandtheamountofmemoryusedbyeachthread.
Wedemonstratedawidevariationintheperformanceoftheseow-of-controlmechanismsincontextswitchingoverhead.
However,ingeneral,user-levelthreadsprovidebothexibleimplemen-tationsandscalableperformance.
Thismakesuser-levelthreadsanattractiveapproachforprogrammingparallelap-plicationswithalargenumberofowsofcontrol.
Wealsoexaminedapproachestosupportthreadmi-gration,whichisparticularlyusefulforloadbalancing.
Wedescribedseveralmethodsforimplementingmigratablethreadsthatcanbeusedforloadbalancinglargescaleparal-lelapplications.
Wehaveimplementedthesetechniques—stackcopying,isomalloc,andmemoryaliasingstacks—intheCharm++/AdaptiveMPIruntimesystem[16].
Wehavealsoshownthatthesetechniquescanbeusedonawideva-rietyofplatformsbyavarietyofrealparallelapplications.
References[1]TheGNUPortableThreadsLibrary.
http://www.
gnu.
org/software/pth.
[2]N.
Adiga,G.
Almasi,G.
Almasi,Y.
Aridor,R.
Barik,D.
Beece,R.
Bellofatto,G.
Bhanot,R.
Bickford,M.
Blum-rich,A.
Bright,andJ.
Anoverviewofthebluegene/lsuper-computer,2002.
[3]T.
E.
Anderson,B.
N.
Bershad,E.
D.
Lazowska,andH.
M.
Levy.
Scheduleractivations:Effectivekernelsupportfortheuser-levelmanagementofparallelism.
TransactionsonComputerSystems,10(1):53–79,Feburary1992.
[4]G.
Antoniu,L.
Bouge,andR.
Namyst.
AnefcientandtransparentthreadmigrationschemeinthePM2runtimesystem.
InProc.
3rdWorkshoponRuntimeSystemsforPar-allelProgramming(RTSPP)SanJuan,PuertoRico.
LectureNotesinComputerScience1586,pages496–510.
Springer-Verlag,April1999.
[5]J.
Appavoo,M.
Auslander,M.
Burtico,D.
D.
Silva,O.
Krieger,M.
Mergen,M.
Ostrowski,B.
Rosenburg,R.
W.
Wisniewski,andJ.
Xenidis.
K42:anopen-sourcelinux-compatiblescalableoperatingsystemkernel.
IBMSystemsJournal,44(2):427–440,2005.
[6]A.
Barak,S.
Guday,andR.
G.
Wheeler.
Themosixdis-tributedoperatingsystem.
InLNCS672.
Springer,1993.
[7]K.
Barker,A.
Chernikov,N.
Chrisochoides,andK.
Pin-gali.
ALoadBalancingFrameworkforAdaptiveandAsyn-chronousApplications.
InIEEETransactionsonParallelandDistributedSystems,volume15,pages183–192,2003.
[8]K.
J.
BarkerandN.
P.
Chrisochoides.
AnEvaluationofaFrameworkfortheDynamicLoadBalancingofHighlyAdaptiveandIrregularParallelApplications.
InProceed-ingsofSC2003,Phoenix,AZ,2003.
[9]J.
A.
Booth.
Balancingprioritiesandloadforstatespacesearchonlargeparallelmachines.
Master'sthesis,Univer-sityofIllinoisatUrbana-Champaign,2003.
[10]R.
K.
BrunnerandL.
V.
Kale.
Adaptingtoloadonwork-stationclusters.
InTheSeventhSymposiumontheFrontiersofMassivelyParallelComputation,pages106–112.
IEEEComputerSocietyPress,February1999.
[11]R.
K.
BrunnerandL.
V.
Kale.
Handlingapplication-inducedloadimbalanceusingparallelobjects.
InParallelandDis-tributedComputingforSymbolicandIrregularApplica-tions,pages167–181.
WorldScienticPublishing,2000.
[12]S.
ChakravortyandL.
V.
Kale.
Afaulttolerantprotocolformassivelyparallelmachines.
InFTPDSWorkshopforIPDPS2004.
IEEEPress,2004.
[13]F.
DouglisandJ.
Ousterhout.
Processmigrationinthespriteoperatingsystem.
InProceedingsofthe7thInternationalConferenceonDistributedComputerSystems,pages18–25,1987.
[14]E.
A.
Hendriks.
Bproc:Thebeowulfdistributedprocessspace.
In16thAnnualACMInternationalConferenceonSupercomputing.
ACMPress,2002.
[15]C.
Huang,O.
Lawlor,andL.
V.
Kale.
AdaptiveMPI.
InPro-ceedingsofthe16thInternationalWorkshoponLanguagesandCompilersforParallelComputing(LCPC2003),LNCS2958,pages306–322,CollegeStation,Texas,October2003.
[16]C.
Huang,G.
Zheng,S.
Kumar,andL.
V.
Kale.
Perfor-manceevaluationofadaptiveMPI.
InProceedingsofACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming2006,March2006.
12[17]J.
Frey,T.
Tannenbaum,I.
Foster,M.
Livny,andS.
Tuecke.
Condor-G:Acomputationmanagementagentformulti-institutionalgrids.
InTenthIEEESymposiumonHighPer-formanceDistributedComputing(HPDC10),2001.
[18]H.
JinandR.
F.
V.
derWijngaart.
Performancecharacteris-ticsofthemulti-zonenasparallelbenchmarks.
InProceed-ingsoftheInternationalParallelandDistributedProcessingSymposium(IPDPS),2004.
[19]R.
Jyothi,O.
S.
Lawlor,andL.
V.
Kale.
DebuggingsupportforCharm++.
InPADTADWorkshopforIPDPS2004,page294.
IEEEPress,2004.
[20]L.
V.
Kale.
Thevirtualizationmodelofparallelprogram-ming:Runtimeoptimizationsandthestateofart.
InLACSI2002,Albuquerque,October2002.
[21]L.
V.
Kale.
Performanceandproductivityinparallelpro-grammingviaprocessorvirtualization.
InProc.
oftheFirstIntl.
WorkshoponProductivityandPerformanceinHigh-EndComputing(atHPCA10),Madrid,Spain,February2004.
[22]L.
V.
KaleandM.
Bhandarkar.
StructuredDagger:ACo-ordinationLanguageforMessage-DrivenProgramming.
InProceedingsofSecondInternationalEuro-ParConference,volume1123-1124ofLectureNotesinComputerScience,pages646–653,September1996.
[23]L.
V.
Kale,M.
Bhandarkar,R.
Brunner,andJ.
Yelon.
Mul-tiparadigm,MultilingualInteroperability:ExperiencewithConverse.
InProceedingsof2ndWorkshoponRuntimeSys-temsforParallelProgramming(RTSPP)Orlando,Florida-USA,LectureNotesinComputerScience,March1998.
[24]L.
V.
KaleandS.
Krishnan.
Charm++:ParallelProgram-mingwithMessage-DrivenObjects.
InG.
V.
WilsonandP.
Lu,editors,ParallelProgrammingusingC++,pages175–213.
MITPress,1996.
[25]L.
V.
KaleandJ.
Yelon.
ThreadsforInteroperableParallelProgramming.
InProc.
9thConferenceonLanguagesandCompilersforParallelComputers,SanJose,California,Au-gust1996.
[26]D.
Keppel.
Toolsandtechniquesforbuildingfastportablethreadspackages.
TechnicalReportUWCSE93-05-06,Uni-versityofWashingtonDepartmentofComputerScienceandEngineering,May1993.
[27]D.
Knuth.
TheArtofComputerProgramming,volume1.
Addison-Wesley,1997.
[28]O.
S.
LawlorandL.
V.
Kale.
Supportingdynamicparallelobjectarrays.
ConcurrencyandComputation:PracticeandExperience,15:371–393,2003.
[29]M.
LitzkowandM.
Solomon.
Supportingcheckpointingandprocessmigrationoutsidetheunixkernel.
InUsenixCon-ferenceProceedings,pages283–290,January1992.
[30]D.
S.
Milojicic,F.
Douglis,Y.
Paindaveine,R.
Wheeler,andS.
Zhou.
Processmigration.
ACMComputingSurveys,32(3):241–299,2000.
[31]J.
MukherjeeandS.
Varadarajan.
Weaves:Aframeworkforrecongurableprogramming.
InternationalJournalofParallelProgramming,33(2-3):279–305,2005.
[32]J.
C.
Phillips,R.
Brunner,A.
Shinozaki,M.
Bhandarkar,N.
Krawetz,L.
Kale,R.
D.
Skeel,andK.
Schulten.
Avoidingalgorithmicobfuscationinamessage-drivenparallelMDcode.
InComputationalMolecularDynamics:Challenges,Methods,Ideas,volume4ofLectureNotesinComputa-tionalScienceandEngineering,pages472–482.
Springer-Verlag,November1998.
[33]J.
C.
Phillips,G.
Zheng,S.
Kumar,andL.
V.
Kale.
NAMD:Biomolecularsimulationonthousandsofprocessors.
InProceedingsofSC2002,Baltimore,MD,September2002.
[34]SayantanChakravorty,CelsoMendesandL.
V.
Kale.
Proac-tivefaulttoleranceinlargesystems.
InHPCRIWorkshopinconjunctionwithHPCA2005,2005.
[35]W.
W.
ShuandL.
V.
Kale.
Adynamicloadbalancingstrat-egyfortheChareKernelsystem.
InProceedingsofSuper-computing'89,pages389–398,November1989.
[36]D.
S.
T.
G.
MattsonandS.
T.
Wheat.
Ateraopsupercom-puterin1996:theascitopssystem.
InProceedingsoftheInternationalParallelProcessingSymposium,1996.
[37]S.
Tatham.
Coroutinesinc,2005.
http://www.
chiark.
greenend.
org.
uk/sgtatham/coroutines.
html.
[38]N.
Williams.
Animplementationofscheduleractivationsonthenetbsdoperatingsystem,2002.
[39]T.
WilmarthandL.
V.
Kale.
Pose:Gettingovergrainsizeinparalleldiscreteeventsimulation.
In2004InternationalConferenceonParallelProcessing,pages12–19,August2004.
[40]T.
L.
Wilmarth,G.
Zheng,E.
J.
Bohm,Y.
Mehta,N.
Choud-hury,P.
Jagadishprasad,andL.
V.
Kale.
Performancepre-dictionusingsimulationoflarge-scaleinterconnectionnet-worksinpose.
InProceedingsoftheWorkshoponPrin-ciplesofAdvancedandDistributedSimulation,pages109–118,2005.
[41]G.
Zheng.
AchievingHighPerformanceonExtremelyLargeParallelMachines:PerformancePredictionandLoadBal-ancing.
PhDthesis,DepartmentofComputerScience,Uni-versityofIllinoisatUrbana-Champaign,2005.
[42]G.
Zheng,C.
Huang,andL.
V.
Kale.
Performanceevaluationofautomaticcheckpoint-basedfaulttoleranceforampiandcharm++.
ACMSIGOPSOperatingSystemsReview:Op-eratingandRuntimeSystemsforHigh-endComputingSys-tems,40(2),April2006.
[43]G.
Zheng,G.
Kakulapati,andL.
V.
Kale.
Bigsim:Aparal-lelsimulatorforperformancepredictionofextremelylargeparallelmachines.
In18thInternationalParallelandDis-tributedProcessingSymposium(IPDPS),SantaFe,NewMexico,April2004.
[44]G.
Zheng,A.
K.
Singla,J.
M.
Unger,andL.
V.
Kale.
Aparallel-objectprogrammingmodelforpetaopsmachinesandbluegene/cyclops.
InNSFNextGenerationSystemsProgramWorkshop,16thInternationalParallelandDis-tributedProcessingSymposium(IPDPS),FortLauderdale,FL,April2002.
13ViewpublicationstatsViewpublicationstats

展开全文

lifelinuxredhat相关文档

蓝瘦香菇被抢注蓝瘦香菇这梗是怎么火起来的？怎么觉得火得莫名其妙？摩根币摩根币是怎么骗人的？kaixin.com人人网和开心网互通，可我用的是kaixin001的开心，和kaixin*com不是一个呀！22zizi.com河南福利彩票22选52010175开奖结果月神谭适合12岁男孩的网名,要非主流的，帮吗找找，谢啦 www.haole012.com阜阳有什么好的正规的招聘网站？www.119mm.comwww.kb119.com 这个网站你们能打开不？5xoy.com求个如月群真汉化版下载地址 qq530.com求教：如何下载http://www.qq530.com/ 上的音乐 se95se.com现在400se就是进不去呢？进WWW怎么400se总cOM打开一半，？求解域名注册网网页空间租用域名主机基地堪萨斯服务器 kdata 20g硬盘轻博 512m内存商家促销 789电视网可外链网盘免费智能解析 hkt 能外链的相册 drupal安装美国独立日环聊江苏徐州移动 windowsserver2008 wannacry勒索病毒更多

lifelinuxredhat

HostYun 新上美国CN2 GIA VPS 月15元

傲游主机38.4元起,韩国CN2/荷兰VPS全场8折vps香港高防

iHostART：罗马尼亚VPS/无视DMCA抗投诉vps;2核4G/40GB SSD/100M端口月流量2TB,€20/年