NVIDIA2010NVIDIA2010PauliusMicikevicius|NVIDIAFundamentalOptimizationsSupercomputing,TutorialS03NewOrleans,Nov14,2010NVIDIA2010OutlineKerneloptimizations–Launchconfiguration–Globalmemorythroughput–Sharedmemoryaccess–Instructionthroughput/controlflowOptimizationofCPU-GPUinteraction–MaximizingPCIethroughput–OverlappingkernelexecutionwithmemorycopiesNVIDIA2010LaunchConfigurationNVIDIA2010LaunchConfigurationHowmanythreads/threadblockstolaunchKeytounderstanding:–Instructionsareissuedinorder–Athreadstallswhenoneoftheoperandsisn'tready:Memoryreadbyitselfdoesn'tstallexecution–LatencyishiddenbyswitchingthreadsGMEMlatency:400-800cyclesArithmeticlatency:18-22cyclesConclusion:–NeedenoughthreadstohidelatencyNVIDIA2010LaunchConfigurationHidingarithmeticlatency:Need~18warps(576)threadsperFermiSMFewerwarpsforpre-FermiGPUs(FermiSMmorethandoubledissuerate)Or,latencycanalsobehiddenwithindependentinstructionsfromthesamewarpForexample,ifinstructionneverdependsontheoutputofprecedinginstruction,thenonly9warpsareneeded,etc.
Maximizingglobalmemorythroughput:Dependsontheaccesspattern,andwordsizeNeedenoughmemorytransactionsinflighttosaturatethebusIndependentloadsandstoresfromthesamethreadLoadsandstoresfromdifferentthreadsLargerwordsizescanalsohelp(float2istwicethetransactionsoffloat,forexample)NVIDIA2010MaximizingMemoryThroughputIncrementofanarrayof64Melements–Twoaccessesperthread(loadthenstore)–Thetwoaccessesaredependent,soreally1accessperthreadatatimeTeslaC2050,ECCon,theoreticalbandwidth:~120GB/sSeveralindependentsmalleraccesseshavethesameeffectasonelargerone.
Forexample:Four32-bit~=one128-bitNVIDIA2010LaunchConfiguration:SummaryNeedenoughtotalthreadstokeepGPUbusy–Typically,you'dlike512+threadsperSMMoreifprocessingonefp32elementperthread–Ofcourse,exceptionsexistThreadblockconfiguration–Threadsperblockshouldbeamultipleofwarpsize(32)–SMcanconcurrentlyexecuteupto8threadblocksReallysmallthreadblockspreventachievinggoodoccupancyReallylargethreadblocksarelessflexibleIgenerallyuse128-256threads/block,butusewhateverisbestfortheapplicationFormoredetails:–VasilyVolkov'sGTC2010talk"BetterPerformanceatLowerOccupancy"NVIDIA2010GlobalMemoryThroughputNVIDIA2010FermiMemoryHierarchyReviewL2GlobalMemoryRegistersL1SM-NSMEMRegistersL1SM-0SMEMRegistersL1SM-1SMEMNVIDIA2010FermiMemoryHierarchyReviewLocalstorage–Eachthreadhasownlocalstorage–Mostlyregisters(managedbythecompiler)Sharedmemory/L1–Programconfigurable:16KBshared/48KBL1OR48KBshared/16KBL1–Sharedmemoryisaccessiblebythethreadsinthesamethreadblock–Verylowlatency–Veryhighthroughput:1+TB/saggregateL2–AllaccessestoglobalmemorygothroughL2,includingcopiesto/fromCPUhostGlobalmemory–Accessiblebyallthreadsaswellashost(CPU)–Higherlatency(400-800cycles)–Throughput:upto177GB/sNVIDIA2010ProgrammingforL1andL2Shortanswer:DON'T–GPUcachesarenotintendedforthesameuseasCPUcachesSmallersize(especiallyperthread),sonotaimedattemporalreuseIntendedtosmoothoutsomeaccesspatterns,helpwithspilledregisters,etc.
–Don'ttrytoblockforL1/L2likeyouwouldonCPUYouhave100sto1,000sofrun-timescheduledthreadshittingthecachesIfitispossibletoblockforL1thenblockforSMEM–Samesize,samebandwidth,hwwillnotevictbehindyourbackOptimizeasifnocacheswerethere–NoFermi-onlytechniquestolearnperse(so,allyouknowisstillgood)–SomecaseswilljustrunfasterNVIDIA2010FermiGMEMOperationsTwotypesofloads:–CachingDefaultmodeAttemptstohitinL1,thenL2,thenGMEMLoadgranularityis128-byteline–Non-cachingCompilewith–Xptxas–dlcm=cgoptiontonvccAttemptstohitinL2,thenGMEM–DonothitinL1,invalidatethelineifit'sinL1alreadyLoadgranularityis32-bytesStores:–InvalidateL1,write-backforL2NVIDIA2010LoadCachingandL1SizeNon-cachingloadscanimproveperfwhen:–LoadingscatteredwordsoronlyapartofawarpissuesaloadBenefit:transactionissmaller,sousefulpayloadisalargerpercentageLoadinghalos,forexample–Spillingregisters(reducelinefightingwithspillage)LargeL1canimproveperfwhen:–Spillingregisters(morelinessofewerevictions)–Somemisaligned,stridedaccesspatterns–16-KBL1/48-KBsmemOR48-KBL1/16-KBsmemCUDAcall,canbesetfortheapporper-kernelHowtouse:–Justtrya2x2experimentmatrix:{CA,CG}x{48-L1,16-L1}Keepthebestcombination-sameasyouwouldwithanyHWmanagedcache,includingCPUsNVIDIA2010LoadOperationMemoryoperationsareissuedperwarp(32threads)–Justlikeallotherinstructions–PriortoFermi,memoryissueswereperhalf-warpOperation:–Threadsinawarpprovidememoryaddresses–Determinewhichlines/segmentsareneeded–Requesttheneededlines/segmentsNVIDIA2010CachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010Non-cachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010CachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010Non-cachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010CachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithin2cache-lines–Warpneeds128bytes–256bytesmoveacrossthebusonmisses–Busutilization:50%NVIDIA2010Non-cachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithinatmost5segments–Warpneeds128bytes–Atmost160bytesmoveacrossthebus–Busutilization:atleast80%Somemisalignedpatternswillfallwithin4segments,so100%utilizationNVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglecache-line–Warpneeds4bytes–128bytesmoveacrossthebusonamiss–Busutilization:3.
125%NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglesegment–Warpneeds4bytes–32bytesmoveacrossthebusonamiss–Busutilization:12.
5%NVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNcache-lines–Warpneeds128bytes–N*128bytesmoveacrossthebusonamiss–Busutilization:128/(N*128)NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNsegments–Warpneeds128bytes–N*32bytesmoveacrossthebusonamiss–Busutilization:128/(N*32)NVIDIA2010ImpactofAddressAlignmentWarpsshouldaccessalignedregionsformaximummemorythroughput–FermiL1canhelpformisalignedloadsifseveralwarpsareaccessingacontiguousregion–ECCfurthersignificantlyreducesmisalignedstorethroughputExperiment:–Copy16MBoffloats–256threads/blockGreatestthroughputdrop:–GT200:40%–Fermi:–CAloads:15%–CGloads:32%NVIDIA2010GMEMOptimizationGuidelinesStriveforperfectcoalescingperwarp–Alignstartingaddress(mayrequirepadding)–AwarpshouldaccesswithinacontiguousregionHaveenoughconcurrentaccessestosaturatethebus–LaunchenoughthreadstomaximizethroughputLatencyishiddenbyswitchingthreads(warps)–ProcessseveralelementsperthreadMultipleloadsgetpipelinedIndexingcalculationscanoftenbereusedTryL1andcachingconfigurationstoseewhichoneworksbest–Cachingvsnon-cachingloads(compileroption)–16KBvs48KBL1(CUDAcall)NVIDIA2010SharedMemoryNVIDIA2010SharedMemoryUses:–Inter-threadcommunicationwithinablock–Cachedatatoreduceredundantglobalmemoryaccesses–UseittoimproveglobalmemoryaccesspatternsFermiorganization:–32banks,4-bytewidebanks–Successive4-bytewordsbelongtodifferentbanksPerformance:–4bytesperbankper2clockspermultiprocessor–smemaccessesareissuedper32threads(warp)per16-threadsforGPUspriortoFermi–serialization:ifnthreadsinawarpaccessdifferent4-bytewordsinthesamebank,naccessesareexecutedserially–multicast:nthreadsaccessthesamewordinonefetchCouldbedifferentbyteswithinthesamewordPriortoFermi,onlybroadcastwasavailable,sub-wordaccesseswithinthesamebankcausedserializationNVIDIA2010BankAddressingExamplesNoBankConflictsNoBankConflictsBank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0NVIDIA2010BankAddressingExamples2-wayBankConflicts8-wayBankConflictsThread31Thread30Thread29Thread28Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank9Bank8Bank31Bank7Bank2Bank1Bank0x8x8NVIDIA2010SharedMemory:AvoidingBankConflicts32x32SMEMarrayWarpaccessesacolumn:–32-waybankconflicts(threadsinawarpaccessthesamebank)312103121031210warps:01231Bank0Bank1…Bank3120131NVIDIA2010SharedMemory:AvoidingBankConflictsAddacolumnforpadding:–32x33SMEMarrayWarpaccessesacolumn:–32differentbanks,nobankconflicts312103121031210warps:01231paddingBank0Bank1…Bank3131201NVIDIA2010Additional"memories"TextureandconstantRead-onlyDataresidesinglobalmemoryReadthroughdifferentcachesNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwise__global__voidkernel(constfloat*g_a){.
.
.
floatx=g_a[15];//uniformfloaty=g_a[blockIdx.
x+5];//uniformfloatz=g_a[threadIdx.
x];//non-uniform.
.
.
}NVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingGMEM:–Eachwarpfetches32B->10KBofbustraffic–Cachingloadspotentiallyworse–128Bline,verylikelytobeevictedmultipletimes.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingconstant/uniformaccess:–Firstwarpfetches32bytes–Allothershitinconstantcache->32bytesofbustrafficperSMUnlikelytobeevictedoverkernellifetime–otherloadsdonotgothroughthiscache.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010TextureSeparatecacheDedicatedtexturecachehardwareprovides:–Out-of-boundsindexhandlingclamporwrap-around–OptionalinterpolationThink:usingfpindicesforarraysLinear,bilinear,trilinear–Interpolationweightsare9-bit–Optionalformatconversion{char,short,int}->float–Alloftheseare"free"NVIDIA2010InstructionThroughput/ControlFlowNVIDIA2010RuntimeMathLibraryandIntrinsicsTwotypesofruntimemathlibraryfunctions–__func():manymapdirectlytohardwareISAFastbutloweraccuracy(seeCUDAProgrammingGuideforfulldetails)Examples:__sinf(x),__expf(x),__powf(x,y)–func():compiletomultipleinstructionsSlowerbuthigheraccuracy(5ulporless)Examples:sin(x),exp(x),pow(x,y)Anumberofadditionalintrinsics:–__sincosf(),__frcp_rz(),.
.
.
–ExplicitIEEEroundingmodes(rz,rn,ru,rd)NVIDIA2010ControlFlowInstructionsareissuedper32threads(warp)Divergentbranches:–Threadswithinasinglewarptakedifferentpathsif-else,.
.
.
–DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformanceAvoiddivergingwithinawarp–Examplewithdivergence:if(threadIdx.
x>2){.
.
.
}else{.
.
.
}Branchgranularity2){.
.
.
}else{.
.
.
}BranchgranularityisawholemultipleofwarpsizeNVIDIA2010ControlFlowif(.
.
.
){//then-clause}else{//else-clause}instructionsNVIDIA2010Executionwithinwarpsiscoherentinstructions/timeWarp("vector"ofthreads)35343363623232131300Warp("vector"ofthreads)NVIDIA2010Executiondivergeswithinawarpinstructions/time32131300353433636232NVIDIA2010CPU-GPUInteractionNVIDIA2010Pinned(non-pageable)memoryPinnedmemoryenables:–fasterPCIecopies–memcopiesasynchronouswithCPU–memcopiesasynchronouswithGPUUsage–cudaHostAlloc/cudaFreeHostinsteadofmalloc/freeImplication:–pinnedmemoryisessentiallyremovedfromhostvirtualmemoryNVIDIA2010StreamsandAsyncAPIDefaultAPI:–KernellaunchesareasynchronouswithCPU–Memcopies(D2H,H2D)blockCPUthread–CUDAcallsareserializedbythedriverStreamsandasyncfunctionsprovide:–Memcopies(D2H,H2D)asynchronouswithCPU–AbilitytoconcurrentlyexecuteakernelandamemcopyStream=sequenceofoperationsthatexecuteinissue-orderonGPU–Operationsfromdifferentstreamsmaybeinterleaved–AkernelandmemcopyfromdifferentstreamscanbeoverlappedNVIDIA2010OverlapkernelandmemorycopyRequirements:–D2HorH2Dmemcopyfrompinnedmemory–Devicewithcomputecapability≥1.
1(G84andlater)–Kernelandmemcopyindifferent,non-0streamsCode:cudaStream_tstream1,stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(dst,src,size,dir,stream1);kernel>>(…);potentiallyoverlappedNVIDIA2010CallSequencingforOptimalOverlapCUDAcallsaredispatchedtothehwinthesequencetheywereissuedFermicanconcurrentlyexecute:–Upto16kernels–Upto2memcopies,aslongastheyareindifferentdirections(D2HandH2D)Acallisdispatchedifbotharetrue:–Resourcesareavailable–PrecedingcallsinthesamestreamhavecompletedNotethatifacallblocks,itblocksallothercallsofthesametypebehindit,eveninotherstreams–Typeisoneof{kernel,memcopy}NVIDIA2010StreamExamplesK1,M1,K2,M2:K1M1K2M2K1,K2,M1,M2:K1M1K2M2K1,M1,M2:K1M1M2K1,M2,M1:K1M1M2K1,M2,M2:K1M2M2TimeK:kernelM:memcopyInteger:streadIDNVIDIA2010MoreonFermiConcurrentKernelsKernelsmaybeexecutedconcurrentlyiftheyareissuedintodifferentstreamsScheduling:–Kernelsareexecutedintheorderinwhichtheywereissued–ThreadblocksforagivenkernelarescheduledifallthreadblocksforprecedingkernelshavebeenscheduledandtherestillareSMresourcesavailableNVIDIA2010MoreonFermiDualCopyFermiiscapableofduplexcommunicationwiththehost–PCIebusisduplex–Thetwomemcopiesmustbeindifferentstreams,differentdirectionsNotallcurrenthostsystemscansaturateduplexPCIebandwidth:–LikelylimitationsoftheIOHchips–Ifthisisimportanttoyou,testyourhostsystemNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-0DRAM10.
8GB/s7.
5GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-1DRAM10.
8GB/s11GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010SummaryKernelLaunchConfiguration:–LaunchenoughthreadsperSMtohidelatency–LaunchenoughthreadblockstoloadtheGPUGlobalmemory:–Maximizethroughput(GPUhaslotsofbandwidth,useiteffectively)Usesharedmemorywhenapplicable(over1TB/sbandwidth)GPU-CPUinteraction:–MinimizeCPU/GPUidling,maximizePCIethroughputUseanalysis/profilingwhenoptimizing:–"Analysis-drivenOptimization"talknextNVIDIA2010AdditionalResourcesBasics:–CUDAwebinarsonNVIDIAwebsite(justgoogleforCUDAwebinar)–CUDAbyExample"bookbyJ.
SandersandE.
CandrotProfiling,analysis,andoptimizationforFermi:–GTC-2010session2012:"Analysis-drivenOptimization"(tomorrow,3-5pm)GT200optimization:–GTC-2009session1029(slidesandvideo)Slides:–http://www.
nvidia.
com/content/GTC/documents/1029_GTC09.
pdfMaterialsforallsessions:–http://developer.
download.
nvidia.
com/compute/cuda/docs/GTC09Materials.
htmCUDATutorialsatSupercomputing:–http://gpgpu.
org/{sc2007,sc2008,sc2009}CUDAProgrammingGuideCUDABestPracticesGuideNVIDIA2010Questions
博鳌云是一家以海外互联网基础业务为主的高新技术企业,运营全球高品质数据中心业务。自2008年开始为用户提供服务,距今11年,在国人商家中来说非常老牌。致力于为中国用户提供域名注册(国外接口)、免费虚拟主机、香港虚拟主机、VPS云主机和香港、台湾、马来西亚等地服务器租用服务,各类网络应用解決方案等领域的专业网络数据服务。商家支持支付宝、微信、银行转账等付款方式。目前香港有一款特价独立服务器正在促销,...
LOCVPS发来了针对XEN架构VPS的促销方案,其中美国洛杉矶机房7折,其余日本/新加坡/中国香港等机房全部8折,优惠后日本/新加坡机房XEN VPS月付仅29.6元起。这是成立较久的一家国人VPS服务商,目前提供美国洛杉矶(MC/C3)、和中国香港(邦联、沙田电信、大埔)、日本(东京、大阪)、新加坡、德国和荷兰等机房VPS主机,基于XEN或者KVM虚拟架构,均选择国内访问线路不错的机房,适合建...
触碰云怎么样?触碰云是一家成立于2019年的商家。触碰云主营香港/美国 VPS服务器、独立服务器以及免备案CDN。采用的是kvm虚拟构架,硬盘Raid10,Cn2线路,去程电信CN2、移动联通直连,回程三网CN2。最低1核1G带宽1M仅20.8元/月,不过这里推荐香港4核4G15M,香港cn2 gia线路云服务器,仅115.2元/月起,性价比还是不错的。点击进入:触碰云官方网站地址触碰云优惠码:优...
freehost为你推荐
中老铁路一带一路的火车是什么火车地图应用手机地图软件那么多,都不知道用哪个好了?李子柒年入1.6亿李子柒男朋友是谁,李子柒父母怎么去世的?同ip站点同IP网站具体是什么意思,能换独立的吗haole018.com为啥进WWWhaole001)COM怎么提示域名出错?囡道是haole001换地了吗mole.61.com摩尔大陆?????www.zjs.com.cn中通快递投诉网站网址是什么?www.se222se.comhttp://www.qqvip222.com/javlibrary.comsony home network library官方下载地址sodu.tw今天sodu.org为什么打不开了?
北京域名空间 locvps awardspace 正版win8.1升级win10 日本空间 七夕快乐英文 可外链网盘 服务器是干什么的 服务器监测 无限流量 web服务器搭建 百度云加速 西安主机 贵阳电信测速 工信部网站备案查询 徐州电信 杭州电信宽带 好看的空间 开心online forwarder 更多