expensiveamd裁员
amd裁员 时间:2021-04-02 阅读:(
)
FAQ1of13AMDAPPSDKv2.
8FAQ1GeneralQuestions1.
DoIneedtouseadditionalsoftwarewiththeSDKTorunanOpenCLapplication,youmusthaveanOpenCLruntimeonyoursystem.
IfyoursystemincludesarecentAMDdiscreteGPU,oranAPU,youalsoshouldinstallthelatestCatalystdrivers,whichcanbedownloadedfromAMD.
com.
Informationonsupporteddevicescanbefoundatdeveloper.
amd.
com/appsdk.
IfyoursystemdoesnotincludearecentAMDdiscreteGPU,orAPU,theSDKinstallsaCPU-onlyOpenCLrun-time.
Also,werecommendusingthedebuggingprofilingandanalysistoolscontainedintheAMDCodeXLheterogeneouscomputetoolssuite.
2.
WhichversionsoftheOpenCLstandarddoesthisSDKsupportAMDAPPSDK2.
8supportsdevelopmentofapplicationsusingtheOpenCLSpecificationv1.
2.
AsallOpenCL1.
1APIsaresupportedwithinOpenCL1.
2,youalsocandevelopOpenCL1.
1-compliantapplications.
3.
WillapplicationsdevelopedtoexecuteonOpenCL1.
1stilloperateinanOpenCL1.
2environmentOpenCLisdesignedtobebackwardscompatible.
TheOpenCL1.
2run-timedeliveredwiththeAMDCatalystdriversrunanyOpenCL1.
1-compliantapplication.
However,anOpenCL1.
2-compliantapplicationwillnotexecuteonanOpenCL1.
1run-timeifAPIsonlysupportedbyOpenCL1.
2areused.
4.
DoesAMDprovideanyadditionalOpenCLsamples,otherthanthosecontainedwithintheSDKThemostrecentversionsofallofthesamplescontainedwithintheSDKarealsoavailableforindividualdownloadfromthedeveloper.
amd.
com/appsdk"Samples&Demos"page.
ThispagealsocontainsadditionalsamplesthateitherweretoolargetoincludeintheSDK,orwhichhavebeendevelopedsincethemostrecentSDKrelease.
Checkthiswebpagefornew,updated,orlargesamples.
5.
HowoftencanIexpecttogetAMDAPPSDKupdatesDeveloperscanexpectthattheAMDAPPSDKmaybeupdatedtwotothreetimesayear.
Actualreleaseintervalsmayvarydependingonavailablenewfeaturesandproductupdates.
AMDiscommittedtoprovidingdeveloperswithregularupdatestoallowthemtotakeadvantageofthelatestdevelopmentsinAMDAPPtechnology.
2of13FAQ6.
WhatisthedifferencebetweentheCPUandGPUcomponentsofOpenCLthatarebundledwiththeAMDAPPSDKTheCPUcomponentusesthecompatibleCPUcoresinyoursystemtoaccelerateyourOpenCLcomputekernels;theGPUcomponentusesthecompatibleGPUcoresinyoursystemtoaccelerateyourOpenCLcomputekernels.
7.
WhatCPUsdoestheAMDAPPSDKv2.
8withOpenCL1.
2supportworkonTheCPUcomponentofOpenCLbundledwiththeAMDAPPSDKworkswithanyx86CPUwithSSE3orlater,aswellasSSE2.
xorlater.
AMDCPUshavesupportedSSE3(andlater)since2005.
SomeexamplesofAMDCPUsthatsupportSSE3(orlater)aretheAMDAthlon64(startingwiththeVenice/SanDiegosteppings),AMDAthlon64X2,AMDAthlon64FX(startingwithSanDiegostepping),AMDOpteron(startingwithE4stepping),AMDSempron(startingwithPalermostepping),AMDPhenom,AMDTurion64,andAMDTurion64X2.
8.
WhatAPUsandGPUsdoestheAMDAPPSDKv2.
8withOpenCL1.
2supportworkonForthelistofsupportedAPUsandGPUs,seetheAMDAPPSDKv2.
8SystemRequirementslistat:http://developer.
amd.
com/appsdk9.
CanmyOpenCLcoderunonGPUsfromothervendorsAtthistime,AMDdoesnotplantohavetheAMDAPPSDKsupportGPUproductsfromothervendors;however,sinceOpenCLisanindustrystandardprogramminginterface,programswritteninOpenCL1.
2canberecompiledandrunwithanyOpenCL-compliantcompilerandruntime.
10.
WhatversionofMSVisualStudioissupportedTheAMDAPPSDKv2.
8withOpenCL1.
2supportsMicrosoftVisualStudio2008ProfessionalEdition,MicrosoftVisualStudio2010ProfessionalEdition,andMicrosoftVisualStudio2012.
11.
IsitpossibletorunmultipleAMDAPPapplications(computeandgraphics)concurrentlyMultipleAMDAPPapplicationscanberunconcurrently,aslongastheydonotaccessthesameGPUatthesametime.
AMDAPPapplicationsthatattempttoaccessthesameGPUatthesametimeareautomaticallyserializedbytheruntimesystem.
12.
WhichgraphicsdriverisrequiredforthecurrentAMDAPPSDKv2.
8withOpenCL1.
2CPUsupportFortheminimumrequiredgraphicsdriver,seetheAMDAPPSDKv2.
8SystemRequirementslistat:http://developer.
amd.
com/appsdk.
Ingeneral,itisadvisedthatyouupdateyoursystemtousethemostrecentgraphicsdriversthatareavailableforit.
13.
HowdoesOpenCLcomparetootherAPIsandprogrammingplatformsforparallelcomputing,suchasOpenMPandMPIWhichoneshouldIuseOpenCLisdesignedtotargetparallelismwithinasinglesystemandprovideportabilitytomultipledifferenttypesofdevices(GPUs,multi-coreCPUs,etc.
).
OpenMPtargetsmulti-coreFAQ3of13CPUsandSMPsystems.
MPIisamessagepassingprotocolmostoftenusedforcommunicationbetweennodes;itisapopularparallelprogrammingmodelforclustersofmachines.
Eachprogrammingmodelhasitsadvantages.
ItisanticipatedthatdevelopersmixAPIs,forexampleprogrammingaclusterofmachineswithGPUswithMPIandOpenCL.
14.
IfIwritemycodeontheCPUversion,doesitworkontheGPUversion,ordoIhavetomakechanges.
AssumingthesizelimitationsforCPUsisconsidered,thecodeworksonboththeCPUandGPUcomponents.
Performancetuning,however,isdifferentforeach.
15.
WhatistheprecisionofmathematicaloperationsSeeChapter7,"OpenCLNumericalCompliance,"oftheOpenCL1.
2Specificationforexactmathematicaloperationsprecisionrequirements.
http://developer.
amd.
com/support/KnowledgeBase/Lists/KnowledgeBase/DispForm.
aspxID=8816.
Arebyte-addressablestoressupportedByte-addressablestoresaresupported.
17.
ArelongintegerssupportedYes,64-bitintegersaresupported.
18.
AreoperationsonvectorssupportedYes,operationsonvectorsaresupported.
19.
IsswizzlingsupportedYes,swizzling(therearrangingofelementsinavector)issupported.
20.
HowdoesoneverifyifOpenCLhasbeeninstalledcorrectlyonthesystemRunclinfofromthecommand-lineinterface.
TheclinfotoolshowstheavailableOpenCLdevicesonasystem.
21.
HowdoIknowifIhaveinstalledthelatestversionoftheAMDAPPSDKOninstallationoftheSDK,ifaninternetconnectionisavailable,theinstallerstateswhetherornotanewerSDKisavailableinthefileVersionInfo.
txtindirectoryC:\ProgramFiles(x86)\AMDAPP\docs\.
Alternatively,youcancheckforthelatestavailableversionoftheAMDAPPSDKathttp://developer.
amd.
com/appsdk.
2Optimizations22.
HowdoIuseconstantsinakernelforbestperformanceForperformanceusingconstants,highesttolowestperformanceisachievedusing:-Literalvalues-Constantpointerwithcompiletimeconstantsindexing.
-Constantpointerwithruntimeconstantindexingthatisthesameforallthreads.
4of13FAQ-Constantpointerwithlinearaccessindexingthatisthesameforallthreads.
-Constantpointerwithlinearaccessindexingthatisdifferentbetweenthreads.
-Constantpointerwithrandomaccessindexing.
23.
WhyareliteralvaluesthefastestwaytouseconstantsUpto96bitsofliteralvaluesareembeddedintheinstruction;thus,intheory,thereisnolimitonthenumberofusableliterals.
Inpractice,thelimitis16Kuniqueliteralsinacompilationunit.
24.
Whydoesa*b+cnotgenerateamadinstructionDependingonthehardwareandthefloatingpointprecision,thecompilermaynotgenerateamadinstructionforthecomputationofa*b+cduetothefloating-pointprecisionrequirementsintheOpenCLspecification.
Here,developerswhowanttoexploitthemadinstructionforperformancecanreplacethatcomputationwiththemad()built-infunctioninOpenCL.
25.
Whatisthepreferredwork-groupsizeThepreferredwork-groupsizeontheAMDplatformisamultipleof64.
3OpenCLQuestions26.
WhatisOpenCLOpenCL(OpenComputingLanguage)isthefirsttrulyopenandroyalty-freeprogrammingstandardforgeneral-purposecomputationsonheterogeneoussystems.
OpenCLletsprogrammerspreservetheirexpensivesourcecodeinvestmentandeasilytargetbothmulti-coreCPUsandthelatestGPUs,suchasthosefromAMD.
Developedinanopenstandardscommitteewithrepresentativesfrommajorindustryvendors,OpenCLgivesusersacross-vendor,non-proprietarysolutionforacceleratingtheirapplicationsontheirCPUandGPUcores.
27.
HowmuchdoestheAMDOpenCLdevelopmentplatformcostAMDbundlessupportforOpenCLaspartofitsAMDAPPSDKproductoffering.
TheAMDAPPSDKisofferedtodevelopersandusersfreeofcharge.
28.
WhatoperatingsystemsdoestheAMDAPPSDKv2.
8withOpenCL1.
2supportAMDAPPSDKv2.
8runson32-bitand64-bitversionsofWindowsandLinux.
Fortheexactlistofsupportedoperatingsystems,seetheAMDAPPSDKv2.
8SystemRequirementslistat:http://developer.
amd.
com/appsdk29.
CanIwriteanOpenCLapplicationthatworksonbothCPUandGPUApplicationsthatprogramtothecoreOpenCL1.
2APIandkernellanguageshouldbeabletotargetbothCPUsandGPUs.
Atruntime,theappropriatedevice(CPUorGPU)mustbeselectedbytheapplication.
FAQ5of1330.
DoestheAMDOpenCLcompilerautomaticallyvectorizeforSSEontheCPUTheCPUcomponentofOpenCLthatisbundledwiththeAMDAPPSDKtakesadvantageofSSE3instructionsontheCPU.
ItalsotakesadvantageoftheAVXinstructionswheresupported.
InadditiontoAVX,OpenCLmathlibraryfunctionsalsoleverageXOPandFMA4capabilitiesonCPUsthatsupportthem.
31.
DoestheAMDAPPSDKv2.
8withOpenCL1.
2supportworkonmultipleGPUs(ATICrossFire)OpenCLapplicationscanexplicitlyinvokeseparatecomputekernelsonmultiplecompatibleGPUsinasinglesystem.
Thepartitioningofthealgorithmtomultipleparallelcomputekernelsmustbedonebythedeveloper.
ItisrecommendedthatATICrossFirebeturnedoffinmostsystemconfigurationssothatAMDAPPapplicationscanaccessallavailableGPUsinthesystem.
ATICrossFiretechnologyallowsmultipleAMDGPUstoworktogetheronasinglegraphics-renderingtask.
ThismethoddoesnotapplytoAMDAPPcomputationaltasksbecauseitisnotcompatiblewiththecomputemodelusedforAMDAPPapplications.
32.
CanIshippre-compiledOpenCLapplicationbinariesthatworkoneitherCPUorGPUByusingOpenCLruntimeAPIs,developerscanwriteOpenCLapplicationsthatcandetecttheavailablecompatibleCPUsandGPUsinthesystem.
Thisletsdeveloperspre-compileapplicationsintobinariesthatdynamicallyworkoneitherCPUsorGPUsthatexecuteontargeteddevices.
IncludingLLVMIRinthebinaryprovidesameansforthebinarytosupportdevicesforwhichtheapplicationwasnotexplicitlypre-compiled.
33.
IstheOpenCLdoubleprecisionoptionalextensionsupportedTheKhronosandAMDdoubleprecisionextensionsaresupportedoncertaindevices.
YourapplicationcanusetheOpenCLAPItoqueryifthisfunctionalityissupportedonthedeviceinuse.
34.
IsitpossibletowriteOpenCLcodethatscalestransparentlyovermultipledevicesForOpenCLprogramsthattargetonlymulti-coreCPUs,scalingcanbedonetransparently;however,scalingacrossmultipleGPUsrequiresthedevelopertoexplicitlypartitionthealgorithmintomultiplecomputekernels,aswellasexplicitlylaunchthecomputekernelsontoeachcompatibleGPU.
35.
WhatshouldIdoifIgetwrongresultsontheAppleplatformwithAMDdevicesApplehandlessupportfortheAppleplatform;pleasecontactthem.
36.
IsitpossibletodynamicallyindexintoavectorNo,thisisnotpossiblebecauseavectorisnotanarray,butarepresentationofahardwareregister.
6of13FAQ37.
Whatisthedifferencebetweenlocalinta[4]andinta[4]localinta[4]useshardwarelocalmemory,whichisasmall,low-latency,high-bandwidthmemory;inta[4]usesper-threadhardwarescratchmemory,whichislocatedinuncachedglobalmemory.
38.
Whydoesusingabarriercausethemaxkernelwork-groupsizetodropto64onHD4XXXchipsThesupportedHD4XXXchipsdonothaveahardwarebarrier,sotheOpenCLruntimecannotexecutemorethanasinglewavefrontpergrouptosatisfytheOpenCLmemoryconsistencymodel.
NotethatHD4XXXdevicesupportisEOL.
Catalystdriversnolongerincludesupportforthesedevices.
SeetheOpenCLSDKdriverandcompatibilitypageformoredetails.
39.
HowcomemyprogramrunsslowerinOpenCLthaninCUDA/Brook+/ILWhencomparingperformance,itisbettertocomparecodeoptimizedforourOpenCLplatformagainstcodeoptimizedagainstanothervendor'sOpenCLplatforms.
Bycomparingthesametoolchainondifferentvendors,youcanfindoutwhichvendorshardwareworksthebestforyourproblemset.
40.
WhycanInotusetextureonRV7XXdevicesinOpenCLRV7XXdevicesdonotsupportallofthetexturemodesandprecisionrequirementsthatOpenCLrequires.
SincetexturesaremappedtoimagesinOpenCLandisan"allornothing"approach,wedonotsupportimagesonRV7XXdevices;thus,thereisnoaccesstotextures.
NotethatHD4XXXdevicesupportisEOL.
Catalystdriversnolongerincludesupportforthesedevices.
SeetheOpenCLSDKdriverandcompatibilitypageformoredetails.
41.
Whydoread-writeimagesnotexistinOpenCLOpenCLhasamemoryconsistencymodelthatrequirescertainconstraints(seetheOpenCLSpecificationformoreinformation).
Sinceimagesarespecialfunctionalhardwareunits,theyaredifferentforreadingandwriting.
Thisisdifferentfrompointers,whichforthemostpartusethesamehardwareunitsandcanguaranteetheconsistencythatOpenCLrequires.
42.
DoesprefetchingworkontheGPUPrefetchisnotneededontheGPUbecausethehardwarehasabuilt-inmechanismtohidelatencywhenmanywork-groupsarerunning.
TheLDScanbeusedasasoftware-controlledcache.
43.
Howdoyoudeterminethemaxnumberofconcurrentwork-groupsThemaximumnumberofconcurrentwork-groupsisdeterminedbyresourceusage.
Thisincludesnumberofregisters,amountofLDSspace,andnumberofthreadsperworkgroup.
Thereisnowaytodirectlyspecifythenumberofregistersusedbyakernel.
EachSIMDhasa64-wideregisterfile,witheachcolumnconsistingof256x32x4registers.
44.
IsitpossibletotellOpenCLnottouseallCPUCoresYes,usethedevicefissionextension.
FAQ7of134OpenCLOptimizations45.
Whatismoreefficient,theternaryoperator:ortheselectfunctionTheselectfunctioncompilestothesinglecycleinstruction,cmov_logical;inmostcases,:alsocompilestothesameinstruction.
Insomecases,whenmemoryisinoneoftheoperands,the:operatoriscompileddowntoanIF/ELSEblock.
AnIF/ELSEblocktakesmorethanasingleinstructiontoexecute.
46.
Whatisthedifferencebetween24-bitand32-bitintegeroperations24-bitoperationsarefasterbecausetheyusefloatingpointhardwareandcanexecuteonallcomputeunits.
Many32-bitintegeroperationsalsorunonallstreamprocessors,butifbotha24-bitanda32-bitversionexistforthesameinstruction,the32-bitinstructionexecutesonlyonepercycle.
5HardwareInformation47.
Howare8/16-bitoperationshandledinhardwareThe8/16-bitoperationsareemulatedwith32-bitregisters.
48.
Do24-bitintegersexistinhardwareNo,thereare24-bitinstructions,suchasMUL24/MAD24,butthesmallestintegerinhardwareregistersis32-bits.
49.
Whatarethebenefitsofusing8/16-bittypesover32-bitintegers8/16-bittypestakelessmemorythana32-bitintegertype,increasingtheamountofdatayouareabletoloadwithasingleinstruction.
TheOpenCLcompilerup-convertsto32-bitsonloadanddown-convertstothecorrecttypeonstore.
50.
WhatisthedifferencebetweenaGPRandasharedregisterAlthoughtheyarephysicallyequivalent,thedifferenceiswhethertheregisteroffsetinthehardwareisabsolutetotheregisterfileorrelativetothewavefrontID.
51.
HowoftenarewavefrontscreatedWavefrontsarecreatedbythehardwaretoexecuteaslongasresourcesareavailable.
Iftheyarecreatedbutcannotexecuteimmediately,theyareputinawaitqueuewheretheystayuntilcurrentlyrunningwavefrontsarefinished.
52.
WhatisthemaximumnumberofwavefrontsThemaximumnumberofwavefrontsisdeterminedbywhichresourcelimitsthenumberofwavefrontsthatcanbespawned.
Thiscanbethenumberofregisters,amountoflocalmemory,requiredstacksize,orotherfactors.
Computeshaderwithlocalmemoryusagehasahardcapat16wavefronts.
8of13FAQ53.
WhydoIgetblueorblackscreenswhenexecutinglongerrunningkernelsTheGPUisnotapreemptabledevice.
IfyouarerunningtheGPUasyourdisplaydevice,ensurethatacomputeprogramdoesnotusetheGPUpastacertaintimelimitsetbyWindows.
Exceedingthetimelimitcausesthewatchdogtimertotrigger;thiscanresultinundefinedprogramresults.
54.
WhatisthecostofaclauseswitchIngeneral,thelatencyofaclauseswitchisaround40cycles.
NotethatthisisrelevantonlyforEvergreenandNorthernIslanddevices.
55.
HowcanIhideclauseswitchlatencyByexecutingmultiplewavefrontsinparallel.
NotethatthisisrelevantonlyforEvergreenandNorthernIslanddevices.
56.
HowcanIreduceclauseswitchesClauseswitchesarealmostdirectlyrelatedtosourceprogramcontrolflow.
Byreducingsourceprogramcontrolflow,clauseswitchescanalsobereduced.
ThisisonlyrelevantforEvergreenandNorthernIslandsdevices.
57.
HowdoesthehardwareexecutealooponawavefrontThelooponlyendsexecutionforawavefrontonceeverythreadinthewavefrontbreaksoutoftheloop.
Onceathreadbreaksoutoftheloop,allofitsexecutionresultsaremasked,buttheexecutionstilloccurs.
58.
HowdoesflowcontrolworkwithwavefrontsTherearenoflowcontrolunitsforeachindividualthread,sothewholewavefrontmustexecutethebranchifanythreadinthewavefrontexecutesthebranch.
Iftheconditionisfalse,theresultsarenotwrittentomemory,buttheexecutionstilloccurs.
59.
WhatistheconstantbuffersizeonGPUhardware64kB.
60.
Whathappenswithout-of-boundmemoryoperationsWritesaredropped,andreadsreturnapre-definedvalue.
61.
For7XXdevices,whydoes64x1givebadperformanceinOpenCLkernelsOneofthereasonsisbecauseofhowthecachesaresetuponRV7XXdevices.
Thecachesareoptimizedtoworkinatiledmode,notinlinearmode(whichisthemodeOpenCLkernelsuse).
Togetoptimalcachere-usefromthetextureincomputeshadermodeonRV7XXdevices,reblockyourthreadIDs.
A16x4,8x8,or4x16shouldgiveyougoodenoughblockingtogetsimilarcacheperformanceasyourpixelshaderkernel.
Thisisbecauseacachelinecanbethoughtofasa4x2blockofdatacominginatonce.
So,forpixelshaders,64threadsareblockedina8x8blockthatusesexactlyeightcachelines.
ForOpenCLkernels,your64x1blockpatternuses16cachelines,butonlyuseshalfthedataineachcacheline.
NotethatHD4XXXdevicesupportisEOL.
CatalystdriversnolongerincludeFAQ9of13supportforthesedevices.
SeetheOpenCLSDKdriverandcompatibilitypageformoredetails.
62.
WhatisuniqueabouttheLDSinHD4XXXdevices,andwhatareitsperformancecharacteristicsTheLDSintheHD4XXXdevicesisanowner'swritemodelwithlimitedapplications.
Whenusedcorrectly,ithasverysimilarperformancecharacteristicstotheL1cache,buttheusergainscontroloverwhatdataexistsinthememory.
TheLDS_TransposesampleintheSDKusestheLDSintheHD4XXXdevicesveryefficiently.
NotethatHD4XXXdevicesupportisEOL.
Catalystdriversnolongerincludesupportforthesedevices.
SeetheOpenCLSDKdriverandcompatibilitypageformoredetails.
6MicrosoftVisualStudio63.
CanIusetheSDKwithMicrosoftVisualStudio2012ExpressDuetolimitationsinMicrosoftVisualStudioExpress,itisonlypossibletousebuildfilesforindividualsamples.
MicrosoftVisualStudioExpressdoesnotsupportbuildingofallofthesamplesatthesametime.
TheprojectfilesthatbuildallofthesamplesareonlysupportedbyfullversionsofMicrosoftVisualStudio2008,2010,or2012.
7Bolt64.
WhatGPUsisBoltoptimizedforBoltisoptimizedforAMD"SouthernIslands"familyofGPUs.
65.
WhichGPUcomputetechnologiesaresupportedbyBoltThepreviewversionofBoltusesOpenCLasitscomputeback-end;however,futurereleasesshouldsupportbothC++AMPandOpenCLcomputeback-ends.
66.
AretheBoltAPIsfinalTheAPIsintheBoltpreviewarenotfinal.
TheBoltAPIshouldbestableandshoulduseformaldeprecationproceduresfromBoltrelease1.
0onwards.
PleasehelpustoimproveBolt;weencourageyoutoprovidesuggestionsonhowwecanimprovetheBoltAPI.
67.
DoesBoltrequirelinkingtoalibraryYes,Boltincludesasmall,staticlibrarythatmustbelinkedintoauserprogramtoproperlyresolvesymbols.
SinceBoltisatemplatelibrary,almostallfunctionalityiswritteninheaderfiles,butOpenCLfollowsanonlinecompilationmodelwhichmakessensetoincludeinalibraryandnotinheaderfiles.
68.
DoesBoltdependonanyotherlibrariesYes,BoltcontainsdependenciesonBoost.
Alldependentheaderfilesandpre-compiledlibrariesareavailableintheBoltSDK.
10of13FAQ69.
WhatalgorithmsdoesBoltcurrentlysupportTransform,reduce,transform_reduce,count,count_if,inclusive_scan,exclusive_scan,andsort.
70.
WhatversionofOpenCLdoesBoltcurrentlyrequireBoltusesfeaturesavailableinOpenCLv1.
2.
71.
DoesBoltrequiretheAMDOpenCLruntimeYes,Boltrequirestheuseoftemplatesinkernelcode.
Atthetimeofthiswriting,AMDistheonlyvendortoprovidethissupport.
72.
WhichCatalystpackageshouldIusewithBoltGenerallyspeaking,downloadingandinstallingthelatestCatalystpackagecontainsthemostrecentOpenCLruntime.
Asofthetimeofthiswriting,therecommendedCatalystpackageis12.
10.
73.
WhichlicenseisBoltlicensedunderTheApacheLicense,Version2.
0.
74.
WhenshouldIusedevice_vectorvsregularhostmemorybolt::cl::device_vectorisusedtomanagedevice-localmemoryandcandeliverhigherperformanceondiscreteGPUsystem.
However,thehostmemoryinterfaceseliminatetheneedtocreateandmanagedevice_vectors.
Ifmemoryisre-usedacrossmultipleBoltcallsorisreferencedbyotherkernels,usingdevice_vectordelivershigherperformance75.
HowdoImeasuretheperformanceofOpenCLBoltlibrarycallsThefirsttimethataBoltlibraryroutineiscalled,theruntimecallstheOpenCLcompilertocompilethekernel.
EachuniquetemplateinstantiationreusesthesameOpenCLcompilation,sotwoBoltcallswiththesamefunctorsarecompiledonlyonetime.
WhenmeasuringtheperformanceofBolt,itisimportanttoexcludethefirstcall(withthecompilationoverhead)fromthetimingmeasurement.
HerearetwowaystotimeBoltfunctioncalls:Thefirstexamplepullsthecalloutsidethetimerloop.
std::vectora;bolt::cl::transform(a.
begin,a.
end,z.
begin(),bolt::cl::negate);startTimer(.
.
.
);for(inti=0;i);}stopTimer(.
.
.
);Asecondalternativeistoexplicitlyexcludethefirstcompilationfromthetimingregion://Explicitlystarttimerafterfirstiteration:std::vectora,z;for(inti=0;i);FAQ11of13}stopTimer(.
.
.
);76.
TheBoltlibraryAPIsaccepthostdatastructuressuchasstd::vector().
ArethesecopiedtotheGPUorkeptinhostmemoryForhostdatastructures,BoltcreatesanOpenCLbufferwiththeCL_MEM_USE_HOST_PTRflagandusestheOpenCLmapandunmapAPIstomanagethehostanddeviceaccess.
OntheAMDOpenCLimplementation,thisresultsindatabeingkeptinhostmemorywhenthehostmemoryisalignedon256-byteboundary.
Inothercases,theruntimecopiesthedataasneededtotheGPU.
Seesection4.
5.
2oftheAMDAcceleratedParallelProcessingOpenCLProgrammingGuideformoreinformationonmemoryoptimization.
Higher-performancemaybeobtainedbyaligningthehostdatastructuresto256-byteboundariesorbyusingthedevice_vectorclassforfinercontrolovertheallocationproperties.
77.
BoltAPIsalsoacceptdevice_vectorinputs.
Whenshouldusedevice_vectorsformyinputsThebolt::cl::device_vectorinterfaceisathinwrapperaroundthecl::Bufferclass;itprovidesaninterfacesimilartostd::vector.
Also,thedevice_vectoroptionallyacceptsaflagsargumentthatispassedtotheOpenCLbuffercreation.
Thefollowingtwoflagscanbeuseful:CL_MEM_ALLOC_HOST_POINTER:Forcesmemorytobeallocatedinhostmemory,evenondiscreteGPUs.
GPUsaccesstothedataisslower(overthePCIebus),buttheallocationavoidstheoverheadofcopyingdatato,andfrom,theGPU.
ThiscanbeusefulwhenthedataisusedinonlyasingleBoltcallbeforebeingprocessedagainontheCPU.
CL_MEM_USE_PERSISTENT_MEM_AMD:Providesdevice-residentzerocopymemory.
Usethisfordatathatiswrite-onlybytheCPU.
Seesection4.
5.
2oftheAMDAcceleratedParallelProcessingOpenCLProgrammingGuideformoreinformation.
8C++AMP78.
WhatisC++AMPC++AcceleratedMassiveParallelism(C++AMP)acceleratesexecutionofC++codebytakingadvantageofdata-parallelhardware,suchasthecomputeunitsonandAPU.
C++AMPisanopenspecificationthatextendsregularC++throughtheuseoftherestrictkeywordtodenoteportionsofcodetobeexecutedontheaccelerateddevice.
79.
WhatAMDdevicessupportC++AMPAnyAMDAPUorGPUthatsupportsMicrosoftDirectX11FeatureLevel11.
0andhigher.
80.
CanyourunaC++AMPprogramonaCPUYoucanrunaC++AMPprogramonaCPU,butitsusageisnotrecommendedinaproductionenvironment.
Formoreinformation,see:http://blogs.
msdn.
com/b/nativeconcurrency/archive/2012/03/10/cpu-accelerator-in-c-amp.
aspx81.
WherecanIgetsupportforC++AMPC++AMPissupportedbytheC++compilerinMicrosoftVisualStudio2012.
12of13FAQ82.
IsC++AMPsupportedonLinuxAtthistime,C++AMPisnotsupportedonLinux;however,asC++AMPisdefinedasanopenstandard,MicrosofthasopenedthedoorforimplementationsonoperatingsystemsotherthanWindows.
83.
WherecanIlearnmoreaboutC++AMPThereisanMSDNpageinC++AMP.
Also,thebookC++AMP:AcceleratedMassiveParallelismwithMicrosoftVisualC++,byKateGregoryandAdeMiller,maybeuseful.
9Aparapi84.
WhatisAparapiAparapiisanAPIforexpressingdataparallelalgorithmsinJavaandaruntimecomponentcapableofconvertingJavabytecodetoOpenCLforexecutionontheGPU.
IfAparapicannotexecuteontheGPUatruntime,itexecutesthedeveloper'salgorithminaJavathreadpool.
Forappropriateworkloads,thisextendsJava's'WriteOnceRunAnywhere'toincludeGPUdevices.
"85.
WhatAMDdevicessupportAparapiAnyAMDAPU,CPU,orGPUthatsupportsOpenCL1.
1andhigherwillwork.
SeethelistofAMDAPPSDKv2.
8SystemRequirementslistat:http://developer.
amd.
com/appsdk.
86.
WherecanIgetsupportforAparapiSee:http://code.
google.
com/p/aparapi/87.
IsAparapisupportedonLinuxYes.
88.
WhichJDK/JREisrecommendedforAparapiTheOracleJDK/JRE,althoughnotarequirementforAparapi,ishighlyrecommendedforperformanceandforstabilityreasons.
89.
WherecanIlearnmoreaboutAparapiYoucanfindoutmoreaboutAparapifrom:http://code.
google.
com/p/aparapi/AMD'sproductsarenotdesigned,intended,authorizedorwarrantedforuseascomponentsinsystemsintendedforsurgicalimplantintothebody,orinotherapplicationsintendedtosupportorsustainlife,orinanyotherapplicationinwhichthefailureofAMD'sproductcouldcreateasituationwherepersonalinjury,death,orseverepropertyorenvironmentaldamagemayoccur.
AMDreservestherighttodiscontinueormakechangestoitsproductsatanytimewithoutnotice.
CopyrightandTrademarks2012AdvancedMicroDevices,Inc.
Allrightsreserved.
AMD,theAMDArrowlogo,ATI,theATIlogo,Radeon,FireStream,andcombinationsthereofaretrade-marksofAdvancedMicroDevices,Inc.
OpenCLandtheOpenCLlogoaretrade-marksofAppleInc.
usedbypermissionbyKhronos.
Othernamesareforinfor-mationalpurposesonlyandmaybetrademarksoftheirrespectiveowners.
ThecontentsofthisdocumentareprovidedinconnectionwithAdvancedMicroDevices,Inc.
("AMD")products.
AMDmakesnorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthispublicationandreservestherighttomakechangestospecificationsandproductdescriptionsatanytimewithoutnotice.
Theinformationcontainedhereinmaybeofapreliminaryoradvancenatureandissubjecttochangewithoutnotice.
Nolicense,whetherexpress,implied,arisingbyestoppelorotherwise,toanyintellectualpropertyrightsisgrantedbythispublication.
ExceptassetforthinAMD'sStandardTermsandConditionsofSale,AMDassumesnoliabilitywhatsoever,anddisclaimsanyexpressorimpliedwar-ranty,relatingtoitsproductsincluding,butnotlimitedto,theimpliedwar-rantyofmerchantability,fitnessforaparticularpurpose,orinfringementofanyintellectualpropertyright.
ContactAdvancedMicroDevices,Inc.
OneAMDPlaceP.
O.
Box3453Sunnyvale,CA,94088-3453Phone:+1.
408.
749.
400013of13FAQForAMDAcceleratedParallelProcessing:URL:developer.
amd.
com/appsdkDeveloping:developer.
amd.
com/Forum:developer.
amd.
com/openclforum
ucloud:全球大促活动降价了!这次云服务器全网最低价,也算是让利用户了,UCloud商家调低了之前的促销活动价格,并且新增了1核1G内存配置快杰型云服务器,价格是47元/年(也可选2元首月),这是全网同配置最便宜的云服务器了!UCloud全球大促活动促销机型有快杰型云服务器和通用型云服务器,促销机房国内海外都有,覆盖全球20个城市,具体有北京、上海、广州、香港、 台北、日本东京、越南胡志明市、...
哪里购买香港云服务器便宜?众所周知,国内购买云服务器大多数用户会选择阿里云或腾讯云,但是阿里云香港云服务器不仅平时没有优惠,就连双十一、618、开年采购节这些活动也很少给出优惠。那么,腾讯云虽然海外云有优惠活动,但仅限新用户,购买过腾讯云服务器的用户就不会有优惠了。那么,我们如果想买香港云服务器,怎么样购买香港云服务器便宜和优惠呢?下面,云服务器网(yuntue.com)小编就介绍一下!我们都知道...
提速啦的来历提速啦是 网站 本着“良心 便宜 稳定”的初衷 为小白用户避免被坑提速啦的市场定位提速啦主要代理市场稳定速度的云服务器产品,避免新手购买云服务器的时候众多商家不知道如何选择,妮妮云就帮你选择好了产品,无需承担购买风险,不用担心出现被跑路 被诈骗的情况。提速啦的售后保证提速啦退款 通过于合作商的友好协商,云服务器提供3天内全额退款,超过3天不退款 物理机部分支持当天全额退款提速啦提现 充...
amd裁员为你推荐
美国互联网瘫痪网络中断会对美国军力造成什么影响摩根币摩根币是怎么骗人的?老虎数码我想买个一千左右的数码相机!最好低于一千五!再给我说一下像素是多少?百度关键词价格查询百度关键字如何设定竟价价格?psbc.comwww.psbc.com怎样注册javmoo.com0904-javbo.net_avop210hhb主人公叫什么,好喜欢,有知道的吗www.kanav001.com长虹V001手机小游戏下载的网址是什么抓站工具仿站必备软件有哪些工具?最好好用的仿站工具是那个几个?baqizi.cc誰知道,最近有什麼好看的電視劇haole012.com说在:012qq.com这个网站能免费挂QQ,是真的吗?
万网域名查询 哈尔滨服务器租用 enzu java主机 ca4249 dd444 cpanel空间 全站静态化 圣诞促销 建立邮箱 刀片式服务器 秒杀汇 傲盾官网 搜索引擎提交入口 多线空间 新世界服务器 双12 1元域名 西安服务器托管 路由跟踪 更多