cgfreehost

freehost  时间:2021-04-10  阅读:()
NVIDIA2010NVIDIA2010PauliusMicikevicius|NVIDIAFundamentalOptimizationsSupercomputing,TutorialS03NewOrleans,Nov14,2010NVIDIA2010OutlineKerneloptimizations–Launchconfiguration–Globalmemorythroughput–Sharedmemoryaccess–Instructionthroughput/controlflowOptimizationofCPU-GPUinteraction–MaximizingPCIethroughput–OverlappingkernelexecutionwithmemorycopiesNVIDIA2010LaunchConfigurationNVIDIA2010LaunchConfigurationHowmanythreads/threadblockstolaunchKeytounderstanding:–Instructionsareissuedinorder–Athreadstallswhenoneoftheoperandsisn'tready:Memoryreadbyitselfdoesn'tstallexecution–LatencyishiddenbyswitchingthreadsGMEMlatency:400-800cyclesArithmeticlatency:18-22cyclesConclusion:–NeedenoughthreadstohidelatencyNVIDIA2010LaunchConfigurationHidingarithmeticlatency:Need~18warps(576)threadsperFermiSMFewerwarpsforpre-FermiGPUs(FermiSMmorethandoubledissuerate)Or,latencycanalsobehiddenwithindependentinstructionsfromthesamewarpForexample,ifinstructionneverdependsontheoutputofprecedinginstruction,thenonly9warpsareneeded,etc.
Maximizingglobalmemorythroughput:Dependsontheaccesspattern,andwordsizeNeedenoughmemorytransactionsinflighttosaturatethebusIndependentloadsandstoresfromthesamethreadLoadsandstoresfromdifferentthreadsLargerwordsizescanalsohelp(float2istwicethetransactionsoffloat,forexample)NVIDIA2010MaximizingMemoryThroughputIncrementofanarrayof64Melements–Twoaccessesperthread(loadthenstore)–Thetwoaccessesaredependent,soreally1accessperthreadatatimeTeslaC2050,ECCon,theoreticalbandwidth:~120GB/sSeveralindependentsmalleraccesseshavethesameeffectasonelargerone.
Forexample:Four32-bit~=one128-bitNVIDIA2010LaunchConfiguration:SummaryNeedenoughtotalthreadstokeepGPUbusy–Typically,you'dlike512+threadsperSMMoreifprocessingonefp32elementperthread–Ofcourse,exceptionsexistThreadblockconfiguration–Threadsperblockshouldbeamultipleofwarpsize(32)–SMcanconcurrentlyexecuteupto8threadblocksReallysmallthreadblockspreventachievinggoodoccupancyReallylargethreadblocksarelessflexibleIgenerallyuse128-256threads/block,butusewhateverisbestfortheapplicationFormoredetails:–VasilyVolkov'sGTC2010talk"BetterPerformanceatLowerOccupancy"NVIDIA2010GlobalMemoryThroughputNVIDIA2010FermiMemoryHierarchyReviewL2GlobalMemoryRegistersL1SM-NSMEMRegistersL1SM-0SMEMRegistersL1SM-1SMEMNVIDIA2010FermiMemoryHierarchyReviewLocalstorage–Eachthreadhasownlocalstorage–Mostlyregisters(managedbythecompiler)Sharedmemory/L1–Programconfigurable:16KBshared/48KBL1OR48KBshared/16KBL1–Sharedmemoryisaccessiblebythethreadsinthesamethreadblock–Verylowlatency–Veryhighthroughput:1+TB/saggregateL2–AllaccessestoglobalmemorygothroughL2,includingcopiesto/fromCPUhostGlobalmemory–Accessiblebyallthreadsaswellashost(CPU)–Higherlatency(400-800cycles)–Throughput:upto177GB/sNVIDIA2010ProgrammingforL1andL2Shortanswer:DON'T–GPUcachesarenotintendedforthesameuseasCPUcachesSmallersize(especiallyperthread),sonotaimedattemporalreuseIntendedtosmoothoutsomeaccesspatterns,helpwithspilledregisters,etc.
–Don'ttrytoblockforL1/L2likeyouwouldonCPUYouhave100sto1,000sofrun-timescheduledthreadshittingthecachesIfitispossibletoblockforL1thenblockforSMEM–Samesize,samebandwidth,hwwillnotevictbehindyourbackOptimizeasifnocacheswerethere–NoFermi-onlytechniquestolearnperse(so,allyouknowisstillgood)–SomecaseswilljustrunfasterNVIDIA2010FermiGMEMOperationsTwotypesofloads:–CachingDefaultmodeAttemptstohitinL1,thenL2,thenGMEMLoadgranularityis128-byteline–Non-cachingCompilewith–Xptxas–dlcm=cgoptiontonvccAttemptstohitinL2,thenGMEM–DonothitinL1,invalidatethelineifit'sinL1alreadyLoadgranularityis32-bytesStores:–InvalidateL1,write-backforL2NVIDIA2010LoadCachingandL1SizeNon-cachingloadscanimproveperfwhen:–LoadingscatteredwordsoronlyapartofawarpissuesaloadBenefit:transactionissmaller,sousefulpayloadisalargerpercentageLoadinghalos,forexample–Spillingregisters(reducelinefightingwithspillage)LargeL1canimproveperfwhen:–Spillingregisters(morelinessofewerevictions)–Somemisaligned,stridedaccesspatterns–16-KBL1/48-KBsmemOR48-KBL1/16-KBsmemCUDAcall,canbesetfortheapporper-kernelHowtouse:–Justtrya2x2experimentmatrix:{CA,CG}x{48-L1,16-L1}Keepthebestcombination-sameasyouwouldwithanyHWmanagedcache,includingCPUsNVIDIA2010LoadOperationMemoryoperationsareissuedperwarp(32threads)–Justlikeallotherinstructions–PriortoFermi,memoryissueswereperhalf-warpOperation:–Threadsinawarpprovidememoryaddresses–Determinewhichlines/segmentsareneeded–Requesttheneededlines/segmentsNVIDIA2010CachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010Non-cachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010CachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010Non-cachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010CachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithin2cache-lines–Warpneeds128bytes–256bytesmoveacrossthebusonmisses–Busutilization:50%NVIDIA2010Non-cachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithinatmost5segments–Warpneeds128bytes–Atmost160bytesmoveacrossthebus–Busutilization:atleast80%Somemisalignedpatternswillfallwithin4segments,so100%utilizationNVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglecache-line–Warpneeds4bytes–128bytesmoveacrossthebusonamiss–Busutilization:3.
125%NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglesegment–Warpneeds4bytes–32bytesmoveacrossthebusonamiss–Busutilization:12.
5%NVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNcache-lines–Warpneeds128bytes–N*128bytesmoveacrossthebusonamiss–Busutilization:128/(N*128)NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNsegments–Warpneeds128bytes–N*32bytesmoveacrossthebusonamiss–Busutilization:128/(N*32)NVIDIA2010ImpactofAddressAlignmentWarpsshouldaccessalignedregionsformaximummemorythroughput–FermiL1canhelpformisalignedloadsifseveralwarpsareaccessingacontiguousregion–ECCfurthersignificantlyreducesmisalignedstorethroughputExperiment:–Copy16MBoffloats–256threads/blockGreatestthroughputdrop:–GT200:40%–Fermi:–CAloads:15%–CGloads:32%NVIDIA2010GMEMOptimizationGuidelinesStriveforperfectcoalescingperwarp–Alignstartingaddress(mayrequirepadding)–AwarpshouldaccesswithinacontiguousregionHaveenoughconcurrentaccessestosaturatethebus–LaunchenoughthreadstomaximizethroughputLatencyishiddenbyswitchingthreads(warps)–ProcessseveralelementsperthreadMultipleloadsgetpipelinedIndexingcalculationscanoftenbereusedTryL1andcachingconfigurationstoseewhichoneworksbest–Cachingvsnon-cachingloads(compileroption)–16KBvs48KBL1(CUDAcall)NVIDIA2010SharedMemoryNVIDIA2010SharedMemoryUses:–Inter-threadcommunicationwithinablock–Cachedatatoreduceredundantglobalmemoryaccesses–UseittoimproveglobalmemoryaccesspatternsFermiorganization:–32banks,4-bytewidebanks–Successive4-bytewordsbelongtodifferentbanksPerformance:–4bytesperbankper2clockspermultiprocessor–smemaccessesareissuedper32threads(warp)per16-threadsforGPUspriortoFermi–serialization:ifnthreadsinawarpaccessdifferent4-bytewordsinthesamebank,naccessesareexecutedserially–multicast:nthreadsaccessthesamewordinonefetchCouldbedifferentbyteswithinthesamewordPriortoFermi,onlybroadcastwasavailable,sub-wordaccesseswithinthesamebankcausedserializationNVIDIA2010BankAddressingExamplesNoBankConflictsNoBankConflictsBank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0NVIDIA2010BankAddressingExamples2-wayBankConflicts8-wayBankConflictsThread31Thread30Thread29Thread28Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank9Bank8Bank31Bank7Bank2Bank1Bank0x8x8NVIDIA2010SharedMemory:AvoidingBankConflicts32x32SMEMarrayWarpaccessesacolumn:–32-waybankconflicts(threadsinawarpaccessthesamebank)312103121031210warps:01231Bank0Bank1…Bank3120131NVIDIA2010SharedMemory:AvoidingBankConflictsAddacolumnforpadding:–32x33SMEMarrayWarpaccessesacolumn:–32differentbanks,nobankconflicts312103121031210warps:01231paddingBank0Bank1…Bank3131201NVIDIA2010Additional"memories"TextureandconstantRead-onlyDataresidesinglobalmemoryReadthroughdifferentcachesNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwise__global__voidkernel(constfloat*g_a){.
.
.
floatx=g_a[15];//uniformfloaty=g_a[blockIdx.
x+5];//uniformfloatz=g_a[threadIdx.
x];//non-uniform.
.
.
}NVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingGMEM:–Eachwarpfetches32B->10KBofbustraffic–Cachingloadspotentiallyworse–128Bline,verylikelytobeevictedmultipletimes.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingconstant/uniformaccess:–Firstwarpfetches32bytes–Allothershitinconstantcache->32bytesofbustrafficperSMUnlikelytobeevictedoverkernellifetime–otherloadsdonotgothroughthiscache.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010TextureSeparatecacheDedicatedtexturecachehardwareprovides:–Out-of-boundsindexhandlingclamporwrap-around–OptionalinterpolationThink:usingfpindicesforarraysLinear,bilinear,trilinear–Interpolationweightsare9-bit–Optionalformatconversion{char,short,int}->float–Alloftheseare"free"NVIDIA2010InstructionThroughput/ControlFlowNVIDIA2010RuntimeMathLibraryandIntrinsicsTwotypesofruntimemathlibraryfunctions–__func():manymapdirectlytohardwareISAFastbutloweraccuracy(seeCUDAProgrammingGuideforfulldetails)Examples:__sinf(x),__expf(x),__powf(x,y)–func():compiletomultipleinstructionsSlowerbuthigheraccuracy(5ulporless)Examples:sin(x),exp(x),pow(x,y)Anumberofadditionalintrinsics:–__sincosf(),__frcp_rz(),.
.
.
–ExplicitIEEEroundingmodes(rz,rn,ru,rd)NVIDIA2010ControlFlowInstructionsareissuedper32threads(warp)Divergentbranches:–Threadswithinasinglewarptakedifferentpathsif-else,.
.
.
–DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformanceAvoiddivergingwithinawarp–Examplewithdivergence:if(threadIdx.
x>2){.
.
.
}else{.
.
.
}Branchgranularity2){.
.
.
}else{.
.
.
}BranchgranularityisawholemultipleofwarpsizeNVIDIA2010ControlFlowif(.
.
.
){//then-clause}else{//else-clause}instructionsNVIDIA2010Executionwithinwarpsiscoherentinstructions/timeWarp("vector"ofthreads)35343363623232131300Warp("vector"ofthreads)NVIDIA2010Executiondivergeswithinawarpinstructions/time32131300353433636232NVIDIA2010CPU-GPUInteractionNVIDIA2010Pinned(non-pageable)memoryPinnedmemoryenables:–fasterPCIecopies–memcopiesasynchronouswithCPU–memcopiesasynchronouswithGPUUsage–cudaHostAlloc/cudaFreeHostinsteadofmalloc/freeImplication:–pinnedmemoryisessentiallyremovedfromhostvirtualmemoryNVIDIA2010StreamsandAsyncAPIDefaultAPI:–KernellaunchesareasynchronouswithCPU–Memcopies(D2H,H2D)blockCPUthread–CUDAcallsareserializedbythedriverStreamsandasyncfunctionsprovide:–Memcopies(D2H,H2D)asynchronouswithCPU–AbilitytoconcurrentlyexecuteakernelandamemcopyStream=sequenceofoperationsthatexecuteinissue-orderonGPU–Operationsfromdifferentstreamsmaybeinterleaved–AkernelandmemcopyfromdifferentstreamscanbeoverlappedNVIDIA2010OverlapkernelandmemorycopyRequirements:–D2HorH2Dmemcopyfrompinnedmemory–Devicewithcomputecapability≥1.
1(G84andlater)–Kernelandmemcopyindifferent,non-0streamsCode:cudaStream_tstream1,stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(dst,src,size,dir,stream1);kernel>>(…);potentiallyoverlappedNVIDIA2010CallSequencingforOptimalOverlapCUDAcallsaredispatchedtothehwinthesequencetheywereissuedFermicanconcurrentlyexecute:–Upto16kernels–Upto2memcopies,aslongastheyareindifferentdirections(D2HandH2D)Acallisdispatchedifbotharetrue:–Resourcesareavailable–PrecedingcallsinthesamestreamhavecompletedNotethatifacallblocks,itblocksallothercallsofthesametypebehindit,eveninotherstreams–Typeisoneof{kernel,memcopy}NVIDIA2010StreamExamplesK1,M1,K2,M2:K1M1K2M2K1,K2,M1,M2:K1M1K2M2K1,M1,M2:K1M1M2K1,M2,M1:K1M1M2K1,M2,M2:K1M2M2TimeK:kernelM:memcopyInteger:streadIDNVIDIA2010MoreonFermiConcurrentKernelsKernelsmaybeexecutedconcurrentlyiftheyareissuedintodifferentstreamsScheduling:–Kernelsareexecutedintheorderinwhichtheywereissued–ThreadblocksforagivenkernelarescheduledifallthreadblocksforprecedingkernelshavebeenscheduledandtherestillareSMresourcesavailableNVIDIA2010MoreonFermiDualCopyFermiiscapableofduplexcommunicationwiththehost–PCIebusisduplex–Thetwomemcopiesmustbeindifferentstreams,differentdirectionsNotallcurrenthostsystemscansaturateduplexPCIebandwidth:–LikelylimitationsoftheIOHchips–Ifthisisimportanttoyou,testyourhostsystemNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-0DRAM10.
8GB/s7.
5GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-1DRAM10.
8GB/s11GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010SummaryKernelLaunchConfiguration:–LaunchenoughthreadsperSMtohidelatency–LaunchenoughthreadblockstoloadtheGPUGlobalmemory:–Maximizethroughput(GPUhaslotsofbandwidth,useiteffectively)Usesharedmemorywhenapplicable(over1TB/sbandwidth)GPU-CPUinteraction:–MinimizeCPU/GPUidling,maximizePCIethroughputUseanalysis/profilingwhenoptimizing:–"Analysis-drivenOptimization"talknextNVIDIA2010AdditionalResourcesBasics:–CUDAwebinarsonNVIDIAwebsite(justgoogleforCUDAwebinar)–CUDAbyExample"bookbyJ.
SandersandE.
CandrotProfiling,analysis,andoptimizationforFermi:–GTC-2010session2012:"Analysis-drivenOptimization"(tomorrow,3-5pm)GT200optimization:–GTC-2009session1029(slidesandvideo)Slides:–http://www.
nvidia.
com/content/GTC/documents/1029_GTC09.
pdfMaterialsforallsessions:–http://developer.
download.
nvidia.
com/compute/cuda/docs/GTC09Materials.
htmCUDATutorialsatSupercomputing:–http://gpgpu.
org/{sc2007,sc2008,sc2009}CUDAProgrammingGuideCUDABestPracticesGuideNVIDIA2010Questions

老周互联24小时无理由退款,香港原生IP,28元起

老周互联怎么样?老周互联隶属于老周网络科技部旗下,创立于2019年12月份,是一家具有代表性的国人商家。目前主营的产品有云服务器,裸金属服务器。创办一年多以来,我们一直坚持以口碑至上,服务宗旨为理念,为用户提供7*24小时的轮班服务,目前已有上千多家中小型站长选择我们!服务宗旨:老周互联提供7*24小时轮流值班客服,用户24小时内咨询问题可提交工单,我们会在30分钟内为您快速解答!另免费部署服务器...

HostSlim,双E5-2620v2/4x 1TB SATA大硬盘,荷兰服务器60美元月

hostslim美国独立日活动正在进行中,针对一款大硬盘荷兰专用服务器:双E5-2620v2/4x 1TB SATA硬盘,活动价60美元月。HostSlim荷兰服务器允许大人内容,不过只支持电汇、信用卡和比特币付款,商家支持7天内退款保证,有需要欧洲服务器的可以入手试试,记得注册的时候选择中国,这样不用交20%的税。hostslim怎么样?HostSlim是一家成立于2008年的荷兰托管服务器商,...

宝塔面板批量设置站点404页面

今天遇到一个网友,他在一个服务器中搭建有十几个网站,但是他之前都是采集站点数据很大,但是现在他删除数据之后希望设置可能有索引的文章给予404跳转页面。虽然他程序有默认的404页面,但是达不到他引流的目的,他希望设置统一的404页面。实际上设置还是很简单的,我们找到他是Nginx还是Apache,直接在引擎配置文件中设置即可。这里有看到他采用的是宝塔面板,直接在他的Nginx中设置。这里我们找到当前...

freehost为你推荐
百度爱好者武汉理工大学有百度爱好者协会吗?那个协会怎么样美国互联网瘫痪网络中断会对美国军力造成什么影响微信回应封杀钉钉微信发过来的钉钉链接打不开?广东GDP破10万亿广东省2019年各市gdp是多少?7788k.com以前有个网站是7788MP3.com后来改成KK130现在又改网站域名了。有知道现在是什么域名么?lunwenjiancepaperfree论文检测怎样算合格百度关键词分析关键词怎么分析?www.e12.com.cn上海高中除了四大名校,接下来哪所高中最好?顺便讲下它的各方面情况5xoy.comhttp www.05eee.com125xx.comwww.free.com 是官方网站吗?
cn域名注册 韩国vps俄罗斯美女 花生壳免费域名申请 realvnc 万网优惠券 网通ip 太原联通测速平台 炎黄盛世 699美元 hkg 台湾谷歌 支持外链的相册 视频服务器是什么 双线空间 阿里云手机官网 阿里云邮箱申请 深圳主机托管 服务器防御 windowssever2008 web是什么意思 更多