cgfreehost

freehost  时间:2021-04-10  阅读:()
NVIDIA2010NVIDIA2010PauliusMicikevicius|NVIDIAFundamentalOptimizationsSupercomputing,TutorialS03NewOrleans,Nov14,2010NVIDIA2010OutlineKerneloptimizations–Launchconfiguration–Globalmemorythroughput–Sharedmemoryaccess–Instructionthroughput/controlflowOptimizationofCPU-GPUinteraction–MaximizingPCIethroughput–OverlappingkernelexecutionwithmemorycopiesNVIDIA2010LaunchConfigurationNVIDIA2010LaunchConfigurationHowmanythreads/threadblockstolaunchKeytounderstanding:–Instructionsareissuedinorder–Athreadstallswhenoneoftheoperandsisn'tready:Memoryreadbyitselfdoesn'tstallexecution–LatencyishiddenbyswitchingthreadsGMEMlatency:400-800cyclesArithmeticlatency:18-22cyclesConclusion:–NeedenoughthreadstohidelatencyNVIDIA2010LaunchConfigurationHidingarithmeticlatency:Need~18warps(576)threadsperFermiSMFewerwarpsforpre-FermiGPUs(FermiSMmorethandoubledissuerate)Or,latencycanalsobehiddenwithindependentinstructionsfromthesamewarpForexample,ifinstructionneverdependsontheoutputofprecedinginstruction,thenonly9warpsareneeded,etc.
Maximizingglobalmemorythroughput:Dependsontheaccesspattern,andwordsizeNeedenoughmemorytransactionsinflighttosaturatethebusIndependentloadsandstoresfromthesamethreadLoadsandstoresfromdifferentthreadsLargerwordsizescanalsohelp(float2istwicethetransactionsoffloat,forexample)NVIDIA2010MaximizingMemoryThroughputIncrementofanarrayof64Melements–Twoaccessesperthread(loadthenstore)–Thetwoaccessesaredependent,soreally1accessperthreadatatimeTeslaC2050,ECCon,theoreticalbandwidth:~120GB/sSeveralindependentsmalleraccesseshavethesameeffectasonelargerone.
Forexample:Four32-bit~=one128-bitNVIDIA2010LaunchConfiguration:SummaryNeedenoughtotalthreadstokeepGPUbusy–Typically,you'dlike512+threadsperSMMoreifprocessingonefp32elementperthread–Ofcourse,exceptionsexistThreadblockconfiguration–Threadsperblockshouldbeamultipleofwarpsize(32)–SMcanconcurrentlyexecuteupto8threadblocksReallysmallthreadblockspreventachievinggoodoccupancyReallylargethreadblocksarelessflexibleIgenerallyuse128-256threads/block,butusewhateverisbestfortheapplicationFormoredetails:–VasilyVolkov'sGTC2010talk"BetterPerformanceatLowerOccupancy"NVIDIA2010GlobalMemoryThroughputNVIDIA2010FermiMemoryHierarchyReviewL2GlobalMemoryRegistersL1SM-NSMEMRegistersL1SM-0SMEMRegistersL1SM-1SMEMNVIDIA2010FermiMemoryHierarchyReviewLocalstorage–Eachthreadhasownlocalstorage–Mostlyregisters(managedbythecompiler)Sharedmemory/L1–Programconfigurable:16KBshared/48KBL1OR48KBshared/16KBL1–Sharedmemoryisaccessiblebythethreadsinthesamethreadblock–Verylowlatency–Veryhighthroughput:1+TB/saggregateL2–AllaccessestoglobalmemorygothroughL2,includingcopiesto/fromCPUhostGlobalmemory–Accessiblebyallthreadsaswellashost(CPU)–Higherlatency(400-800cycles)–Throughput:upto177GB/sNVIDIA2010ProgrammingforL1andL2Shortanswer:DON'T–GPUcachesarenotintendedforthesameuseasCPUcachesSmallersize(especiallyperthread),sonotaimedattemporalreuseIntendedtosmoothoutsomeaccesspatterns,helpwithspilledregisters,etc.
–Don'ttrytoblockforL1/L2likeyouwouldonCPUYouhave100sto1,000sofrun-timescheduledthreadshittingthecachesIfitispossibletoblockforL1thenblockforSMEM–Samesize,samebandwidth,hwwillnotevictbehindyourbackOptimizeasifnocacheswerethere–NoFermi-onlytechniquestolearnperse(so,allyouknowisstillgood)–SomecaseswilljustrunfasterNVIDIA2010FermiGMEMOperationsTwotypesofloads:–CachingDefaultmodeAttemptstohitinL1,thenL2,thenGMEMLoadgranularityis128-byteline–Non-cachingCompilewith–Xptxas–dlcm=cgoptiontonvccAttemptstohitinL2,thenGMEM–DonothitinL1,invalidatethelineifit'sinL1alreadyLoadgranularityis32-bytesStores:–InvalidateL1,write-backforL2NVIDIA2010LoadCachingandL1SizeNon-cachingloadscanimproveperfwhen:–LoadingscatteredwordsoronlyapartofawarpissuesaloadBenefit:transactionissmaller,sousefulpayloadisalargerpercentageLoadinghalos,forexample–Spillingregisters(reducelinefightingwithspillage)LargeL1canimproveperfwhen:–Spillingregisters(morelinessofewerevictions)–Somemisaligned,stridedaccesspatterns–16-KBL1/48-KBsmemOR48-KBL1/16-KBsmemCUDAcall,canbesetfortheapporper-kernelHowtouse:–Justtrya2x2experimentmatrix:{CA,CG}x{48-L1,16-L1}Keepthebestcombination-sameasyouwouldwithanyHWmanagedcache,includingCPUsNVIDIA2010LoadOperationMemoryoperationsareissuedperwarp(32threads)–Justlikeallotherinstructions–PriortoFermi,memoryissueswereperhalf-warpOperation:–Threadsinawarpprovidememoryaddresses–Determinewhichlines/segmentsareneeded–Requesttheneededlines/segmentsNVIDIA2010CachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010Non-cachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010CachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010Non-cachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010CachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithin2cache-lines–Warpneeds128bytes–256bytesmoveacrossthebusonmisses–Busutilization:50%NVIDIA2010Non-cachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithinatmost5segments–Warpneeds128bytes–Atmost160bytesmoveacrossthebus–Busutilization:atleast80%Somemisalignedpatternswillfallwithin4segments,so100%utilizationNVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglecache-line–Warpneeds4bytes–128bytesmoveacrossthebusonamiss–Busutilization:3.
125%NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglesegment–Warpneeds4bytes–32bytesmoveacrossthebusonamiss–Busutilization:12.
5%NVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNcache-lines–Warpneeds128bytes–N*128bytesmoveacrossthebusonamiss–Busutilization:128/(N*128)NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNsegments–Warpneeds128bytes–N*32bytesmoveacrossthebusonamiss–Busutilization:128/(N*32)NVIDIA2010ImpactofAddressAlignmentWarpsshouldaccessalignedregionsformaximummemorythroughput–FermiL1canhelpformisalignedloadsifseveralwarpsareaccessingacontiguousregion–ECCfurthersignificantlyreducesmisalignedstorethroughputExperiment:–Copy16MBoffloats–256threads/blockGreatestthroughputdrop:–GT200:40%–Fermi:–CAloads:15%–CGloads:32%NVIDIA2010GMEMOptimizationGuidelinesStriveforperfectcoalescingperwarp–Alignstartingaddress(mayrequirepadding)–AwarpshouldaccesswithinacontiguousregionHaveenoughconcurrentaccessestosaturatethebus–LaunchenoughthreadstomaximizethroughputLatencyishiddenbyswitchingthreads(warps)–ProcessseveralelementsperthreadMultipleloadsgetpipelinedIndexingcalculationscanoftenbereusedTryL1andcachingconfigurationstoseewhichoneworksbest–Cachingvsnon-cachingloads(compileroption)–16KBvs48KBL1(CUDAcall)NVIDIA2010SharedMemoryNVIDIA2010SharedMemoryUses:–Inter-threadcommunicationwithinablock–Cachedatatoreduceredundantglobalmemoryaccesses–UseittoimproveglobalmemoryaccesspatternsFermiorganization:–32banks,4-bytewidebanks–Successive4-bytewordsbelongtodifferentbanksPerformance:–4bytesperbankper2clockspermultiprocessor–smemaccessesareissuedper32threads(warp)per16-threadsforGPUspriortoFermi–serialization:ifnthreadsinawarpaccessdifferent4-bytewordsinthesamebank,naccessesareexecutedserially–multicast:nthreadsaccessthesamewordinonefetchCouldbedifferentbyteswithinthesamewordPriortoFermi,onlybroadcastwasavailable,sub-wordaccesseswithinthesamebankcausedserializationNVIDIA2010BankAddressingExamplesNoBankConflictsNoBankConflictsBank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0NVIDIA2010BankAddressingExamples2-wayBankConflicts8-wayBankConflictsThread31Thread30Thread29Thread28Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank9Bank8Bank31Bank7Bank2Bank1Bank0x8x8NVIDIA2010SharedMemory:AvoidingBankConflicts32x32SMEMarrayWarpaccessesacolumn:–32-waybankconflicts(threadsinawarpaccessthesamebank)312103121031210warps:01231Bank0Bank1…Bank3120131NVIDIA2010SharedMemory:AvoidingBankConflictsAddacolumnforpadding:–32x33SMEMarrayWarpaccessesacolumn:–32differentbanks,nobankconflicts312103121031210warps:01231paddingBank0Bank1…Bank3131201NVIDIA2010Additional"memories"TextureandconstantRead-onlyDataresidesinglobalmemoryReadthroughdifferentcachesNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwise__global__voidkernel(constfloat*g_a){.
.
.
floatx=g_a[15];//uniformfloaty=g_a[blockIdx.
x+5];//uniformfloatz=g_a[threadIdx.
x];//non-uniform.
.
.
}NVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingGMEM:–Eachwarpfetches32B->10KBofbustraffic–Cachingloadspotentiallyworse–128Bline,verylikelytobeevictedmultipletimes.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingconstant/uniformaccess:–Firstwarpfetches32bytes–Allothershitinconstantcache->32bytesofbustrafficperSMUnlikelytobeevictedoverkernellifetime–otherloadsdonotgothroughthiscache.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010TextureSeparatecacheDedicatedtexturecachehardwareprovides:–Out-of-boundsindexhandlingclamporwrap-around–OptionalinterpolationThink:usingfpindicesforarraysLinear,bilinear,trilinear–Interpolationweightsare9-bit–Optionalformatconversion{char,short,int}->float–Alloftheseare"free"NVIDIA2010InstructionThroughput/ControlFlowNVIDIA2010RuntimeMathLibraryandIntrinsicsTwotypesofruntimemathlibraryfunctions–__func():manymapdirectlytohardwareISAFastbutloweraccuracy(seeCUDAProgrammingGuideforfulldetails)Examples:__sinf(x),__expf(x),__powf(x,y)–func():compiletomultipleinstructionsSlowerbuthigheraccuracy(5ulporless)Examples:sin(x),exp(x),pow(x,y)Anumberofadditionalintrinsics:–__sincosf(),__frcp_rz(),.
.
.
–ExplicitIEEEroundingmodes(rz,rn,ru,rd)NVIDIA2010ControlFlowInstructionsareissuedper32threads(warp)Divergentbranches:–Threadswithinasinglewarptakedifferentpathsif-else,.
.
.
–DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformanceAvoiddivergingwithinawarp–Examplewithdivergence:if(threadIdx.
x>2){.
.
.
}else{.
.
.
}Branchgranularity2){.
.
.
}else{.
.
.
}BranchgranularityisawholemultipleofwarpsizeNVIDIA2010ControlFlowif(.
.
.
){//then-clause}else{//else-clause}instructionsNVIDIA2010Executionwithinwarpsiscoherentinstructions/timeWarp("vector"ofthreads)35343363623232131300Warp("vector"ofthreads)NVIDIA2010Executiondivergeswithinawarpinstructions/time32131300353433636232NVIDIA2010CPU-GPUInteractionNVIDIA2010Pinned(non-pageable)memoryPinnedmemoryenables:–fasterPCIecopies–memcopiesasynchronouswithCPU–memcopiesasynchronouswithGPUUsage–cudaHostAlloc/cudaFreeHostinsteadofmalloc/freeImplication:–pinnedmemoryisessentiallyremovedfromhostvirtualmemoryNVIDIA2010StreamsandAsyncAPIDefaultAPI:–KernellaunchesareasynchronouswithCPU–Memcopies(D2H,H2D)blockCPUthread–CUDAcallsareserializedbythedriverStreamsandasyncfunctionsprovide:–Memcopies(D2H,H2D)asynchronouswithCPU–AbilitytoconcurrentlyexecuteakernelandamemcopyStream=sequenceofoperationsthatexecuteinissue-orderonGPU–Operationsfromdifferentstreamsmaybeinterleaved–AkernelandmemcopyfromdifferentstreamscanbeoverlappedNVIDIA2010OverlapkernelandmemorycopyRequirements:–D2HorH2Dmemcopyfrompinnedmemory–Devicewithcomputecapability≥1.
1(G84andlater)–Kernelandmemcopyindifferent,non-0streamsCode:cudaStream_tstream1,stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(dst,src,size,dir,stream1);kernel>>(…);potentiallyoverlappedNVIDIA2010CallSequencingforOptimalOverlapCUDAcallsaredispatchedtothehwinthesequencetheywereissuedFermicanconcurrentlyexecute:–Upto16kernels–Upto2memcopies,aslongastheyareindifferentdirections(D2HandH2D)Acallisdispatchedifbotharetrue:–Resourcesareavailable–PrecedingcallsinthesamestreamhavecompletedNotethatifacallblocks,itblocksallothercallsofthesametypebehindit,eveninotherstreams–Typeisoneof{kernel,memcopy}NVIDIA2010StreamExamplesK1,M1,K2,M2:K1M1K2M2K1,K2,M1,M2:K1M1K2M2K1,M1,M2:K1M1M2K1,M2,M1:K1M1M2K1,M2,M2:K1M2M2TimeK:kernelM:memcopyInteger:streadIDNVIDIA2010MoreonFermiConcurrentKernelsKernelsmaybeexecutedconcurrentlyiftheyareissuedintodifferentstreamsScheduling:–Kernelsareexecutedintheorderinwhichtheywereissued–ThreadblocksforagivenkernelarescheduledifallthreadblocksforprecedingkernelshavebeenscheduledandtherestillareSMresourcesavailableNVIDIA2010MoreonFermiDualCopyFermiiscapableofduplexcommunicationwiththehost–PCIebusisduplex–Thetwomemcopiesmustbeindifferentstreams,differentdirectionsNotallcurrenthostsystemscansaturateduplexPCIebandwidth:–LikelylimitationsoftheIOHchips–Ifthisisimportanttoyou,testyourhostsystemNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-0DRAM10.
8GB/s7.
5GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-1DRAM10.
8GB/s11GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010SummaryKernelLaunchConfiguration:–LaunchenoughthreadsperSMtohidelatency–LaunchenoughthreadblockstoloadtheGPUGlobalmemory:–Maximizethroughput(GPUhaslotsofbandwidth,useiteffectively)Usesharedmemorywhenapplicable(over1TB/sbandwidth)GPU-CPUinteraction:–MinimizeCPU/GPUidling,maximizePCIethroughputUseanalysis/profilingwhenoptimizing:–"Analysis-drivenOptimization"talknextNVIDIA2010AdditionalResourcesBasics:–CUDAwebinarsonNVIDIAwebsite(justgoogleforCUDAwebinar)–CUDAbyExample"bookbyJ.
SandersandE.
CandrotProfiling,analysis,andoptimizationforFermi:–GTC-2010session2012:"Analysis-drivenOptimization"(tomorrow,3-5pm)GT200optimization:–GTC-2009session1029(slidesandvideo)Slides:–http://www.
nvidia.
com/content/GTC/documents/1029_GTC09.
pdfMaterialsforallsessions:–http://developer.
download.
nvidia.
com/compute/cuda/docs/GTC09Materials.
htmCUDATutorialsatSupercomputing:–http://gpgpu.
org/{sc2007,sc2008,sc2009}CUDAProgrammingGuideCUDABestPracticesGuideNVIDIA2010Questions

HaBangNet(6.95美元/月)美国vps 5TB流量/德国vps 香港双向CN2 GIA VPS

HaBangNet支持支付宝和微信支付,只是价格偏贵,之前国内用户并不多。这次HaBangNet推出三个特价套餐,其中美国机房和德国机房价格也还可以,但是香港机房虽然是双向CN2 GIA线路,但是还是贵的惊人,需要美国和德国机房的可以参考下。HaBangNet是一家成立于2014年的香港IDC商家,中文译名:哈邦网络公司,主营中国香港、新加坡、澳大利亚、荷兰、美国、德国机房的虚拟主机、vps、专用...

spinservers($89/月),圣何塞10Gbps带宽服务器,达拉斯10Gbps服务器

spinservers是Majestic Hosting Solutions LLC旗下站点,主要提供国外服务器租用和Hybrid Dedicated等产品的商家,数据中心包括美国达拉斯和圣何塞机房,机器一般10Gbps端口带宽,高配置硬件,支持使用PayPal、信用卡、支付宝或者微信等付款方式。目前,商家针对部分服务器提供优惠码,优惠后达拉斯机房服务器最低每月89美元起,圣何塞机房服务器最低每月...

阿里云香港 16核32G 20M 999元/月

阿里云香港配置图提速啦是成立于2012年的十分老牌的一个商家这次给大家评测的是 阿里云香港 16核32G 20M 这款产品,单单说价格上就是十分的离谱原价8631元/月的现价只要 999元 而且还有个8折循环优惠。废话不多说直接进入正题。优惠时间 2021年8月20日-2021年9月20日 优惠码 wn789 8折优惠阿里云香港BGP专线 16核32G 10M带宽 优惠购买 399元购买链接阿里云...

freehost为你推荐
百度爱好者武汉理工大学有百度爱好者协会吗?那个协会怎么样同ip站点同ip站点很多有没有影响?www.kanav001.com跪求下载[GJOS-024] 由愛可奈 [Kana Yume] 現役女子高生グラビア种子的网址谁有partnersonline电脑内一切浏览器无法打开www.175qq.com最炫的qq分组hao.rising.cn我的Google Chrome主页被http://hao.rising.cn//?b=64锁定了,谁有办法?222cc.com求都市后宫小说、越多越好铂金血痕求Hp卢修斯,v大,盖特勒重生文,cp不要斯内普和邓不利多,名子和简介就行.最好是晋江的.谢谢.网页源代码网页源代码是什么,具体讲一下?蜘蛛机器人汤姆克鲁斯主演,有巴掌大小的蜘蛛机器人,很厉害的,科幻片吧,是什么电影
电信服务器租赁 香港vps 如何查询ip地址 万网域名证书查询 host1plus sugarsync 512av 申请空间 嘉洲服务器 卡巴斯基官方免费版 怎么测试下载速度 最好的免费空间 赞助 网通服务器托管 33456 服务器维护 个人免费邮箱 九零网络 美国主机侦探 windows2008 更多