cgfreehost

freehost  时间:2021-04-10  阅读:()
NVIDIA2010NVIDIA2010PauliusMicikevicius|NVIDIAFundamentalOptimizationsSupercomputing,TutorialS03NewOrleans,Nov14,2010NVIDIA2010OutlineKerneloptimizations–Launchconfiguration–Globalmemorythroughput–Sharedmemoryaccess–Instructionthroughput/controlflowOptimizationofCPU-GPUinteraction–MaximizingPCIethroughput–OverlappingkernelexecutionwithmemorycopiesNVIDIA2010LaunchConfigurationNVIDIA2010LaunchConfigurationHowmanythreads/threadblockstolaunchKeytounderstanding:–Instructionsareissuedinorder–Athreadstallswhenoneoftheoperandsisn'tready:Memoryreadbyitselfdoesn'tstallexecution–LatencyishiddenbyswitchingthreadsGMEMlatency:400-800cyclesArithmeticlatency:18-22cyclesConclusion:–NeedenoughthreadstohidelatencyNVIDIA2010LaunchConfigurationHidingarithmeticlatency:Need~18warps(576)threadsperFermiSMFewerwarpsforpre-FermiGPUs(FermiSMmorethandoubledissuerate)Or,latencycanalsobehiddenwithindependentinstructionsfromthesamewarpForexample,ifinstructionneverdependsontheoutputofprecedinginstruction,thenonly9warpsareneeded,etc.
Maximizingglobalmemorythroughput:Dependsontheaccesspattern,andwordsizeNeedenoughmemorytransactionsinflighttosaturatethebusIndependentloadsandstoresfromthesamethreadLoadsandstoresfromdifferentthreadsLargerwordsizescanalsohelp(float2istwicethetransactionsoffloat,forexample)NVIDIA2010MaximizingMemoryThroughputIncrementofanarrayof64Melements–Twoaccessesperthread(loadthenstore)–Thetwoaccessesaredependent,soreally1accessperthreadatatimeTeslaC2050,ECCon,theoreticalbandwidth:~120GB/sSeveralindependentsmalleraccesseshavethesameeffectasonelargerone.
Forexample:Four32-bit~=one128-bitNVIDIA2010LaunchConfiguration:SummaryNeedenoughtotalthreadstokeepGPUbusy–Typically,you'dlike512+threadsperSMMoreifprocessingonefp32elementperthread–Ofcourse,exceptionsexistThreadblockconfiguration–Threadsperblockshouldbeamultipleofwarpsize(32)–SMcanconcurrentlyexecuteupto8threadblocksReallysmallthreadblockspreventachievinggoodoccupancyReallylargethreadblocksarelessflexibleIgenerallyuse128-256threads/block,butusewhateverisbestfortheapplicationFormoredetails:–VasilyVolkov'sGTC2010talk"BetterPerformanceatLowerOccupancy"NVIDIA2010GlobalMemoryThroughputNVIDIA2010FermiMemoryHierarchyReviewL2GlobalMemoryRegistersL1SM-NSMEMRegistersL1SM-0SMEMRegistersL1SM-1SMEMNVIDIA2010FermiMemoryHierarchyReviewLocalstorage–Eachthreadhasownlocalstorage–Mostlyregisters(managedbythecompiler)Sharedmemory/L1–Programconfigurable:16KBshared/48KBL1OR48KBshared/16KBL1–Sharedmemoryisaccessiblebythethreadsinthesamethreadblock–Verylowlatency–Veryhighthroughput:1+TB/saggregateL2–AllaccessestoglobalmemorygothroughL2,includingcopiesto/fromCPUhostGlobalmemory–Accessiblebyallthreadsaswellashost(CPU)–Higherlatency(400-800cycles)–Throughput:upto177GB/sNVIDIA2010ProgrammingforL1andL2Shortanswer:DON'T–GPUcachesarenotintendedforthesameuseasCPUcachesSmallersize(especiallyperthread),sonotaimedattemporalreuseIntendedtosmoothoutsomeaccesspatterns,helpwithspilledregisters,etc.
–Don'ttrytoblockforL1/L2likeyouwouldonCPUYouhave100sto1,000sofrun-timescheduledthreadshittingthecachesIfitispossibletoblockforL1thenblockforSMEM–Samesize,samebandwidth,hwwillnotevictbehindyourbackOptimizeasifnocacheswerethere–NoFermi-onlytechniquestolearnperse(so,allyouknowisstillgood)–SomecaseswilljustrunfasterNVIDIA2010FermiGMEMOperationsTwotypesofloads:–CachingDefaultmodeAttemptstohitinL1,thenL2,thenGMEMLoadgranularityis128-byteline–Non-cachingCompilewith–Xptxas–dlcm=cgoptiontonvccAttemptstohitinL2,thenGMEM–DonothitinL1,invalidatethelineifit'sinL1alreadyLoadgranularityis32-bytesStores:–InvalidateL1,write-backforL2NVIDIA2010LoadCachingandL1SizeNon-cachingloadscanimproveperfwhen:–LoadingscatteredwordsoronlyapartofawarpissuesaloadBenefit:transactionissmaller,sousefulpayloadisalargerpercentageLoadinghalos,forexample–Spillingregisters(reducelinefightingwithspillage)LargeL1canimproveperfwhen:–Spillingregisters(morelinessofewerevictions)–Somemisaligned,stridedaccesspatterns–16-KBL1/48-KBsmemOR48-KBL1/16-KBsmemCUDAcall,canbesetfortheapporper-kernelHowtouse:–Justtrya2x2experimentmatrix:{CA,CG}x{48-L1,16-L1}Keepthebestcombination-sameasyouwouldwithanyHWmanagedcache,includingCPUsNVIDIA2010LoadOperationMemoryoperationsareissuedperwarp(32threads)–Justlikeallotherinstructions–PriortoFermi,memoryissueswereperhalf-warpOperation:–Threadsinawarpprovidememoryaddresses–Determinewhichlines/segmentsareneeded–Requesttheneededlines/segmentsNVIDIA2010CachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010Non-cachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010CachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010Non-cachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010CachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithin2cache-lines–Warpneeds128bytes–256bytesmoveacrossthebusonmisses–Busutilization:50%NVIDIA2010Non-cachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithinatmost5segments–Warpneeds128bytes–Atmost160bytesmoveacrossthebus–Busutilization:atleast80%Somemisalignedpatternswillfallwithin4segments,so100%utilizationNVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglecache-line–Warpneeds4bytes–128bytesmoveacrossthebusonamiss–Busutilization:3.
125%NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglesegment–Warpneeds4bytes–32bytesmoveacrossthebusonamiss–Busutilization:12.
5%NVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNcache-lines–Warpneeds128bytes–N*128bytesmoveacrossthebusonamiss–Busutilization:128/(N*128)NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNsegments–Warpneeds128bytes–N*32bytesmoveacrossthebusonamiss–Busutilization:128/(N*32)NVIDIA2010ImpactofAddressAlignmentWarpsshouldaccessalignedregionsformaximummemorythroughput–FermiL1canhelpformisalignedloadsifseveralwarpsareaccessingacontiguousregion–ECCfurthersignificantlyreducesmisalignedstorethroughputExperiment:–Copy16MBoffloats–256threads/blockGreatestthroughputdrop:–GT200:40%–Fermi:–CAloads:15%–CGloads:32%NVIDIA2010GMEMOptimizationGuidelinesStriveforperfectcoalescingperwarp–Alignstartingaddress(mayrequirepadding)–AwarpshouldaccesswithinacontiguousregionHaveenoughconcurrentaccessestosaturatethebus–LaunchenoughthreadstomaximizethroughputLatencyishiddenbyswitchingthreads(warps)–ProcessseveralelementsperthreadMultipleloadsgetpipelinedIndexingcalculationscanoftenbereusedTryL1andcachingconfigurationstoseewhichoneworksbest–Cachingvsnon-cachingloads(compileroption)–16KBvs48KBL1(CUDAcall)NVIDIA2010SharedMemoryNVIDIA2010SharedMemoryUses:–Inter-threadcommunicationwithinablock–Cachedatatoreduceredundantglobalmemoryaccesses–UseittoimproveglobalmemoryaccesspatternsFermiorganization:–32banks,4-bytewidebanks–Successive4-bytewordsbelongtodifferentbanksPerformance:–4bytesperbankper2clockspermultiprocessor–smemaccessesareissuedper32threads(warp)per16-threadsforGPUspriortoFermi–serialization:ifnthreadsinawarpaccessdifferent4-bytewordsinthesamebank,naccessesareexecutedserially–multicast:nthreadsaccessthesamewordinonefetchCouldbedifferentbyteswithinthesamewordPriortoFermi,onlybroadcastwasavailable,sub-wordaccesseswithinthesamebankcausedserializationNVIDIA2010BankAddressingExamplesNoBankConflictsNoBankConflictsBank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0NVIDIA2010BankAddressingExamples2-wayBankConflicts8-wayBankConflictsThread31Thread30Thread29Thread28Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank9Bank8Bank31Bank7Bank2Bank1Bank0x8x8NVIDIA2010SharedMemory:AvoidingBankConflicts32x32SMEMarrayWarpaccessesacolumn:–32-waybankconflicts(threadsinawarpaccessthesamebank)312103121031210warps:01231Bank0Bank1…Bank3120131NVIDIA2010SharedMemory:AvoidingBankConflictsAddacolumnforpadding:–32x33SMEMarrayWarpaccessesacolumn:–32differentbanks,nobankconflicts312103121031210warps:01231paddingBank0Bank1…Bank3131201NVIDIA2010Additional"memories"TextureandconstantRead-onlyDataresidesinglobalmemoryReadthroughdifferentcachesNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwise__global__voidkernel(constfloat*g_a){.
.
.
floatx=g_a[15];//uniformfloaty=g_a[blockIdx.
x+5];//uniformfloatz=g_a[threadIdx.
x];//non-uniform.
.
.
}NVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingGMEM:–Eachwarpfetches32B->10KBofbustraffic–Cachingloadspotentiallyworse–128Bline,verylikelytobeevictedmultipletimes.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingconstant/uniformaccess:–Firstwarpfetches32bytes–Allothershitinconstantcache->32bytesofbustrafficperSMUnlikelytobeevictedoverkernellifetime–otherloadsdonotgothroughthiscache.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010TextureSeparatecacheDedicatedtexturecachehardwareprovides:–Out-of-boundsindexhandlingclamporwrap-around–OptionalinterpolationThink:usingfpindicesforarraysLinear,bilinear,trilinear–Interpolationweightsare9-bit–Optionalformatconversion{char,short,int}->float–Alloftheseare"free"NVIDIA2010InstructionThroughput/ControlFlowNVIDIA2010RuntimeMathLibraryandIntrinsicsTwotypesofruntimemathlibraryfunctions–__func():manymapdirectlytohardwareISAFastbutloweraccuracy(seeCUDAProgrammingGuideforfulldetails)Examples:__sinf(x),__expf(x),__powf(x,y)–func():compiletomultipleinstructionsSlowerbuthigheraccuracy(5ulporless)Examples:sin(x),exp(x),pow(x,y)Anumberofadditionalintrinsics:–__sincosf(),__frcp_rz(),.
.
.
–ExplicitIEEEroundingmodes(rz,rn,ru,rd)NVIDIA2010ControlFlowInstructionsareissuedper32threads(warp)Divergentbranches:–Threadswithinasinglewarptakedifferentpathsif-else,.
.
.
–DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformanceAvoiddivergingwithinawarp–Examplewithdivergence:if(threadIdx.
x>2){.
.
.
}else{.
.
.
}Branchgranularity2){.
.
.
}else{.
.
.
}BranchgranularityisawholemultipleofwarpsizeNVIDIA2010ControlFlowif(.
.
.
){//then-clause}else{//else-clause}instructionsNVIDIA2010Executionwithinwarpsiscoherentinstructions/timeWarp("vector"ofthreads)35343363623232131300Warp("vector"ofthreads)NVIDIA2010Executiondivergeswithinawarpinstructions/time32131300353433636232NVIDIA2010CPU-GPUInteractionNVIDIA2010Pinned(non-pageable)memoryPinnedmemoryenables:–fasterPCIecopies–memcopiesasynchronouswithCPU–memcopiesasynchronouswithGPUUsage–cudaHostAlloc/cudaFreeHostinsteadofmalloc/freeImplication:–pinnedmemoryisessentiallyremovedfromhostvirtualmemoryNVIDIA2010StreamsandAsyncAPIDefaultAPI:–KernellaunchesareasynchronouswithCPU–Memcopies(D2H,H2D)blockCPUthread–CUDAcallsareserializedbythedriverStreamsandasyncfunctionsprovide:–Memcopies(D2H,H2D)asynchronouswithCPU–AbilitytoconcurrentlyexecuteakernelandamemcopyStream=sequenceofoperationsthatexecuteinissue-orderonGPU–Operationsfromdifferentstreamsmaybeinterleaved–AkernelandmemcopyfromdifferentstreamscanbeoverlappedNVIDIA2010OverlapkernelandmemorycopyRequirements:–D2HorH2Dmemcopyfrompinnedmemory–Devicewithcomputecapability≥1.
1(G84andlater)–Kernelandmemcopyindifferent,non-0streamsCode:cudaStream_tstream1,stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(dst,src,size,dir,stream1);kernel>>(…);potentiallyoverlappedNVIDIA2010CallSequencingforOptimalOverlapCUDAcallsaredispatchedtothehwinthesequencetheywereissuedFermicanconcurrentlyexecute:–Upto16kernels–Upto2memcopies,aslongastheyareindifferentdirections(D2HandH2D)Acallisdispatchedifbotharetrue:–Resourcesareavailable–PrecedingcallsinthesamestreamhavecompletedNotethatifacallblocks,itblocksallothercallsofthesametypebehindit,eveninotherstreams–Typeisoneof{kernel,memcopy}NVIDIA2010StreamExamplesK1,M1,K2,M2:K1M1K2M2K1,K2,M1,M2:K1M1K2M2K1,M1,M2:K1M1M2K1,M2,M1:K1M1M2K1,M2,M2:K1M2M2TimeK:kernelM:memcopyInteger:streadIDNVIDIA2010MoreonFermiConcurrentKernelsKernelsmaybeexecutedconcurrentlyiftheyareissuedintodifferentstreamsScheduling:–Kernelsareexecutedintheorderinwhichtheywereissued–ThreadblocksforagivenkernelarescheduledifallthreadblocksforprecedingkernelshavebeenscheduledandtherestillareSMresourcesavailableNVIDIA2010MoreonFermiDualCopyFermiiscapableofduplexcommunicationwiththehost–PCIebusisduplex–Thetwomemcopiesmustbeindifferentstreams,differentdirectionsNotallcurrenthostsystemscansaturateduplexPCIebandwidth:–LikelylimitationsoftheIOHchips–Ifthisisimportanttoyou,testyourhostsystemNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-0DRAM10.
8GB/s7.
5GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-1DRAM10.
8GB/s11GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010SummaryKernelLaunchConfiguration:–LaunchenoughthreadsperSMtohidelatency–LaunchenoughthreadblockstoloadtheGPUGlobalmemory:–Maximizethroughput(GPUhaslotsofbandwidth,useiteffectively)Usesharedmemorywhenapplicable(over1TB/sbandwidth)GPU-CPUinteraction:–MinimizeCPU/GPUidling,maximizePCIethroughputUseanalysis/profilingwhenoptimizing:–"Analysis-drivenOptimization"talknextNVIDIA2010AdditionalResourcesBasics:–CUDAwebinarsonNVIDIAwebsite(justgoogleforCUDAwebinar)–CUDAbyExample"bookbyJ.
SandersandE.
CandrotProfiling,analysis,andoptimizationforFermi:–GTC-2010session2012:"Analysis-drivenOptimization"(tomorrow,3-5pm)GT200optimization:–GTC-2009session1029(slidesandvideo)Slides:–http://www.
nvidia.
com/content/GTC/documents/1029_GTC09.
pdfMaterialsforallsessions:–http://developer.
download.
nvidia.
com/compute/cuda/docs/GTC09Materials.
htmCUDATutorialsatSupercomputing:–http://gpgpu.
org/{sc2007,sc2008,sc2009}CUDAProgrammingGuideCUDABestPracticesGuideNVIDIA2010Questions

HostYun:联通AS9929线路,最低月付18元起,最高500Mbps带宽,洛杉矶机房

最近AS9929线路比较火,联通A网,对标电信CN2,HostYun也推出了走联通AS9929线路的VPS主机,基于KVM架构,开设在洛杉矶机房,采用SSD硬盘,分为入门和高带宽型,最高提供500Mbps带宽,可使用9折优惠码,最低每月仅18元起。这是一家成立于2008年的VPS主机品牌,原主机分享组织(hostshare.cn),商家以提供低端廉价VPS产品而广为人知,是小成本投入学习练手首选。...

创梦网络-新上雅安电信200G防护值内死扛,无视CC攻击,E5 32核高配/32G内存/1TB SSD/100Mbps独享物理机,原价1299,年未上新促销6折,仅779.4/月,续费同价

创梦网络怎么样,创梦网络公司位于四川省达州市,属于四川本地企业,资质齐全,IDC/ISP均有,从创梦网络这边租的服务器均可以****,属于一手资源,高防机柜、大带宽、高防IP业务,另外创梦网络近期还会上线四川眉山联通、广东优化线路高防机柜,CN2专线相关业务。广东电信大带宽近期可以预约机柜了,成都优化线路,机柜租用、服务器云服务器租用,适合建站做游戏,不须要在套CDN,全国访问快,直连省骨干,大网...

企鹅小屋:垃圾服务商有跑路风险,站长注意转移备份数据!

企鹅小屋:垃圾服务商有跑路风险!企鹅不允许你二次工单的,二次提交工单直接关服务器,再严重就封号,意思是你提交工单要小心,别因为提交工单被干了账号!前段时间,就有站长说企鹅小屋要跑路了,站长不太相信,本站平台已经为企鹅小屋推荐了几千元的业绩,CPS返利达182.67CNY。然后,站长通过企鹅小屋后台申请提现,提现申请至今已经有20几天,企鹅小屋也没有转账。然后,搞笑的一幕出现了:平台账号登录不上提示...

freehost为你推荐
怎么查询商标想要知道一个商标是否被注册,在哪里查到的比较权威?sonicchat国外军人的左胸上有彩色的阁子是什么意思留学生认证留学生为什么要做学历认证?巨星prince去世有几位好莱坞巨星死在2016年18comic.funAnime Comic Fun是什么意思啊 我不懂英文xyq.163.cbg.com梦幻西游藏宝阁钟神发跪求钟神发名言出处,A站大神看过来网站检测请问,对网站进行监控检测的工具有哪些?kb123.netwww.zhmmjyw.net百度收录慢?www.zhiboba.com看NBA直播的网站哪个知道
济南域名注册 网页空间租用 vps是什么 主机优惠码 漂亮qq空间 liquidweb t楼 cpanel主机 cpanel 视频存储服务器 老左博客 http500内部服务器错误 论坛空间 湖南服务器托管 个人域名 国外免费全能空间 phpmyadmin配置 服务器监测 腾讯总部在哪 视频服务器是什么 更多