cgfreehost

freehost  时间:2021-04-10  阅读:()
NVIDIA2010NVIDIA2010PauliusMicikevicius|NVIDIAFundamentalOptimizationsSupercomputing,TutorialS03NewOrleans,Nov14,2010NVIDIA2010OutlineKerneloptimizations–Launchconfiguration–Globalmemorythroughput–Sharedmemoryaccess–Instructionthroughput/controlflowOptimizationofCPU-GPUinteraction–MaximizingPCIethroughput–OverlappingkernelexecutionwithmemorycopiesNVIDIA2010LaunchConfigurationNVIDIA2010LaunchConfigurationHowmanythreads/threadblockstolaunchKeytounderstanding:–Instructionsareissuedinorder–Athreadstallswhenoneoftheoperandsisn'tready:Memoryreadbyitselfdoesn'tstallexecution–LatencyishiddenbyswitchingthreadsGMEMlatency:400-800cyclesArithmeticlatency:18-22cyclesConclusion:–NeedenoughthreadstohidelatencyNVIDIA2010LaunchConfigurationHidingarithmeticlatency:Need~18warps(576)threadsperFermiSMFewerwarpsforpre-FermiGPUs(FermiSMmorethandoubledissuerate)Or,latencycanalsobehiddenwithindependentinstructionsfromthesamewarpForexample,ifinstructionneverdependsontheoutputofprecedinginstruction,thenonly9warpsareneeded,etc.
Maximizingglobalmemorythroughput:Dependsontheaccesspattern,andwordsizeNeedenoughmemorytransactionsinflighttosaturatethebusIndependentloadsandstoresfromthesamethreadLoadsandstoresfromdifferentthreadsLargerwordsizescanalsohelp(float2istwicethetransactionsoffloat,forexample)NVIDIA2010MaximizingMemoryThroughputIncrementofanarrayof64Melements–Twoaccessesperthread(loadthenstore)–Thetwoaccessesaredependent,soreally1accessperthreadatatimeTeslaC2050,ECCon,theoreticalbandwidth:~120GB/sSeveralindependentsmalleraccesseshavethesameeffectasonelargerone.
Forexample:Four32-bit~=one128-bitNVIDIA2010LaunchConfiguration:SummaryNeedenoughtotalthreadstokeepGPUbusy–Typically,you'dlike512+threadsperSMMoreifprocessingonefp32elementperthread–Ofcourse,exceptionsexistThreadblockconfiguration–Threadsperblockshouldbeamultipleofwarpsize(32)–SMcanconcurrentlyexecuteupto8threadblocksReallysmallthreadblockspreventachievinggoodoccupancyReallylargethreadblocksarelessflexibleIgenerallyuse128-256threads/block,butusewhateverisbestfortheapplicationFormoredetails:–VasilyVolkov'sGTC2010talk"BetterPerformanceatLowerOccupancy"NVIDIA2010GlobalMemoryThroughputNVIDIA2010FermiMemoryHierarchyReviewL2GlobalMemoryRegistersL1SM-NSMEMRegistersL1SM-0SMEMRegistersL1SM-1SMEMNVIDIA2010FermiMemoryHierarchyReviewLocalstorage–Eachthreadhasownlocalstorage–Mostlyregisters(managedbythecompiler)Sharedmemory/L1–Programconfigurable:16KBshared/48KBL1OR48KBshared/16KBL1–Sharedmemoryisaccessiblebythethreadsinthesamethreadblock–Verylowlatency–Veryhighthroughput:1+TB/saggregateL2–AllaccessestoglobalmemorygothroughL2,includingcopiesto/fromCPUhostGlobalmemory–Accessiblebyallthreadsaswellashost(CPU)–Higherlatency(400-800cycles)–Throughput:upto177GB/sNVIDIA2010ProgrammingforL1andL2Shortanswer:DON'T–GPUcachesarenotintendedforthesameuseasCPUcachesSmallersize(especiallyperthread),sonotaimedattemporalreuseIntendedtosmoothoutsomeaccesspatterns,helpwithspilledregisters,etc.
–Don'ttrytoblockforL1/L2likeyouwouldonCPUYouhave100sto1,000sofrun-timescheduledthreadshittingthecachesIfitispossibletoblockforL1thenblockforSMEM–Samesize,samebandwidth,hwwillnotevictbehindyourbackOptimizeasifnocacheswerethere–NoFermi-onlytechniquestolearnperse(so,allyouknowisstillgood)–SomecaseswilljustrunfasterNVIDIA2010FermiGMEMOperationsTwotypesofloads:–CachingDefaultmodeAttemptstohitinL1,thenL2,thenGMEMLoadgranularityis128-byteline–Non-cachingCompilewith–Xptxas–dlcm=cgoptiontonvccAttemptstohitinL2,thenGMEM–DonothitinL1,invalidatethelineifit'sinL1alreadyLoadgranularityis32-bytesStores:–InvalidateL1,write-backforL2NVIDIA2010LoadCachingandL1SizeNon-cachingloadscanimproveperfwhen:–LoadingscatteredwordsoronlyapartofawarpissuesaloadBenefit:transactionissmaller,sousefulpayloadisalargerpercentageLoadinghalos,forexample–Spillingregisters(reducelinefightingwithspillage)LargeL1canimproveperfwhen:–Spillingregisters(morelinessofewerevictions)–Somemisaligned,stridedaccesspatterns–16-KBL1/48-KBsmemOR48-KBL1/16-KBsmemCUDAcall,canbesetfortheapporper-kernelHowtouse:–Justtrya2x2experimentmatrix:{CA,CG}x{48-L1,16-L1}Keepthebestcombination-sameasyouwouldwithanyHWmanagedcache,includingCPUsNVIDIA2010LoadOperationMemoryoperationsareissuedperwarp(32threads)–Justlikeallotherinstructions–PriortoFermi,memoryissueswereperhalf-warpOperation:–Threadsinawarpprovidememoryaddresses–Determinewhichlines/segmentsareneeded–Requesttheneededlines/segmentsNVIDIA2010CachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010Non-cachingLoadWarprequests32aligned,consecutive4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0NVIDIA2010CachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin1cache-line–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010Non-cachingLoad.
.
.
961921281602242882563264352320384448416Memoryaddressesaddressesfromawarp0Warprequests32aligned,permuted4-bytewordsAddressesfallwithin4segments–Warpneeds128bytes–128bytesmoveacrossthebusonamiss–Busutilization:100%NVIDIA2010CachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithin2cache-lines–Warpneeds128bytes–256bytesmoveacrossthebusonmisses–Busutilization:50%NVIDIA2010Non-cachingLoad96192128160224288256.
.
.
addressesfromawarp32640352320384448416MemoryaddressesWarprequests32misaligned,consecutive4-bytewordsAddressesfallwithinatmost5segments–Warpneeds128bytes–Atmost160bytesmoveacrossthebus–Busutilization:atleast80%Somemisalignedpatternswillfallwithin4segments,so100%utilizationNVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglecache-line–Warpneeds4bytes–128bytesmoveacrossthebusonamiss–Busutilization:3.
125%NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Allthreadsinawarprequestthesame4-bytewordAddressesfallwithinasinglesegment–Warpneeds4bytes–32bytesmoveacrossthebusonamiss–Busutilization:12.
5%NVIDIA2010CachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNcache-lines–Warpneeds128bytes–N*128bytesmoveacrossthebusonamiss–Busutilization:128/(N*128)NVIDIA2010Non-cachingLoad.
.
.
addressesfromawarp961921281602242882563264352320384448416Memoryaddresses0Warprequests32scattered4-bytewordsAddressesfallwithinNsegments–Warpneeds128bytes–N*32bytesmoveacrossthebusonamiss–Busutilization:128/(N*32)NVIDIA2010ImpactofAddressAlignmentWarpsshouldaccessalignedregionsformaximummemorythroughput–FermiL1canhelpformisalignedloadsifseveralwarpsareaccessingacontiguousregion–ECCfurthersignificantlyreducesmisalignedstorethroughputExperiment:–Copy16MBoffloats–256threads/blockGreatestthroughputdrop:–GT200:40%–Fermi:–CAloads:15%–CGloads:32%NVIDIA2010GMEMOptimizationGuidelinesStriveforperfectcoalescingperwarp–Alignstartingaddress(mayrequirepadding)–AwarpshouldaccesswithinacontiguousregionHaveenoughconcurrentaccessestosaturatethebus–LaunchenoughthreadstomaximizethroughputLatencyishiddenbyswitchingthreads(warps)–ProcessseveralelementsperthreadMultipleloadsgetpipelinedIndexingcalculationscanoftenbereusedTryL1andcachingconfigurationstoseewhichoneworksbest–Cachingvsnon-cachingloads(compileroption)–16KBvs48KBL1(CUDAcall)NVIDIA2010SharedMemoryNVIDIA2010SharedMemoryUses:–Inter-threadcommunicationwithinablock–Cachedatatoreduceredundantglobalmemoryaccesses–UseittoimproveglobalmemoryaccesspatternsFermiorganization:–32banks,4-bytewidebanks–Successive4-bytewordsbelongtodifferentbanksPerformance:–4bytesperbankper2clockspermultiprocessor–smemaccessesareissuedper32threads(warp)per16-threadsforGPUspriortoFermi–serialization:ifnthreadsinawarpaccessdifferent4-bytewordsinthesamebank,naccessesareexecutedserially–multicast:nthreadsaccessthesamewordinonefetchCouldbedifferentbyteswithinthesamewordPriortoFermi,onlybroadcastwasavailable,sub-wordaccesseswithinthesamebankcausedserializationNVIDIA2010BankAddressingExamplesNoBankConflictsNoBankConflictsBank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0NVIDIA2010BankAddressingExamples2-wayBankConflicts8-wayBankConflictsThread31Thread30Thread29Thread28Thread4Thread3Thread2Thread1Thread0Bank31Bank7Bank6Bank5Bank4Bank3Bank2Bank1Bank0Thread31Thread7Thread6Thread5Thread4Thread3Thread2Thread1Thread0Bank9Bank8Bank31Bank7Bank2Bank1Bank0x8x8NVIDIA2010SharedMemory:AvoidingBankConflicts32x32SMEMarrayWarpaccessesacolumn:–32-waybankconflicts(threadsinawarpaccessthesamebank)312103121031210warps:01231Bank0Bank1…Bank3120131NVIDIA2010SharedMemory:AvoidingBankConflictsAddacolumnforpadding:–32x33SMEMarrayWarpaccessesacolumn:–32differentbanks,nobankconflicts312103121031210warps:01231paddingBank0Bank1…Bank3131201NVIDIA2010Additional"memories"TextureandconstantRead-onlyDataresidesinglobalmemoryReadthroughdifferentcachesNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwise__global__voidkernel(constfloat*g_a){.
.
.
floatx=g_a[15];//uniformfloaty=g_a[blockIdx.
x+5];//uniformfloatz=g_a[threadIdx.
x];//non-uniform.
.
.
}NVIDIA2010ConstantMemoryIdealforcoefficientsandotherdatathatisreaduniformlybywarpsDataisstoredinglobalmemory,readthroughaconstant-cache–__constant__qualifierindeclarations–CanonlybereadbyGPUkernels–Limitedto64KBFermiaddsuniformaccesses:–Kernelpointerargumentqualifiedwithconst–Compilermustdeterminethatallthreadsinathreadblockwilldereferencethesameaddress–Nolimitonarraysize,canuseanyglobalmemorypointerConstantcachethroughput:–32bitsperwarpper2clockspermultiprocessor–TobeusedwhenallthreadsinawarpreadthesameaddressSerializesotherwiseNVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingGMEM:–Eachwarpfetches32B->10KBofbustraffic–Cachingloadspotentiallyworse–128Bline,verylikelytobeevictedmultipletimes.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010ConstantMemoryKernelexecutes10Kthreads(320warps)perSMduringitslifetimeAllthreadsaccessthesame4BwordUsingconstant/uniformaccess:–Firstwarpfetches32bytes–Allothershitinconstantcache->32bytesofbustrafficperSMUnlikelytobeevictedoverkernellifetime–otherloadsdonotgothroughthiscache.
.
.
addressesfromawarp9619212816022428825632643523203844484160NVIDIA2010TextureSeparatecacheDedicatedtexturecachehardwareprovides:–Out-of-boundsindexhandlingclamporwrap-around–OptionalinterpolationThink:usingfpindicesforarraysLinear,bilinear,trilinear–Interpolationweightsare9-bit–Optionalformatconversion{char,short,int}->float–Alloftheseare"free"NVIDIA2010InstructionThroughput/ControlFlowNVIDIA2010RuntimeMathLibraryandIntrinsicsTwotypesofruntimemathlibraryfunctions–__func():manymapdirectlytohardwareISAFastbutloweraccuracy(seeCUDAProgrammingGuideforfulldetails)Examples:__sinf(x),__expf(x),__powf(x,y)–func():compiletomultipleinstructionsSlowerbuthigheraccuracy(5ulporless)Examples:sin(x),exp(x),pow(x,y)Anumberofadditionalintrinsics:–__sincosf(),__frcp_rz(),.
.
.
–ExplicitIEEEroundingmodes(rz,rn,ru,rd)NVIDIA2010ControlFlowInstructionsareissuedper32threads(warp)Divergentbranches:–Threadswithinasinglewarptakedifferentpathsif-else,.
.
.
–DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformanceAvoiddivergingwithinawarp–Examplewithdivergence:if(threadIdx.
x>2){.
.
.
}else{.
.
.
}Branchgranularity2){.
.
.
}else{.
.
.
}BranchgranularityisawholemultipleofwarpsizeNVIDIA2010ControlFlowif(.
.
.
){//then-clause}else{//else-clause}instructionsNVIDIA2010Executionwithinwarpsiscoherentinstructions/timeWarp("vector"ofthreads)35343363623232131300Warp("vector"ofthreads)NVIDIA2010Executiondivergeswithinawarpinstructions/time32131300353433636232NVIDIA2010CPU-GPUInteractionNVIDIA2010Pinned(non-pageable)memoryPinnedmemoryenables:–fasterPCIecopies–memcopiesasynchronouswithCPU–memcopiesasynchronouswithGPUUsage–cudaHostAlloc/cudaFreeHostinsteadofmalloc/freeImplication:–pinnedmemoryisessentiallyremovedfromhostvirtualmemoryNVIDIA2010StreamsandAsyncAPIDefaultAPI:–KernellaunchesareasynchronouswithCPU–Memcopies(D2H,H2D)blockCPUthread–CUDAcallsareserializedbythedriverStreamsandasyncfunctionsprovide:–Memcopies(D2H,H2D)asynchronouswithCPU–AbilitytoconcurrentlyexecuteakernelandamemcopyStream=sequenceofoperationsthatexecuteinissue-orderonGPU–Operationsfromdifferentstreamsmaybeinterleaved–AkernelandmemcopyfromdifferentstreamscanbeoverlappedNVIDIA2010OverlapkernelandmemorycopyRequirements:–D2HorH2Dmemcopyfrompinnedmemory–Devicewithcomputecapability≥1.
1(G84andlater)–Kernelandmemcopyindifferent,non-0streamsCode:cudaStream_tstream1,stream2;cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(dst,src,size,dir,stream1);kernel>>(…);potentiallyoverlappedNVIDIA2010CallSequencingforOptimalOverlapCUDAcallsaredispatchedtothehwinthesequencetheywereissuedFermicanconcurrentlyexecute:–Upto16kernels–Upto2memcopies,aslongastheyareindifferentdirections(D2HandH2D)Acallisdispatchedifbotharetrue:–Resourcesareavailable–PrecedingcallsinthesamestreamhavecompletedNotethatifacallblocks,itblocksallothercallsofthesametypebehindit,eveninotherstreams–Typeisoneof{kernel,memcopy}NVIDIA2010StreamExamplesK1,M1,K2,M2:K1M1K2M2K1,K2,M1,M2:K1M1K2M2K1,M1,M2:K1M1M2K1,M2,M1:K1M1M2K1,M2,M2:K1M2M2TimeK:kernelM:memcopyInteger:streadIDNVIDIA2010MoreonFermiConcurrentKernelsKernelsmaybeexecutedconcurrentlyiftheyareissuedintodifferentstreamsScheduling:–Kernelsareexecutedintheorderinwhichtheywereissued–ThreadblocksforagivenkernelarescheduledifallthreadblocksforprecedingkernelshavebeenscheduledandtherestillareSMresourcesavailableNVIDIA2010MoreonFermiDualCopyFermiiscapableofduplexcommunicationwiththehost–PCIebusisduplex–Thetwomemcopiesmustbeindifferentstreams,differentdirectionsNotallcurrenthostsystemscansaturateduplexPCIebandwidth:–LikelylimitationsoftheIOHchips–Ifthisisimportanttoyou,testyourhostsystemNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-0DRAM10.
8GB/s7.
5GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010DuplexCopy:ExperimentalResultsCPU-0IOHX58DRAMGPU-0CPU-0IOHD36DRAMGPU-0CPU-1DRAM10.
8GB/s11GB/sQPI,6.
4GT/s25.
6GB/s3xDDR3,1066MHz25.
8GB/sPCIe,x1616GB/sNVIDIA2010SummaryKernelLaunchConfiguration:–LaunchenoughthreadsperSMtohidelatency–LaunchenoughthreadblockstoloadtheGPUGlobalmemory:–Maximizethroughput(GPUhaslotsofbandwidth,useiteffectively)Usesharedmemorywhenapplicable(over1TB/sbandwidth)GPU-CPUinteraction:–MinimizeCPU/GPUidling,maximizePCIethroughputUseanalysis/profilingwhenoptimizing:–"Analysis-drivenOptimization"talknextNVIDIA2010AdditionalResourcesBasics:–CUDAwebinarsonNVIDIAwebsite(justgoogleforCUDAwebinar)–CUDAbyExample"bookbyJ.
SandersandE.
CandrotProfiling,analysis,andoptimizationforFermi:–GTC-2010session2012:"Analysis-drivenOptimization"(tomorrow,3-5pm)GT200optimization:–GTC-2009session1029(slidesandvideo)Slides:–http://www.
nvidia.
com/content/GTC/documents/1029_GTC09.
pdfMaterialsforallsessions:–http://developer.
download.
nvidia.
com/compute/cuda/docs/GTC09Materials.
htmCUDATutorialsatSupercomputing:–http://gpgpu.
org/{sc2007,sc2008,sc2009}CUDAProgrammingGuideCUDABestPracticesGuideNVIDIA2010Questions

RAKsmart裸机云/云服务器/VPS全场7折,独立服务器限量秒杀$30/月起

适逢中国农历新年,RAKsmart也发布了2月促销活动,裸机云、云服务器、VPS主机全场7折优惠,新用户注册送10美元,独立服务器每天限量秒杀最低30.62美元/月起,美国洛杉矶/圣何塞、日本、香港站群服务器大量补货,1-10Gbps大带宽、高IO等特色服务器抄底价格,机器可选大陆优化、国际BGP、精品网及CN2等线路,感兴趣的朋友可以持续关注下。裸机云新品7折,秒杀产品5台/天优惠码:Bare-...

HostYun 新增美国三网CN2 GIA VPS主机 采用美国原生IP低至月15元

在之前几个月中也有陆续提到两次HostYun主机商,这个商家前身是我们可能有些网友熟悉的主机分享团队的,后来改名称的。目前这个品牌主营低价便宜VPS主机,这次有可以看到推出廉价版本的美国CN2 GIA VPS主机,月费地址15元,适合有需要入门级且需要便宜的用户。第一、廉价版美国CN2 GIA VPS主机方案我们可看到这个类型的VPS目前三网都走CN2 GIA网络,而且是原生IP。根据信息可能后续...

HostKvm($4.25/月),俄罗斯CN2带宽大升级,俄罗斯/香港高防限量5折优惠进行中

HostKvm是一家成立于2013年的国外VPS服务商,产品基于KVM架构,数据中心包括日本、新加坡、韩国、美国、俄罗斯、中国香港等多个地区机房,均为国内直连或优化线路,延迟较低,适合建站或者远程办公等。本月,商家旗下俄罗斯、新加坡、美国、香港等节点带宽进行了大幅度升级,俄罗斯机房国内电信/联通直连,CN2线路,150Mbps(原来30Mbps)带宽起,目前俄罗斯和香港高防节点5折骨折码继续优惠中...

freehost为你推荐
neworiental天津新东方总部地址在哪里?西部妈妈网啊,又是星期天嘉兴商标注册嘉兴那里有设计商标的百度关键词分析关键词怎么分析?www.7788dy.comwww.tom365.com这个免费的电影网站有毒吗?www.6vhao.com有哪些电影网站hao.rising.cn我的Google Chrome主页被http://hao.rising.cn//?b=64锁定了,谁有办法?ww.43994399??????????猴山条约中国近代史领土被割占去了多少,包括战争中失去的和吞并的总数酒仙琐事"酒仙"指的是什么人?
重庆网站空间 域名停靠 免费域名申请 kvmla 外贸主机 iis安装教程 牛人与腾讯客服对话 北京双线机房 admit的用法 昆明蜗牛家 江苏双线服务器 网购分享 备案空间 国外的代理服务器 美国服务器 pptpvpn dbank ddos攻击小组 紫田网络 网站服务器硬件配置 更多