NoteEE-218aTechnicalnotesonusingAnalogDevicesDSPs,processorsanddevelopmenttoolsContactourtechnicalsupportatdsp.
support@analog.
comandatdsptools.
support@analog.
comOrvisitouron-lineresourceshttp://www.
analog.
com/ee-notesandhttp://www.
analog.
com/processorsWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessorsContributedbyBorisLernerRev2–March4,2004Copyright2004,AnalogDevices,Inc.
Allrightsreserved.
AnalogDevicesassumesnoresponsibilityforcustomerproductdesignortheuseorapplicationofcustomers'productsorforanyinfringementsofpatentsorrightsofotherswhichmayresultfromAnalogDevicesassistance.
AlltrademarksandlogosarepropertyIntroductionSo,youwanttowriteefficientcodefortheADSP-TS201TigerSHARCprocessorOr,maybe,youhavecomeacrosstheoptimizedexamplefloating-pointFFTforthisprocessorandwouldliketounderstandhowitworksandwhattheauthorhadinmindwhenwritingit.
ThisapplicationnotetriestoanswerbothquestionsbygoingthroughthatFFTexampleandallitslevelsofoptimizationindetail.
ThisexamplecanbefollowedindevelopingotheralgorithmsandcodeoptimizedfortheADSP-TS201Sprocessor.
Generally,mostalgorithmshaveseverallevelsofoptimization,allofwhicharediscussedindetailinthisnote.
Thefirstandmoststraightforwardlevelofoptimizationisparallelingofinstructions,astheprocessorarchitecturewillallow.
Thisissimpleandboring.
Thesecondlevelofoptimizationisloopunrollingandsoftwarepipeliningtoachievemaximumparallelismandtoavoidpipelinestalls.
Althoughmorecomplexthanthesimpleparallelismoflevelone,thiscanbedoneinprescribedstepswithoutgoodunderstandingofthealgorithmand,thus,requireslittleingenuity.
Thethirdlevelistorestructurethemathofthealgorithmtostillproducevalidresults,butsothatthenewrestructuredalgorithmfitstheprocessor'sarchitecturebetter.
Beingabletodothisrequiresathoroughunderstandingofthealgorithmand,unlikesoftwarepipelining,therearenoprescribedstepsthatleadtotheoptimalsolution.
Thisiswheremostofthefuninwritingoptimizedcodelies.
Inpracticalapplicationsitisoftenunnecessarytogothroughalloftheselevels.
Whenallofthelevelsarerequired,itisalwaysbesttodotheselevelsofoptimizationinreverseorder.
Bythetimethecodeisfullypipelined,itistoolatetotrytochangethefundamentalunderlyingalgorithm.
Thus,aprogrammerwouldhavetothinkaboutthealgorithmstructurefirstandorganizethecodeaccordingly.
Then,levelstwoandone(paralleling,unrolling,andpipelining)areusuallydoneatthesametime.
ThecodethatthisnotereferstoissuppliedbyAnalogDevicesintheformthatallowsittobecalledaseitherarealoracomplexFFT,thelastcallingparameterofthefunctiondefiningifrealorcomplexistobecalled.
TherealN-pointFFTisobtainedfromthecomplexN/2-pointFFTwithanadditionalspecialstageattheend.
Thisnoteisconcernedwithcodeoptimizationmorethanthetechnicalitiesofthespecialstage,soitdiscussesthealgorithmforthecomplexFFTportionofthecodeonly.
ThelastspecialstageoftherealFFTisdiscussedindetailinthecommentsofthecode.
StandardRadix-2FFTAlgorithmFigure1showsastandard16-pointradix-2FFTimplementation,aftertheinputhasbeenbit-reversed.
Traditionally,inthisalgorithm,stagesoftheirrespectiveholders.
InformationfurnishedbyAnalogDevicesApplicationsandDevelopmentToolsEngineersisbelievedtobeaccurateandreliable,howevernoresponsibilityisassumedbyAnalogDevicesregardingtechnicalaccuracyandtopicalityofthecontentprovidedinAnalogDevices'Engineer-to-EngineerNotes.
a1and2arecombinedtogetherwiththerequiredbitreversingintoasingleoptimizedloop(sincethesetwostagesrequirenomultiplies,onlyaddsandsubtracts).
Eachoftheremainingstagesisusuallydonebycombiningthebutterfliesthatsharethesametwiddlefactorstogetherintogroups(sothetwiddleshavetobefetchedonlyonceforeachgroup).
Un-optimizedassemblysourcecodeforaTigerSHARCprocessorimplementingthisalgorithmisshowninListing1.
This,withafewtricksthatareirrelevanttothisdiscussion,isthewaythatthe32-bitfloating-pointFFTcodewaswrittenwhenitwastargetedtoanADSP-TS101processor.
Thebenchmarks(incoreclockcycles)forthisalgorithm,includingbitreversal,runningonanADSP-TS101andADSP-TS201,areshowninTable1.
NotethatsincetheADSP-TS101haslessmemorypermemoryblockthanaADSP-TS201,largerpointsizebenchmarksdonotapplytotheADSP-TS101.
Clearly,aslongasthedatafitsintotheADSP-TS201cache,itisefficient.
Oncethedatabecomestoolargeforthecache,thisFFTimplementationbecomesextremelyinefficient–thecyclecountincreasesfromoptimalbyafactoroffive.
Figure1.
StandardStructureofthe16-PointFFTWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page2of16aStagesk10=k31+N;;//twiddlesstridek11=k31+N/2;;//butterflies/groupj12=j31+1;;//groupsj10=j31+2;;//widthofbutterflyj11=j31+4;;//butterflystridek20=k31+STAGES;;_stages_loop:j0=j31+j29;;//j0->internal_buffk0=k31+k30;;//k0->twiddlesj13=j31+0;;LC1=j12;;_group_loop:xr1:0=L[k0+=k10];;//xr0=cos,xr1=-sinj1=j0+j10;;//j1->secondinputtobutterflyLC0=k11;;_butterfly_loop:xr3:2=L[j0+=0];;//xr2=Re1,xr3=Im1xr5:4=L[j1+=0];;//xr4=Re2,xr5=Im2xfr6=r4*r0;;//xr6=Re2*cosxfr7=r5*r1;;//xr7=Im2*sinxfr8=r6-r7;;//xr8=Re(z2*twid)xfr9=r4*r1;;//xr6=Re2*sinxfr10=r5*r0;;//xr7=Im2*cosxfr11=r9+r10;;//xr8=Im(z2*twid)xfr12=r2+r8,fr14=r2-r8;;//Re(butterfly)xfr13=r3+r11,fr15=r3-r11;;//Im(butterfly)L[j0+=j11]=xr13:12;;L[j1+=j11]=xr15:14;;ifNLC0E,jump_butterfly_loop(NP);;j13=j13+2;;j0=j29+j13;;//offsetforthenextgroupifNLC1E,jump_group_loop(NP);;k10=lshiftrk10;;//twiddlesstridek11=lshiftrk11;;//butterflies/groupj12=j12+j12;;//groupsj10=j10+j10;;//widthofbutterflyj11=j11+j11;;//butterflystridek20=k20-1;;ifNKEQ,jump_stages_loop(NP);;Listing1.
fft32_unoptimized.
asmPointsNADSP-TS101ADSP-TS201InputnotincacheADSP-TS201Inputincache256217226412218512458255334649102498721217099922048213382661022173409646244197272NA819299886444628NA16384215224987730NA32768NA2133220NA65536NA4720010NATable1.
CoreClockCyclesforN-pointComplexFFTOptimizingtheStructureoftheFFTforADSP-TS201ProcessorsTobeabletore-structurethealgorithmtoperformoptimallyonADSP-TS201,wehavetounderstandwhytheperformanceoflargeFFTsusingtheconventionalFFTstructureissopoor.
ADSP-TS201memoryisoptimizedforsequentialreads.
Cacheisdesignedtohelpwithalgorithmswherethereadsarenotsequential.
IntheconventionalFFTalgorithm,eachstage'sbutterfliesstridedoubles,sothereadsarenon-sequentialand,witheachnewstage,thecacheislessandlesslikelytobeahit–thereadsareallovertheplace.
Thesolutionliesinre-arrangingastage'soutputtoensurethatthenextstage'sreadsaresequential.
ThestructureofthealgorithmimplementationisshowninFigure2.
WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page3of16aWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page4of16Figure2.
ReorganizedStructureofthe16-PointFFTItissimpleenoughtotracethisdiagrambyhandtoseethatitissimplyare-orderingofthediagraminFigure1.
Amazinglyenough,thefinaloutputisincorrectorder.
ThiscanbeeasilyprovenforgeneralN=2k=thenumberofpointsintheFFT.
Notethatthere-orderingisgivenbythefollowingformula:()+=oddisif,221evenisif,2nNnnnnfThus,ifniseven,itisshiftedrightandifnisodd,itisshiftedrightanditsmostsignificantbitisset.
Thisis,ofcourse,equivalenttotheoperationofright1-bitrotation,whichafter()NK2log=stepsreturnstheoriginalnback.
Thus,theoutputafterKstagesisincorrectorderagain.
Great!
Wehaveournewstructure.
Ithassequentialreadsandweareluckyenoughthattheoutputisinthecorrectorder.
Thisshouldbemuchmoreefficient.
RightLet'swritethecodeforit!
Well,beforewespendalotoftimewritingthecode,weshouldensurethatalloftheDSPoperationsthatweareabouttodoactuallyfitintoourprocessorarchitectureefficiently.
Theremaybenoreasontooptimizedatamovementsiftheunderlyingmathsuffers.
aThefirstobviouspointtonoticeisthatthisstructurecannotbedonein-placeduetoitsre-ordering.
Thestageswillhavetoping-pongtheirinput/outputbuffers.
Thisshouldnotbeaproblem.
TheADSP-TS201processorhasalotofmemoryonboard,butshouldmemoryoptimizationberequired(andinputdoesnothavetobepreserved),wecanusetheinputasoneofthetwoping-pongbuffers.
Next,wenotethatatraditionalFFTcombinesbutterfliesthatsharetwiddlesintothesamegrouptosavetwiddlefetchcycles.
Amazingly,thetwiddlesofthestructureinFigure2lineuplinearly–onegroupatatime.
Weareluckyagain!
Now,whatwouldabutterflyofthisnewstructureconsistofTable2liststheoperationsnecessarytoperformasinglecomplexbutterfly.
SincetheADSP-TS201isaSIMDprocessor(i.
e.
,itcandoubleallthecomputes),wewillwritethestepsoutlinedinListing1inSIMDfashion,sothattwoadjacentbutterfliesarecomputedinparallel,oneintheX-ComputeblockandtheotheroneintheY-Computeblock.
LetusanalyzetheDSPoperationsinmoredetail.
F1,F2,K2andF4fetchatotaloffour32-bitwords,whichonADSP-TS201canbedoneinasinglequadfetchintoX-Computeblockregisters.
TobeabletosupplySIMDmachinewithdata,wewouldalsohavetoperformasecondbutterflyquadfetchintotheY-Computeblockregisters.
Then,M1,M2,M3,M4,A1,andA2willperformSIMDoperationsforbothbutterflies.
TheADSP-TS201supportsasingleadd/subtractinstruction,soA3andA4canbecombinedintoasingleoperation(whichis,ofcourse,performedSIMDonbothbutterfliesatonce)andsimilarlyA5andA6canbecombined,aswell.
MnemonicOperationF1FetchReal(Input1)oftheButterflyF2FetchImag(Input1)oftheButterflyK2FetchReal(Input2)oftheButterflyF4FetchImag(Input2)oftheButterflyM1K2*Real(twiddle)M2F4*Imag(twiddle)M3K2*Imag(twiddle)M4F4*Real(twiddle)A1M1–M2=Real(Input2*twiddle)A2M3+M4=Imag(Input2*twiddle)A3F1+A1=Real(Output1)A4F1-A1=Real(Output2)A5F2+A2=Imag(Output1)A6F2–A2=Imag(Output2)S1Store(Real(Output1))S2Store(Imag(Output1))S3Store(Real(Output2))S4Store(Imag(Output2))Table2.
SingleButterflyDoneLinearly–LogicalImplementationNowwerunintoaproblem:S1,S2,S3andS4cannotbeperformedinthesamecycle,sinceS3andS4aredestinedtoanotherplaceinmemoryduetoouroutputre-ordering.
Instead,wecanstoreS1andS2forbothbutterfliesinonecycle(luckyagain–theseareadjacent!
)andS3andS4forbothbutterfliesinthenextcycle.
Sofar,sogood–thenewsetofoperationsissummarizedinTable3.
WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page5of16aMnemonicOperationF1FetchInput1,2oftheButterfly1F2FetchInput1,2oftheButterfly2M1Real(Input2)*Real(twiddle)M2Imag(Input2)*Imag(twiddle)M3Real(Input2)*Imag(twiddle)M4Imag(Input2)*Real(twiddle)A1M1–M2=Real(Input2*twiddle)A2M3+M4=Imag(Input2*twiddle)A3Real(Input1)+/-A1=Real(Output1,2)A4Imag(Input1)+/-A2=Imag(Output1)S1Store(Output1,bothButterflies)S2Store(Output2,bothButterflies)Table3.
SingleButterflyDoneLinearly–ActualADSP-TS20xImplementationEachoperationinTable3isasingle-cycleoperationonADSP-TS201processor.
Thereisatotalof2fetches,4multiplies,4ALU,and2storeinstructions.
SincetheADSP-TS201allowsfetches/storestobeparalleledwithmultipliesandALUsinasinglecycle,loopunrolling,pipelining,andparallelingshouldyielda4-cycleexecutionofthesetwoSIMDbutterflies(andwearestillefficientinthememoryusage!
).
Atthispoint,wecannowbereasonablycertainthattheabovewillyieldefficientcodeandwecanstartdevelopingit.
However,carefulobservationatthispointcanhelpusoptimizethisstructureevenfurther.
Notethatweareonlyusingatotalof4fetchesandstoresfromasinglememoryblock,say,byusingJALUpointerregisters.
Inparallelwecando3morefetches/stores/KALUoperationswithoutlosinganycycles(actually,wecando4ofthem,butwedoneedonereservedplaceinoneoftheinstructionsforaloopjumpback).
Thus,theoldruleoffetchingtwiddlesonlyoncepergroupofbutterfliesthatsharesthemisnolongernecessary–thetwiddlefetchescomefree!
And,sincethestructureofthearrowsofFigure2isidenticalateverystage,wemaybeabletoreducetheFFTfromtheusualthreenestedloopstoonlytwo,providedthatwecanfindawaytocorrectlyfetchthetwiddlesateachstage(twiddlesaretheonlythingthatdistinguishesthestagesofFigure2).
Figure2showshowthetwiddlesmustbefetchedateachstage:1stStage–allareW0.
2ndStage–halfareW0,nexthalfareWN/4.
3rdStage–onequarterareW0,thenextquarterareWN/8,thenextquarterareW2N/8,andthelastquarterareW3N/8.
Andsoon…Ifwekeepavirtualtwiddlepointeroffset,incrementittothenextsequentialtwiddleeverybutterfly,butANDitwithamaskbeforeactuallyusingitinthetwiddlefetch,weachievepreciselythisorderoftwiddlefetch.
Moreover,thisruleisthesameforeverystage,exceptthatthemaskateverystagemustbeshifteddownbyonebit(i.
e.
,eachstagerequirestwiceasfinearesolutionofthetwiddlesasthepreviousstage).
Here,ourunusedKALUoperationscomeinveryhandy.
Toimplementthistwiddlefetch,weneedtoincrementthevirtualoffset,maskitanddoatwiddlefetcheverybutterfly…Oh,no!
WeareinSIMD(i.
e.
wearedoingtwobutterfliestogether)andwedonothavethe6availableinstructionslotsforthis!
Butlucksavesusagain.
WecaneasilynoticethatallstagesexceptthelastsharethetwiddlesbetweentheSIMDpairofbutterflies–so,forthesestages,weneedonlytodothetwiddlefetchonceperSIMDpairofthebutterflies!
Andthethreecyclesarepreciselywhatwehavetodothis.
Unfortunately,inthelaststage,everybutterflyhasitsownuniquetwiddle;butinthelaststage,wedonothavetomask–juststepthepointertothenexttwiddleeverytime!
Itwillhavetobewrittenseparately,butitwilloptimizecompletelyaswell.
Table4summarizesthelateststructure'ssteps.
ThreeWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page6of16anewKALUoperations(K1,K2andK3)havebeenaddedtoTable3.
TimetowritethecodeWell,no–letusfigureouthowtopipelineitfirst.
MnemonicOperationK1VirtualPointerOffsetMaskK2TwiddlesFetchK3VirtualPointerOffsetIncrementF1FetchInput1,2oftheButterfly1F2FetchInput1,2oftheButterfly2M1Real(Input2)*Real(twiddle)M2Imag(Input2)*Imag(twiddle)M3Real(Input2)*Imag(twiddle)M4Imag(Input2)*Real(twiddle)A1M1–M2=Real(Input2*twiddle)A2M3+M4=Imag(Input2*twiddle)A3Real(Input1)+/-A1=Real(Output1,2)A4Imag(Input1)+/-A2=Imag(Output1)S1Store(Output1,bothButterflies)S2Store(Output2,bothButterflies)Table4.
SingleButterflyDoneLinearly–ModifiedADSP-TS20xImplementationPipeliningoftheAlgorithmFigure3showsthealgorithm'soperationsfromTable4witharrowsshowingthedependencies.
Thearrowsofthedependenciesindicatethattheresultoftheoperationatthestartofthearrowisusedbytheoperationattheendofthatarrowand,thus,mustbecompletedfirsttoensurecorrectdata.
Somearrowshaveastallassociatedwiththem,specifically:K2->M1,M2,M3,M4F1,F2->M1,M2,M3,M4,A3,A4M1,M2->A1M3,M4->A2A1,A2->A3,A4Thismeansthatiftheoperationatthestartofthearrowisimmediatelyfollowedbytheoperationattheendofthatarrow,theresultwillbecorrect,butcodeexecutionwillproduceastall.
Thus,tofullyoptimizethecode,operationsattheendsofarrowswithstallsmustbekeptmorethanoneinstructionlineapart.
Figure3.
ReorganizedStructure'sDependenciesAquickobservationofthedependenciesinFigure3issufficienttoanalyzethelevelofpipeliningandthenumberofcomputeblockregistersneededtodoit.
WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page7of16aStateDependencyToStatesMaxDepCyclesComputeBlockRegistersNeededK1K210K2M1,M2,M3,M454*([5/4]+1)=8K3K110F1M1,M2,M3,M4,A1,A2104*([10/4]+1)=16F2M1,M2,M3,M4,A1,A2104*([10/4]+1)=16M1A122([2/4]+1)=2M2A122([2/4]+1)=2M3A222([2/4]+1)=2M4A222([2/4]+1)=2A1A3,A422([2/4]+1)=2A2A3,A422([2/4]+1)=2A3S1,S214([1/4]+1)=4A4S1,S214([1/4]+1)=4S1none00S2none00TotalRegs60Table5.
NumberofComputeBlockRegistersRequiredtoPipelinetheButterfliesFullpipelining,asmentionedearlier,wouldgivea4-cycleSIMDpairofbutterflies.
Thus,Pipelined_CB_Registers_Per_State_Output=Unpipelined_CB_Registers_Per_State_Output*([Maximum_Dependency_Cycles/4]+1)Here,[x]denotestheintegerpartofthenumberx.
Wecanthereforedeterminethenumberofcomputeblockregistersneeded,asshowninTable5.
NotethatA3andA4requiretwiceasmanyoutputregistersasM1,M2,M3,M4,A1andA2sinceA3andA4areadd/subtracts.
Theresultingrequirementtofullypipelinethiscodeis60computeblockregisters,outof64total–justbarelymadeit!
Cycle/OperationJALUKALUMACALU1F1K1M4--A3---2F2K2M2-A4---3S1---M3-A2--4S2---K3M1A1-5F1+K1+M4-A3--6F2+K2+M2A4--7S1--M3A2-8S2--K3+M1+A19F1++K1++M4A3-10F2++K2++M2+A4-11S1-M3+A212S2-K3++M1++A1+13F1+++K1+++M4+A314F2+++K2+++M2++A415S1M3++A2+16S2K3+++M1+++A1++Table6.
PipelinedButterfliesWepipelinethisfullysymbolically,usingthemnemonicsofTable4andFigure3.
ThepipeliningisshowninTable6,inwhich"+"intheoperationindicatestheoperationthatcorrespondstothenextsetofthebutterfliesand"-"correspondstotheoperationintheprevioussetofthebutterflies.
WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page8of16aAllinstructionsareparalleled,therearenostalls,andthereisaplacetoputthejumptothetopoftheloop(actually,fourplaces,butthisisonlybecausethepipelineis4pairsofbutterfliesdeep,eachiterationoftheloopinTable6willactuallydo4pairsofbutterflies).
TheCodeNow,writingthecodeistrivial.
TheADSP-TS201issoflexiblethatittakesallthechallengerightoutofit.
JustfollowthepipelineofTable6andthecodeisdone.
TheresultingcodeforthestagesotherthanlastisshowninListing2.
Outsideofthisinnerloopisastageloopthatping-pongsinput/outputbuffersandshiftsthetwiddlemodifiermask.
Prettysimple!
Additionaloptimizationisdonebybreakingthefirsttwostagesawayfromthemainofthecodeanddoingthemseparately–theydonotreallyrequireacomplexmultiplyandcanbedonefaster.
Also,bit-reversalisincorporatedintothefirsttwostages,aswell.
Now,forthebottomline–howmuchdidthecyclecountimproveInTable7werepeatTable1withadditionalcolumnsforthebenchmarksforthenewalgorithm.
Thecyclecountforthelarger-than-cacheFFTsimprovedbyafactorgreaterthan3!
Moreover,thecyclecountforFFTsthatfitintothecacheisbetterthanitwasontheoriginalADSP-TS101processor,whichhadnocacheormemorylatencyissuesofanykind.
Thereasonforthisisthatthenewarchitectureallowsthecodetobewrittenintwonestedloopsinsteadofthreeand,thus,hassignificantlylessoverhead.
Thiscode,portedtotheADSP-TS101,improvesitsbenchmarks,too–asshowninTable7.
.
align_code4;_BflyLoop:q[j2+=4]=r27:26;k5=k5+k9;fr6=r30*r12;fr16=r6-r7;;//S2----,K3-,M1-,A1--yr3:0=q[j0+=4];k3=k5andk4;fr15=r23*r4;fr24=r8+r18,fr26=r8-r18;;//F1,K1,M4--,A3---xr3:0=q[j0+=4];r5:4=l[k7+k3];fr7=r31*r13;fr25=r9+r19,fr27=r9-r19;;//F2,K2,M2-,A4---q[j1+=4]=r25:24;fr14=r30*r13;fr17=r14+r15;;//S1---,M3-,A2--q[j2+=4]=r27:26;k5=k5+k9;fr6=r2*r4;fr18=r6-r7;;//S2---,K3,M1,A1-yr11:8=q[j0+=4];k3=k5andk4;fr15=r31*r12;fr24=r20+r16,fr26=r20-r16;;//F1+,K1+,M4-,A3--xr11:8=q[j0+=4];r13:12=l[k7+k3];fr7=r3*r5;fr25=r21+r17,fr27=r21-r17;;//F2+,K2+,M2,A4--q[j1+=4]=r25:24;fr14=r2*r5;fr19=r14+r15;;//S1--,M3,A2-q[j2+=4]=r27:26;k5=k5+k9;fr6=r10*r12;fr16=r6-r7;;//S2--,K3+,M1+,A1yr23:20=q[j0+=4];k3=k5andk4;fr15=r3*r4;fr24=r28+r18,fr26=r28-r18;;//F1++,K1++,M4,A3-xr23:20=q[j0+=4];r5:4=l[k7+k3];fr7=r11*r13;fr25=r29+r19,fr27=r29-r19;;//F2++,K2++,M2+,A4-q[j1+=4]=r25:24;fr14=r10*r13;fr17=r14+r15;;//S1-,M3+,A2q[j2+=4]=r27:26;k5=k5+k9;fr6=r22*r4;fr18=r6-r7;;//S2-,K3++,M1++,A1+yr31:28=q[j0+=4];k3=k5andk4;fr15=r11*r12;fr24=r0+r16,fr26=r0-r16;;//F1+++,K1+++,M4+,A3xr31:28=q[j0+=4];r13:12=l[k7+k3];fr7=r23*r5;fr25=r1+r17,fr27=r1-r17;;//F2+++,K2+++,M2++,A4.
align_code4;ifNLC0E,jump_BflyLoop;q[j1+=4]=r25:24;fr14=r22*r5;fr19=r14+r15;;//S1,M3++,A2+Listing2.
fft32.
asm-fragmentWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page9of16aNADSP-TS101OldstructureADSP-TS101NewstructureADSP-TS201Oldstructure-InputnotincacheADSP-TS201Newstructure-InputnotincacheADSP-TS201Oldstructure-InputincacheADSP-TS201Newstructure-Inputincache25621721958264124022218196351245824276553351924649428310249872941012170116629992941920482133820688266102531622173206994096462444527819727269924NANA81929988698540444628147628NANA16384215224213243987730313292NANA32768NANA2133220662614NANA65536NANA47200101397544NANATable7.
CoreClockCyclesforN-pointComplexFFT,NewversusOldStructureUsageRulesTheC-callablecomplexFFTroutineiscalledasFFT32(&(input),&(ping_pong_buffer1),&(ping_pong_buffer2),&(output),N,F);whereinput->FFTinputbuffer,output->FFToutputbuffer,ping_pong_bufferxarethepingpongbuffers,N=Numberofcomplexpoints,F=0ifFFTisrealand1ifFFTiscomplex.
Asmentionedearlier,duetodatare-ordering,stagescannotbedonein-placeandhavetoping-pong.
Thus,ping_pong_buffer1andping_pong_buffer2havetobetwodistinctbuffers.
However,dependingontheroutine'suserrequirements,somememoryoptimizationispossible.
Ping_pong_buffer1canbemadethesameasinputifinputdoesnotneedtobepreserved.
Also,ifLog2(N)iseven,outputcanbemadethesameasping_pong_buffer2andifLog2(N)isodd,outputcanbemadethesameasping_pong_buffer1.
Belowaretwoexamplesoftheroutineusagewithminimaluseofmemory:FFT32(&(input),&(input),&(output),&(output),1024,1);FFT32(&(input),&(input),&(ping_pong_buffer2),&(input),2048,1);Toeliminatememoryblockaccessconflicts,inputmustresideinadifferentmemoryblockthanping_pong_buffer2andtwiddlefactorsmustresideinadifferentmemoryblockthantheping-pongbuffers.
Ofcourse,allcodemustresideinablockthatisdifferentfromallthedatabuffers,aswell.
Ping-pongbufferscanshareamemoryblock,however–thereisnoinstructionthataccessesbothping-pongbuffersinthesamecycle.
WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page10of16aAppendixCompleteSourceCodeoftheOptimizedFFT/*fft32.
asmPrelimrev.
October19,2003-BLRev.
1.
0-addedrealinputscase-PMThisisassemblyroutinefortheComplexradix-2C-callableFFTonTigerSHARCfamilyofDSPs.
I.
DescriptionofCalling.
1.
Inputs:j4->input(ping-pongbuffer1)j5->ping-pongbuffer1j6->ping-pongbuffer2j7->outputj27+0x18->N=Numberofpointsj27+0x19->REALorCOMPLEX2.
C-CallingExample:fft32(&(input),&(ping_pong_buffer1),&(ping_pong_buffer2),&(output),N,COMPLEX);3.
Limitations:a.
Allbuffersmustbealignedonmemoryboundarywhichisamultipleof4.
b.
Nmustbebetween32andMAX_FFT_SIZE.
c.
Ifmemoryspacesavingsarerequiredandinputdoesnothavetobepreserved,ping_pong_buffer1canbethesamebufferasinput.
d.
Ifmemoryspacesavingsarerequired,outputcanbethesamebufferasping_pong_buffer2ifthenumberofFFTstagesiseven(i.
e.
Log2(N)iseven)orthesameasping_pong_buffer1ifthenumberofFFTstagesisodd(i.
e.
Log2(N)isodd).
4.
MAX_FFT_SIZEcanbeselectedvia#define.
LargervaluesallowformorechoicesofN,butitstwiddleswilloccupymorememory.
5.
ThisC-callablefunctioncanprocessupto64KblocksofdataonTS201(16KblocksonTS101)becauseCenvironmentitselfnecessitatesmemory.
Therefore,ifmoreinputpointsarenecessary,assemblylanguagedevelopmentmaybecomeamust.
OnTS201,ablockofmemoryis128Kwordslong,somaximumNis128Krealpointsor64Kcomplexpoints.
TS101containsonly2blocksofdatamemoryof64Kwordsand4buffersmustbeaccommodated.
Therefore,maximumNis32Krealwordsor16Kcomplexwords.
II.
DescriptionoftheFFTalgorithm.
1.
TheinputdataistreatedascomplexinterleavedN-point.
2.
Duetore-ordering,nostagecanbedonein-place.
3.
Thebitreversalandthefirsttwostagesarecombinedintoasingleloop.
Thislooptakesdatafrominputandstoresitintheping-pongbuffer1.
4.
Eachsubsequentstageping-pongsthedatabetweenthetwoping-pongbuffers.
ThelaststageusesFFToutputbufferforitsoutput.
5.
AlthoughtheFFTisdesignedtobecalledwithanypointsizeNinput,xr1=bitdifferencebetweenMAXandNk10=lshiftrk10;xr0=bsetr0byr1;xr30=r2-r3;;//k10=N/16,xr30=Stages-3k10=k10-1;xr0=lshiftr0by2;LC1=xr30;;//k10=N/16-1,LC1=Stages-3k9=xr0;k4=k31+(MAX_FFT_SIZE/4-1);;k4=notk4;j10=lshiftrj9;;//initialtwiddlespointermask,j10=N/8BitReverseandStages1&2k5=lshiftrk1;;//k5=N/2j0=j31+j6;k6=k6-k6;;//j0->ping_pong_buffer2j1=j0+j9;LC0=k10;;//j1->ping_pong_buffer2+N/4,LC0=N/16-1j2=j1+j9;k1=k0+k5;;//j2->ping_pong_buffer2+N/2,k1->input+N/2j3=j2+j9;k2=k1+k5;;//j3->ping_pong_buffer2+3N/4,k2->input+Nj12=j3+j9;k3=k2+k5;;//j12->ping_pong_buffer2+N,k3->input+3N/2j13=j12+j9;k5=lshiftrk5;;//j13->ping_pong_buffer2+5N/4,k5=N/4j14=j13+j9;r1:0=q[k0+k6];;//j14->ping_pong_buffer2+3N/2j15=j14+j9;r3:2=q[k2+k6];;//j15->ping_pong_buffer2+7N/4r5:4=q[k1+k6];;r7:6=q[k3+k6];;k6=k6+k5(br);fr0=r0+r2,fr20=r0-r2;;r9:8=q[k0+k6];fr2=r1+r3,fr29=r1-r3;;r11:10=q[k2+k6];fr4=r4+r6,fr21=r4-r6;;r13:12=q[k1+k6];fr5=r5+r7,fr28=r5-r7;;r15:14=q[k3+k6];fr18=r8+r10,fr22=r8-r10;;k6=k6+k5(br);fr19=r9+r11,fr31=r9-r11;;fr26=r12+r14,fr23=r12-r14;;fr27=r13+r15,fr30=r13-r15;;fr20=r20+r28,fr28=r20-r28;;fr29=r29+r21,fr21=r29-r21;;fr22=r22+r30,fr30=r22-r30;;fr31=r31+r23,fr23=r31-r23;;.
align_code4;_Stages1and2Loop:r1:0=q[k0+k6];q[j2+=4]=yr23:20;fr16=r0+r4,fr24=r0-r4;;r3:2=q[k2+k6];q[j3+=4]=xr23:20;fr17=r2+r5,fr25=r2-r5;;r5:4=q[k1+k6];q[j14+=4]=yr31:28;fr18=r18+r26,fr26=r18-r26;;r7:6=q[k3+k6];q[j15+=4]=xr31:28;fr19=r19+r27,fr27=r19-r27;;k6=k6+k5(br);q[j0+=4]=yr19:16;fr0=r0+r2,fr20=r0-r2;;r9:8=q[k0+k6];q[j1+=4]=xr19:16;fr2=r1+r3,fr29=r1-r3;;r11:10=q[k2+k6];q[j12+=4]=yr27:24;fr4=r4+r6,fr21=r4-r6;;r13:12=q[k1+k6];q[j13+=4]=xr27:24;fr5=r5+r7,fr28=r5-r7;;r15:14=q[k3+k6];fr18=r8+r10,fr22=r8-r10;;k6=k6+k5(br);fr19=r9+r11,fr31=r9-r11;;fr26=r12+r14,fr23=r12-r14;;fr27=r13+r15,fr30=r13-r15;;fr20=r20+r28,fr28=r20-r28;;fr29=r29+r21,fr21=r29-r21;;fr22=r22+r30,fr30=r22-r30;;WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page12of16a.
align_code4;ifNLC0E,jump_Stages1and2Loop;fr31=r31+r23,fr23=r31-r23;;q[j2+=4]=yr23:20;fr16=r0+r4,fr24=r0-r4;;q[j3+=4]=xr23:20;fr17=r2+r5,fr25=r2-r5;;q[j14+=4]=yr31:28;fr18=r18+r26,fr26=r18-r26;;q[j15+=4]=xr31:28;fr19=r19+r27,fr27=r19-r27;;q[j0+=4]=yr19:16;;q[j1+=4]=xr19:16;;q[j12+=4]=yr27:24;;q[j13+=4]=xr27:24;;Stages3toLog2(N)-1j0=j31+j6;k5=k31+0;;.
align_code4;_StageLoop:yr3:0=q[j0+=4];k3=k5andk4;;//F1,K1xr3:0=q[j0+=4];r5:4=l[k7+k3];;//F2,K2LC0=k10;k5=k5+k9;;//K3,M1yr11:8=q[j0+=4];k3=k5andk4;fr6=r2*r4;;//F1+,K1+xr11:8=q[j0+=4];r13:12=l[k7+k3];fr7=r3*r5;;//F2+,K2+,M2fr14=r2*r5;;//M3j1=j31+j5;k5=k5+k9;fr6=r10*r12;fr16=r6-r7;;//K3+,M1+,A1yr23:20=q[j0+=4];k3=k5andk4;fr15=r3*r4;;//F1++,K1++,M4xr23:20=q[j0+=4];r5:4=l[k7+k3];fr7=r11*r13;;//F2++,K2++,M2+fr14=r10*r13;fr17=r14+r15;;//M3+,A2j2=j1+j11;k5=k5+k9;fr6=r22*r4;fr18=r6-r7;;//K3++,M1++,A1+yr31:28=q[j0+=4];k3=k5andk4;fr15=r11*r12;fr24=r0+r16,fr26=r0-r16;;//F1+++,K1+++,M4+,A3xr31:28=q[j0+=4];r13:12=l[k7+k3];fr7=r23*r5;fr25=r1+r17,fr27=r1-r17;;//F2+++,K2+++,M2++,A4q[j1+=4]=r25:24;fr14=r22*r5;fr19=r14+r15;;//S1,M3++,A2+.
align_code4;_BflyLoop:q[j2+=4]=r27:26;k5=k5+k9;fr6=r30*r12;fr16=r6-r7;;//S2----,K3-,M1-,A1--yr3:0=q[j0+=4];k3=k5andk4;fr15=r23*r4;fr24=r8+r18,fr26=r8-r18;;//F1,K1,M4--,A3---xr3:0=q[j0+=4];r5:4=l[k7+k3];fr7=r31*r13;fr25=r9+r19,fr27=r9-r19;;//F2,K2,M2-,A4---q[j1+=4]=r25:24;fr14=r30*r13;fr17=r14+r15;;//S1---,M3-,A2--q[j2+=4]=r27:26;k5=k5+k9;fr6=r2*r4;fr18=r6-r7;;//S2---,K3,M1,A1-yr11:8=q[j0+=4];k3=k5andk4;fr15=r31*r12;fr24=r20+r16,fr26=r20-r16;;//F1+,K1+,M4-,A3--xr11:8=q[j0+=4];r13:12=l[k7+k3];fr7=r3*r5;fr25=r21+r17,fr27=r21-r17;;//F2+,K2+,M2,A4--q[j1+=4]=r25:24;fr14=r2*r5;fr19=r14+r15;;//S1--,M3,A2-q[j2+=4]=r27:26;k5=k5+k9;fr6=r10*r12;fr16=r6-r7;;//S2--,K3+,M1+,A1yr23:20=q[j0+=4];k3=k5andk4;fr15=r3*r4;fr24=r28+r18,fr26=r28-r18;;//F1++,K1++,M4,A3-xr23:20=q[j0+=4];r5:4=l[k7+k3];fr7=r11*r13;fr25=r29+r19,fr27=r29-r19;;//F2++,K2++,M2+,A4-q[j1+=4]=r25:24;fr14=r10*r13;fr17=r14+r15;;//S1-,M3+,A2q[j2+=4]=r27:26;k5=k5+k9;fr6=r22*r4;fr18=r6-r7;;//S2-,K3++,M1++,A1+yr31:28=q[j0+=4];k3=k5andk4;fr15=r11*r12;fr24=r0+r16,fr26=r0-r16;;//F1+++,K1+++,M4+,A3xr31:28=q[j0+=4];r13:12=l[k7+k3];fr7=r23*r5;fr25=r1+r17,fr27=r1-r17;;//F2+++,K2+++,M2++,A4.
align_code4;ifNLC0E,jump_BflyLoop;q[j1+=4]=r25:24;fr14=r22*r5;fr19=r14+r15;;//S1,M3++,A2+q[j2+=4]=r27:26;fr6=r30*r12;fr16=r6-r7;;//S2----,M1-,A1--j0=j31+j5;fr15=r23*r4;fr24=r8+r18,fr26=r8-r18;;//M4--,A3---//swapping-pongpointersj5=j31+j6;fr7=r31*r13;fr25=r9+r19,fr27=r9-r19;;//M2-,A4---q[j1+=4]=r25:24;fr14=r30*r13;fr17=r14+r15;;//S1---,M3-,A2--q[j2+=4]=r27:26;fr18=r6-r7;;//S2---,A1-j6=j31+j0;fr15=r31*r12;fr24=r20+r16,fr26=r20-r16;;//M4-,A3--fr25=r21+r17,fr27=r21-r17;;//A4--q[j1+=4]=r25:24;fr19=r14+r15;;//S1--,A2-q[j2+=4]=r27:26;fr24=r28+r18,fr22=r28-r18;;//S2--A3-j0=j31+j6;fr25=r29+r19,fr23=r29-r19;;//A4-q[j1+=4]=r25:24;k5=k31+0;;//S1-.
align_code4;ifNLC1E,jump_StageLoop;q[j2+=4]=r23:22;k4=ashiftrk4;;//S2-,shiftthemaskLaststagek9=ashiftrk9;;//inthismanneranyMAX_FFT_SIZEcanbeusedyr3:0=q[j0+=4];yr5:4=l[k7+=k9];;//F1,xr3:0=q[j0+=4];xr5:4=l[k7+=k9];;//F2,K2j1=j31+j7;fr6=r2*r4;LC0=k10;;//M1yr11:8=q[j0+=4];yr13:12=l[k7+=k9];;//F1+xr11:8=q[j0+=4];xr13:12=l[k7+=k9];fr7=r3*r5;;//F2+,K2+,M2j2=j1+j11;fr14=r2*r5;;//M3fr6=r10*r12;fr16=r6-r7;;//M1+,A1yr23:20=q[j0+=4];yr5:4=l[k7+=k9];fr15=r3*r4;;//F1++,M4xr23:20=q[j0+=4];xr5:4=l[k7+=k9];fr7=r11*r13;;//F2++,K2++,M2+fr14=r10*r13;fr17=r14+r15;;//M3+,A2fr6=r22*r4;fr18=r6-r7;;//M1++,A1+WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page13of16ayr31:28=q[j0+=4];yr13:12=l[k7+=k9];fr15=r11*r12;fr24=r0+r16,fr26=r0-r16;;//F1+++,M4+,A3xr31:28=q[j0+=4];xr13:12=l[k7+=k9];fr7=r23*r5;fr25=r1+r17,fr27=r1-r17;;//F2+++,K2+++,M2++,A4q[j1+=4]=r25:24;fr14=r22*r5;fr19=r14+r15;;//S1,M3++,A2+.
align_code4;_BflyLastLoop:q[j2+=4]=r27:26;fr6=r30*r12;fr16=r6-r7;;//S2----,M1-,A1--yr3:0=q[j0+=4];yr5:4=l[k7+=k9];fr15=r23*r4;fr24=r8+r18,fr26=r8-r18;;//F1,M4--,A3---xr3:0=q[j0+=4];xr5:4=l[k7+=k9];fr7=r31*r13;fr25=r9+r19,fr27=r9-r19;;//F2,K2,M2-,A4---q[j1+=4]=r25:24;fr14=r30*r13;fr17=r14+r15;;//S1---,M3-,A2--q[j2+=4]=r27:26;fr6=r2*r4;fr18=r6-r7;;//S2---,M1,A1-yr11:8=q[j0+=4];yr13:12=l[k7+=k9];fr15=r31*r12;fr24=r20+r16,fr26=r20-r16;;//F1+,M4-,A3--xr11:8=q[j0+=4];xr13:12=l[k7+=k9];fr7=r3*r5;fr25=r21+r17,fr27=r21-r17;;//F2+,K2+,M2,A4--q[j1+=4]=r25:24;fr14=r2*r5;fr19=r14+r15;;//S1--,M3,A2-q[j2+=4]=r27:26;fr6=r10*r12;fr16=r6-r7;;//S2--,M1+,A1yr23:20=q[j0+=4];yr5:4=l[k7+=k9];fr15=r3*r4;fr24=r28+r18,fr26=r28-r18;;//F1++,M4,A3-xr23:20=q[j0+=4];xr5:4=l[k7+=k9];fr7=r11*r13;fr25=r29+r19,fr27=r29-r19;;//F2++,K2++,M2+,A4-q[j1+=4]=r25:24;fr14=r10*r13;fr17=r14+r15;;//S1-,M3+,A2q[j2+=4]=r27:26;fr6=r22*r4;fr18=r6-r7;;//S2-,M1++,A1+yr31:28=q[j0+=4];yr13:12=l[k7+=k9];fr15=r11*r12;fr24=r0+r16,fr26=r0-r16;;//F1+++,M4+,A3xr31:28=q[j0+=4];xr13:12=l[k7+=k9];fr7=r23*r5;fr25=r1+r17,fr27=r1-r17;;//F2+++,K2+++,M2++,A4.
align_code4;ifNLC0E,jump_BflyLastLoop;q[j1+=4]=r25:24;fr14=r22*r5;fr19=r14+r15;;//S1,M3++,A2+q[j2+=4]=r27:26;fr6=r30*r12;fr16=r6-r7;;//S2----,M1-,A1--fr15=r23*r4;fr24=r8+r18,fr26=r8-r18;;//M4--,A3---fr7=r31*r13;fr25=r9+r19,fr27=r9-r19;;//M2-,A4---q[j1+=4]=r25:24;fr14=r30*r13;fr17=r14+r15;;//S1---,M3-,A2--q[j2+=4]=r27:26;fr18=r6-r7;;//S2---,A1-fr15=r31*r12;fr24=r20+r16,fr26=r20-r16;;//M4-,A3--fr25=r21+r17,fr27=r21-r17;;//A4--q[j1+=4]=r25:24;fr19=r14+r15;;//S1--,A2-q[j2+=4]=r27:26;;//S2--fr24=r28+r18,fr26=r28-r18;;//A3-fr25=r29+r19,fr27=r29-r19;;//A4-q[j1+=4]=r25:24;;//S1-q[j2+=4]=r27:26;;//S2-j11=[j27+0x19];;//j11=COMPLEXorREAL,offthestackcomp(j11,COMPLEX);;//ComplexorReal.
align_code4;ifjeq,jump_FFTEpilogue;;//IfComplex,doneRealre-combine//j17=N/2,j7=outputk8=k31+_twiddles;j0=j31+j7;;//k8->twiddles,j0->internalbufferk9=ashiftrk9;j10=j31+j7;;//k9=twiddlestride,j10->internalbufferj14=j17+j17;;//j14=N(N/2complexvalues)j14=j14-4;;//j14=N-4real=N/2-2complexj1=j0+j14;;//j1->internalbuffer+(N/2-2)j14=j10+j14;;//j14->internalbuffer+(N/2-2)j29=ashiftrj17;;//j29=N/4k15=k31+MAX_FFT_SIZE/4;j30=ashiftrj29;;//k15=N/4*twiddle_stride,j30=N/8j30=ashiftrj30;;//N/16k8=k8+k9;r0=l[j7+j17];;//k8->twiddles+1,getG(N/4)j0=j0+2;k12=k8+k15;;//j0->internalbuffer+1,k12->twiddles+N/8+1LC0=j30;fr0=r0+r0;j2=j0+j29;;//LC0=N/16,computeF(N/4)=2*conj(G(N/4)),//j2->internalbuffer+1+N/8j3=j1-j29;;//j3->internalbuffer+3N/8-2xfr0=-r0;j10=j10+2;k10=yr0;;//j10->internalbuffer+1,k10=Im(F(N/4))j12=j10+j29;;//j12->internalbuffer+N/8+1ifLC0E;j13=j14-j29;k11=xr0;;//LC0=N/16-1,j13->internalbuffer+3N/8-2,k11=Re(F(N/4))yr3:0=DABq[j0+=4];;//PrimetheDABxr3:0=DABq[j2+=4];;//PrimetheDAByr3:0=DABq[j0+=4];;//yr0=Re(G(n)),yr1=Im(G(n)),yr2=Re(G(n+1)),yr3=Im(G(n+1))xr3:0=DABq[j2+=4];;//xr0=Re(G(n+N/8)),xr1=Im(G(n+N/8))//xr2=Re(G(n+1+N/8)),xr3=Im(G(n+1+N/8))yr7:4=q[j1+=-4];xr9:8=l[k12+=k9];;//yr4=Re(G(N/2-(n+1))),yr5=Im(G(N/2-(n+1)))//yr6=Re(G(N/2-n)),yr7=Im(G(N/2-n))//twiddles(n+N/8)-wanttomultbysin(x)-icos(x)xr7:4=q[j3+=-4];xr11:10=l[k12+=k9];;//xr4=Re(G(N/2-(n+1+N/8))),xr5=Im(G(N/2-(n+1+N/8)))//xr6=Re(G(N/2-(n+N/8))),xr7=Im(G(N/2-(n+N/8)))//twiddles(n+1+N/8)ifLC0E;fr16=r0+r6,fr20=r0-r6;yr9:8=l[k8+=k9];;//LC0=N/16-2,r16=Re(G(n)+conj(G(N/2-n))),//r20=Re(G(n)-conj(G(N/2-n)))//twiddles(n)fr18=r2+r4,fr22=r2-r4;yr11:10=l[k8+=k9];;//r18=Re(G(n+1)+conj(G(N/2-(n+1)))),//r22=Re(G(n+1)-conj(G(N/2-(n+1))))//twiddles(n+1)fr24=r20*r9;fr21=r1+r7,fr17=r1-r7;;//r24=s(n)*Re(G(n)-conj(G(N/2-n)))//r17=Im(G(n)+conj(G(N/2-n))),r21=Im(G(n)-conj(G(N/2-n)))fr26=r22*r11;fr23=r3+r5,fr19=r3-r5;xr3:0=DABq[j2+=4];;//r26=s(n+1)*Re(G(n+1)-conj(G(N/2-(n+1))))//r19=Im(G(n+1)+conj(G(N/2-(n+1)))),//r23=Im(G(n+1)-conj(G(N/2-(n+1))))//xr3:0=nextG(n+2+N/8),G(n+3+N/8)fr25=r21*r8;yr3:0=DABq[j0+=4];;//r25=c(n)*Im(G(n)-conj(G(N/2-n))),//yr3:0=nextG(n+2),G(n+3)fr27=r23*r10;;//r27=c(n+1)*Im(G(n+1)-conj(G(N/2-(n+1))))fr24=r24+r25;fr25=r21*r9;yr7:4=q[j1+=-4];;//r24=Re(-i*exp(2*pi*i*n)(G(n)-conj(G(N/2-n))))//r13=s(n)*Im(G(n)-conj(G(N/2-n)))//yr7:4=nextG(N/2-(n+2)),G(N/2-(n+3))fr26=r26+r27;fr27=r23*r11;xr7:4=q[j3+=-4];;//r26=Re(-i*exp(2*pi*i*(n+1))(G(n+1)-conj(G(N/2-(n+1)))))WritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page14of16a//r27=s(n+1)*Im(G(n+1)-conj(G(N/2-(n+1))))//xr7:4=nextG(N/2-(n+2+N/8)),G(N/2-(n+3+N/8))fr13=r20*r8;fr12=r16+r24,fr30=r16-r24;;//r13=c(n)*Re(G(n)-conj(G(N/2-n))),//r12=Re(F(n)),r30=Re(F(N/2-n))fr15=r22*r10;fr14=r18+r26,fr28=r18-r26;;//r15=c(n+1)*Re(G(n+1)-conj(G(N/2-(n+1))))//r14=Re(F(n+1)),r28=Re(F(N/2-(n+1)))fr13=r25-r13;xr9:8=l[k12+=k9];;//r13=Im(-i*exp(2*pi*i*x)(G(n)-conj(G(N/2-n)))),//nexttwiddles(n+2+N/8)fr15=r27-r15;xr11:10=l[k12+=k9];;//r15=Im(-i*exp(2*pi*i*x)(G(n+1)-conj(G(N/2-(n+1)))))//nexttwiddles(n+3+N/8).
align_code4;_combine_stage:fr16=r0+r6,fr20=r0-r6;yr9:8=l[k8+=k9];;//r16=Re(G(n+2)+conj(G(N/2-(n+2)))),//r20=Re(G(n+2)-conj(G(N/2-(n+2))))//nexttwiddles(n+2)fr18=r2+r4,fr22=r2-r4;yr11:10=l[k8+=k9];;//r18=Re(G(n+3)+conj(G(N/2-(n+3)))),//r22=Re(G(n+3)-conj(G(N/2-(n+3))))//nexttwiddles(n+3)fr13=r13+r17,fr31=r13-r17;;//r13=Im(F(n)),r31=Im(F(N/2-n))fr15=r15+r19,fr29=r15-r19;l[j12+=2]=xr13:12;;//r15=Im(F(n+1)),r29=Im(F(N/2-(n+1))),storeF(n+N/8)fr24=r20*r9;fr21=r1+r7,fr17=r1-r7;q[j14+=-4]=yr31:28;;//r24=s(n+2)*Re(G(n+2)-conj(G(N/2-(n+2))))//r21=Im(G(n+2)+conj(G(N/2-(n+2)))),//r17=Im(G(n+2)-conj(G(N/2-(n+2))))//storeF(N/2-n),F(N/2-(n+1))fr26=r22*r11;fr23=r3+r5,fr19=r3-r5;xr3:0=DABq[j2+=4];;//r26=s(n+3)*Re(G(n+3)-conj(G(N/2-(n+3))))//r23=Im(G(n+3)+conj(G(N/2-(n+3)))),//r19=Im(G(n+3)-conj(G(N/2-(n+3))))//xr3:0=nextG(n+4+N/8),G(n+5+N/8)fr25=r21*r8;yr3:0=DABq[j0+=4];;//r25=c(n+2)*Im(G(n+2)-conj(G(N/2-(n+2))))//yr3:0=nextG(n+4),G(n+5)fr27=r23*r10;q[j13+=-4]=xr31:28;;//r27=c(n+3)*Im(G(n+3)-conj(G(N/2-(n+3))))//storeF(N/2-(n+N/8)),F(N/2-(n+1+N/8))fr24=r24+r25;fr25=r21*r9;l[j10+=2]=yr13:12;;//r24=Re(-i*exp(2*pi*i*x)(G(n+2)-conj(G(N/2-(n+2)))))//r25=s(n+2)*Im(G(n+2)-conj(G(N/2-(n+2)))),storeF(n)fr26=r26+r27;fr27=r23*r11;l[j10+=2]=yr15:14;;//r26=Re(-i*exp(2*pi*i*x)(G(n+3)-conj(G(N/2-(n+3)))))//r27=s(n+3)*Im(G(n+3)-conj(G(N/2-(n+3)))),storeF(n+1)fr13=r20*r8;fr12=r16+r24,fr30=r16-r24;l[j12+=2]=xr15:14;;//r13=cos(n+2)*Re(G(n+2)-conj(G(N/2-(n+2))))//r12=Re(F(n+2)),r30=Re(F(N/2-(n+2))),storeF(n+1+N/8)fr15=r22*r10;fr14=r18+r26,fr28=r18-r26;xr7:4=q[j3+=-4];;//r15=cos(n+3)*Re(G(n+3)-conj(G(N/2-(n+3))))//r14=Re(F(n+3)),r28=Re(F(N/2-(n+3)))//xr7:4=nextG(N/2-(n+4+N/8)),G(N/2-(n+5+N/8))fr13=r25-r13;xr9:8=l[k12+=k9];yr7:4=q[j1+=-4];;//r13=Im(-i*exp(2*pi*i*x)(G(n+2)-conj(G(N/2-(n+2)))))//nexttwiddles(n+4+N/8)//yr7:4=nextG(N/2-(n+4)),G(N/2-(n+5)).
align_code4;ifNLC0E,jump_combine_stage(P);fr15=r27-r15;xr11:10=l[k12+=k9];;//r15=Im(-i*exp(2*pi*i*x)(G(n+3)-conj(G(N/2-(n+3)))))//nexttwiddles(n+5+N/8)fr16=r0+r6,fr20=r0-r6;yr9:8=l[k8+=k9];;//r16=Re(G(n+4)+conj(G(N/2-(n+4)))),//r20=Re(G(n+4)-conj(G(N/2-(n+4))))//nexttwiddles(n+4)fr18=r2+r4,fr22=r2-r4;yr11:10=l[k8+=k9];;//r18=Re(G(n+5)+conj(G(N/2-(n+5)))),//r22=Re(G(n+5)-conj(G(N/2-(n+5))))//nexttwiddles(n+5)fr13=r13+r17,fr31=r13-r17;;//r13=Im(F(n+2)),r31=Im(F(N/2-(n+2)))fr15=r15+r19,fr29=r15-r19;;//r15=Im(F(n+3)),r29=Im(F(N/2-(n+3)))fr24=r20*r9;fr21=r1+r7,fr17=r1-r7;yr1:0=l[j31+j7];;//r24=s(n+4)*Re(G(n+4)-conj(G(N/2-(n+4))))//r21=Im(G(n+4)+conj(G(N/2-(n+4)))),//r17=Im(G(n+4)-conj(G(N/2-(n+4))))//yr0=Re(G(0)),yr1=Im(G(0))fr26=r22*r11;fr23=r3+r5,fr19=r3-r5;;//r26=s(n+5)*Re(G(n+5)-conj(G(N/2-(n+5))))//r23=Im(G(n+5)+conj(G(N/2-(n+5)))),//r19=Im(G(n+5)-conj(G(N/2-(n+5))))fr25=r21*r8;;//r25=cos(x)*Im(G(n)-conj(G(N/2-n)))fr27=r23*r10;l[j12+=2]=xr13:12;;//r27=cos(x)*Im(G(n)-conj(G(N/2-n)))//storeF(n+2+N/8)yfr0=r1+r0;yr1=lshiftr1by-32;q[j14+=-4]=yr31:28;;//yr0=Re(G(0))+Im(G(0)),yr1=0=Im(F(0))//storeF(N/2-(n+2)),F(N/2-(n+3))yfr0=r0+r0;q[j13+=-4]=xr31:28;;//yr0=Re(F(0))//storeF(N/2-(n+2+N/8)),F(N/2-(n+3+N/8))fr24=r24+r25;fr25=r21*r9;l[j10+=2]=yr13:12;;//r24=Re(-i*exp(2*pi*i*x)(G(n+4)-conj(G(N/2-(n+4)))))//r25=s(n+4)*Im(G(n+4)-conj(G(N/2-(n+4)))),storeF(n+2)fr26=r26+r27;fr27=r23*r11;l[j10+=2]=yr15:14;;//r26=Re(-i*exp(2*pi*i*x)(G(n+5)-conj(G(N/2-(n+5)))))//r27=s(n+5)*Im(G(n+5)-conj(G(N/2-(n+5)))),storeF(n+3)fr13=r20*r8;fr12=r16+r24,fr30=r16-r24;l[j12+=2]=xr15:14;;//r13=c(n+4)*Re(G(n+4)-conj(G(N/2-(n+4))))//r12=Re(F(n+4)),r30=Re(F(N/2-(n+4))),storeF(n+3+N/8)fr15=r22*r10;fr14=r18+r26,fr28=r18-r26;l[j31+j7]=yr1:0;;//r15=c(n+5)*Re(G(n+5)-conj(G(N/2-(n+5))))//r14=Re(F(n+5)),r28=Re(F(N/2-(n+5)))//storeF(0)fr13=r25-r13;l[j7+j17]=k11:10;;//r13=Im(-i*exp(2*pi*i*x)(G(n+4)-conj(G(N/2-(n+4)))))//storeF(N/4)fr15=r27-r15;;//r15=Im(-i*exp(2*pi*i*x)(G(n+5)-conj(G(N/2-(n+5)))))fr13=r13+r17,fr31=r13-r17;;//r13=Im(F(n+4)),r31=Im(F(N/2-(n+4)))fr15=r15+r19,fr29=r15-r19;l[j12+=2]=xr13:12;;//r15=Im(F(n+5)),r29=Im(F(N/2-(n+5))),storeF(n+4+N/8)q[j14+=-4]=yr31:28;;//storeF(N/2-(n+4)),F(N/2-(n+5))q[j13+=-4]=xr31:28;;//storeF(N/2-(n+4+N/8)),F(N/2-(n+5+N/8))l[j10+=2]=yr13:12;;//storeF(n+4)l[j10+=2]=yr15:14;;//storeF(n+5)l[j12+=2]=xr15:14;;//storeF(n+5+N/8)Epilogue_FFTEpilogue:mPOPQ(yR27:24)mPOPQ(yR31:28)mPOPQ(xR27:24)mPOPQ(xR31:28)mRETURNWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page15of16aEndLabelForStatisticalProfiling_FFT32.
end:Listing3.
fft32.
asmReferences[1]ADSP-TS201TigerSHARCProcessorProgrammingReference.
Revision0.
1,June2003.
AnalogDevices,Inc.
DocumentHistoryRevisionDescriptionRev2–March04,2004byBorisLernerAddedmentionoftherealstageandupdatedthecallingexamplesappropriately.
Rev1–December18,2003byBorisLernerInitialReleaseWritingEfficientFloating-PointFFTsforADSP-TS201TigerSHARCProcessors(EE-218)Page16of16
之前分享过很多次CloudCone的信息,主要是VPS主机,其实商家也提供独立服务器租用,同样在洛杉矶MC机房,分为两种线路:普通优化线路及CN2 GIA,今天来分享下商家的CN2 GIA线路独立服务器产品,提供15-100Mbps带宽,不限制流量,可购买额外的DDoS高防IP,最低每月82美元起,支持使用PayPal或者支付宝等付款方式。下面分享几款洛杉矶CN2 GIA线路独立服务器配置信息。配...
腾讯云轻量应用服务器又要免费升级配置了,之前已经免费升级过一次了(腾讯云轻量应用服务器套餐配置升级 轻量老用户专享免费升配!),这次在上次的基础上再次升级。也许这就是良心云吧,名不虚传。腾讯云怎么样?腾讯云好不好。腾讯云轻量应用服务器 Lighthouse 是一种易于使用和管理、适合承载轻量级业务负载的云服务器,能帮助个人和企业在云端快速构建网站、博客、电商、论坛等各类应用以及开发测试环境,并提供...
spinservers是Majestic Hosting Solutions,LLC旗下站点,主营美国独立服务器租用和Hybrid Dedicated等,spinservers这次提供的大硬盘、大内存服务器很多人很喜欢。TheServerStore自1994年以来,它是一家成熟的企业 IT 设备供应商,专门从事二手服务器和工作站业务,在德克萨斯州拥有40,000 平方英尺的仓库,库存中始终有数千台...
www.k8k8.com为你推荐
国家网络安全部国家网络安全嘉兴商标注册嘉兴那里有设计商标的关键字数据库:什么是关键字?陈嘉垣大家觉得陈嘉桓漂亮还是钟嘉欣漂亮?www.228gg.comwww.a8tb.com这个网站该如何改善www.mywife.ccMywife-No 00357 MANAMI SAITO种子下载地址有么?求好心人给www.se222se.comhttp://www.qqvip222.com/斗城网女追男有多易?喜欢你,可我不知道你喜不喜欢我!!平安夜希望有他陪我过haole012.com012.com网站真的可以挂Q升级吗?www.cn12365.orgwww.12365china.net是不是真的防伪网站300373一搓黑是真的吗
高防直连vps 西安电信测速 namecheap webhostingpad koss 英文简历模板word godaddy优惠券 搜狗抢票助手 申请空间 网盘申请 100m独享 息壤代理 爱奇艺vip免费领取 卡巴斯基破解版 idc查询 跟踪路由命令 测速电信 攻击服务器 广州服务器托管 服务器防御 更多