randomopendns

opendns  时间:2021-05-20  阅读:()
NEONcryptoDanielJ.
Bernstein1andPeterSchwabe21DepartmentofComputerScienceUniversityofIllinoisatChicago,Chicago,IL60607–7045,USAdjb@cr.
yp.
to2InstituteofInformationScienceAcademiaSinica,128Section2AcademiaRoad,Taipei115-29,Taiwanpeter@cryptojedi.
orgAbstract.
NEONisavectorinstructionsetincludedinalargefractionofnewARM-basedtabletsandsmartphones.
ThispapershowsthatNEONsupportshigh-securitycryptographyatsurprisinglyhighspeeds;normallydataarrivesatlowerspeeds,givingtheCPUtimetohandletasksotherthancryptography.
Inparticular,thispaperexplainshowtouseasingle800MHzCortexA8coretocom-putetheexistingNaClsuiteofhigh-securitycryptographicprimitivesatthefollowingspeeds:5.
60cyclesperbyte(1.
14Gbps)toencryptusingasharedsecretkey,2.
30cyclesperbyte(2.
78Gbps)toauthenticateusingasharedsecretkey,527102cycles(1517/second)tocomputeasharedsecretkeyforanewpublickey,650102cycles(1230/second)toverifyasignature,and368212cycles(2172/second)tosignamessage.
Thesespeedsmakenouseofsecretbranchesandnouseofsecretmemoryaddresses.
Keywords:vectorization-friendlycryptographicprimitives,ecientsoftwareimplementations,smart-phones,tablets,therebedragons1IntroductionTheAppleA4CPUusedintheiPad1(2010,1GHz)andiPhone4(2010,1GHz)containsasingleCortexA8CPUcore.
ThesameCPUcorealsoappearsinmanyothertabletsandsmartphones.
ThepointofthispaperisthattheCortexA8achievesimpressivespeedsforhigh-securitycryptography:5.
60cyclesperbytetoencryptamessageusingasharedsecretkey;2.
30cyclesperbytetoauthenticateamessageusingasharedsecretkey;527102cyclestocomputeasharedsecretkeyforanewpublickey;650102cyclestoverifyasignatureonashortmessage;and368212cyclestosignashortmessage.
Wedonotclaimthatallhigh-securitycryptographicprimitivesrunwellontheCortexA8.
Quitetheopposite:werelycriticallyonasynergybetweenthecapabilitiesofthe"NEON"vectorunitintheCortexA8andtheparallelizabilityofsomecarefullyselectedcryptographicprimitives.
TheprimitivesweuseareSalsa20[9],amemberofthenalportfoliofromtheECRYPTStreamCipherProject;Poly1305[5],apolynomial-evaluationmessage-authenticationcodesimilartothemessage-authenticationcodeinGCM;Curve25519[6],anelliptic-curveDie–Hellmansystem;andEd25519[10],anelliptic-curvesignaturesystemthatwasintroducedatCHES2011.
TherestofthispaperexplainshowweuseNEONtoobtainsuchhighspeedsfortheseprimitives.
ItisnotacoincidencethatourselectionmatchesthedefaultprimitivesinNaCl,theexisting"NetworkingandCryptographylibrary"[13]usedinapplicationssuchasDNSCrypt[45];vector-izabilitywasoneofthedesigncriteriaforNaCl.
Itisneverthelesssurprisingthatarathersmallvectorunit,carryingoutjustonearithmeticinstructionpercycle,canruntheseprimitivesatthespeedslistedabove.
Ahigh-powerIntelCore2CPUcore(at45nm,liketheAppleA4),witha64-bitinstructionsetandthreefull128-bitvectorunits,hascyclecountsof3.
98/byte,3.
32/byte,307053,365742,and106542forthesamevetaskswiththebestreportedassembly-languageThisworkwassupportedbytheNationalScienceFoundationundergrant1018836;bytheUSAirForcegrantAOARD-11-4092;andbytheCareerDevelopmentAwardofthesecondauthor'semployer.
PermanentIDofthisdocument:9b53e3cd38944dcc8baf4753eeb1c5e7.
Date:2012.
03.
20.
2DanielJ.
BernsteinandPeterSchwabeimplementationsofthesameprimitivesintheSUPERCOPbenchmarkingsuite[12];theCortexA8endsupmuchmorecompetitivethanonemightexpect.
Wealsodobetterthanthe697080CellcyclesforCurve25519achievedin[17],eventhoughtheCellhasmorepowerfulpermutationinstructionsandmanymoreregisters.
Sidechannels.
Allmemoryaddressesandbranchconditionsinoursoftwarearepublic,dependingonlyonmessagelengths.
Thereisnodataowfromsecretdata(keys,plaintext,etc.
)tocachetiming,branchtiming,etc.
Wedonotclaimthatoursoftwareisimmunetohardwareside-channelattackssuchaspoweranalysis,butwedoclaimthatitisimmunetosoftwareside-channelattackssuchas[44],[2],and[47].
Benchmarkingplatform.
Thespeedsreportedaboveweremeasuredonalow-costHerculeseCAFEnetbook(releasedandpurchasedin2011)containingaFreescalei.
MX515CPU.
ThisCPUhasasingle800MHzCortexA8core.
OccasionallywemakecomparisonstobenchmarksthatuseOpenSSLoraCcompiler;thenetbookisshippedwithUbuntu10.
04,andinparticularOpenSSL0.
9.
8kandgcc4.
4.
3,neitherofwhichweclaimisoptimal.
Allofoursoftwarehasbeencheckedagainststandardtestsuites.
Weareplacingoursoftwareonlinetomaximizeveriabilityofourresults,andareplacingitintothepublicdomaintomaximizereusability.
Someofourpreliminaryresultsarealreadyonlineandincludedinvariouspublicbenchmarkreports,butthispaperisourrstformalannouncementandachievesevenbetterspeeds.
MoreCPUswithNEON.
TheCortexA8isnottheonlyhardwaredesignsupportingtheNEONinstructionset.
TheAppleA5CPUusedintheiPad2(2011,1GHz)andiPhone4S(2011,800MHz)containstwoCortexA9CPUcoreswithNEONunits.
TheNVIDIATegra3CPUusedinthe2011AsusEeePadTransformerPrimetablet(2011,1.
3GHz)andHTCOneXsmartphone(2012,1.
5GHz)containsfourCortexA9CPUcoreswithNEONunits.
Qualcomm's"Snapdragon"seriesofCPUsreportedlyincludesadierentNEONmicroarchitecturefortheolder"Scorpion"coresandafasterNEONmicroarchitectureforthenewer"Krait"cores.
WehaveveryrecentlybenchmarkedoursoftwareonaScorpion,obtainingcyclecountsof5.
40/byte,1.
89/byte,606824,756795,and511123forthevetaskslistedabove.
WeexpectthatfurtheroptimizationforCortexA9andSnapdragonwillproduceevenbetterresults.
TherestofthispaperfocusesontheoriginalCortexA8NEONmicroarchitecture.
OneshouldnotthinkthatalltabletsandsmartphonessupportNEONinstructions.
Forexample,NVIDIAomittedNEONfromtheCortexA9coresintheTegra2;lower-costARM11processorsdonotsupportNEONandcontinuetoappearinnewdevices;andsomedevicesuseIntelprocessorswithaquitedierentinstructionset.
However,Applealonehassoldmorethan50milliontabletswithNEONandmanymoresmartphoneswithNEON,andoursamplingsuggeststhatNEONalsoappearsinthemajorityofnewtabletsandsmartphonesfromothermanufacturers.
Thispaperturnsallofthesedevicesintopowerfulcryptographicengines,capableofprotectinglargevolumesofdatawhileleavingtheCPUwithenoughtimetoactuallydosomethingusefulwiththatdata.
2NEONinstructionsandspeedsThissectionreviewsNEON'scapabilities.
Thisisnotacomprehensivereview:itfocusesonthemostimportantinstructionsforoursoftware,andthemainbottlenecksinthoseinstructions.
AllcommentsaboutspeedrefertotheNEONunitinasingleCortexA8core.
Registers.
TheNEONarchitecturehas16128-bitvectorregisters(2048bitsoverall),q0throughq15.
Italsohas3264-bitvectorregisters,d0throughd31,buttheseregisterssharephysicalspacewiththe128-bitvectorregisters:q0istheconcatenationofd0andd1,q1istheconcatenationofd2andd3,etc.
Forcomparison,thebasicARMarchitecturehasonly1632-bitregisters,r0throughr15.
Regis-terr13isthestackpointerandregisterr15istheprogramcounter,leavingonly1432-bitregisters(448bitsoverall)forgeneraluse.
OneofthemostobviousbenetsofNEONforcryptographyisNEONcrypto3thatitprovidesmuchmorespaceinregisters,reducingthenumberofloadsandstoresthatweneed.
Syntax.
WerarelylookatNEONregisternames,eventhoughwewritecodeinassembly:weuseahigher-levelassemblysyntaxthatallowsanynumberofnamesfor128-bitvectorregisters.
Forexample,wewritediag3^=b0andthenanautomatictranslatorproducestraditionalassemblylanguageveorq6,q6,q14forassemblybythestandardGNUassemblergas;herethetranslatorhasselectedq6fordiag3andq14forb0.
Weneverthelesspaycloseattentiontothenumberof"live"128-bitregistersateachmoment,reorganizingourcomputationstotreasonablylargeamountsofworkintoregisters.
Thesyntaxisourowndesign.
Tobuildthetranslatorwereusedtheexistingqhasmtoolkit[7]andwroteashortARM+NEONmachine-descriptionleforqhasm.
Thislecontains,forexample,theline4xr=s+t:>r=reg128:r,>25—butthesecondinstructionoccupiesthearithmeticunitfortwocyclesandgenerallycauseslargerlatencyproblemsthanaseparateshiftandxor.
Apairof32-bitmultiplications,eachproducinga64-bitresult,usesoneinstruction:c[0,1]=a[0]signed*b[0];c[2,3]=a[1]signed*b[1]Thisinstructionoccupiesthearithmeticunitfortwocycles,foratotalthroughputofone32*32→64-bitmultiplicationpercycle.
ThisinstructionreadsbinstageN1,readsainstageN2,andmakescavailableinstageN8.
Thisinstructionhasamultiply-accumulatevariant,carryingoutadditionsforfree:c[0,1]+=a[0]signed*b[0];c[2,3]+=a[1]signed*b[1]TheaccumulatorisnormallyreadinstageN3,butisreadmuchlaterifitistheresultofasimilarmultiplicationinstruction.
Atypicalsequencesuchasc[0,1]=a[0]unsigned*b[0];c[2,3]=a[1]unsigned*b[1]c[0,1]+=e[2]unsigned*f[2];c[2,3]+=e[3]unsigned*f[3]c[0,1]+=g[0]unsigned*h[2];c[2,3]+=g[1]unsigned*h[3]takessixcycleswithoutanystalls.
Loads,stores,andpermutations.
Thereisa128-bitNEONload/storeunitthatrunsinparallelwiththeNEONarithmeticunit.
Analigned128-bitoraligned64-bitloadorstoreconsumestheload/storeunitforonecycleandmakesitsresultavailableinN2.
Alignmentisstatic(encodedexplicitlyintheinstruction),notdynamic:NEONcrypto5x01aligned=mem128[input_1];input_1+=16Theload/storeinstructiondoesnotallowanosetfromtheindexregisterbutdoesallowsubse-quentincrementoftheindexregisterbytheloadamountorbyanotherregister.
Thereareseparateinstructionsforanunaligned128-bitorunaligned64-bitloadorstore,foranunaligned64-bitloadorstorewithanoset,andvariousotherpossibilities,eachconsumingtheload/storeunitforatleasttwocycles.
NEONincludesafewpermutationinstructionsthatconsumetheload/storeunitforonecycle:forexample,r=s[1]t[2]r[2,3]takesasinglecycletoreplacer[0]andr[1]withs[1]andt[2]respectively,leavingr[2]andr[3]unchanged.
ThisinstructionreadssandtinstageN1andwritesrinstageN3.
Therearemorepermutationinstructionsthatconsumetheload/storeunitfortwocycles.
EachNEONcycledispatchesatbestoneinstructiontothearithmeticunitandoneinstructiontotheload/storeunit.
Thesetwodispatchescanoccurineitherorder.
Forexample,asequenceof6single-cycleinstructionsoftheformALSALSALSwilltake3NEONcycles(ALS,ALS,ALS);asequenceLSAALSLSAwilltake3NEONcycles(LSA,ALS,LSA);butasequenceLSLSLSAAAwilltake5NEONcycles(LS,LS,LSA,A,A).
Ac-cycleinstructionisdispatchedinthesamewayascadjacentsingle-cycleinstructions.
Forexample,thepermutationinstructionin4xa2=diag3+diag2diag3=diag3[3]diag3[0,1,2]4xnext_a2=next_diag3+next_diag2takestwoLScycles,sooverallthissequencetakestwocycles(ALS,LSA).
Occasionalpermutationsthusdonotcostanycycles.
Asanotherexample,onecaninterleavetwo-cyclepermutationswithtwo-cyclemultiplications.
3Encryptusingasharedsecretkey:5.
60cycles/byteforSalsa20ThissectionexplainshowtoencryptdatawiththeSalsa20streamcipher[9]at5.
60CortexA8cycles/byte:e.
g.
,1.
14Gbpsonan800MHzcore.
Theinnerloopuses4.
58cycles/byteandscaleslinearlywiththenumberofcipherrounds;forexample,Salsa20/12uses2.
75cycles/bytefortheinnerloopand3.
77cycles/bytefortheentirecipher.
(Thesearelong-messagegures,buttheper-messageoverheadisreasonablysmall:forexample,a1536-bytemessagewithfullSalsa20uses5.
75cycles/byte.
)Forcomparison,[29]reportsthatanewAES-128-CTRassembly-languageimplementation,con-tributedtoOpenSSLbyPolyakov,runsat25.
4CortexA8cyclesperbyte(0.
25Gbpsat800MHz).
Thereisnoindicationthatthisspeedincludesprotectionagainstsoftwareside-channelattacks;infact,therecentpaper[47]byWei,Heinz,andStumpfdemonstratedCortexA8cache-timingleakageofatleasthalftheAESkeybitsfromOpenSSLandseveralotherAESimplementations.
AnAESimplementationalongthelinesof[23]wouldbeprotected,buttheNEONcyclesperbytewouldbefarabovethe7.
59Core2cyclesperbytereportedin[23].
TheeBASCstream-cipherbenchmarks[12]report,forCortexA8,twootherciphersprovidingcomparablelong-messagespeeds:5.
77cycles/byteforNLSv2and7.
18cycles/byteforTPy.
NLSv2iscertainlyfast,butitislimitedtoa128-bitkeyand264bitsofoutput,itreliesonS-boxlookupsthatwouldincurextracosttoprotectagainstcache-timingattacks,andingeneralitdoesnotappeartohaveaslargeasecuritymarginasSalsa20.
Weseeourresultsasshowingthatthesamespeedscanbeachievedwithhighersecurity.
TPyislesscompetitive:itreliesonrandomaccesstoalargesecretarray,requiringanexpensivesetupforeachnonce(notvisibleinthelong-messagetimings)andincurringvastlyhighercostsforprotectionagainstcache-timingattacks.
6DanielJ.
BernsteinandPeterSchwabeReviewofSalsa20;non-NEONbottlenecks.
Salsa20expandsa256-bitkeyanda64-bitnonceintoalongoutputstream,andxorsthisstreamwiththeplaintexttoproduceciphertext.
Thestreamisgeneratedin64-byteblocks.
Themainbottleneckingeneratingeachblockisaseriesof20rounds,eachconsistingof1632-bitadd-rotate-xorsequencessuchasthefollowing:s4=x0+x12x4^=(s4>>>25)Thismightalreadyseemtobeaperfecttforthebasic32-bitARMinstructionset,withouthelpfromNEON.
TheCortexA8hastwo32-bitexecutionunits;additionoccupiesoneunitforonecycle,androtate-xoroccupiesoneunitforonecycle.
Onewouldthusexpect320add-rotate-xorsequencestooccupybothintegerexecutionunitsfor320cycles,i.
e.
,5cyclesperbyte.
However,thereisalatencyof2cyclesbetweenthetwoinstructionsshownabove,andanoveralllatencyof3cyclesbetweentheavailabilityofx0andtheavailabilityofx4.
Furthermore,theARMarchitectureprovidesonly14registers,butSalsa20needsatleast17activevalues:x0throughx15togetherwithasumsuchass4.
(Onecanoverwritex0withs4,butonlyattheexpenseofextraarithmetictorestorex0afterwards.
)Loadsandstoresoccupytheexecutionunits,takingtimeawayfromarithmeticoperations.
(ARMcanmergetwoloadsofadjacentregistersintoasingleinstruction,butthisinstructionconsumesbothexecutionunitsforonecycleandtherstexecutionunitforanothercycle.
)Therearealsovariousoverheadsoutsidethe20-roundinnerloop.
CompilingseveraldierentCimplementationsofSalsa20withmanydierentcompileroptionsdidnotbeat15cyclesperbyte.
Internalparallelization;vectorization;NEONbottlenecks.
EachSalsa20roundhas4-wayparallelism,with4independentadd-rotate-xorsequencestocarryoutateachmoment.
Twoparallelcomputationshidesomelatenciesbutrequire8loadsandstoresperroundwithourbestinstructionschedule;threeorfourparallelcomputationswouldhidealllatenciesbutwouldrequireevenmoreloadsandstoresperround.
NEONhasfarmorespaceinregisters,andits128-bitarithmeticunitcanperform432-bitoperationsineachcycle.
The4operationstocarryoutateachmomentinSalsa20naturallyforma4-wayvectoroperation,atthecostofthree128-bitpermutationsperround.
Salsa20thusseemstobeanaturaltforNEON.
However,NEONrotationconsumes3operationsasdiscussedinSection2,soadd-rotate-xorconsumes5operations,atleast1.
25cycles;5add-rotate-xoroperationsperoutputbyteconsumeatleast6.
25cyclesperbyte.
Furthermore,NEONlatenciesareevenhigherthanbasicARMlatencies.
Thelowest-latencysequenceofinstructionsforadd-rotate-xoris4xa0=diag1+diag04xb0=a0>=25diag3^=b0diag3^=a0withtotallatency9tothenextaddition:theindividuallatenciesare3(N4additionoutputa0toN1shiftinput),0(butcarriedoutthenextcyclesincethearithmeticunitisbusy),2(N4shiftoutputb0toN2xorinput),2(N4xoroutputdiag3toN2xorinput),and2(N4xoroutputdiag3toN2additioninput).
AstraightforwardNEONimplementationcannotdobetterthan11.
25cyclesperbyte.
Externalparallelization.
WedobetterbytakingadvantageofanotherlevelofparallelizabilityinSalsa20:Salsa20,likeAES-CTR,generatesoutputblocksindependentlyasfunctionsofasimplecounter.
Computingtwooutputblocksinparallelwiththefollowingpatternofadd-rotate-xoroperations—0123456789101112131415161718192021NEONcrypto7—hidesalmostallNEONlatencies,reducingourinnerloopto44cyclesperroundforbothblocks,i.
e.
,880cyclesfor20roundsproducing128bytes,i.
e.
,6.
875cyclesperbyte.
Computingthreeout-putblocksinparallelstilltsintoNEONregisters(withaslightlytrickierpatternofoperations—themostobviouspatternswouldneed18registers),furtherreducingourinnerloopto6.
25cyclesperbyte,andalleviateslatencyissuesenoughtoallowtwo-instructionrotations,butasfaraswecantellthisisoutweighedbysomewhatlowereectivenessofthespeedupdiscussedinthenextsubsection.
PreviousworkonSalsa20forother128-bitvectorarchitectureshadvectorizedacrossfouroutputblocks.
However,thisneedsatleast17activevectors(andmoretohidelatencies),requiringextrainstructionsforloadsandstores,morethanthenumberofpermutationinstructionssaved.
Thiswouldalsoaddoverheadoutsidetheinnerloopandwouldinterferewiththespeedupdescribedinthenextsubsection.
InterleavingARMwithNEON.
Wedobetterthan6.
25cyclesperbytebyusingthebasicARMexecutionunitstogenerateoneblockwhileNEONgeneratestwoblocks.
Eachroundinvolves23NEONinstructionsforoneblock(20instructionsforfouradd-rotate-xorsequences,plus3permutationinstructions),23NEONinstructionsforasecondblock,and40ARMinstructionsforathirdblock.
TheextraARMinstructionsreducetheinnerloopto(2/3)6.
875≈4.
58cyclesperbyte:thecyclesfortheloopareexactlythesamebuttheloopproduces1.
5*asmuchoutput.
WearepushingthistechniqueextremelyclosetoanimportantCortexA8limit.
Thelimitisthattheentirecoredecodesatmosttwoinstructionspercycle,whethertheinstructionsareARMinstructionsorNEONinstructions.
The880cyclesthatwespendfor128NEONoutputbyteshave1760instructionslots,whileweuseonly920NEONinstructions,leaving840freeslots;weuse800oftheseslotsforARMinstructionsthatgenerate64additionaloutputbytes,andanadditional35slotsforloopcontroltoavoidexcessivecodesize.
(Registerpressureforcedustospilltheloopcounter,andeachbranchinstructionhasahiddencostof3slots;weendedupunrolling4rounds.
)PuttingevenmarginallymoreworkontheARMunitwouldslowdowntheNEONprocessing,andaneasyquantitativeanalysisshowsthatthiswouldslowdownthecipherasawhole.
ThesamelimitmakesARMinstructionsfarlesseectivefor,e.
g.
,thecomputationsmodulo225519discussedlaterinthispaper.
ThesecomputationsarelargeenoughthattheyrequiremanyNEONloadsandstoresalongsidearithmetic,oftenconsumingbothoftheinstructionslotsavailableinacycle.
TherearestillsomeslotsforARMinstructions,butthesecomputationsrequireanevenlargernumberofARMloadsandstores,leavingveryfewslotsforARMarithmeticinstructions.
Furthermore,thesecomputationsaredominatedbymultiplicationsratherthanrotations,andevenfull-speedARMmultiplicationshaveonlyafractionofthepowerofNEONmultiplications.
Minimizingoverhead.
TheabovediscussionconcentratesontheperformanceoftheSalsa20innerloop,buttherearealsooverheadsforinitializingandnalizingeachblock,readingplaintext,andgeneratingciphertext.
The64-byteSalsa20outputblockconsistsoffourvectorsx0x1x2x3,x4x5x6x7,x8x9x10x11,andx12x13x14x15thatmustbexor'edwithplaintexttoproduceciphertext.
NEONuses0.
125cycles/bytetoreadpotentiallyunalignedplaintext,and0.
125cycles/bytetowritepotentiallyunalignedciphertext,foranoverheadof0.
25cycles/byte;ARMisslower.
Itshouldbepossibletoreducethisoverhead,atsomecostincodesize,byoverlappingmemoryaccesswithcomputation,butwehavenotyetdonethis.
TheSalsa20innerloopnaturallyusesandproduces"diagonal"vectorsx0x5x10x15,x4x9x13x3,etc.
Convertingthesediagonalvectorstotheoutputvectorsx0x1x2x3etc.
posesaninterestingchallengeforNEON'spermutationinstructions.
Weusethefollowingshortsequenceofinstructions(andgratefullyacknowledgeoptimizationassistancefromTanjaLange):r0=.
.
.
#x0x5x10x15r4=.
.
.
#x4x9x14x3r12=.
.
.
#x12x1x6x11r8=.
.
.
#x8x13x2x7t4=r0[1]r4[0]t4[2,3]#x5x4--8DanielJ.
BernsteinandPeterSchwabet12=t12[0,1]r0[3]r4[2]#--x15x14r0=(abab&r0)|(~abab&r12)#x0x1x10x11t4=t4[0,1]r8[3]r12[2]#x5x4x7x6t12=r8[1]r12[0]t12[2,3]#x13x12x15x14r8=(abab&r8)|(~abab&r4)#x8x9x2x3r4=t4[1]t4[0]t4[3]t4[2]#x4x5x6x7r12=t12[1]t12[0]t12[3]t12[2]#x12x13x14x15r0r8=r0[0]r8[1]r8[0]r0[1]#x0x1x2x3x8x9x10x11Thereare7single-cyclepermutationshere,consuming0.
11cycles/byte,and2two-cyclearithmeticinstructions(usingabab)interleavedwiththepermutations.
Similarcommentsapplytoblockinitialization.
Theseandotheroverheadsincreasetheoverallencryptioncoststo5.
60cycles/byte.
4Authenticateusingasharedsecretkey:2.
30cycles/byteforPoly1305ThissectionexplainshowtocomputethePoly1305message-authenticationcode[5]at2.
30CortexA8cycles/byte:e.
g.
,2.
78Gbpsonan800MHzcore.
AuthenticatedencryptionwithSalsa20andPoly1305takesjust7.
90cycles/byte.
Forcomparison,[29]reports50CortexA8cycles/byteforAES-GCMand28.
9cycles/byteforitsproposedAES-OCB3;comparedtothe25.
4cycles/byteofAES-CTRencryption,authenticationadds25or3.
5cycles/byterespectively.
GCM,OCB3,andPoly1305guaranteethatattacksareasdicultasbreakingtheunderlyingcipher,withsimilarquantitativesecuritybounds.
Anotherap-proach,withoutthisguarantee,isHMACusingahashfunction;theCortexA8speedleadersintheeBASHhash-functionbenchmarks[12]areMD5at6.
04cycles/byte,Edon-Rat9.
76cycles/byte,Shabalat12.
94cycles/byte,BMWat13.
55cycles/byte,andSkeinat15.
26cycles/byte.
Oneoftheseauthenticationspeeds,the"free"3.
5-cycle/byteauthenticationinOCB3,iswithinafactorof2ofourPoly1305speed.
However,OCB3alsohastwoimportantdisadvantages.
First,OCB3cannotbecombinedwithafaststreamciphersuchasSalsa20—itrequiresablockcipher,asdiscussedin[29].
Second,rejectinganOCB3forgeryrequirestakingthetimetodecrypttheforgery,afull28.
9cycles/byte;Poly1305rejectsforgeriesanorderofmagnitudemorequickly.
ReviewofPoly1305.
Poly1305readsaone-time32-bytesecretkeyandamessageofanylength.
Itchopsthemessageinto128-bitlittle-endianintegers(andanalb-bitintegerwithb≤128),adds2128toeachinteger(and2btothenalinteger)toobtaincomponentsm[0],m[1]m[1],andproducesthe16-byteauthenticator(((m[0]r+m[1]r1m[1]r)mod21305)+s)mod2128whererandsarecomponentsofthesecretkey.
"Onetime"hasthesamemeaningasforaone-timepad:eachmessagehasanewkey.
Iftheseone-timekeysaretrulyrandomthentheattackerisreducedtoblindguessing;see[5]forquantitativeboundsontheattacker'sforgerychance.
Ifthesekeysareinsteadproducedascipheroutputsfromalong-termkeythensecurityreliesonthepresumeddicultyofdistinguishingthecipheroutputsfromrandom.
ReadersfamiliarwiththeGCMauthenticated-encryptionmode[32]willrecognizethatPoly1305sharesthepolynomial-evaluationstructureoftheGMACauthenticatorinsideGCM.
ThegeneralstructurewasintroducedbydenBoer[18],Johansson,Kabatianskii,andSmeets[24],andinde-pendentlyTaylor[43];concreteexamplesinclude[39],[34],[4],[28],and[27].
Buttheseproposalsdierinmanydetails,notablythechoiceofniteeld:aeldofsize2128forGCM,forexample,andintegersmodulo21305forPoly1305.
Ecientauthenticationinsoftwarereliesprimarilyonfastmultiplicationinthiseld,andsec-ondarilyonfastconversionofmessagebytesintoelementsoftheeld.
Ecientauthenticationunderaone-timekey(addressingthesecurityissuesdiscussedin[4,Section8,Notes],[8,Sections2.
4–2.
5],[21],[14],etc.
)meansthatonecannotaordtoprecomputelargetablesofmultiplesofr;NEONcrypto9wecountthecostsofallprecomputation.
Avoidingthepossibilityofcache-timingattacksmeansthatonecannotusevariable-indextablelookups;see,e.
g.
,thediscussionofGCMsecurityin[23,Section2.
3].
Multiplicationmod21305onNEON.
Werepresentanintegerfmodulo21305inradix226asf0+226f1+252f2+278f3+2104f4.
Attheendofthecomputationwereduceeachfibelow226,andreduceftotheinterval0,1,21306,butearlierinthecomputationweusestandardlazy-reductiontechniques,allowingwiderrangesoffandoffi.
ThemostattractiveNEONmultipliersarethepaired32-bitmultipliers,whichasdiscussedinSection2producetwo64-bitproductseverytwocycles,includingfreeadditions.
Theproductoff0+226f1+···andg0+226g1+···ish0+226h1+···modulo21305whereh0=f0g0+5f1g4+5f2g3+5f3g2+5f4g1,h1=f0g1+f1g0+5f2g4+5f3g3+5f4g2,h2=f0g2+f1g1+f2g0+5f3g4+5f4g3,h3=f0g3+f1g2+f2g1+f3g0+5f4g4,h4=f0g4+f1g3+f2g2+f3g1+f4g0,allofwhicharesmallerthan264/195ifeachfiandgiisboundedby226.
Evidentlysomewhatlargerinputsfiandgi,productsofsumsofinputs,sumsofseveraloutputs,etc.
donotposeanyriskof64-bitoverow.
Thiscomputation(performedfromrighttolefttoabsorballsumsintoproducts)involves25genericmultiplicationsand4multiplicationsby5,butitisbettertoeliminatethemultiplicationsby5infavorofprecomputing5g1,5g2,5g3,5g4,inpartbecausethoseare32-bitmultiplicationsandinpartbecauseamultiplicationinputisoftenreused.
Ratherthanvectorizingwithinamessageblock,andhavingtosearchfor12convenientpairsof32-bitmultiplicationsinthepatternof25multiplicationsshownabove,wesimplyvectorizeacrosstwomessageblocks,usingawell-knownparallelizationofHorner'srule.
Forexample,for=10,wecompute((((m[0]r2+m[2])r2+m[4])r2+m[6])r2+m[8])r2+((((m[1]r2+m[3])r2+m[5])r2+m[7])r2+m[9])rbystartingwiththevector(m[0],m[1]),multiplyingbythevector(r2,r2),adding(m[2],m[3]),multiplyingby(r2,r2),etc.
Theintegerm[0]isactuallyrepresentedasve32-bitwords,sothevector(m[0],m[1])isactuallyrepresentedasvevectorsof32-bitwords.
The25multiplicationsshownabove,timestwoblocks,thentriviallyuse25NEONmultiplicationinstructionscosting50cycles,i.
e.
,1.
5625cyclesperbyte.
Thereare,however,alsooverheadsforreadingthemessageandreducingtheproduct,asdiscussedbelow.
Reduction.
Theproductobtainedabovecanbesafelyaddedtoanewmessageblockbutmustbereducedbeforeitcanbeusedasinputtoanothermultiplication.
Toreducealargecoecienth0,wecarryh0→h1;thismeansreplacing(h0,h1)with(h0mod226,h1+h0/226).
Similarcommentsapplytotheothercoecients.
Carryingh4→h0meansreplacing(h4,h0)with(h4mod226,h0+5h4/226),againtakingadvantageofthesparsityof21305.
NEONuses1cycleforapairof64-bitshifts,1cycleforapairof64-bitmasks,and1cycleforapairof64-bitadditions,foratotalof3cyclesforapairofcarries(plus2cyclesforh4→h0).
Achainofsixcarriesh0→h1→h2→h3→h4→h0→h1isadequateforsubsequentmultiplications:itleavesh1below226+213andeachotherhibelow226.
However,eachstepinthischainhaslatencyatleast5,andevenaggressiveinterleavingofcarriesintothecomputationsofhiwouldeliminateonlyafewoftheresultingidlecycles.
Weinsteadcarryh0→h1andh3→h4,thenh1→h2andh4→h0,thenh2→h3andh0→h1,thenh3→h4,spending3cyclestoeliminatelatencyproblems.
Theselectionofinitialindices(0,3)hereallowsthelongercarryh4→h0tooverlaptwoindependentcarriesh1→h2→h3;weactuallyinterleaveh0→h1→h2→h3→h4withh3→h4→h0→h1,beingcarefultokeeptheseparateusesofhiawayfromeachother.
10DanielJ.
BernsteinandPeterSchwabeThisapproachconsumes23cyclesfortwoblocks,i.
e.
,0.
71875cyclesperbyte.
AsmessagelengthsgrowitbecomesbettertoretreatfromHorner'smethod,forexamplecomputing((m[0]r4+m[2]r2+m[4])r4+m[6]r2+m[8])r2+((m[1]r4+m[3]r2+m[5])r4+m[7]r2+m[9])rbystartingwith(m[0],m[1])and(m[2],m[3]),multiplyingby(r4,r4)and(r2,r2)respectively,adding,adding(m[4],m[5]),thenreducing,etc.
Thiseliminateshalfofthereductionsattheexpenseofextendingtheprecomputationfrom(r2,r2)to(r4,r4).
Onecaneasilyeliminatemorereductionswithmoreprecomputation,butonepaysforprecomputationlinearlyinbothtimeandspace,whilethebenetbecomessmallerandsmaller.
Forcomparison,[4,Section6]precomputed97powersofrforapolynomialevaluationinanothereld.
Thenumber97waschosentojustbarelyavoidoverowofsumsof97intermediatevalues;[4]didnotcountthecostofprecomputation.
Ofcourse,whenwereportlong-messageperformanceguresweblindourselvestoanyconstantamountofprecomputation,butbeyondthosegureswearealsocarefultoavoidexcessiveprecomputation(and,forsimilarreasons,excessivecodesize).
Wethussettledoneliminatinghalfofthereductions.
Readingthemessage.
Theinnerloopinourcomputation,withhalfreductionsasdescribedabove,computesfr4+m[i]r2+m[i+2].
Oneinputisanaccumulatorf;theoutputiswrittenontopoffforthenextpassthroughtheloop.
Twomoreinputsarer2andr4,bothprecomputed.
Thelasttwoinputsaremessageblocksm[i]andm[i+2];theinnerlooploadstheseblocksandconvertsthemtoradix226.
Thefollowingparagraphsdiscussthecostsofthisconversion.
Thesamecomputationsarecarriedoutinparallelonm[i+1]andm[i+3],usinganotheraccumulator.
Wesuppressfurthermentionofthisstraightforwardvectorization:forexample,whenwesaybelowthatNEONtakes0.
5cyclesfora64-bitshiftinvolvedinm[i],whatweactuallymeanisthatNEONtakes1cycleforapairof64-bitshifts,wheretherstshiftisusedform[i]andthesecondisusedform[i+1].
Loadingm[i]producesavector(m0,m1,m2,m3)representingtheintegerm0+232m1+264m2+296m3.
Ourgoalhereistorepresentthesameinteger(plus2128)inradix226asc0+226c1+252c2+278c3+2104c4.
Ashiftofthe64bits(m2,m3)downby40bitsproducesexactlyc4.
Ashiftof(m2,m3)downby14bitsdoesnotproduceexactlyc3,andashiftof(m1,m2)downby20bitsdoesnotproduceexactlyc2,butasingle64-bitmaskthenproduces(c2,c3).
Similarcommentsapplyto(c0,c1),exceptthatc0doesnotrequireashift.
Overallthereareseven64-bitarithmeticinstructionshere(fourshifts,twomasks,andoneadditiontoc4tohandlethe2128),consuming3.
5cyclesforeach16-byteblock.
Thereisalsoatwo-cycle(potentiallyunaligned)load,alongwithjustsixsingle-cyclepermutationinstructions;NEONhasanarithmeticinstructionthatcombinesa64-bitrightshift(byupto32bits)withanextractionofthebottom32bitsoftheresult,eliminatingsome64-bit-to-32-bitshuing.
Thesecondmessageblockm[i+2]hasadierentroleinfr4+m[i]r2+m[i+2]:itisaddedtotheoutputratherthantheinput.
Wetakeadvantageofthisbyloadingm[i+2]intoavector(m0,m1,m2,m3)andaddingm0+232m1+264m2+296m3intoamultiplicationresulth0+226h1+252h2+278h3+2104h4beforecarryingtheresult.
Thismeanssimplyaddingm0intoh0,adding26m1intoh1,etc.
Weabsorbtheadditionsintomultiplicationsbyschedulingm[i+2]beforethecomputationofh.
Theonlyremainingcostsform[i+2]areafewshiftssuchas26m1,oneoperationtoadd2128,andvariouspermutations.
Theconversionofm[i]andm[i+2]costs,onaverage,0.
171875cycles/byteforarithmeticinstruc-tions.
OurtotalcostforNEONarithmeticinPoly1305is2.
09375cycles/byte:1.
5625cycles/byteforonemultiplicationperblock,0.
359375cycles/byteforhalfareductionperblock,and0.
171875cycles/byteforinputconversion.
Wehavenotyetmanagedtoperfectlyscheduletheinnerloop:rightnowittakes147cyclesfor64bytes,slightlyabovethe134cyclesofarithmetic,sooursoftwarecomputesPoly1305at2.
30cycles/byte.
NEONcrypto115Computeasharedsecretkeyforanewpublickey:527102cyclesforCurve25519;signandverify:368212and650102cyclesforEd25519ThissectionexplainshowtocomputetheCurve25519Die–Hellmanfunction[6],obtaininga32-bytesharedsecretfromAlice's32-bytesecretkeyandBob's32-bytepublickey,in527102CortexA8cycles:e.
g.
,1517/secondonan800MHzcore.
ThissectionalsoexplainshowtosignandverifymessagesintheEd25519public-keysignaturesystem[10]in,respectively,368212and650102CortexA8cycles:e.
g.
,2172/secondand1230/secondonan800MHzcore.
Ed25519publickeysare32bytes,andsignaturesare64bytes.
Forcomparison,opensslspeedonthesamemachinereports424.
2RSA-2048vericationspersecond(1.
9millioncycles),11.
1RSA-2048signaturespersecond(72millioncycles),88.
6NISTP-256Die–Hellmanoperationspersecond(9.
0millioncycles),388.
8NISTP-256signaturespersecond(2.
1millioncycles),and74.
5NISTP-256vericationspersecond(10.
7millioncycles).
Morozov,Tergino,andSchaumont[33]reporttwospeedsfor"secp224r1"Die–Hellman:15609microsecondsona500MHzCortexA8(7.
8millioncycles),and6043microsecondsona360MHzDSP(2millionDSPcycles)includedinthesameCPU,aTIOMAP3530.
Curve25519andEd25519haveahighersecuritylevelthansecp224r1and2048-bitRSA;itisalsonotclearwhichofthepreviousspeedsincludeprotectionagainstside-channelattacks.
ReviewofCurve25519andEd25519.
Curve25519andEd25519areelliptic-curvesystems.
Keygenerationisxed-base-pointsingle-scalarmultiplication:Bob'spublickeyisamultipleB=bPofastandardbasepointPonastandardcurve.
Bob'ssecretkeyistheintegerb.
Curve25519'sDie–Hellmanfunctionisvariable-base-pointsingle-scalarmultiplication:Alice,givenBob'spublickeyB,computesaBwhereaisAlice'ssecretkey.
ThesecretsharedbyAliceandBobissimplyahashofaB;thissecretisused,forexample,asalong-termkeyforSalsa20,whichinturnisusedtogenerateencryptionpadsandPoly1305authenticationkeys.
SigninginEd25519consistsprimarilyofxed-base-pointsingle-scalarmultiplication.
(Wemakethestandardassumptionthatmessagesareshort;hashingtimeisthebottleneckforverylongmessages.
Ourmeasurementsuse59-bytemessages,asin[12].
)SigningismuchfasterthanDie–Hellman:itexploitsprecomputedmultiplesofPinvariousstandardways.
VericationinEd25519isslowerthanDie–Hellman:itconsistsprimarilyofdouble-scalarmultiplication.
TheCurve25519ellipticcurveistheMontgomerycurvey2=x3+486662x2+xmodulo225519,withauniquepointoforder2.
TheEd25519ellipticcurveisthetwistedEdwardscurvex2+y2=1(121665/121666)x2y2modulo225519,alsowithauniquepointoforder2.
Thesetwocurveshavean"ecientbirationalequivalence"andthereforehavethesamesecurity.
Montgomerycurvesarewellknowntoallowecientvariable-base-pointsingle-scalarmultiplica-tion.
Edwardscurvesarewellknowntoallowawidervarietyofecientelliptic-curveoperations,includingdouble-scalarmultiplication.
Thesefastscalar-multiplicationmethodsare"complete":theyaresequencesofadditions,multiplications,etc.
thatalwaysproducetherightanswer,withnoneedforcomparisons,branches,etc.
CompletenesswasprovenbyBernstein[6]forsingle-scalarmultiplicationonanyMontgomerycurvehavingauniquepointoforder2,andbyBernsteinandLange[11]forarbitrarygroupoperationsonanyEdwardscurvehavingauniquepointoforder2.
ThemainloopinCurve25519,executed255times,hasfouradditionsofintegersmodulo225519,foursubtractions,twoconditionalswaps(whichmustbecomputedwitharithmeticratherthanbranchesorvariablearraylookups),foursquarings,onemultiplicationbytheconstant121666,andvegenericmultiplications.
Thereisalsoasmallernalloop(aeldinversion),consistingof254squaringsand11multiplications.
SimilarcommentsapplytoEd25519signingandEd25519verication.
12DanielJ.
BernsteinandPeterSchwabeMultiplicationmod225519onNEON.
Weuseradix225.
5,imitatingtheoating-pointrep-resentationin[6,Section4]butwithunscaledintegersratherthanscaledoating-pointnumbers:werepresentanintegerfmodulo225519asf0+226f1+251f2+277f3+2102f4+2128f5+2153f6+2179f7+2204f8+2230f9where,asinSection4,theallowablerangesoffivarythroughthecomputation.
Weusesignedintegersfiratherthanunsignedintegers:forexample,whenwecarryf0→f1wereducef0totherange[225,225]ratherthan[0,226].
Thiscomplicatescarries,replacingamaskwithashiftandsubtraction,butsavesonebitinproductsofreducedcoecients,allowingustosafelycomputevariousproductsofsumswithoutcarryingthesums.
Thiswasunnecessaryintheprevioussection,inpartbecausethe5in21305issmallerthanthe19in225519,inpartbecause130issmallerthan255,andinpartbecausethesumsofinputsandoutputsnaturallyappearingintheprevioussectionhavefewertermsthanthesumsthatappearintheseelliptic-curvecomputations.
Theproductoff0+226f1+251f2+···andg0+226g1+251g2+···ish0+226h1+251h2+···modulo225519whereh0=f0g0+38f1g9+19f2g8+38f3g7+19f4g6+38f5g5+19f6g4+38f7g3+19f8g2+38f9g1h1=f0g1+f1g0+19f2g9+19f3g8+19f4g7+19f5g6+19f6g5+19f7g4+19f8g3+19f9g2h2=f0g2+2f1g1+f2g0+38f3g9+19f4g8+38f5g7+19f6g6+38f7g5+19f8g4+38f9g3h3=f0g3+f1g2+f2g1+f3g0+19f4g9+19f5g8+19f6g7+19f7g6+19f8g5+19f9g4h4=f0g4+2f1g3+f2g2+2f3g1+f4g0+38f5g9+19f6g8+38f7g7+19f8g6+38f9g5h5=f0g5+f1g4+f2g3+f3g2+f4g1+f5g0+19f6g9+19f7g8+19f8g7+19f9g6h6=f0g6+2f1g5+f2g4+2f3g3+f4g2+2f5g1+f6g0+38f7g9+19f8g8+38f9g7h7=f0g7+f1g6+f2g5+f3g4+f4g3+f5g2+f6g1+f7g0+19f8g9+19f9g8h8=f0g8+2f1g7+f2g6+2f3g5+f4g4+2f5g3+f6g2+2f7g1+f8g0+38f9g9h9=f0g9+f1g8+f2g7+f3g6+f4g5+f5g4+f6g3+f7g2+f8g1+f9g0.
Theextrafactorsof2appearbecause225.
5isnotaninteger.
Weprecompute2f1,2f3,2f5,2f7,2f9and19g1,19g2,19g9;eachhiisthenasumoftenproductsofprecomputedquantities.
Mostmultiplicationsappearasindependentpairs,computingfgandfginparallel,intheelliptic-curveformulasweuse.
Wevectorizeacrossthesemultiplications:westartfrom2064-bitvectorssuchas(f0,f0)and(g0,g0),precompute1464-bitvectorssuchas(2f1,2f1)and(19g1,19g1),andthenaccumulate10128-bitvectorssuchas(h0,h0).
Byschedulingoperationscarefullywetthese5464-bitquantitiesintothe32available64-bitregisterswithamoderatenumberofloadsandstores.
Somemultiplicationsdonotappearaspairs.
Forthosecaseswevectorizewithinonemultipli-cationbythefollowingstrategy.
Accumulatethevectors(f0g0,2f1g1)and(19f2g8,38f3g9)and(19f4g6,38f5g7)and(19f6g4,38f7g5)and(19f8g2,38f9g3)into(h0,h2);accumulate(f0g2,2f1g3)etc.
into(h2,h4);andsoonthrough(h8,h0).
Alsoaccumulate(f1g2,19f8g3),(f3g0,f0g1),etc.
into(h3,h1);accumulate(f1g4,19f8g5)etc.
into(h5,h3);andsoonthrough(h1,h9).
Eachvectoraddedhereisaproductoftwoofthefollowing27precomputedvectors:(f0,2f1),(f2,2f3),(f4,2f5),(f6,2f7),(f8,2f9);(f1,f8),(f3,f0),(f5,f2),(f7,f4),(f9,f6);(g0,g1),(g2,g3),(g4,g5),(g6,g7);(g0,19g1),(g2,19g3),(g4,19g5),(g6,19g7),(g8,19g9);(19g2,19g3),(19g4,19g5),(19g6,19g7),(19g8,19g9);(19g2,g3),(19g4,g5),(19g6,g7),(19g8,g9).
Wetriedseveralotherstrategies,pairinginputsandoutputsinvariousways,beforesettlingonthisstrategy.
Alloftheotherstrategiesusedmoreprecomputedvectors,requiringmoreloadsandstores.
NEONcrypto13Reduction,squaring,etc.
ReductionfollowsananalogousstrategytoSection4.
Onecompli-cationisthateachcarryhasanextraoperation,asmentionedabove.
Anothercomplicationforvectorizingasinglemultiplicationisthattheshiftdistancesaresometimes26bitsandsometimes25bits;wevectorizecarrying(h0,h4)→(h1,h5),forexample,butwouldnothavebeenabletovectorizecarrying(h0,h5)→(h1,h6).
Forsquaring,likemultiplication,wevectorizeacrosstwoindependentoperationswhenpossible,andotherwisevectorizewithinoneoperation.
Squaringsareserializedinsquare-rootcomputations(fordecompressingshortsignatures)andininversions(forconvertingscalar-multiplicationresultstoanecoordinates),butthecriticalbottlenecksareelliptic-curveoperations,andsquaringscomeinconvenientpairsinalloftheelliptic-curveformulasthatweuse.
Intheendarithmeticconsumes150cyclesingenericmultiplication(called1286timesinCurve25519),105cyclesinsquaring(called1274times),67cyclesinmultiplicationby121666(called255times),3cyclesinaddition(called1020times),3cyclesinsubtraction(called1020times),and12cyclesinconditionalswaps(called512times),explainingfewerthan400000cycles.
ThemostimportantsourceofoverheadinourcurrentCurve25519performance,527102cycles,isnon-arithmeticinstructionsatthebeginningandendofeachfunction.
Weareworkingonaddress-ingthisbyinliningallfunctionsintothemainloopandschedulingthemainloopasawhole,andweanticipatethencomingmuchclosertothelowerbound,asinSalsa20andPoly1305.
SimilarcommentsapplytoEd25519.
ManyEd25519cycles(about50000cyclesinsigningand25000inverication)areconsumedbytheSHA-512implementationselectedbySUPERCOP[12].
WeseeampleroomforimprovingtheSHA-512implementationontheCortexA8buthavenotbothereddoingso:theEd25519paper[10]recommendsswitchingtoEd25519-SHA-3.
References[1]—(noeditor),9thIEEEsymposiumonapplicationspecicprocessors,InstituteofElectricalandElectronicsEngineers,2011.
See[33].
[2]OnurAciicmez,BillyBobBrumley,PhilippGrabher,Newresultsoninstructioncacheattacks,inCHES2010[31](2010),110–124.
Citationsinthisdocument:§1.
[3]ARMLimited,Cortex-A8technicalreferencemanual,revisionr3p2,2010.
URL:http://infocenter.
arm.
com/help/index.
jsptopic=/com.
arm.
doc.
ddi0344k/index.
html.
Citationsinthisdocument:§2.
[4]DanielJ.
Bernstein,Floating-pointarithmeticandmessageauthentication(1999).
URL:http://cr.
yp.
to/papers.
html#hash127.
Citationsinthisdocument:§4,§4,§4,§4.
[5]DanielJ.
Bernstein,ThePoly1305-AESmessage-authenticationcode,inFSE2005[20](2005),32–49.
URL:http://cr.
yp.
to/papers.
html#poly1305.
Citationsinthisdocument:§1,§4,§4.
[6]DanielJ.
Bernstein,Curve25519:newDie-Hellmanspeedrecords,inPKC2006[48](2006),207–228.
URL:http://cr.
yp.
to/papers.
html#curve25519.
Citationsinthisdocument:§1,§5,§5,§5.
[7]DanielJ.
Bernstein,qhasmsoftwarepackage(2007).
URL:http://cr.
yp.
to/qhasm.
html.
Citationsinthisdocument:§2.
[8]DanielJ.
Bernstein,Polynomialevaluationandmessageauthentication(2007).
URL:http://cr.
yp.
to/papers.
html#pema.
Citationsinthisdocument:§4.
[9]DanielJ.
Bernstein,TheSalsa20familyofstreamciphers,in[37](2008),84–97.
URL:http://cr.
yp.
to/papers.
html#salsafamily.
Citationsinthisdocument:§1,§3.
[10]DanielJ.
Bernstein,NielsDuif,TanjaLange,PeterSchwabe,Bo-YinYang,High-speedhigh-securitysignatures,inCHES2011[36](2011).
URL:http://eprint.
iacr.
org/2011/368.
Citationsinthisdocument:§1,§5,§5.
[11]DanielJ.
Bernstein,TanjaLange,Fasteradditionanddoublingonellipticcurves,inAsiacrypt2007[30](2007),29–50.
URL:http://eprint.
iacr.
org/2007/286.
Citationsinthisdocument:§5.
[12]DanielJ.
Bernstein,TanjaLange(editors),eBACS:ECRYPTBenchmarkingofCryptographicSystems,ac-cessed5March2012(2012).
URL:http://bench.
cr.
yp.
to.
Citationsinthisdocument:§1,§3,§4,§5,§5.
[13]DanielJ.
Bernstein,TanjaLange,PeterSchwabe,Thesecurityimpactofanewcryptographiclibrary(2011).
URL:http://eprint.
iacr.
org/2011/646.
Citationsinthisdocument:§1.
[14]JohnBlack,MartinCochran,MACreforgeability,inFSE2009[19](2009),345–362.
URL:http://eprint.
iacr.
org/2006/095.
Citationsinthisdocument:§4.
[15]AnneCanteaut,KapaleeViswanathan(editors),Progressincryptology—INDOCRYPT2004,5thinternationalconferenceoncryptologyinIndia,Chennai,India,December20–22,2004,proceedings,LectureNotesinCom-puterScience,3348,Springer,2004.
ISBN3-540-24130-2.
See[32].
[16]ChristopheClavier,KrisGaj(editors),Cryptographichardwareandembeddedsystems—CHES2009,11thinternationalworkshop,Lausanne,Switzerland,September6–9,2009,proceedings,LectureNotesinComputerScience,5747,Springer,2009.
ISBN978-3-642-04137-2.
See[23].
14DanielJ.
BernsteinandPeterSchwabe[17]NeilCostigan,PeterSchwabe,Fastelliptic-curvecryptographyontheCellBroadbandEngine,inAfricacrypt2009[35](2009),368–385.
URL:http://cryptojedi.
org/users/peter/#celldh.
Citationsinthisdocument:§1.
[18]BertdenBoer,Asimpleandkey-economicalunconditionalauthenticationscheme,JournalofComputerSecurity2(1993),65–71.
ISSN0926–227X.
Citationsinthisdocument:§4.
[19]OrrDunkelman(editor),Fastsoftwareencryption,16thinternationalworkshop,FSE2009,Leuven,Belgium,February22–25,2009,revisedselectedpapers,LectureNotesinComputerScience,5665,Springer,2009.
ISBN978-3-642-03316-2.
See[14].
[20]HenriGilbert,HelenaHandschuh(editors),Fastsoftwareencryption:12thinternationalworkshop,FSE2005,Paris,France,February21–23,2005,revisedselectedpapers,LectureNotesinComputerScience,3557,Springer,2005.
ISBN3-540-26541-4.
See[5].
[21]HelenaHandschuh,BartPreneel,Key-recoveryattacksonuniversalhashfunctionbasedMACalgorithms,inCRYPTO2008[46](2008),144–161.
Citationsinthisdocument:§4.
[22]TorHelleseth(editor),Advancesincryptology—EUROCRYPT'93,workshoponthetheoryandapplicationofcryptographictechniques,Lofthus,Norway,May23–27,1993,proceedings,LectureNotesinComputerScience,765,Springer,1994.
ISBN3-540-57600-2.
See[24].
[23]EmiliaK¨asper,PeterSchwabe,Fasterandtiming-attackresistantAES-GCM,inCHES2009[16](2009),1–17.
URL:http://eprint.
iacr.
org/2009/129.
Citationsinthisdocument:§3,§3,§4.
[24]ThomasJohansson,GregoryKabatianskii,BenJ.
M.
Smeets,OntherelationbetweenA-codesandcodescorrectingindependenterrors,inEUROCRYPT'93[22](1994),1–11.
Citationsinthisdocument:§4.
[25]AntoineJoux(editor),Fastsoftwareencryption—18thinternationalworkshop,FSE2011,Lyngby,Denmark,February13–16,2011,revisedselectedpapers,LectureNotesinComputerScience,6733,Springer,2011.
ISBN978-3-642-21701-2.
See[29].
[26]NealKoblitz(editor),Advancesincryptology—CRYPTO'96,LectureNotesinComputerScience,1109,Springer,1996.
See[39].
[27]TadayoshiKohno,JohnViega,DougWhiting,CWC:ahigh-performanceconventionalauthenticatedencryptionmode,inFSE2004[38](2004),408–426.
Citationsinthisdocument:§4.
[28]TedKrovetz,PhillipRogaway,Fastuniversalhashingwithsmallkeysandnopreprocessing:thePolyRcon-struction(2000).
URL:http://www.
cs.
ucdavis.
edu/~rogaway/papers/poly.
htm.
Citationsinthisdocument:§4.
[29]TedKrovetz,PhilipRogaway,Thesoftwareperformanceofauthenticated-encryptionmodes,inFSE2011[25](2011),306–327.
URL:http://www.
cs.
ucdavis.
edu/~rogaway/papers/ae.
pdf.
Citationsinthisdocument:§3,§4,§4.
[30]KaoruKurosawa(editor),Advancesincryptology—ASIACRYPT2007,13thinternationalconferenceonthetheoryandapplicationofcryptologyandinformationsecurity,Kuching,Malaysia,December2–6,2007,pro-ceedings,LectureNotesinComputerScience,4833,Springer,2007.
ISBN978-3-540-76899-9.
See[11].
[31]StefanMangard,Francois-XavierStandaert(editors),Cryptographichardwareandembeddedsystems,CHES2010,12thinternationalworkshop,SantaBarbara,CA,USA,August17–20,2010,proceedings,LectureNotesinComputerScience,6225,Springer,2010.
ISBN978-3-642-15030-2.
See[2].
[32]DavidA.
McGrew,JohnViega,ThesecurityandperformanceoftheGalois/Countermode(GCM)ofopera-tion,inINDOCRYPT2004[15](2004),343–355.
URL:http://eprint.
iacr.
org/2004/193.
Citationsinthisdocument:§4.
[33]SergeyMorozov,ChristianTergino,PatrickSchaumont,SystemintegrationofellipticcurvecryptographyonanOMAPPlatform,in[1](2011),52–57.
URL:http://rijndael.
ece.
vt.
edu/schaum/papers/2011sasp.
pdf.
Citationsinthisdocument:§5.
[34]WimNevelsteen,BartPreneel,Softwareperformanceofuniversalhashfunctions,inEUROCRYPT'99[41](1999),24–41.
Citationsinthisdocument:§4.
[35]BartPreneel(editor),Progressincryptology—AFRICACRYPT2009,secondinternationalconferenceoncryp-tologyinAfrica,Gammarth,Tunisia,June21–25,2009,proceedings,LectureNotesinComputerScience,5580,Springer,2009.
See[17].
[36]BartPreneel,TsuyoshiTakagi(editors),Cryptographichardwareandembeddedsystems—CHES2011,13thinternationalworkshop,Nara,Japan,September28–October1,2011,proceedings,LectureNotesinComputerScience,Springer,2011.
ISBN978-3-642-23950-2.
See[10].
[37]MatthewRobshaw,OlivierBillet(editors),Newstreamcipherdesigns,LectureNotesinComputerScience,4986,Springer,2008.
ISBN978-3-540-68350-6.
See[9].
[38]BimalK.
Roy,WilliMeier(editors),Fastsoftwareencryption,11thinternationalworkshop,FSE2004,Delhi,India,February5–7,2004,revisedpapers,LectureNotesinComputerScience,3017,Springer,2004.
ISBN3-540-22171-9.
See[27].
[39]VictorShoup,Onfastandprovablysecuremessageauthenticationbasedonuniversalhashing,inCRYPTO'96[26](1996),313–328.
URL:http://www.
shoup.
net/papers.
Citationsinthisdocument:§4.
[40]EtienneSobole,CalculateurdecyclepourleCortexA8(2012).
URL:http://pulsar.
webshaker.
net/ccc/index.
php.
Citationsinthisdocument:§2.
[41]JacquesStern(editor),Advancesincryptology—EUROCRYPT'99,LectureNotesinComputerScience,1592,Springer,1999.
ISBN3-540-65889-0.
MR2000i:94001.
See[34].
[42]DouglasR.
Stinson(editor),Advancesincryptology—CRYPTO'93:13thannualinternationalcryptologycon-ference,SantaBarbara,California,USA,August22–26,1993,proceedings,LectureNotesinComputerScience,773,Springer,1994.
ISBN3-540-57766-1,0-387-57766-1.
See[43].
NEONcrypto15[43]RichardTaylor,Anintegritycheckvaluealgorithmforstreamciphers,inCRYPTO'93[42](1994),40–48.
Citationsinthisdocument:§4.
[44]EranTromer,DagArneOsvik,AdiShamir,EcientcacheattacksonAES,andcountermeasures,JournalofCryptology23(2010),37–71.
URL:http://people.
csail.
mit.
edu/tromer/papers/cache-joc-official.
pdf.
Citationsinthisdocument:§1.
[45]DavidUlevitch,DNSCrypt—critical,fundamental,andabouttime(2011).
URL:http://blog.
opendns.
com/2011/12/06/dnscrypt-%E2%80%93-critical-fundamental-and-about-time/.
Citationsinthisdocument:§1.
[46]DavidWagner(editor),Advancesincryptology—CRYPTO2008,28thannualinternationalcryptologyconfer-ence,SantaBarbara,CA,USA,August17–21,2008,proceedings,LectureNotesinComputerScience,5157,Springer,2008.
ISBN978-3-540-85173-8.
See[21].
[47]MichaelWei,BenediktHeinz,FredericStumpf,AcachetimingattackonAESinvirtualizationenvironments,ProceedingsofFinancialCryptography2012,toappear(2012).
Citationsinthisdocument:§1,§3.
[48]MotiYung,YevgeniyDodis,AggelosKiayias,TalMalkin(editors),Publickeycryptography—9thinternationalconferenceontheoryandpracticeinpublic-keycryptography,NewYork,NY,USA,April24–26,2006,pro-ceedings,LectureNotesinComputerScience,3958,Springer,2006.
ISBN978-3-540-33851-2.
See[6].

Hosteons:洛杉矶/纽约/达拉斯免费升级10Gbps端口,KVM年付21美元起

今年1月的时候Hosteons开始提供1Gbps端口KVM架构VPS,目前商家在LET发布消息,到本月30日之前,用户下单洛杉矶/纽约/达拉斯三个地区机房KVM主机可以从1Gbps免费升级到10Gbps端口,最低年付仅21美元起。Hosteons是一家成立于2018年的国外VPS主机商,主要提供VPS、Hybrid Dedicated Servers及独立服务器租用等,提供IPv4+IPv6,支持...

virmach:AMD平台小鸡,赌一把,单车变摩托?$7.2/年-512M内存/1核/10gSSD/1T流量,多机房可选

virmach送来了夏季促销,价格低到爆炸,而且在低价的基础上还搞首年8折,也就是说VPS低至7.2美元/年。不过,这里有一点要说明:你所购买的当前的VPS将会在09/30/2021 ~ 04/30/2022进行服务器转移,而且IP还会改变,当前的Intel平台会换成AMD平台,机房也会变动(目前来看以后会从colocrossing切换到INAP和Psychz),采取的是就近原则,原来的水牛城可能...

阿里云年中活动最后一周 - ECS共享型N4 2G1M年付59元

以前我们在参与到云服务商促销活动的时候周期基本是一周时间,而如今我们会看到无论是云服务商还是电商活动基本上周期都要有超过一个月,所以我们有一些网友习惯在活动结束之前看看商家是不是有最后的促销活动吸引力的,比如有看到阿里云年中活动最后一周,如果我们有需要云服务器的可以看看。在前面的文章中(阿里云新人福利选择共享性N4云服务器年79.86元且送2月数据库),(LAOZUO.ORG)有提到阿里云今年的云...

opendns为你推荐
请务必阅读正文之后的免责条款部分http://www.huajinsc.cn/万家增强收益债券型证券投资基金Intentsandroid支持ipad三星iphone特斯拉苹果5重庆网通重庆联通宽带tracerouteTRACEROUTE的作用是什么重庆宽带测速重庆电信测速我的网速溢出
重庆服务器租用 日本私人vps 仿牌空间 优key 美国仿牌空间 php免费空间 坐公交投2700元 免费测手机号 无限流量 512mb 免费外链相册 丽萨 个人免费邮箱 日本代理ip 北京主机托管 googlevoice zcloud winserver2008 reboot 更多