EURASIPJournalonInformationSecurityrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22DOI10.
1186/s13635-016-0045-0RESEARCHOpenAccessHidost:astaticmachine-learning-baseddetectorofmaliciousfilesNedimrndic1*andPavelLaskov2AbstractMalicioussoftware,i.
e.
,malware,hasbeenapersistentthreatintheinformationsecuritylandscapesincetheearlydaysofpersonalcomputing.
Therecenttargetedattacksextensivelyusenon-executablemalwareasastealthyattackvector.
Thereexistsasubstantialbodyofpreviousworkonthedetectionofnon-executablemalware,includingstatic,dynamic,andcombinedmethods.
Whilestaticmethodsperformordersofmagnitudefaster,theirapplicabilityhasbeenhithertolimitedtospecificfileformats.
ThispaperintroducesHidost,thefirststaticmachine-learning-basedmalwaredetectionsystemdesignedtooperateonmultiplefileformats.
Extendingapreviouslypublished,highlyeffectivemethod,itcombinesthelogicalstructureoffileswiththeircontentforevenbetterdetectionaccuracy.
Oursystemhasbeenimplementedandevaluatedontwoformats,PDFandSWF(Flash).
Thankstoitsmodulardesignandgeneralfeatureset,itisextensibletootherformatswhoselogicalstructureisorganizedasahierarchy.
Evaluatedinrealisticexperimentsontimestampeddatasetscomprising440,000PDFand40,000SWFfilescollectedduringseveralmonths,HidostoutperformedallantivirusenginesdeployedbythewebsiteVirusTotaltodetectthehighestnumberofmaliciousPDFfilesandrankedamongthebestonSWFmalware.
Keywords:Machinelearning,Security,Malwaredetection,Fileformats,PDF,SWF1IntroductionOneofthemosteffectivetoolsforbreakingintocomputersystemsremainsmalicioussoftware,i.
e.
,malware.
Whilebeingawell-knownplaguesincethedawnofpersonalcomputing,malwarehasdevelopedseveralinsidioustraitsintherecentdecadetoservetheneedsofcriminalbusi-ness.
Oneofthemistheinfectionoffilesinwell-knownformatsusedtoexchangedocumentsbetweenbusinessesandindividuals.
Suchinfectionoffersthefollowingbene-fitstoattackers:1.
Itiseasiertolureusersintoopeningdocumentsthanintolaunchingexecutableprograms.
2.
Asteadystreamofnewvulnerabilitieshasbeenobservedintherecentyearsindocumentviewersduetotheirhighcomplexitycaused,inturn,bythecomplexityofdocumentformats.
*Correspondence:nedim.
srndic@uni-tuebingen.
de1CognitiveSystems,DepartmentofComputerScience,UniversityofTübingen,Sand1,72076,Tübingen,GermanyFulllistofauthorinformationisavailableattheendofthearticle3.
Flexibilityandversatilityofdocumentformatsofferampleopportunitiesforobfuscationofembeddedmaliciouscontent.
Thesamefeaturesalsohindertheidentificationofmali-ciousdocumentsandincreasethecomputationalburdenonthedetectiontools.
ThefavoriteformatsusedbyattackersarePDF(target-ingAdobeReader),Flash(targetingAdobeFlashPlayer),andMicrosoftOfficefiles[1,2].
In2012,thepioneer-ingexploitkitBlackholetargetedJava,PDF,andFlashfiles,anditssuccessorshavecontinuedthispractice[3].
In2013,thenon-executablemalwaredeliveredthroughthewebwasdominatedbyPDFandFlashfilestargetingAdobeReaderandMicrosoftOfficeapplications[2].
Flashhasseenwidedeploymentrecentlyformaliciousadver-tising,i.
e.
,placementofmalwareonlegitimatewebsitesbymeansofadvertisingnetworks.
Evensomeofthemostprominentwebsiteshavefallenvictimstosuchattacks[3].
Althoughprevalentlyusedforredirectiontositesservingexploitkits,itisnotuncommonforFlashfilestotargetFlashPlayerdirectly.
2016TheAuthor(s).
OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.
0InternationalLicense(http://creativecommons.
org/licenses/by/4.
0/),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicense,andindicateifchangesweremade.
rndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page2of20Non-executablefilesareespeciallypopularasameansfortargetedattacks.
Recentyearshavebroughtarangeofhigh-profiletargetedattacksagainstgovernmentsandindustry,andtheyaregettingmorecommonandeverstealthier.
TheMiniduketargetedattackcampaignagainstEuropeangovernmentagenciesusedsophisti-catedPDFfilesexploitinganAdobeReaderzero-dayvulnerability.
Fourdifferentzero-dayvulnerabilitiesinMicrosoftOfficewereusedintheElderwoodattackagainstthedefenseindustry.
ThegroupAPT1orCom-mentCrewused0-dayvulnerabilitiesinAdobeReaderandMicrosoftOfficeagainstgovernmentandindustrytargets[4].
Amongtherecorded240-daydiscoveredin2014,16targetedAdobeReaderandFlashPlayer(cf.
Fig.
1),whileMicrosoftWordfilesdominatedthelistoffiletypesusedfortargetedattacks[1,5].
Inthefirst9monthsof2015,8outoftop10vulnerabilitieslever-agedbyexploitkitswerereportedtobeFlashPlayervulnerabilities[6].
Themaindifficultyindetectingmaliciousnon-executablefilesisthenecessitytounderstandcomplexformats.
Whilesuchdifficultyismarginalizedinthemeth-odsbasedondynamicanalysis,i.
e.
,renderingafileinaninstrumentedsandbox,suchmethodsareingeneralratherslow.
Staticanalysismethods,knownfortheirhighperfor-mance,usuallydeployadhoc,format-specificdetectiontechniqueswhichdonotgeneralizeacrosstheformats.
Toalleviatethisproblem,weproposeanewstaticanal-ysismethodwiththepotentialofbeingmoreportableacrossformats.
Ourexperimentsdemonstratethat,withtheincorporationofanappropriateformatparser,itcanbeappliedtobothPDFandFlashfiles.
Beforepresentingthemainfeaturesoftheproposedmethod,wereviewtherelatedwork.
2002200320042005200620072008200920102011201220132014Year01020304050607080NumberofassignedCVEsAdobeReaderFlashPlayerFig.
1CVEswiththephrases"AdobeReader"or"FlashPlayer"intheirdescription1.
1RelatedworkEarlyworkonPDFmalwaredetectionfocusedonn-gramanalysis[7,8]ofPDFfilesondisk.
However,PDFisacomplexfileformat[9].
PDFfiles,especiallymaliciousones,routinelyemployobfuscationintheformofcom-pression,theuseofdifferentencodingsandevenencryp-tion.
Thus,onlyfull-fledgedPDFparserscanproperlydeobfuscatethem.
Then-gramapproachisinthisregardoverlysimplistic.
ThefirstmethodutilizingaPDFparserwasPJScan[10].
ItemployedanomalydetectionbasedonlexicalpropertiesofJavaScriptcodeembeddedinsidePDFfiles.
However,itcouldnothandleJavaScriptcodeloadedatruntimeormalwarethatdoesnotuseJavaScriptinthefirstplace.
Twosimplelearning-basedmethodswereproposedsubsequently,MalwareSlayer[11]andPDFrate[12],bothutilizingheuristicfeaturesbasedonrawbytesofPDFfiles.
Allmethodspresentedsofararecommonlyreferredtoasstaticbecausetheydonotperformexe-cutionoremulationofanypartofaPDFfile.
Theycanbedividedintodeepandshallowmethods,dependingonwhethertheirparsingofPDFfilesconformstothePDFStandard[9](deep)ornot(shallow).
PJScanistheonlydeepmethodpresentedsofar.
AcommonvulnerabilityofshallowmethodsistherelativeeaseoffalsificationofthePDFphysicalstructure,demonstratedontheexampleofPDFrate[13].
Acommonshortcomingofallpurestaticmethodsistheirinabilitytodetectdynamicallyloadedthreats,e.
g.
,whentheanalyzedfiledoesnotcontainattackcodebutinsteadloadsitoverthenetworkorfromanotherfile.
Alongwiththedescribedstaticmethods,dynamicapproachesweredevelopedtoleveragetheadditionalinformationobtainedbyobservingtheeffectsofopeningaPDFfileatruntime.
NotreliantonexaminingthePDFfileatall,dynamicmethodsareimmunetoPDFobfusca-tionandphysicalstructurefalsification.
Earlyapproacheswerebasedonsoftwareemulation[14,15].
However,soft-wareemulationwasshowntobesusceptibletoevasionandcomputationallyintensive.
OtherpopulardynamicapproachesincludeWepawet[16]basedonthesandboxJSand[17]andMalOfficebasedonCWSandbox[18].
Snowetal.
proposedtoemployhardwarevirtualizationandevaluatedtheirsystemShellOS[19]onPDFmalware.
Tangetal.
usedanomalydetectiononlow-levelhard-warefeatures[20].
Whilethedynamicapproachestendtobemoreaccuratethanthestaticones,theirexecu-tiontimerenderstheminadequatefordetectingmaliciousdocumentsonbusynetworksinrealtime.
Furthermore,buildingandmaintainingadynamicdetectorcapableofemulatingeveryversionofavulnerablesoftwareproductincombinationwitheveryversionofeachofitssupportedoperatingsystemsandlibrariesisacostlyandtechni-callychallengingtask.
Ontheotherhand,itsufficestoomitonecombinationoftargetsoftwarefromthedetectorrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page3of20andathreatdesignedforthatspecificversionwillgoundetected.
Asanattempttoachievethespeedofstaticapproacheswiththeaccuracyofdynamicones,combinedstaticanddynamicmethodsweresubsequentlydeveloped.
MDScanperformsstaticJavaScriptextractionanddynamiccodeexecution[21],butthecomplexityofemulationofthePDFJavaScriptAPIwithundocumentedfeaturespro-hibitsacompleteanderror-freesolution.
MPScan,ontheotherhand,hooksintoAdobeReaderforflawlessJavaScriptextractionanddeobfuscationbutperformsstaticexploitdetection[22].
Duetoitsdesign,itissuit-ableformalwaredetectiononlyonasingleversionofAdobeReader,anditsdynamiccomponenttakessecondstorun.
Incontrasttofullyautomatedmethodspresentedsofar,Nissimetal.
proposetouseanactivelearningapproach,whereahumanexpertmanuallylabelsinterestingsam-plesforamachinelearningalgorithm,withthegoalofkeepingthedetectorup-to-datewiththenewestthreats[23].
Theyoutlineadesignwithacombinationofsigna-turedetectionandmultiplemethodsdescribedsofarbutleaveitsimplementationandevaluationforfuturework.
ForamoredetailedsurveyofmanyofthementionedPDFmalwaredetectionmethods,wereferthereaderto[23].
ComparedtoPDF,researchondetectingFlashmal-warehasbeenscarcewithonlytwomethodsproposedintherecentyears.
TheOdoSwiffsystemfrom2009usedaheuristics-basedapproachonfeaturesobtainedwithbothstaticanddynamicanalysis[24].
Itwassuc-ceededin2012byFlashDetect,whichupgradeditsdetec-tionfromActionScript2toActionScript3exploitsandreplaceditsthreshold-basedapproachwithaNaiveBayesclassifier[25].
Bothmethodsarebasedonanempiricalapproach,strivingtoencodetheknowledgeofdomainexperts,i.
e.
,malwareanalysts,aboutexistingwaysoftheSWFexploitation.
Theseexpertfeaturesperformverywell.
Forexample,FlashDetect'smachinelearningclas-sifierwasevaluatedusingatrainingdatasetcomprisingonly47samplesofeachclass,buteventhissmallsam-plesizewasenoughtoachieveahighdetectionaccu-racy.
However,astheauthorspointout,someemployedheuristics-basedfeaturesarenotrobustagainstcommit-tedevaders.
Furthermore,embeddedmalwaremaydetecttheemployeddynamicexecutionenvironmentbasedonitsdifferencetoAdobeFlashPlayer,coveringitsbehav-iorasareaction.
Incontrast,themethodproposedhereinusesadata-drivenapproachinsteadofexpertfeatures,anditsdetectionisbasedonthestructuraldifferencesbetweenbenignandmaliciousSWFfiles.
Byremainingexploit-agnostic,itremainsopentonovelattacks,itsstaticapproachleadstofasterexecutionandisnotvulnerabletorun-timeevasion.
1.
2ContributionsTheproposeddetectionmethodisbasedontheanaly-sisofhierarchicaldocumentstructureandishenceforthabbreviatedasHidost.
ItisanextensionofpreviousworkpublishedbyrndicandLaskovin[26],hereinreferredtoasSL2013.
ThenoveltyintroducedinSL2013wastheuseoflogicalstructureforcharacterizationofmaliciousandbenignPDFfiles.
PDFlogicalstructureisahigh-levelcon-structdefinedbythePDFStandardthatorganizesbasicPDFbuildingblocksintoafunctionaldocument.
Resultspublishedin[26]showthatpropertiesofmaliciousfilessuchasthepresenceofJavaScriptandminimaluseofbenigncontentcanbeaccuratelydeterminedfromtheirlogicalstructure.
Asadeepstaticmethod,SL2013islessaffectedbyPDFobfuscationandphysicalstructurefalsifi-cationthatplagueshallowmethods.
Evaluatedonareal-worlddatasetcomprising660,000PDFfiles,SL2013hasdemonstratedacombinationofdetectionperformanceandthroughputthatremainsunrivaledamongantivirusenginesandpublishedscientificwork.
Nevertheless,inarealisticslidingwindowexperimentontimestampeddata,thedetectionperformanceofSL2013wasshowntobeinconsistent.
Itsfeaturedefinitioncreatedablindspotexploitablebyevadersanditsoversizedfeaturesetcreateddifficultiesformorememory-intensivemachinelearningclassifiersthantheemployedsupportvectormachine.
HidostinheritsalltheadvantagesofSL2013.
Itmain-tainsthenearlyperfectdetectionperformanceandhighthroughputonPDFfilesthattailoredSL2013forcentral-izeddeploymentonbusynetworks.
Asafurtheradvan-tageofadeepstaticapproach,HidostisimmunetoPDFobfuscationandphysicalstructurefalsification.
HidostfurthermoreaddressescertainshortcomingsofSL2013wehavediscoveredlater.
Inparticular,wedevel-opedstructuralpathconsolidation(SPC),atechniqueusedtomergesimilarfeatures.
Suchconsolidatedfea-turesbetterpreservethesemanticsoflogicalstructureandreducethedependencyofthefeaturesetonthespe-cificdataset.
ThebenefitsofSPCarethreefold:(a)theattacksurfaceforevasionisreduced;(b)changesinfea-turesetovertimearelimited;and(c)thenumberoffeaturesisdrasticallyreduced.
Together,theseimprove-mentsrenderHidostmuchmoresecureandpracticalthanSL2013.
Mostimportantly,however,thispaperintroducesanovelsystemdesignforHidostthatenablesitsgeneral-izationtomultipleunrelatedfileformats.
Tothebestofourknowledge,Hidostisthefirststaticmachine-learning-basedmalwaredetectionsystemapplicabletomultiplefileformats.
Itsgeneralitywasachievedbyextendingthefea-turedefinitionbasedonthePDFlogicalstructuretoasecondfileformatwithahierarchicallogicalstructure,Flash'sSWFformat.
Finally,takingastepfurther,HidostnotonlyconsidersthelogicalstructureofthefilebutitsrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page4of20contentaswell,enablingahigherdegreeofprecisiononformatswithlessdiscriminativestructuresuchasSWF.
TodemonstratetheexcellentdetectionperformanceofHidost,weexperimentallyevaluateitfortwoformats:PDFandSWF.
Ourevaluationprotocolisintendedtomodelthepracticaldeploymentofadata-drivendetectionmethodandtoaccountforanaturalevolutionofmali-ciousdata.
Inourprotocol,adetectionmodelistrainedonafixed-sizewindowofdataandisdeployedforalimitedtimeperiod.
Oncethemodelisdeemedtobetooold,itisre-trainedonanotherwindowofmorerecentdataandagainevaluatedforalimitedtimeperiod.
Unliketheclas-sicalcross-validationmethodscommoninevaluationofmachinelearningalgorithms,ourexperimentalprotocolaccountsforatemporalnatureofdatainsecurityappli-cationsandneverpredictsthepastdatafromthefutureone.
Insummary,themaincontributionsofthispaperareasfollows:Astaticmachine-learning-basedmalwaredetector,Hidost,thefirstsuchsystemapplicabletodifferentfileformatsbasedontheirlogicalstructureandcontent.
AnexperimentalevaluationofHidostontwoformats,PDFandSWF,designedtoreflecttheoperationalenvironmentofamalwaredetector,performedonadatasetof440,000PDFfilesandunprecedented40,000SWFfiles.
Inourevaluation,HidostoutperformedallantivirusenginesatVirusTotalonPDFandrankedamongthebestonSWFfiles.
AprototypeimplementationofHidostfortwofileformats,PDFandSWF,releasedasOpenSourcesoftware.
Sourcecoderequiredtoreproducethiswork,includingexperimentsandplots,releasedasOpenSourceSoftware.
Datasetsrequiredtoreproducethiswork.
1.
3OutlineThestructureofthispaperisasfollows.
FileformatsthatHidostisapplicableto,i.
e.
,hierarchicallystructuredfileformats,aredescribedinSection2,alongwithadetailedintroductiontologicalstructuresofPDFandSWF.
Hidost'ssystemdesignisdescribedinSection3whichcoversextractionofstructuralelementsfromPDFandSWFformats,featuredefinition,selection,andcom-pactionaswellaslearningandclassification.
Theexper-imentalevaluation,includingthedescriptionofdatasetsandexperimentalprotocols,aswellasadiscussionofresults,ispresentedinSection4.
WediscussHidost'sextensiontootherfileformatsandpresentaconceptualdesignforitsapplicationtoofficefileformatsOOXMLandODFinSection5.
Finally,Section6presentsconclu-sionsandoutlinesopenquestionsforfuturework.
2HierarchicallystructuredfileformatsFileformatsaredevelopedasameanstostoreaphysicalrepresentationofcertaininformation.
Someformats,e.
g.
,textfiles,donothaveanylogicalstructure,butothers,e.
g.
,HTML,do.
HTMLfilesareaphysicalrepresentationoflogicalrelationshipsbetweenHTMLelements.
AstheexampleinFig.
2shows,inanHTMLfile,apelementmightbeadescendantofthebodyelement,whichinturnhasthehtmlelementasitsparent.
HTMLelementshavealogicalstructureintheformofahierarchy.
Workpresentedinthispaperisconcernedwiththedetectionofmalwareinhierarchicallystructuredfileformats.
Thephysicallayoutofthefileformat,whichcansubstantiallydeviatefromitslogicallayout,isirrelevantfortheoperationoftheproposedmethod.
Examplesofhierarchicallystructuredfileformatsinclude:PortableDocumentFormat(PDF)SWFFileFormat(SWF)ExtensibleMarkupLanguage(XML)HypertextMarkupLanguage(HTML)OpenDocumentFormat(ODF),anXML-basedformatforofficedocumentsOfficeOpenXML(OOXML),adifferentXML-basedformatforofficedocumentsScalableVectorGraphics(SVG),anXML-basedformatforvectorgraphicsInthefollowing,wedescribethehierarchicallogicalstructureoftwofileformatsimplementedinHidost,PDF,andSWF.
2.
1PortableDocumentFormat(PDF)Thissectionwascopied(withadaptation)from[26],copy-rightedbytheInternetSociety.
PortableDocumentFormat(PDF)isanopenstandardpublishedasISO32000-1:2008[9].
ThesyntaxofPDFcomprisesthesefourmainelements:Objects.
ThesearethebasicbuildingblocksinPDF.
Filestructure.
ItspecifieshowobjectsarelaidoutandmodifiedinaPDFfile.
Fig.
2AsampleHTMLfilerndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page5of20Documentstructure.
ItdetermineshowobjectsarelogicallyorganizedtorepresentthecontentsofaPDFfile(text,graphics,etc.
).
Contentstreams.
Theyprovideameansforefficientstorageofvariouspartsofthefilecontent.
ThereareninebasicobjecttypesinPDF.
SimpleobjecttypesareBoolean,Numeric,String,andNull.
PDFstringshaveboundedlengthandareenclosedinparentheses"("and")".
ThetypeNameisusedasanidentifierinthedescriptionofthePDFdocumentstructure.
Namesareintroducedusingthecharacter"/"andcancontainarbi-trarycharactersexceptnull(0*00).
Theaforementionedfiveobjecttypeswillbereferredtoasprimitivetypesinthispaper.
AnArrayisaone-dimensionalorderedcol-lectionofPDFobjectsenclosedinsquarebrackets,"["and"]".
ArraysmaycontainPDFobjectsofdifferenttype,includingnestedarrays.
ADictionaryisanunorderedsetofkey-valuepairsenclosedbetweenthesymbols">".
Thekeysmustbenameobjectsandmustbeuniquewithinadictionary.
ThevaluesmaybeofanyPDFobjecttype,includingnesteddictionaries.
AStreamobjectisaPDFdictionaryfollowedbyasequenceofbytes.
Thebytesrepresentinformationthatmaybecompressedorencrypted,andtheassociateddictionarycontainsinfor-mationonwhetherandhowtodecodethebytes.
Thesebytesusuallycontaincontenttoberenderedbutmayalsocontainasetofotherobjects.
Finally,anIndirectobjectisanyofthepreviouslydefinedobjectssuppliedwithauniqueobjectidentifierandenclosedinthekeywordsobjandendobj.
Duetotheiruniqueidentifiers,indirectobjectscanbereferencedfromotherobjectsviaindirectreferences.
ThesyntaxofPDFobjectsisillustratedinasimpli-fiedexemplaryPDFfileshowninFig.
3.
Itcontainsfourindirectobjectsdenotedbytheirtwo-partobjectiden-tifiers,e.
g.
,10forthefirstobject,andtheobjandendobjkeywords.
Theseobjectsaredictionaries,astheyaresurroundedwiththesymbols">".
ThefirstoneistheCatalogdictionary,denotedbyitsTypeentrywhichcontainsaPDFnamewiththevalueCatalog.
TheCataloghastwoadditionaldictionaryentries:PagesandOpenAction.
OpenActionisanexampleofanesteddic-tionary.
Ithastwoentries:S,aPDFnameindicatingthatthisisaJavaScriptactiondictionary,andJS,aPDFstringcontainingtheactualJavaScriptscripttobeexecuted:alert('Hello!
');.
Pagesisanindirectreferencetotheobjectwiththeobjectidentifier30:thePagesdic-tionarythatimmediatelyfollowstheCatalog.
Ithasaninteger,Count,indicatingthattherearetwopagesinthedocument,andanarrayKidsidentifiablebythesquarebrackets,withtworeferencestoPageobjects.
ThesameobjecttypesareusedtobuildtheremainingPageobjects.
NoticethateachofthePageobjectscontainsabackwardFig.
3RawcontentofanexamplePDFfile.
Formattedforeasierreading.
Detailsomittedforbrevity.
PrimitivedatatypesandreferencesarecoloredgreenreferencetothePagesobjectintheirParententry.
Alto-gether,therearethreereferencespointingtothesameindirectobject,30,thePagesobject.
Therelationsbetweenvariousbasicobjectsconstitutethelogical,tree-likedocumentstructureofaPDFfile.
Thenodesinthedocumentstructureareobjectsthemselves,andtheedgescorrespondtothenamesunderwhichchildobjectsresideinaparentobject.
Forarrays,theparent-childrelationshipisnamelessandcorrespondstoanintegerindexofindividualelements.
Noticethatthedoc-umentstructureis,strictlyspeaking,notatreebutratheradirectedrootedcyclicgraph,asindirectreferencesmayrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page6of20pointtootherobjectsanywhereinthedocumentstruc-ture.
Thisgraphcanbereducedtoapropertree,calledastructuraltree,aswillbeelaboratedinSection3.
4,andwewillthereforelimitourselvestoconsideringthePDFdoc-umentstructureinitssimplified,treeform,asillustratedinFig.
4.
TherootnodeinthedocumentstructureisaspecialPDFdictionarywiththemandatoryTypeentrycontain-ingthenameCatalog.
Anyobjectofaprimitivetypeconstitutesaleaf,i.
e.
,terminalnode,inthedocumentstructure.
WedefineapathinthePDFstructuraltreeasasequenceofedgesstartingintheCatalogdictionaryandendingwithanobjectofaprimitivetype.
Forexample,inFig.
4thereisapathfromtheroot,i.
e.
,leftmost,nodethroughtheedgesnamed/Pagesand/Counttotheterminalnodewiththevalue2.
ThisdefinitionofapathinthePDFdocumentstructure,whichwedenoteaPDFstructuralpath,playsacentralroleinourapproach.
Weprintpathsasasequenceofalledgelabelsencounteredduringpathtraversalstartingfromtherootnodeandend-ingintheleafnode.
Thepathfromourearlierexamplewouldbeprintedas/Pages/Count.
Thefollowinglistshowsexemplarystructuralpathsfromreal-worldbenignPDFfiles:/Metadata/Type/Pages/Kids/OpenAction/Contents/StructTreeRoot/RoleMap/Pages/Kids/Contents/Length/OpenAction/D/Resources/ProcSet/OpenAction/D/Pages/Count/PageLayoutFig.
4StructuraltreeofthePDFfiledepictedinFig.
3.
Dictionariesareillustratedusingthesymbolarraysusing"squarebrackets".
CycleswereomittedforsimplicityOurinvestigationshowsthatthesearethestructuralpathswhosepresenceinafileismostindicativethatthefileisbenignoralternatively,whoseabsenceindicatesthatafileismalicious.
Forexample,maliciousfilesarenotlikelytocontainmetadatainordertominimizefilesize,theydonotjumptoapageinthedocumentwhenitisopenedandarenotwell-formedsotheyaremissingpathssuchas/Typeand/Pages/Count.
Thefollowingisalistofstructuralpathsfromreal-worldmaliciousPDFfiles:/AcroForm/XFA/Names/JavaScript/Names/EmbeddedFiles/Names/JavaScript/Names/Pages/Kids/Type/StructTreeRoot/OpenAction/Type/OpenAction/S/OpenAction/JS/OpenActionWeseethatmaliciousfilestendtoexecuteJavaScriptstoredwithinmultipledifferentlocationsuponopen-ingthedocumentandmakeuseofAdobeXMLFormsArchitecture(XFA)formsasmaliciouscodecanalsobelaunchedfromthere.
2.
2SWFfileformatSWFFileFormat(SWF,pronouncedswiff)isapropri-etarybinaryfileformat,itsspecificationispublishedin[27].
SWFfilesconsistofaheaderandasequenceoftags,i.
e.
,datastructureswithvaluesforpredefinedfields.
Thereare65differenttypesoftagsspecified,eachdefiningitsownsetoffieldswithdifferentnamesanddatatypes.
SomeofthebasicSWFdatatypesare[27]:8-,16-,32-,and64-bitintegers,bothsignedandunsigned,arraysofthesetypesandintegerswithavariablenumberofbytesFixed-andfloating-pointnumbersofdifferentwidthsandprecisionsIntegerandfixed-pointnumberswithwidthsthatarenotexponentsof2StringsDatastructuressuchas24-and32-bitcolorrecords,rectanglerecords,2Dtransformationmatrices,etc.
Figure5showsaverysmallSWFfileusedforillustra-tivepurposes.
Clearly,thephysicallayoutofSWFistooobscurefordirectinterpretation.
Instead,ourdescriptionoftheSWFlogicalstructureisbasedonthedecoded,human-readabledepictionofthesamefile,illustratedinFig.
6.
ThistextualdescriptionoftheoriginalSWFfilerndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page7of20Fig.
5HexadecimalviewofatoySWFfile.
LeftcolumncontainshexadecimaladdressesoffirstbytesofeveryrowwasproducedusingtheConsoleDumperclassfromtheSWFREToolstoolkit[28],anopen-sourceJavatoolkitforreverse-engineeringSWFfiles.
Theillustrationskipsthefileheaderasitisnotusedinourmethod.
Itshows5SWFtagsseparatedbydot-tedlines:twoSetBackgroundColortagsatbytes0*14and0*1B,twoShowFrametagsatbytes0*19and0*20andanEndtagatbyte0*22.
TagsofaSWFfilearelaidoutsequentially.
Everytaghasaheaderwithanunsigned16-blittle-endianTagCodeAndLengthfieldthatcomprisesa10-bytetagtypeidentifieranda6-btaglengthfield,Fig.
6SWFfiledepictedinFig.
5,decoded.
Valuesarecoloredgreen.
Everylinestartswithahexadecimalnumberbetweensquarebrackets,denotingtheoffset,inbytes,ofthetagfieldspecifiedinthegivenlinefromthebeginningoftheSWFfilewithanoptionalwiderlengthfieldfortagslongerthan62B.
Thefirsttaginthisfileisusedtosetthebackgroundcolorofthedisplay.
Itisasimpletag,definingthreeunsigned1-Bvaluesofthered(0*AA=170),green(0*BB=187),andblue(0*CC=204)colorcomponents.
Thesecondtagmakesthecontentofthecanvasrenderonscreenforthedurationofoneframe.
Followingthis,thebackgroundcolorissetto#112233andthescreenisrefreshedonemoretime.
Thelasttagsignalstheendofthefile.
Figure7illustratesalogicalviewofourexampleSWFfileinwhichthefileisstructuredasatree.
ItcloselyfollowsthelayoutpresentedinthedecodedSWFfileofFig.
6.
Everytagisrepresentedbyatreenodeandisadirectdescendantoftheabstractrootnode.
Theedgefromtheroottothetagnodeislabeledbythetagtypename,inourcaseSetBackgroundColor,ShowFrame,andEnd.
Descendantsoftagnodesareitsheaderandfields.
HeadersareconnectedwithanedgesimplylabeledHeader.
Theedgesleadingtothefieldsarelabeledbytheirnames,e.
g.
,BackgroundColor.
Thevaluesoftags'fieldsarerepresentedasleafs,e.
g.
,thevalueoftheredfieldofthefirstSetBackgroundColortag,170.
WedefineapathintheSWFstructuraltree,analo-goustothepathinthePDFstructuraltree,asaseriesFig.
7LogicalstructureoftheSWFfileillustratedinFig.
5rndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page8of20ofedgesstartingintheabstractrootnodeandend-inginaleafnode.
Forexample,thereisapathfromtherootnodethroughtheedgeslabeledEnd,Header,andTagAndLengthendingintheleafnodewiththevalue0.
ForbetterreadabilityandconsistencywithPDF,weprependtheforwardslashsymbol"/"toeveryedgelabelwhenprintingapath;hence,thepathinquestionprintsas/End/Header/TagAndLength.
Inthefollowingsection,wedescribehowthelogicalstructuresofPDFandSWFfilesareprocessedforusebylearningalgorithmsanddescribethesystemdesignofHidost.
3SystemdesignHidosthasbeendesignedasamalwaredetectionsystemcapableoflearningtodiscriminatebetweenmaliciousandbenignfilesbasedontheirlogicalstructure.
Duetothesemanticheterogeneityofvariousfileformats,itishardtoimagineasingleformattoactasa"commondenom-inator"forallconceivablehierarchicallystructuredfileformats.
Yet,ourdesignclearlyseparatesformat-specificprocessingstepsfromthedetectionmethodology.
Asaresult,ourmethod,currentlytestedonPDFandSWFformats,canbeextendedtootherformatsbyimplement-ingtheformat-specificcomponentswithoutrebuildingitsgeneralframework.
Theproposedmethodwasimple-mentedasaresearchprototype,anditsfeatureextrac-tionsubsystemwaspublishedasopensourcesoftware[29].
Thepublishedcodecomprisesatoolsetforfea-tureextractionfromPDF(implementedinC++)andSWFfiles(implementedinPythonandJava).
Experimentreproductioncodeispublishedseparately,asdescribedinSection4.
ThesystemdesignofHidostisillustratedinFig.
8.
TherearesixmainstagesinHidost:structureextraction,structuralpathconsolidation,featureselection,vector-ization,learning,andclassification.
Structureextractiontransformsthestructuralfeaturesofspecificformatsintoacommondatastructure—structuralmultimap—representingpathsinthestructuralhierarchy.
Structuralpathconsolidationisintendedtotransformstructuralpathsintoamoregeneralform,removingartifacts.
Featureselectionisconcernedwithfindingthemini-mumsetoffeaturesrequiredforasuccessfulmachinelearningapplication.
Vectorizationtransformsstructuralmultimapsintonumericvectorsprocessedbymachinelearningmethods.
Learninggeneratesadiscriminativemodelofmaliciousandbenignfilesbasedontheirpropertiesencodedinfeaturevectors.
Finally,classifi-cationmakesadecisionwhetherapreviouslyunseensampleismaliciousorbenignbasedonthelearnedmodel.
Inthefollowingsubsections,themainstagesofourapproacharepresentedindetail.
PDFFilesStructuralMultimapFeatureVectorDecisionPDFModelSWFModelLearningClassicationSWFFilesOtherFilesOtherModelsStructuralPathConsolidationStructureExtractionStructuralMultimapStructuralMultimapFeatureSelection(Thresholding)VectorizationFig.
8Hidostsystemdesign3.
1FilestructureextractionThefirststepofourmethodtransformsfilesintoamoreabstractrepresentation,theirlogicalstructure.
Thisstepisessentialtoourapproachbecauseitachievestwokeygoals:(a)useoflogicalstructurefordiscriminationbetweenmaliciousandbenignfilesand(b)applicabilitytomultiplefileformats.
Asuitablerepresentationforthelogicalstructureofhierarchicallystructuredfileformatsisastructuralmul-timap.
Amultimapisageneralizationofthecommonmapdatastructure,alsoknownasadictionaryorassociativearray.
Whilemapsprovideamappingbetweenakeyandacorrespondingvalue,multimapsmapakeytoasetofval-ues.
Astructuralmultimapisamultimapthatmapseverystructuralpathofastructuraltreetothesetofallleafsthatlieonthegivenpath.
Inmapterminology,thestruc-turalpathsrepresentthekeysandsetsofallleafsthatapathmapstorepresentthevaluesofthemap.
AnexamplestructuralmultimapisillustratedinFig.
9.
Multiplevaluesforthesamekeyaredelimitedusingverticalbarsymbols"|".
Hidostusesasimplifiedformofstructuralmultimapspresentedlaterinthissection.
Thenatureofthelogicalfilestructurenecessitatesamultimapinsteadofamapbecausemultipleleafsmaybereachablebyasinglestructuralpath.
WithPDFs,thisoccurswhenapathcontainsanarraywithmorethanoneelement,e.
g.
,thepath/Pages/Kids/MediaBoxcon-tainstwoarraysandreacheseightleafs.
IncaseofSWFfiles,apartfromarrays,multipletagsofthesametypecausemultipleleafstolieinthesamepath.
rndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page9of20Fig.
9AcompletestructuralmultimapofthePDFfiledepictedinFig.
3.
ThistypeofstructuralmultimapisnotusedinHidostbutratherillustratedasaninstructiveexampleImplementationofstructureextractionforPDFandSWFispresentedinthefollowingtwosections.
3.
1.
1PDFThePDFlogicalstructureisorganizedasadirectedrootedcyclicgraph.
Totransformitintoastructuralmul-timap,itisfirstnecessarytoreducethegraphtoadirectedrootedtreeinstead.
Thisisachievedbyremovingallcyclesfromthegraph.
ThereisacycleinthePDFlogicalstruc-turewhenanindirectreferencefromatreenodeatdepthdRpointstoatreenodelyingonthesamepathatadepthdT5.
Names/(Dests|AP|JavaScript|Pages|Templates|IDS|URLS|EmbeddedFiles|AlternatePresentations|Renditions)/(Kids/|Parent/)*NamesNames/\1/Names6.
StructTreeRoot/IDTree/(Kids/)*NamesStructTreeRoot/IDTree/Names7.
(StructTreeRoot/ParentTree|PageLabels)/(Kids/|Parent/)+(Nums|Limits)\1/\38.
StructTreeRoot/ParentTree/Nums/(K/|P/)+StructTreeRoot/ParentTree/Nums/9.
(StructTreeRoot|Outlines/SE)/(RoleMap|ClassMap)/[/]+\1/\2/Name10.
(StructTreeRoot|Outlines/SE)/(K/|P/)*\1/11.
(Extensions|Dests)1/Name12.
Font/CharProcs/[/]+Font/\1/CharProcs/Name13.
(AcroForm/(Fields/|C0/)DR/)(ExtGState|ColorSpace|Pattern|Shading|XObject|Font|Properties)/[/]+\1\3/Name14.
/AP/(D|N)AP/\1/Name15.
Threads/F/(V/|N/)*Threads/F16.
(StructTreeRoot|Outlines/SE)/Info/[/]+\1/Info/Name17.
ColorSpace/Colorants/[/]+ColorSpace/\1/Colorants/Name18.
ColorSpace/Colorants/[/]+ColorSpace/Colorants/Name19.
Collection/Schema/[/]+Collection/Schema/Nameandconvertlinkedliststoshallowsets(4)inordertocre-ateageneric,unifiedviewoftheirelements,allonthesamelevel.
WereferthereadertothePDFStandard[9]foradetailedexplanationofthesebranchesofthePDFdocumentstructure.
DuetotherelativelyshallowSWFlogicalfilestructureandthebarringofuser-definedpathcomponents,onlytwoSPCruleswerecompiledforthisformat,listedinTable2,bothforhandlingrepetitivesubpaths.
NoattemptwasmadetocompileacompletelistofSPCrules.
EspeciallyforPDF,thereisampleopportunityforfurtheranonymizationandflatteningofhierarchiessuchasnametreesandnumbertreesnotcoveredinourrules.
Ingeneral,toextendthelist,itisadvisedtoreadthePDFStandardlookingforplacesinthePDFdocumentstruc-turewhereuser-definednamesareallowedorwherethereTable2SWFstructuralpathconsolidationrulesSearchregexSubstituteregex(DefineSprite/ControlTags/){2,}DefineSprite/ControlTags/(Symbol/Name/){2,}Symbol/Name/isawell-definedlistorhierarchy.
However,eventhislim-itedsetofrulesprovidesthefollowingcrucialbenefitscomparedtoSL2013:Reducedattacksurface.
WithoutSPC,everydistinctpathwithanoccurrencecountaboveathresholdconstitutesafeature.
Anattackerstrivingtoevadedetectionmayinthatcaseperformahidingattackbyconcealingamaliciouspayloadatacustompathdifferentfromanyinthefeatureset.
Forexample,apathtoafontwithalong,randomlygeneratednameishighlyunlikelytohavebeenencounteredbefore.
Amaliciouspayloadinsertedtherewouldinvisibletothedetectorthatdoesnothavethisparticularpathinitsfeatureset.
IncaseofSWF,whereuser-definedpathsaredisallowed,payloadsmaybeconcealedinverydeephierarchies,notencounteredin"normal"files.
PDFalsosuffersfromthisvulnerability.
HidingattacksarecheaptoimplementandtheprimaryweaknessofSL2013.
ThisavenueforevasionisclosedinHidostwiththeuseofconsolidatedpaths.
Limitedfeaturesetdriftintime.
Inreal-worldmachinelearningapplications,theproblemathandrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page12of20oftenchangesintime.
Thisisespeciallytrueinsecurityapplications,wheredefendersareforcedtoadapttounpredictablechangesinattacks.
Thisproblemisknowninmachinelearningasconceptdrift[32]andhasrecentlystartedtoattractinterestinsecurityliterature[33].
Thecontinualchangeindatarendersclassifiersevermoreoutdatedastimeprogressessincetheirtraining.
Therefore,theneedarisesforregularupdatestothelearningmodel,i.
e.
,periodicclassifierretraining.
Withdata-dependentfeaturessuchasinthiswork,itisadvisabletoperformfeatureselectionanewbeforeeveryretraininginordertobetteradapttoconceptdrift.
Periodicfeatureselectioncausestheobsolescenceofexistingfeaturesandadditionofnewonesbetweentworetrainingperiods.
Werefertochangesinthefeaturesetcausedbyperiodicfeatureselectionasfeaturesetdrift.
PDFismoresusceptibletofeaturesetdriftthanSWFduetotheflexibilityofitsstructuralpathdefinition.
AsSection4.
3.
3shows,SPCiseffectivelyusedtoreducefeaturesetdriftinHidost.
Featurespacedimensionalityreduction.
Finally,SPChasatremendousimpactonthetotalnumberoffeatures.
Featurespacedimensionalitydirectlyaffectstherunningtimeandmemoryrequirementsoflearningalgorithms.
InourPDFexperimentswithperiodicretraining,theaveragefeaturesetsizewasreducedbyanimpressive88%,from10,412.
5to1237.
4featurespertraining.
However,therearelimitsintheeffectivenessofSPCagainstmanuallycraftedpaths.
Becauseitknowsnonotionofasemanticallyvalidpath,SPCcannothandleunforeseencases,e.
g.
,arbitrarynamesintheCatalogdictionary.
Totacklethisfinal"blindspot"inthecoverageofthePDFlogicalstructure,awhitelistingapproachwouldberequiredwithitscompleteandup-to-daterepresentation—amodeloftheentirestructure—whichisoutofscopeofthiswork.
ThereducedattacksurfaceandlimitedfeaturesetdriftrepresentanimportantcontributiontotheoperationalsecurityofHidostasamachine-learning-baseddetector.
Massivelyreducedfeaturecountenablesitsapplicationonevenbiggerdatasets.
Together,thedescribedimprove-mentsbringHidostabigsteptowardsapplicabilityinareal-world,operationalenvironmentasanaccurate,reli-able,andsecuremaliciousfiledetector.
3.
3FeatureselectionDespitethereductionofsyntacticpolymorphismviastructuralpathconsolidation,theremaystillexistpathsthatoccurveryinfrequentlyintheobserveddata.
Usingsuchpathsforbuildingdiscriminativemodelsincreasesthedimensionalityoftheinputspacewithoutimprovingclassificationaccuracy.
Therefore,featureselection—asitiscommoninothermachinelearningapplications—hastobecarriedouttolimittheimpactofrarefeatures.
Beforepresentingthespecificfeatureselectiontechniques,wediscussthereasonswhyrarefeaturesoccurinthetwoformatsstudiedindetailinthepaper.
TheSWFfileformatspecification[27]strictlydefinesthenamesofalltagsandalltheirfields,prohibitingcustomization.
Therefore,Hidost'sfeaturesetforSWFtheoreticallycompriseseverystructuralpathdefinedbytheSWFspecification.
However,inpractice,noefforthasbeenmadetoenumerateallpathsintheSWFlogi-calstructure.
Instead,thefeaturesetcomprisesallpathsobservedinthetrainingdataset,atotalof3177.
Incontrast,thePDFfileformatspecification[9]allowstheuseofuser-definednamesinanyPDFdictionary,essentiallyenablinganunlimitednumberofdifferentpaths.
OurdataindicatesthatthisPDFfeatureiswidelyusedinpracticeaswehaveobservedover9milliondis-tinctPDFstructuralpaths.
However,23ofthesepathsdonotoccurinmorethanonefile.
Theseandotherpathsthatoccurinasmallpercentageofthedatasetareconsid-eredanomalous.
Therefore,theoriginalSL2013methodselectedthePDFpathswhichoccurinmorethanafixednumberoftrainingfiles,i.
e.
,1000,foritsfeatureset.
Thisthresholdcontrolsthetrade-offbetweendetectionaccu-racy(morepaths)andmodelsimplicity(lesspaths)andmaybefreelyadjusted.
AfterSPCandbeforeeverytraininginourperiodicretrainingexperimentalprotocol,weappliedthesameoccurrencethreshold,i.
e.
,1000files,whichcorrespondstoaround1%ofthetrainingsetsize.
ThisisincontrastwithSL2013,wherefeatureselectionwasperformed"inhindsight",oncefortheentiredataset.
3.
4VectorizationStructuralmultimapsareasuitablerepresentationofPDFandSWFlogicalfilestructurebuttheycannotbedirectlyusedbymachinelearningalgorithms.
Theyfirstneedtobetransformedintofeaturevectors,i.
e.
,pointsinthefeaturespaceRN,inaprocesscalledvectorization.
Duringvectorization,structuralmultimapsarefirstreplacedbystructuralmaps—ordinarymapdatastruc-turesthatmapastructuralpathtoacorrespondingsinglenumericvalue.
Tothisend,everysetofvaluescorrespond-ingtoonestructuralpathinthemultimapisreducedtoitsmedian.
Weselectedmedianasamorerobuststatisticthanmean(here,weusethetermrobustinthestatisticalsense,denotingthatmedianprovidesabettercharacter-izationofthesetofvaluesinthepresenceofoutliersandnotthatitprovidesanyrobustnessagainstadversarialevasion).
Theonlyexceptiontothisruleisthatsetsofval-uesinSWFstructuralmultimapsconsistingprimarilyofrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page13of20Booleansarereducedtotheirmeans,notmedians.
MeanpreservesmoreinformationaboutBooleansthanmedian,whichcanonlybe0,12,or1,andthereisnopossibilityofoutliers.
ThisexceptionisnotimplementedforPDFasitslogicalstructurehasrelativelyfewbooleanvalues.
Structuralmapsaretransformedintofeaturevectorsf∈RNbyreservingaseparatedimensionforeachspecificstructuralpathandusingvaluesfromstructuralmapsasvaluesofspecificdimensions.
Themappingofindividualstructuralpathstodimensionsoffeaturevectorsisdefinedbeforefeatureextractionandduringfeatureselectionandisapplieduniformlytoallstructuralmapspriortobothtrainingandclassification.
Consequently,aspecificstruc-turalpathcorrespondstothesamedimensionineveryfeaturevector,enablingthelearningalgorithmstomakesenseofthefeaturevectors.
Theorderedcollectionofallfeaturesusedbyalearningalgorithmisitsfeatureset.
Figures10and11illustratefea-turevectorsobtainedfromaPDFandaSWFstructuralmultimap,respectively.
Theyshowasimplecasewhenthefeaturesetisidenticaltothesetofkeysinthestruc-turalmultimapandeveryvalueofthefeaturevectorisassigned.
Inpractice,however,filesusuallydonotcontainallstructuralpathspresentinthefeaturesetandthecor-respondingvaluesinthefeaturevectorsaresettozero.
Insummary,afeaturevectorcorrespondingtoastructuralmultimapmisapointf=f1,f2,fNinfeaturespaceRNwithspecificvaluesdefinedasfi=median(m[pi]),pi∈m0,otherwisei∈1,N(1)Here,pidenotestheithpathinthefeaturesetandm[pi]denotesthevalueinamultimapmassociatedwiththatpath.
3.
5LearningandclassificationThestagespresentedsofartransformsamples,i.
e.
,files,intofeaturevectorssuitableasinputformachinelearn-ingalgorithms.
Thechoiceofaconcretemachinelearn-ingclassifierdependsonamultitudeofparameters,e.
g.
,datasetsize,featurespacedimensionality,availablecomputationalresources,robustnessagainstadversar-ialattacks,etc.
,andclassifiersaretailoredfordifferentuses.
ThepublishedimplementationofHidostthereforedoesnotcompriselearningandclassificationsubsystems.
Instead,itsoutputcanbeusedwiththereader'sclassi-fierofchoice.
Forexperimentspresentedinthispaper,theRandomForestimplementationoftheopen-sourcescikit-learnPythonmachinelearninglibrary[34],version0.
15.
0b2,wasutilized.
ThispartofHidostwaspublishedseparately,aspartofexperimentalreproductioncode,asdetailedinthefollowingsection.
RandomForest[35]isanensembleclassifier.
ItistrainedbygrowingaforestofdecisiontreesusingCARTmethod-ology.
EachofthetRFtreesisgrownonitsownfixed-sizerandomsubsetoftrainingdatadrawnwithreplacement.
Ateverybranchingofatreeduringtraining,thefeatureprovidingtheoptimalsplitisselectedfromarandomsub-setcomprisingfRFfeaturesnotpreviouslyusedforthistree.
Duringclassification,thedecisionofeverytreeiscountedasonevoteandtheoveralloutcomeistheclasswiththemajorityofvotes.
RandomForestsareknownfortheirexcellentgeneralizationabilityandrobustnessagainstdatanoise.
Fortheexperimentalevaluation,forestsizewassetto200treesandallotherparameterstotheirscikit-learndefaults.
4ExperimentalevaluationAnextensiveexperimentalevaluationwasperformedtoassessthedetectionperformanceofHidostandcom-pareittorelatedwork.
Entiresourcecodeanddatasetsneededtoreproduceallexperimentsandplotshavebeenpublishedasopen-sourcesoftware[36].
4.
1ExperimentaldatasetsExperimentswererunontwodatasets,oneforeachfileformat.
BothwerecollectedfromVirusTotal[37],awebsitethatperformsananalysisoffilesuploadedbyInternetusersusingmanyantivirusengines.
VirusTo-talprovidesdetectionresultstoresearchers,enablingustocompareHidost'sdetectionperformancetothatofdeployedantivirusengines.
Inourexperiments,wecon-siderthosefilesmaliciousthatwerelabeledasmaliciousbyatleastfiveantivirusenginesandthosefilesbenignthatwerelabeledbenignbyallantivirusengines.
Theremainingfiles,labeledbyonetofourantivirusenginesasmalicious,arediscardedfromtheexperimentsbecauseofthehighuncertaintyoftheirtrueclasslabel.
OurPDFdatasetcomprises439,563(446GiB)files,407,037(443GiB)benignand32,567(2.
7GiB)malicious.
Theywerecollectedduring14weeks,betweenJuly16andOctober21,2012.
Thisisthesamedatasetusedforthe10Weeksexperimentin[26],enablingadirectcomparisonwiththatwork.
TheSWFdatasetwascollectedbetweenAugust1,2013andMarch8,2014andcomprises40,816(14.
2GiB)files,38,326(14.
1GiB)benignand2490(190MiB)malicious.
TheVirusTotalSWFdatahadabenign-to-maliciousratioofaround52:1duringthecollectionperiod;therefore,arandomsubsamplingofbenigndatawasperformedtoapproximatelymatchtheratiowiththatofPDFdata.
4.
2ExperimentalprotocolOurexperimentalprotocolhasthefollowingtwomaingoals:(a)toevaluatetheperformanceofHidostunderrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page14of20realisticconditionsand(b)toenablethecomparisonofHidost'sdetectionperformanceonPDFtoitspredecessor,SL2013.
Tothisend,weadopttheexperimentalprotocolofthe10Weeksexperimentfromthesamepublication.
The10Weeksexperimentattemptstoapproximateareal-world,operationalenvironmentofamalwaredetec-tionsystemexposedtoattacksthatchangeandadaptintime.
Forthatpurpose,itemploysperiodicretrainingandevaluation.
Trainingandevaluationdatasetsareassem-bledinaslidingwindowfashion,i.
e.
,foreveryweekofevaluationdata,theclassifieristrainedontheprevious4weeksoftrainingdata.
Inordertofollowtheprotocolofthe10Weeksexperi-mentascloselyaspossible,thetwoexperimentaldatasetswerepartitionedasfollows.
Thetimeperiodinwhichthefileswerecollectedwasdividedinto14smaller,con-secutivetimeperiods.
ForPDF,everyperiodwasexactly1weeklong,forSWF15days,thelastone25days.
Everytimeperiodwasassignedabucket,andeveryfilewasputintooneofthebuckets,accordingtothetimeperiodwhenitwasfirstseen.
Then,theslidingwindowapproachwasapplied,joiningfourconsecutivebucketsintoatrainingdatasetandusingthefollowingbucketasthecorrespondingevaluationdataset,resultingin10datapartitionsforperiodicretraining.
Beforeeveryretraining,i.
e.
,everyweekforPDF,every15daysforSWF,allfourstepsoffeatureextraction(i.
e.
,structureextraction,SPC,featureselection,andvectorization)wereappliedtothetrainingdataset.
Thus,generatedPDFfilepartitioningisidenticaltotheoneusedin[26],andtheSWFdatasetfollowsthesamedesign.
ThedatasetsareillustratedinFigs.
12,13,and14.
Everyretrainingeventislabeledbythedateoftraining,i.
e.
,thedaythatmarksthebeginningofdatacollectionfortheevaluationperiod.
Aug13Aug20Aug27Sep03Sep10Sep17Sep24Oct01Oct08Oct15Date(2012)0.
00.
20.
40.
60.
81.
01.
21.
41.
6Samples*105Fig.
12PDFdataset.
LegendinFig.
13Sep30Oct15Oct30Nov14Nov29Dec14Dec29Jan13Jan28Feb12Date(2013-2014)0.
00.
51.
01.
52.
02.
5Samples*104BenigntrainingMalicioustrainingBenignevaluationMaliciousevaluationFig.
13SWF-Normaldataset.
MalicioustrainingsamplesolderthanfourtimeperiodsarediscardedWhilethebenign-maliciousclassratioforPDFisapproximatelyequalthroughoutalltimeperiods,thedis-tributionofmaliciousandbenignSWFfilesintimeishighlyskewed,asvisibleinFigs.
13and14.
Asmuchas70%ofmaliciousSWFfilesintheSWF-Normaldatasetwerecollectedbeforethefirstevaluationperiod,whilelessthan10%ofbenignSWFfilesoccurbeforethefiftheval-uationperiod.
Theresultisahighclassimbalanceinmosttrainingandevaluationdatasets.
Inordertoquantifytheeffectofhighclassimbalanceondetectionperformance,wegeneratedanotherdatapartitioningjustforSWFdata,labeledSWF-KeepMalandillustratedinFig.
14,inwhichmalicioustrainingsamplesolderthanfourperiodsarenotdiscarded.
Instead,theyareusedfortraininginallsub-sequentperiods.
Bydiscardingoldbenignsamplesandkeepingmaliciousonesthroughouttheexperiment,theclassimbalanceintrainingdatasetsisgreatlyreduced.
Sep30Oct15Oct30Nov14Nov29Dec14Dec29Jan13Jan28Feb12Date(2013-2014)0.
00.
51.
01.
52.
02.
5Samples*104Fig.
14SWF-KeepMaldataset.
Malicioustrainingsamplesarekeptindefinitely.
LegendinFig.
13rndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page15of200.
9920.
9940.
9960.
9981.
000AreaunderROC0.
930.
940.
950.
960.
970.
980.
991.
00AccuracySL2013reproductionSL2013+RandomForestHidostbinaryHidostnumerical0.
600.
650.
700.
750.
800.
850.
900.
951.
00TruepositiverateAug13Aug20Aug27Sep03Sep10Sep17Sep24Oct01Oct08Oct15Date(2012)0.
00000.
00050.
00100.
00150.
00200.
00250.
00300.
0035FalsepositiverateFig.
15ResultsonPDFdata.
PerformanceofSL2013withanSVMandRandomForestclassifiercomparedtoHidostwithbinaryandnumericalfeatures4.
3ExperimentalresultsExperimentalresultsofdifferentmethodsoperatingonPDFandSWFdataareillustratedinFigs.
15and16,respectively.
Themethodsarecomparedinfourper-formanceindicatorstypicalforclassificationtasks:true(TPR)andfalsepositiverate(FPR),accuracy,andareaunderreceiveroperatingcharacteristic(AUROC).
AUROC,similartotheareaundertheprecision-recallcurve,isagooddetectionperformanceindicatorforbothbalancedandunbalanceddatasets.
Duetothestochasticnatureofthealgorithm,meanvaluesof10independentrunsareplottedforallRandomForestexperiments.
Thevarianceoftheseexperimentswasomittedfromtheplotsduetoitsverylowvalue.
SL2013employsasupportvectormachine(SVM)classifier,adeterministicalgo-rithm;therefore,itsresultsareobtainedfromasingleexperimentalrun.
0.
750.
800.
850.
900.
951.
00AreaunderROC0.
920.
930.
940.
950.
960.
970.
980.
991.
00AccuracySWF-NormalbinarySWF-NormalnumericalSWF-KeepMalbinarySWF-KeepMalnumerical0.
30.
40.
50.
60.
70.
80.
91.
0TruepositiverateSep30Oct15Oct30Nov14Nov29Dec14Dec29Jan13Jan28Feb12Date(2013-2014)0.
000.
010.
020.
030.
040.
05FalsepositiverateFig.
16ResultsonSWFdata.
PerformanceofHidostwithbothbinaryandnumericalfeaturesontwoSWFdatasets,SWF-Normal,andSWF-KeepMalFigure15showsresultsfordifferentvariantsofHidostandSL2013onPDFdata.
Alongwiththeaccuraterepro-ductionofSL2013,thesamemethodisevaluatedusingaRandomForestclassifierinsteadoftheSVM.
Hidostisshownwithbothbinaryandnumericalfeatures.
ThesetwovariantsofHidostarealsoshowninFig.
16onSWFdatasetsSWF-NormalandSWF-KeepMal.
4.
3.
1ClassificationperformanceFigure15showsadirectcomparisonofHidosttoSL2013onthePDFdataset.
RandomForestvariantofSL2013wasintroducedtoenablethecomparisonofthetwomethods'featuresetsandclassifiersindependently.
ItcanbeseenthatSL2013results,especiallythetruepositiverateonOctober1,canbepromptlyimprovedbyusingaRandomForestinsteadofanSVMonthesamebinaryunconsoli-datedfeatures.
Ontheotherhand,theexpectedclassifica-tionperformanceindicatedbyAUROCiseffectivelyequalrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page16of20forallmethods,includingtheSVM.
Themaximumdiffer-encebetweenanytwomethodsinAUROCinagiventimeperiodisamere0.
006,andallmethodshaveanAUROCabove0.
99ineveryperiod.
ItcanbeconcludedthatHidostachievestheexcellentclassificationperformanceofitspredecessor,SL2013,onPDFdata.
Hidost'sperformanceonSWFdataisnotonparwithitssuccessonPDF.
Althoughthemeandetectionaccu-racyliesabove95%,asseeninFig.
16,accuracyisnotameaningfulperformanceindicatorduetothelargeclassimbalanceof15:1infavorofbenignsamplesintheSWFdataset.
Theclassimbalanceisevengreaterinthedatasetsofindividualtimeperiods,showninFigs.
13and14,especiallyinthecaseofSWF-Normal.
Theeffectofclassimbalanceisclearlyreflectedintheresults.
AppliedonSWF-KeepMal,wheremalicioustrainingsamplesareaccumulatedovertime,HidosthasanoverallmuchhigherAUROCthanonSWF-Normal,wheremalwareisdiscardedafterfourperiods.
ThetruepositiverateonSWF-KeepMalintheearlystages,whentheclassesaremorebalanced,is5to10%higherthanonSWF-Normal.
StartingfromDecember29,afterasharpincreaseofclassimbalance,theadvantagejumpstoaround20%—atremendousimprovement.
Accesstomoremalicioustrainingdataalsoincreasedthefalsepos-itiverate,buttheincreaseforthevariantwithnumericalfeaturesremainedwithinbounds,exceptforthelasttimeperiod.
ThesefindingsclearlyshowHidost'spotentialforfurtherimprovementofdetectionperformance,givenagreateravailabilityofmaliciousSWFtrainingdata.
How-ever,astheSWFdatasetonlycomprises2,490malicioussamples,itisimpossibletoaccuratelyquantifythepoten-tialforimprovement.
4.
3.
2ComparisontoantivirusenginesInordertogetanestimateofHidost'sdetectionperfor-manceunderday-to-day,realisticoperationalconditions,itisnecessarytoputitintoawiderperspective.
Adirectcomparisonwithantivirusengines,widelyusedmalwaredetectorsmostpersonsrelyonfortheirsecurity,providessucharealitycheck.
Wecomparethedetectorsintermsoftheirtruepositivecount,i.
e.
,thenumberofmalicioussamplestheyhavecorrectlylabeled,inthecourseofourexperiments.
Bydefinitionofourgroundtruth,samples0500010000150002000025000TruepositivecountKingsoftSUPERAntiSpywareNANO-AntivirusByteHeroTheHackerAntiy-AVLAgnitumeScanVBA32PandaViRobotRisingVirusBusterCAT-QuickHealJiangminTotalDefenseAhnLab-V3eSafeClamAVK7AntiVirusMicroWorld-eScanPCToolsSymantecEmsisoftNormanTrendMicroVIPREFortinetF-ProtMcAfee-GW-EditionTrendMicro-HouseCallCommtouchF-SecureAvastAVGMcAfeeIkarusMicrosoftAntiVirnProtectDrWebComodoBitDefenderKasperskyESET-NOD32GDataSophosHidostTotal0100200300400500600700TruepositivecountKingsoftCMCQihoo-360PandaK7AntiVirusK7GWClamAVViRobotPCToolsAntiy-AVLBkavTotalDefenseRisingF-ProtCAT-QuickHealAhnLab-V3ESET-NOD32McAfeeJiangminComodoCommtouchAgnitumDrWebVBA32Ad-AwareMcAfee-GW-EditionTrendMicro-HouseCallVIPREIkarusSymantecTrendMicroNANO-AntivirusNormanFortinetEmsisoftnProtectF-SecureBitDefenderAVGMicroWorld-eScanMicrosoftGDataKasperskyHidostSophosAntiVirAvastTotalFig.
17ComparisonofHidosttoantivirusenginesonPDF(left)andSWF(right)datarndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page17of20labeledmaliciousbyatmostfourantivirusenginesarefil-teredout.
Therefore,theantivirusenginesdonothavefalsepositivesandcannotbecomparedinthatsense.
Figure17showstheresultsachievedbyHidost(averageof10experimentalruns)andantivirusenginesdeployedbyVirusTotalonbothPDFandSWFfiles.
Antivirusdetec-tionresultswerecollectedaftertheexperimentswereoverandnotimmediatelyaftereachnewfilewassubmit-tedtoVirusTotal.
Thisprovidedantivirusengineswiththeopportunitytoupdatetheirdetectionmechanismsinthemeantimeandcorrectlydetectanyfileresubmit-tedbetweenitsinitialsubmissionandthetimewhenthedetectionresultswerecollected.
Nevertheless,Hidostranksamongthebestoverall.
ItsPDFdetectionrateisunsurpassed,andeventheSWFtruepositivecount,comparativelymuchworsethanPDF"onpaper",ranksamongthebestwhencomparedtoestab-lishedproductsunderrealisticconditions.
4.
3.
3EffectsofstructuralpathconsolidationSPCisoneofthemainnoveltiesinHidostwithrespecttoSL2013;therefore,anevaluationofitseffectsontheper-formanceofthesystemisonlyfitting.
Figure15demon-stratesthatSPCdoesnotaffectdetectionperformance,neitherpositivelynornegatively.
ResultsofSL2013(noSPC)arevirtuallyidenticalonPDFtothoseofHidost(withSPC)whenthesameRandomForestclassifierisutilized.
EffectsonSWFarenegligiblebecauseitsrigidlogicalstructuredisallowsuser-definedpaths,resultinginminimalnecessityforSPC.
However,SPChasastrongpositiveeffectonfeaturesetdrift.
Figure18illustratesfeaturesetdriftinourexperimentwithperiodicretrainingandperiodicfeatureselectiononPDFdata.
Itcanbeobservedthatinthefirsthalfofour10-weekexperiment,thefeaturesethadbeenexpandedwithupto9%ofnewfeaturesperweek,whileinthesecondhalf,manyfeatureswerefoundobsoleteandwereremovedfromthefeatureset.
Featureremovalisespeciallyhighinweek8,whenalmostafifthofallfea-turesfromthepreviousweekweredeletedwhenSPCwasnotused.
Ontheotherhand,whenutilizingSPC,theover-lapbetweenfeaturesetsofconsecutiveweekswaswellabove90%throughouttheentireexperiment.
Overall,theintroductionofSPCinHidostreducedfeaturesetdriftbyaround50%.
4.
3.
4EffectsofnumericalfeaturesAnothernoveltyintroducedinHidostistheuseofnumer-icalinsteadofbinaryfeatures,reflectingthetransitionfromlearningonpurestructuretolearningonstructurecoupledwithcontent.
Here,weevaluatetheimpactofnumericalfeaturesonperformance.
OnPDF,thedifferencebetweenbinaryandnumericalfeaturesisinsignificant,asshowninFig.
15.
Ontheotherhand,theeffectonSWFislargelypositive.
AsshowninFig.
16,numericalconsistentlyoutperformedbinaryfea-turesonbothSWFdatasets.
Theyshowedthehighestinfluenceonfalsepositiverate,reducingitbyasmuchas50%.
TPRandAUROCalsoshowedasignificantoverallimprovement.
Thecauseofthediscrepancybetweenresultsforthetwofileformatsmightlieinthenatureofattacksagainstthem.
MaliciousPDFfilesoftenusefeaturesuncommoninbenignfiles,i.
e.
,theirstructureisdifferent,whilemali-ciousSWFfilesmostlybasetheirattacksondifferentval-ues,i.
e.
,content,atspecificpaths,althoughthesepathsarealsocommonamongbenignfiles.
Whilebinaryfeaturessufficetodescribelogicalstructure,theaddedexpressivepowerofnumericalfeaturesenablesthedescription,andconsequentlydetection,ofbothstructure-andcontent-basedattacks.
4.
4ReproductionofSL2013resultsResultsofallperformanceevaluationexperimentspub-lishedinSL2013[26]wereaccuratelyreproducedusingtheoriginalRBFSVMwithC=12andγ=0.
0025.
How-ever,whenattemptingtoreproducetheevasionrobust-nessevaluationexperiment,wehavediscoveredaflawin0.
000.
020.
040.
060.
080.
10NewfeaturesWithoutSPCWithSPC0.
000.
050.
100.
150.
20Obsoletefeatures2345678910Retrainingperiod0.
800.
850.
900.
951.
00UnchangedfeaturesFig.
18FeaturesetdriftinthePDFdataset.
Illustrateschangeinthefeaturesetbetweenconsecutivefeatureselectioneventsinourexperiment,performedbeforeeverytraining,i.
e.
,onceperweek.
Forweeks2to10,thepercentageoffeaturesthathavebeenaddedto(new),removedfrom(obsolete),orremained(unchanged)inthefeaturesetisplotted,relativetothepreviousweekrndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page18of20thesourcecodeofthemimicryattackusedagainsttheRBFSVM.
Namely,insteadofmimickingthemostbenignsampleinthebenignset,theflawcausedthemostbenignsampleinthemalicioussettobeusedasthemimicrytarget.
Thus,generatedattackdatasetwassuccessfullydetectedassuchbytheRBFSVM.
Upondiscoveringthisflaw,weremoveditandperformedacorrectedexperi-ment.
Thistime,theattackwashighlysuccessfulandtheaccuracyfellto50%.
Wethereforeretracttheresultsofthemimicryattackexperimentpublishedin[26]andseethesuccessofthecorrectedattackasevidenceagainsttherobustnessofRBFSVMinanadversarialenvironment.
RobustnessguaranteesforRandomForestsremainatopicforfutureresearch.
5DiscussionThemainnoveltyintroducedbyHidostisitsapplicabilitytomultiplefileformats,implementedandexperimentallyconfirmedonPDFandSWF.
Itsapplicationtootherhier-archicallystructuredfileformats,e.
g.
,XML,HTML,ODF,OOXML,andSVG,requirestheinstrumentationofanexistingparserorthedevelopmentofanewone,oneforeachfileformat.
Giventheabilitytoparseaspecificfileformat,incorporatingitintoHidostamountstodevelop-ingastructureextractionmodule.
Itisthisstepthathastobespecializedforeveryfileformat.
Inthefollowing,wediscussfilestructureandcontentextractionforvarioushierarchicallystructuredfileformats.
XMLandtherelatedHTMLandSVGhaveaveryclearandwell-definedhierarchicalstructurethatrep-resentsoneofthecornerstonesoftheseformats.
Forexample,Fig.
2depictsanHTMLfilewiththepath/html/body/p.
Furthermore,thereexistsanumberofmatureopen-sourceparsersforXMLfiles.
WeestimateittobeverysimpletoimplementtheextractionofbothlogicaldocumentstructureandcontentfromXMLfiles.
AlthoughbasedlargelyonXML,ODFandOOXMLgenerallycombinemultipleXMLfilesintoaZIParchiveandthereforerequiresomeadditionalprocessing.
Bothformatsprescribeasetoffilesanddirectoriesinwhichcontent,layout,andmetadataareseparatelyorganized.
Theformatsdifferentiatebetweentextualandgraphicalcontent;textualbeingstoredalongsidelogicalstructureinXMLfilesandgraphicalinseparatefileswithinthedirectoryhierarchy.
Weobservethatthefilesanddirecto-riesarethemselvesorganizedhierarchicallyandthattheremaininglogicalstructureisdescribedinXMLfiles.
ThefollowingshowsasimplifiedfileanddirectorylayoutinanODFfile:.
|--content.
xml|--manifest.
rdf|--META-INF|\--manifest.
xml|--meta.
xml|--mimetypeFig.
19Examplemeta.
xmlfilerndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page19of20|--settings.
xml|--styles.
xml\--Thumbnails\--thumbnail.
pngWeconsiderthedirectoryhierarchytobethetoplevelofthelogicalstructure.
Init,therootdirectoryoftheZIParchiverepresentstherootnodeoftheentirestructuralhierarchy.
XMLfilescanbeviewedassub-treesrootedatthecorrespondingnodesinthefilesystemhierarchy.
Forexample,ODFprescribesthatthefilemeta.
xml,depictedinFig.
19,resideswithintherootdirectoryandhasasetofXMLtagsdescribingdocumentmeta-data.
Giventhisstructure,thepathtothedc:creatortagwouldbe:/meta.
xml/office:document-meta/office:meta→/dc:creatorBytreatingthedirectoryhierarchyasthetoplevelofthelogicalstructureandXMLfilesassub-treesbeloningtoit,weensurethecompleteandunambiguousextractionoflogicalstructure.
ComparedtoPDFandSWF,weprependthefilesystempathofagivenXMLfile,relativetotherootoftheZIParchive,tostructuralpathsextractedfromthefileitself.
MultipleparsersforODFandOOXMLexist,ofwhichsomeareopen-source.
Webelievethatitwouldbepossi-ble,withmoderateeffort,todevelopstructureextractionmodulesforbothformats.
Furthermore,inmanycases,completelybenignOOXMLfilesareusedascontainersforembeddingmaliciousSWFfilesandHidostcanalreadyhandlethem.
Structuralpathconsolidationisthesecondandfinalformat-specificstepinHidostandrequiressometuning.
Weexpectthatdifferentformatshavedifferentrequire-mentsforSPCandacknowledgethenecessityfordeeperunderstandingoffileformatsforSPCruledevelopment.
ThevarietyofSPCrulesforPDFversusSWFcorroboratesthishypothesis.
Finally,Hidost'sapplicabilitytoagivenfileformatdoesnotimplyitseffectivenessonit.
Forexample,despiteourfirmbeliefthatextendingHidosttoXMLisstraightfor-ward,itseffectiveness,measuredinitsabilitytodetectmalwaredisguisedinXMLfiles,canonlybeevaluatedexperimentally.
However,itsuseofbothstructureandcontentformodelingmakesitmorelikelytobesuccessful.
6ConclusionsInthispaperweintroducedHidost,amachine-learning-basedmalwaredetectionsystem.
Itrepresentsanexten-sionofapreviouslypublishedmethod,SL2013[26].
Hidostisthefirststaticmachine-learning-basedmalwaredetectordesignedtooperateonmultiplefiletypes.
Itaccomplishesthisbymakingamodelofmaliciousandbenignsamplesbasedontheirstructureandcontent.
Evaluatedonareal-worlddatasetinarealisticexperi-mentwithperiodicretrainingspanningmultiplemonths,HidostoutperformedallantivirusenginesdeployedbythewebsiteVirusTotalanddetectedthehighestnumberofmaliciousPDFfiles.
ItalsorankedamongthebestonSWFmalware.
Comparedtoitspredecessor,SL2013,itismuchlessvulnerabletomalwarehidinginobscuredpartsofPDFfiles.
Hidostalsobecamemorerobustagainstthecontinualadaptationofmalwaretoupdateddefensethroughperiodicretraining.
Finally,itsgreatlyreducedfeaturesetdimensionalityenablesitsefficientapplicationonverylargedatasets.
AlogicalnextstepforHidostisitsimplementationandevaluationonotherhierarchicallystructuredfileformats.
Ofparticularsignificancewouldbeanapplicationtofor-matsusedbyMicrosoftOffice,astheyarewidelyusedforrecenttargetedattacks.
Aconceptualdesignforthisappli-cationwasproposedinthispaper.
Lookingbeyond,thedevelopmentofmoreadvancedstringhandlingmethodsmightproveindispensabletoenabledetectionofmalwareinformatswhoselogicalstructureandnumericalcontentdonotprovideenoughdiscriminativepower.
AcknowledgementsWeacknowledgesupportbyDeutscheForschungsgemeinschaftandOpenAccessPublishingFundofUniversityofTübingen.
Authors'contributionsNSparticipatedinconceivinganddesigningthestudy,implementedthemethod,participatedinthedatacollection,carriedouttheexperiments,anddraftedthemanuscript.
PLparticipatedinconceivinganddesigningthestudy,thesystemdesign,andthedatacollectionandhelpeddraftthemanuscript.
Bothauthorsreadandapprovedthefinalmanuscript.
CompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authordetails1CognitiveSystems,DepartmentofComputerScience,UniversityofTübingen,Sand1,72076,Tübingen,Germany.
2MunichOffice,EuropeanResearchCenter,HuaweiTechnologiesDuesseldorfGmbH,Riessstr.
25D-3.
OG,80992München,Germany.
Received:29November2015Accepted:7September2016References1.
Symantec,2014InternetSecurityThreatReport,Volume19(2014).
https://www.
symantec.
com/content/en/us/enterprise/other_resources/b-istr_main_report_v19_21291018.
en-us.
pdf.
Accessed13Apr20152.
Cisco,2014AnnualSecurityReport(2014).
http://www.
cisco.
com/web/offer/gist_ty2_asset/Cisco_2014_ASR.
pdf.
Accessed13Apr20153.
Sophos,SecurityThreatReport2014(2014).
http://www.
sophos.
com/en-us/medialibrary/pdfs/other/sophos-security-threat-report-2014.
pdf.
Accessed13Apr20154.
Symantec,2014InternetSecurityThreatReport,Volume19,Appendix(2014).
https://www.
symantec.
com/content/en/us/enterprise/other_resources/b-istr_appendices_v19_221284438.
en-us.
pdf.
Accessed13Apr20155.
Symantec,2015InternetSecurityThreatReport,Volume20(2015).
http://know.
symantec.
com/LP=1123.
Accessed15Apr2015rndicandLaskovEURASIPJournalonInformationSecurity(2016)2016:22Page20of206.
RecordedFuture,GoneinaFlash:Top10VulnerabilitiesUsedbyExploitKits(2015).
https://www.
recordedfuture.
com/top-vulnerabilities-2015/.
Accessed15Nov20157.
W-JLi,SJStolfo,AStavrou,EAndroulaki,ADKeromytis,inDetectionofIntrusionsandMalware&VulnerabilityAssessment(DIMVA).
Astudyofmalcode-bearingdocuments(Springer,2007),pp.
231–2508.
ZMShafiq,SAKhayam,MFarooq,inDetectionofIntrusionsandMalware&VulnerabilityAssessment(DIMVA).
Embeddedmalwaredetectionusingmarkovn-grams(Springer,2008),pp.
88–1079.
Documentmanagement-Portabledocumentformat-Part1:PDF1.
7(2008).
https://www.
adobe.
com/devnet/pdf/pdf_reference.
html.
Accessed23Jan201510.
PLaskov,Nrndic,inAnnualComputerSecurityApplicationsConference(ACSAC).
StaticdetectionofmaliciousJavaScript-bearingPDFdocuments(ACM,2011),pp.
373–38211.
DMaiorca,GGiacinto,ICorona,inInternationalWorkshoponMachineLearningandDataMininginPatternRecognition.
ApatternrecognitionsystemformaliciousPDFfilesdetection(Springer,2012),pp.
510–52412.
CSmutz,AStavrou,inProceedingsofthe28thAnnualComputerSecurityApplicationsConference.
MaliciousPDFdetectionusingmetadataandstructuralfeatures(ACM,2012),pp.
239–24813.
Nrndic,PLaskov,inProceedingsofthe2014IEEESymposiumonSecurityandPrivacy.
Practicalevasionofalearning-basedclassifier:acasestudy(IEEEComputerSociety,2014),pp.
197–21114.
MPolychronakis,KGAnagnostakis,EPMarkatos,inAnnualComputerSecurityApplicationsConference(ACSAC).
Comprehensiveshellcodedetectionusingruntimeheuristics(ACM,2010),pp.
287–29615.
PAkritidis,EPMarkatos,MPolychronakis,KGAnagnostakis,in20thInternationalConferenceonInformationSecurity.
STRIDE:Polymorphicsleddetectionthroughinstructionsequenceanalysis(Springer,2005),pp.
375–39216.
Wepawet(2015).
http://wepawet.
iseclab.
org/.
Accessed16Apr201517.
MCova,CKruegel,GVigna,inInternationalConferenceonWorldWideWeb(WWW).
Detectionandanalysisofdrive-by-downloadattacksandmaliciousJavaScriptcode(ACM,2010),pp.
281–29018.
CWillems,THolz,FFreiling,CWSandbox:towardsautomateddynamicbinaryanalysis.
IEEESecur.
Privacy.
5(2),32–39(2007)19.
KZSnow,SKrishnan,FMonrose,NProvos,inUSENIXSecuritySymposium.
ShellOS:enablingfastdetectionandforensicanalysisofcodeinjectionattacks(USENIXAssociation,2011)20.
ATang,SSethumadhavan,SJStolfo,inResearchinAttacks,IntrusionsandDefenses:17thInternationalSymposium,RAID2014,Gothenburg,Sweden,September17–19,2014,Proceedings.
Unsupervisedanomaly-basedmalwaredetectionusinghardwarefeatures,vol.
8688(Springer,2014),p.
10921.
ZTzermias,GSykiotakis,MPolychronakis,EPMarkatos,inEuropeanWorkshoponSystemSecurity(EuroSec).
Combiningstaticanddynamicanalysisforthedetectionofmaliciousdocuments(ACM,2011)22.
XLu,JZhuge,RWang,YCao,YChen,inSystemSciences(HICSS),201346thHawaiiInternationalConferenceOn.
De-obfuscationanddetectionofmaliciousPDFfileswithhighaccuracy(IEEEComputerSociety,2013),pp.
4890–489923.
NNissim,ACohen,CGlezer,YElovici,DetectionofmaliciousPDFfilesanddirectionsforenhancements:astate-of-theartsurvey.
Comput.
Secur.
48(0),246–266(2015)24.
SFord,MCova,CKruegel,GVigna,inComputerSecurityApplicationsConference,2009.
ACSAC'09.
Annual.
Analyzinganddetectingmaliciousflashadvertisements(IEEEComputerSociety,2009),pp.
363–37225.
TVOverveldt,CKruegel,GVigna,inRecentAdancesinIntrusionDetection(RAID).
FlashDetect:ActionScript3malwaredetection(Springer,2012),pp.
274–29326.
Nrndic,PLaskov,in20thAnnualNetworkandDistributedSystemSecuritySymposium,NDSS2013,SanDiego,California,USA,February24–27,2013.
DetectionofmaliciousPDFfilesbasedonhierarchicaldocumentstructure,(2013).
http://internetsociety.
org/doc/detection-malicious-pdf-files-based-hierarchical-document-structure27.
SWFFileFormatSpecification(version19)(2012).
https://www.
adobe.
com/devnet/swf.
html.
Accessed23Jan201528.
SWFRETools-AcollectionoftoolsforreverseengineeringFlashfiles.
https://github.
com/sporst/SWFREtools.
Accessed4Feb201529.
Hidost-ToolsetforextractingdocumentstructuresfromPDFandSWFfiles.
https://github.
com/srndic/hidost.
Accessed29Nov201530.
Poppler.
http://poppler.
freedesktop.
org/.
Accessed10Feb201531.
KRieck,TKrüger,ADewald,inAnnualComputerSecurityApplicationsConference(ACSAC).
Cujo:efficientdetectionandpreventionofdrive-by-downloadattacks,(2010),pp.
31–3932.
GWidmer,MKubat,Learninginthepresenceofconceptdriftandhiddencontexts.
Mach.
Learn.
23(1),69–101(1996)33.
AKantchelian,SAfroz,LHuang,ACIslam,BMiller,MCTschantz,RGreenstadt,ADJoseph,JTygar,inProceedingsofthe2013ACMWorkshoponArtificialIntelligenceandSecurity.
Approachestoadversarialdrift(ACM,2013),pp.
99–11034.
FPedregosa,GVaroquaux,AGramfort,VMichel,BThirion,OGrisel,MBlondel,PPrettenhofer,RWeiss,VDubourg,JVanderplas,APassos,DCournapeau,MBrucher,MPerrot,EDuchesnay,scikit-learn:machinelearninginPython.
J.
Mach.
Learn.
Res.
12,2825–2830(2011)35.
LBreiman,Randomforests.
Mach.
Learn.
45(1),5–32(2001)36.
HidostReproduction.
https://github.
com/srndic/hidost-reproduction.
Accessed29Nov201537.
VirusTotal-FreeOnlineVirus,MalwareandURLScanner.
https://www.
virustotal.
com/.
Accessed6Mar2015Submityourmanuscripttoajournalandbenetfrom:7Convenientonlinesubmission7Rigorouspeerreview7Immediatepublicationonacceptance7Openaccess:articlesfreelyavailableonline7Highvisibilitywithintheeld7RetainingthecopyrighttoyourarticleSubmityournextmanuscriptat7springeropen.
com
spinservers是Majestic Hosting Solutions,LLC旗下站点,主营美国独立服务器租用和Hybrid Dedicated等,spinservers这次提供的大硬盘、大内存服务器很多人很喜欢。TheServerStore自1994年以来,它是一家成熟的企业 IT 设备供应商,专门从事二手服务器和工作站业务,在德克萨斯州拥有40,000 平方英尺的仓库,库存中始终有数千台...
易速互联怎么样?易速互联是国人老牌主机商家,至今已经成立9年,商家销售虚拟主机、VPS及独立服务器,目前商家针对美国加州萨克拉门托RH数据中心进行促销,线路采用BGP直连线路,自带10G防御,美国加州地区,100M带宽不限流量,月付299元起,有需要美国不限流量独立服务器的朋友可以看看。点击进入:易速互联官方网站美国独立服务器优惠套餐:RH数据中心位于美国加州、配置丰富性价比高、10G DDOS免...
webhosting24决定从7月1日开始对日本机房的VPS进行NVMe和流量大升级,几乎是翻倍了硬盘和流量,当然前提是价格依旧不变。目前来看,国内过去走的是NTT直连,服务器托管机房应该是CDN77*(也就是datapacket.com),加上高性能平台(AMD Ryzen 9 3900X+NVMe),这样的日本VPS还是有相当大的性价比的。官方网站:https://www.webhosting...
esetnod32id为你推荐
usergoogle买家google支持ipadipadwifiipad插卡版和wifi版有什么区别,价格差的多么?重庆电信网速测试电信100M下载速度多少M,为什么我家里电信100M下载速度最快5M美妙,是不是严重缩水ms17-010win10pybaen.10.的硬币是哪国的再中国至多少钱micromediamacromedia FreeHand MX是干什么用的?micromediawww.macromedia.com 是什么网站googleadsense10分钟申请Google Adsense是一种怎样的体验迅雷雷鸟啊啊,想下载《看门狗》可13GB的大小,我每秒才450KB,我该怎么样才能大幅度地免费提高电脑下载
如何注销域名备案 godaddy域名解析 免费顶级域名 中国万网域名 java主机 BWH 国外免费空间 空间服务商 坐公交投2700元 帽子云 秒杀预告 最好的免费空间 lol台服官网 100mbps 联通网站 空间登录首页 shuang12 群英网络 可外链的相册 广东主机托管 更多