pointadalovelace

adalovelace 时间:2021-05-04 阅读:()

SpeechandLanguageProcessing.
DanielJurafsky&JamesH.
Martin.
Copyright2020.
Allrightsreserved.
DraftofDecember30,2020.
CHAPTER23QuestionAnsweringThequestforknowledgeisdeeplyhuman,andsoitisnotsurprisingthatpracti-callyassoonastherewerecomputerswewereaskingthemquestions.
Bytheearly1960s,systemsusedthetwomajorparadigmsofquestionanswering—information-retrieval-basedandknowledge-based—toanswerquestionsaboutbaseballstatis-ticsorscienticfacts.
Evenimaginarycomputersgotintotheact.
DeepThought,thecomputerthatDouglasAdamsinventedinTheHitchhiker'sGuidetotheGalaxy,managedtoanswer"theUltimateQuestionOfLife,TheUniverse,andEverything".
1In2011,IBM'sWatsonquestion-answeringsystemwontheTVgame-showJeop-ardy!
,surpassinghumansatansweringquestionslike:2Questionansweringsystemsaremainlydesignedtollhumaninformationneeds.
Humansaskquestionsinmanysituations:whentalkingtoavirtualassistant,wheninteractingwithasearchengine,whenqueryingadatabase.
Mostquestionanswer-ingsystemsfocusonaparticularsubsetoftheseinformationneeds:factoidques-tions,questionsthatcanbeansweredwithsimplefactsexpressedinshorttexts,likethefollowing:(23.
1)WhereistheLouvreMuseumlocated(23.
2)WhatistheaverageageoftheonsetofautismInthischapterwedescribethetwomajorparadigmsforfactoidquestionan-swering.
Information-retrieval(IR)basedQA,sometimescalledopendomainquestionQA,reliesonthevastamountoftextontheweborincollectionsofsci-enticpaperslikePubMed.
Givenauserquestion,informationretrievalisusedtondrelevantpassages.
Thenneuralreadingcomprehensionalgorithmsreadtheseretrievedpassagesanddrawananswerdirectlyfromspansoftext.
Inthesecondparadigm,knowledge-basedquestionanswering,asystemin-steadbuildsasemanticrepresentationofthequery,suchasmappingWhatstatesbor-derTexastothelogicalrepresentation:λx.
state(x)∧borders(x,texas),orWhenwasAdaLovelaceborntothegappedrelation:birth-year(AdaLovelace,x).
Thesemeaningrepresentationsarethenusedtoquerydatabasesoffacts.
We'llalsobrieydiscusstwootherparadigmsforquestionanswering.
Onere-liesonthefactthatthehugepretrainedlanguagemodelsweusethroughoutNLPhavealreadyencodedalotoffactoids.
We'llseehowtoqueryalanguagemodeldirectlytoansweraquestion.
Andwe'llalsomentiontheclassicpre-neuralhy-bridquestion-answeringalgorithmsthatcombineinformationfromIR-basedandknowledge-basedsources.
1Theanswerwas42,butunfortunatelythedetailsofthequestionwereneverrevealed.
2Theanswer,ofcourse,is'WhoisBramStoker',andthenovelwasDracula.
2CHAPTER23QUESTIONANSWERINGWe'llexplorethepossibilitiesandlimitationsofalltheseapproaches,alongthewayalsointroducingtwotechnologiesthatarekeyforquestionansweringbutalsorelevantthroughoutNLP:informationretrieval(akeycomponentofIR-basedQA)andentitylinking(similarlykeyforknowledge-basedQA).
We'llstartinthenextsectionbyintroducingthetaskofinformationretrieval.
Analnote:wefocusinthischapteronlyonfactoidquestionanswering,buttherearemanyotherimportantQAtasksthattheinterestedreadermaywanttofollowupon.
Theseincludelong-formquestionanswering(answeringwhyques-tions,orotherquestionsthatrequiregeneratingalonganswer),communityques-tionanswering,inwhichwemakeuseofdatasetsofcommunity-createdquestion-answerpairslikeQuoraorStackOverow.
Finally,questionansweringisanimpor-tantbenchmarkforNLPprogressingeneral,andsoresearchershavebuiltsystemsthatsuccessfullyanswerquestionsonexamsliketheNewYorkRegentsScienceExamasawaytobenchmarkNLPandAI(Clarketal.
,2019).
23.
1InformationRetrievalInformationretrievalorIRisthenameoftheeldencompassingtheretrievalofallinformationretrievalIRmannerofmediabasedonuserinformationneeds.
TheresultingIRsystemisoftencalledasearchengine.
OurgoalinthissectionistogiveasufcientoverviewofIRtoseeitsapplicationtoquestionanswering.
ReaderswithmoreinterestspecicallyininformationretrievalshouldseetheHistoricalNotessectionattheendofthechapterandtextbookslikeManningetal.
(2008).
TheIRtaskweconsideriscalledadhocretrieval,inwhichauserposesaadhocretrievalquerytoaretrievalsystem,whichthenreturnsanorderedsetofdocumentsfromsomecollection.
Adocumentreferstowhateverunitoftextthesystemindexesanddocumentretrieves(webpages,scienticpapers,newsarticles,orevenshorterpassageslikeparagraphs).
Acollectionreferstoasetofdocumentsbeingusedtosatisfyusercollectionrequests.
Atermreferstoawordinacollection,butitmayalsoincludephrases.
termFinally,aqueryrepresentsauser'sinformationneedexpressedasasetofterms.
queryThehigh-levelarchitectureofanadhocretrievalengineisshowninFig.
23.
1.
Figure23.
1ThearchitectureofanadhocIRsystem.
ThebasicIRarchitectureusesthevectorspacemodelweintroducedinChap-ter6,inwhichwemapqueriesanddocumenttovectorsbasedonunigramwordcounts,andusethecosinesimilaritybetweenthevectorstorankpotentialdocuments(Salton,1971).
Thisisthusanexampleofthebag-of-wordsmodelintroducedinChapter4,sincewordsareconsideredindependentlyoftheirpositions.
23.
1INFORMATIONRETRIEVAL323.
1.
1TermweightinganddocumentscoringLet'slookatthedetailsofhowthematchbetweenadocumentandqueryisscored.
Wedon'tuserawwordcountsinIR,insteadcomputingatermweightforeachtermweightdocumentword.
Twotermweightingschemesarecommon:thetf-idfweightingintroducedinChapter6,andaslightlymorepowerfulvariantcalledBM25.
BM25We'llreintroducetf-idfheresoreadersdon'tneedtolookbackatChapter6.
Tf-idf(the'-'hereisahyphen,notaminussign)istheproductoftwoterms,thetermfrequencytfandtheindirectdocumentfrequencyidf.
Thetermfrequencytellsushowfrequentthewordis;wordsthatoccurmoreofteninadocumentarelikelytobeinformativeaboutthedocument'scontents.
Weusuallyusethelog10ofthewordfrequency,ratherthantherawcount.
Theintuitionisthatawordappearing100timesinadocumentdoesn'tmakethatword100timesmorelikelytoberelevanttothemeaningofthedocument.
Becausewecan'ttakethelogof0,wenormallyadd1tothecount:3tft,d=log10(count(t,d)+1)(23.
3)Ifweuselogweighting,termswhichoccur0timesinadocumentwouldhavetf=log10(1)=0,10timesinadocumenttf=log10(11)=1.
4,100timestf=log10(101)=2.
004,1000timestf=3.
00044,andsoon.
Thedocumentfrequencydftofatermtisthenumberofdocumentsitoc-cursin.
Termsthatoccurinonlyafewdocumentsareusefulfordiscriminatingthosedocumentsfromtherestofthecollection;termsthatoccuracrosstheentirecollectionaren'tashelpful.
Theinversedocumentfrequencyoridftermweight(SparckJones,1972)isdenedas:idft=log10Ndft(23.
4)whereNisthetotalnumberofdocumentsinthecollection,anddftisthenumberofdocumentsinwhichtermtoccurs.
Thefewerdocumentsinwhichatermoccurs,thehigherthisweight;thelowestweightof0isassignedtotermsthatoccurineverydocument.
HerearesomeidfvaluesforsomewordsinthecorpusofShakespeareplays,rangingfromextremelyinformativewordsthatoccurinonlyoneplaylikeRomeo,tothosethatoccurinafewlikesaladorFalstaff,tothosethatareverycommonlikefoolorsocommonastobecompletelynon-discriminativesincetheyoccurinall37playslikegoodorsweet.
4WorddfidfRomeo11.
57salad21.
27Falstaff40.
967forest120.
489battle210.
246wit340.
037fool360.
012good370sweet3703Orwecanusethisalternative:tft,d=1+log10count(t,d)ifcount(t,d)>00otherwise4SweetwasoneofShakespeare'sfavoriteadjectives,afactprobablyrelatedtotheincreaseduseofsugarinEuropeanrecipesaroundtheturnofthe16thcentury(Jurafsky,2014,p.
175).
4CHAPTER23QUESTIONANSWERINGThetf-idfvalueforwordtindocumentdisthentheproductoftermfrequencytft,dandIDF:tf-idf(t,d)=tft,d·idft(23.
5)23.
1.
2DocumentScoringWescoredocumentdbythecosineofitsvectordwiththequeryvectorq:score(q,d)=cos(q,d)=q·d|q||d|(23.
6)Anotherwaytothinkofthecosinecomputationisasthedotproductofunitvectors;werstnormalizeboththequeryanddocumentvectortounitvectors,bydividingbytheirlengths,andthentakethedotproduct:score(q,d)=cos(q,d)=q|q|·d|d|(23.
7)WecanspelloutEq.
23.
7,usingthetf-idfvaluesandspellingoutthedotproductasasumofproducts:score(q,d)=t∈qtf-idf(t,q)qi∈qtf-idf2(qi,q)·tf-idf(t,d)di∈dtf-idf2(di,d)(23.
8)Inpractice,it'scommontoapproximateEq.
23.
8bysimplifyingthequerypro-cessing.
Queriesareusuallyveryshort,soeachquerywordislikelytohaveacountof1.
Andthecosinenormalizationforthequery(thedivisionby|q|)willbethesameforalldocuments,sowon'tchangetherankingbetweenanytwodocumentsDiandDjSowegenerallyusethefollowingsimplescoreforadocumentdgivenaqueryq:score(q,d)=t∈qtf-idf(t,d)|d|(23.
9)Let'swalkthroughanexampleofatinyqueryagainstacollectionof4nanodoc-uments,computingtf-idfvaluesandseeingtherankofthedocuments.
We'llassumeallwordsinthefollowingqueryanddocumentsaredowncasedandpunctuationisremoved:Query:sweetloveDoc1:Sweetsweetnurse!
LoveDoc2:SweetsorrowDoc3:HowsweetisloveDoc4:Nurse!
Fig.
23.
2showsthecomputationofthetf-idfvaluesandthedocumentvectorlength|d|forthersttwodocumentsusingEq.
23.
3,Eq.
23.
4,andEq.
23.
5(com-putationsfordocuments3and4areleftasanexerciseforthereader).
Fig.
23.
3showsthescoresofthe4documents,rerankedaccordingtoEq.
23.
9.
Therankingfollowsintuitivelyfromthevectorspacemodel.
Document1,whichhasbothtermsincludingtwoinstancesofsweet,isthehighestranked,abovedocument3whichhasalargerlength|d|inthedenominator,andalsoasmallertfforsweet.
Document3ismissingoneoftheterms,andDocument4ismissingboth.
23.
1INFORMATIONRETRIEVAL5Document1Document2wordcounttfdfidftf-idfcounttfdfidftf-idflove10.
30120.
3010.
0910020.
3010sweet20.
47730.
1250.
06010.
30130.
1250.
038sorrow0010.
602010.
30110.
6020.
181how0010.
60200010.
6020nurse10.
30120.
3010.
0910020.
3010is0010.
60200010.
6020|d1|=√.
0912+.
0602+.
9012=.
141|d2|=√.
0382+.
1812=.
185Figure23.
2Computationoftf-idffornano-documents1and2,usingEq.
23.
3,Eq.
23.
4,andEq.
23.
5.
Doc|d|tf-idf(sweet)tf-idf(love)score1.
141.
060.
0911.
073.
274.
038.
0910.
4712.
185.
03800.
2054.
090000Figure23.
3RankingdocumentsbyEq.
23.
9.
Aslightlymorecomplexvariantinthetf-idffamilyistheBM25weightingBM25scheme(sometimescalledOkapiBM25aftertheOkapiIRsysteminwhichitwasintroduced(Robertsonetal.
,1995)).
BM25addstwoparameters:k,aknobthatadjustthebalancebetweentermfrequencyandIDF,andb,whichcontrolstheim-portanceofdocumentlengthnormalization.
TheBM25scoreofadocumentdgivenaqueryqis:t∈qIDFlogNdftweightedtftft,dk1b+b|d||davg|+tft,d(23.
10)where|davg|isthelengthoftheaveragedocument.
Whenkis0,BM25revertstonouseoftermfrequency,justabinaryselectionoftermsinthequery(plusidf).
Alargekresultsinrawtermfrequency(plusidf).
brangesfrom1(scalingbydocumentlength)to0(nolengthscaling).
Manningetal.
(2008)suggestreasonablevaluesarek=[1.
2,2]andb=0.
75.
Kamphuisetal.
(2020)isausefulsummaryofthemanyminorvariantsofBM25.
StopwordsInthepastitwascommontoremovehigh-frequencywordsfromboththequeryanddocumentbeforerepresentingthem.
Thelistofsuchhigh-frequencywordstoberemovediscalledastoplist.
Theintuitionisthathigh-frequencytermsstoplist(oftenfunctionwordslikethe,a,to)carrylittlesemanticweightandmaynothelpwithretrieval,andcanalsohelpshrinktheinvertedindexleswedescribebelow.
Thedownsideofusingastoplististhatitmakesitdifculttosearchforphrasesthatcontainwordsinthestoplist.
Forexample,commonstoplistswouldreducethephrasetobeornottobetothephrasenot.
InmodernIRsystems,theuseofstoplistsismuchlesscommon,partlyduetoimprovedefciencyandpartlybecausemuchoftheirfunctionisalreadyhandledbyIDFweighting,whichdownweightsfunctionwordsthatoccurineverydocument.
Nonetheless,stopwordremovalisoccasionallyusefulinvariousNLPtaskssoisworthkeepinginmind.
6CHAPTER23QUESTIONANSWERING23.
1.
3InvertedIndexInordertocomputescores,weneedtoefcientlynddocumentsthatcontainwordsinthequery.
(AswesawinFig.
23.
3,anydocumentthatcontainsnoneofthequerytermswillhaveascoreof0andcanbeignored.
)ThebasicsearchprobleminIRisthustondalldocumentsd∈Cthatcontainatermq∈Q.
Thedatastructureforthistaskistheinvertedindex,whichweuseformak-invertedindexingthissearchefcient,andalsoconvenientlystoringusefulinformationlikethedocumentfrequencyandthecountofeachtermineachdocument.
Aninvertedindex,givenaqueryterm,givesalistofdocumentsthatcontaintheterm.
Itconsistsoftwoparts,adictionaryandthepostings.
Thedictionaryisalistpostingsofterms(designedtobeefcientlyaccessed),eachpointingtoapostingslistfortheterm.
ApostingslististhelistofdocumentIDsassociatedwitheachterm,whichcanalsocontaininformationlikethetermfrequencyoreventheexactpositionsoftermsinthedocument.
ThedictionarycanalsostartthedocumentfrequencyforeachtermForexample,asimpleinvertedindexforour4sampledocumentsabove,witheachwordcontainingitsdocumentfrequencyin{},andapointertoapostingslistthatcontainsdocumentIDsandtermcountsin[],mightlooklikethefollowing:how{1}→3[1]is{1}→3[1]love{2}→1[1]→3[1]nurse{2}→1[1]→4[1]sorry{1}→2[1]sweet{3}→1[2]→2[1]→3[1]Givenalistoftermsinquery,wecanveryefcientlygetlistsofallcandidatedocuments,togetherwiththeinformationnecessarytocomputethetf-idfscoresweneed.
Therearealternativestotheinvertedindex.
Forthequestion-answeringdomainofndingWikipediapagestomatchauserquery,Chenetal.
(2017)showthatindexingbasedonbigramsworksbetterthanunigrams,anduseefcienthashingalgorithmsratherthantheinvertedindextomakethesearchefcient.
23.
1.
4EvaluationofInformation-RetrievalSystemsWemeasuretheperformanceofrankedretrievalsystemsusingthesameprecisionandrecallmetricswehavebeenusing.
Wemaketheassumptionthateachdocu-mentreturnedbytheIRsystemiseitherrelevanttoourpurposesornotrelevant.
Precisionisthefractionofthereturneddocumentsthatarerelevant,andrecallisthefractionofallrelevantdocumentsthatarereturned.
Moreformally,let'sassumeasystemreturnsTrankeddocumentsinresponsetoaninformationrequest,asubsetRofthesearerelevant,adisjointsubset,N,aretheremainingirrelevantdocuments,andUdocumentsinthecollectionasawholearerelevanttothisrequest.
Precisionandrecallarethendenedas:Precision=|R||T|Recall=|R||U|(23.
11)Unfortunately,thesemetricsdon'tadequatelymeasuretheperformanceofasystemthatranksthedocumentsitreturns.
Ifwearecomparingtheperformanceoftworankedretrievalsystems,weneedametricthatpreferstheonethatrankstherelevantdocumentshigher.
Weneedtoadaptprecisionandrecalltocapturehowwellasystemdoesatputtingrelevantdocumentshigherintheranking.
23.
1INFORMATIONRETRIEVAL7RankJudgmentPrecisionRankRecallRank1R1.
0.
112N.
50.
113R.
66.
224N.
50.
225R.
60.
336R.
66.
447N.
57.
448R.
63.
559N.
55.
5510N.
50.
5511R.
55.
6612N.
50.
6613N.
46.
6614N.
43.
6615R.
47.
7716N.
44.
7717N.
44.
7718R.
44.
8819N.
42.
8820N.
40.
8821N.
38.
8822N.
36.
8823N.
35.
8824N.
33.
8825R.
361.
0Figure23.
4Rank-specicprecisionandrecallvaluescalculatedasweproceeddownthroughasetofrankeddocuments(assumingthecollectionhas9relevantdocuments).
Figure23.
5Theprecisionrecallcurveforthedataintable23.
4.
Let'sturntoanexample.
AssumethetableinFig.
23.
4givesrank-specicpre-cisionandrecallvaluescalculatedasweproceeddownthroughasetofrankeddoc-umentsforaparticularquery;theprecisionsarethefractionofrelevantdocumentsseenatagivenrank,andrecallsthefractionofrelevantdocumentsfoundatthesamerank.
Therecallmeasuresinthisexamplearebasedonthisqueryhaving9relevantdocumentsinthecollectionasawhole.
8CHAPTER23QUESTIONANSWERINGNotethatrecallisnon-decreasing;whenarelevantdocumentisencountered,recallincreases,andwhenanon-relevantdocumentisfounditremainsunchanged.
Precision,ontheotherhand,jumpsupanddown,increasingwhenrelevantdoc-umentsarefound,anddecreasingotherwise.
Themostcommonwaytovisualizeprecisionandrecallistoplotprecisionagainstrecallinaprecision-recallcurve,precision-recallcurveliketheoneshowninFig.
23.
5forthedataintable23.
4.
Fig.
23.
5showsthevaluesforasinglequery.
Butwe'llneedtocombinevaluesforallthequeries,andinawaythatletsuscompareonesystemtoanother.
Onewayofdoingthisistoplotaveragedprecisionvaluesat11xedlevelsofrecall(0to100,instepsof10).
Sincewe'renotlikelytohavedatapointsattheseexactlevels,weuseinterpolatedprecisionvaluesforthe11recallvaluesfromthedatapointswedointerpolatedprecisionhave.
Wecanaccomplishthisbychoosingthemaximumprecisionvalueachievedatanylevelofrecallatorabovetheonewe'recalculating.
Inotherwords,IntPrecision(r)=maxi>=rPrecision(i)(23.
12)Thisinterpolationschemenotonlyletsusaverageperformanceoverasetofqueries,butalsohelpssmoothovertheirregularprecisionvaluesintheoriginaldata.
Itisdesignedtogivesystemsthebenetofthedoubtbyassigningthemaximumpreci-sionvalueachievedathigherlevelsofrecallfromtheonebeingmeasured.
Fig.
23.
6andFig.
23.
7showtheresultinginterpolateddatapointsfromourexample.
InterpolatedPrecisionRecall1.
00.
01.
0.
10.
66.
20.
66.
30.
66.
40.
63.
50.
55.
60.
47.
70.
44.
80.
36.
90.
361.
0Figure23.
6InterpolateddatapointsfromFig.
23.
4.
GivencurvessuchasthatinFig.
23.
7wecancomparetwosystemsorapproachesbycomparingtheircurves.
Clearly,curvesthatarehigherinprecisionacrossallrecallvaluesarepreferred.
However,thesecurvescanalsoprovideinsightintotheoverallbehaviorofasystem.
Systemsthatarehigherinprecisiontowardtheleftmayfavorprecisionoverrecall,whilesystemsthataremoregearedtowardsrecallwillbehigherathigherlevelsofrecall(totheright).
Asecondwaytoevaluaterankedretrievalismeanaverageprecision(MAP),meanaverageprecisionwhichprovidesasinglemetricthatcanbeusedtocomparecompetingsystemsorapproaches.
Inthisapproach,weagaindescendthroughtherankedlistofitems,butnowwenotetheprecisiononlyatthosepointswherearelevantitemhasbeenencountered(forexampleatranks1,3,5,6butnot2or4inFig.
23.
4).
Forasinglequery,weaveragetheseindividualprecisionmeasurementsoverthereturnset(uptosomexedcutoff).
Moreformally,ifweassumethatRristhesetofrelevantdocumentsatorabover,thentheaverageprecision(AP)forasinglequeryisAP=1|Rr|d∈RrPrecisionr(d)(23.
13)23.
1INFORMATIONRETRIEVAL9Figure23.
7An11pointinterpolatedprecision-recallcurve.
Precisionateachofthe11standardrecalllevelsisinterpolatedforeachqueryfromthemaximumatanyhigherlevelofrecall.
Theoriginalmeasuredprecisionrecallpointsarealsoshown.
wherePrecisionr(d)istheprecisionmeasuredattherankatwhichdocumentdwasfound.
ForanensembleofqueriesQ,wethenaverageovertheseaverages,togetournalMAPmeasure:MAP=1|Q|q∈QAP(q)(23.
14)TheMAPforthesinglequery(hence=AP)inFig.
23.
4is0.
6.
23.
1.
5IRwithDenseVectorsTheclassictf-idforBM25algorithmsforIRhavelongbeenknowntohaveacon-ceptualaw:theyworkonlyifthereisexactoverlapofwordsbetweenthequeryanddocument.
Inotherwords,theuserposingaquery(oraskingaquestion)needstoguessexactlywhatwordsthewriteroftheanswermighthaveusedtodiscusstheissue.
AsLinetal.
(2020)putit,theusermightdecidetosearchforatragiclovestorybutShakespearewritesinsteadaboutstar-crossedlovers.
Thisiscalledthevocabularymismatchproblem(Furnasetal.
,1987).
Thesolutiontothisproblemistouseanapproachthatcanhandlesynonymy:insteadof(sparse)word-countvectors,using(dense)embeddings.
ThisideawasproposedquiteearlywiththeLSIapproach(Deerwesteretal.
,1990),butmodernmethodsallmakeuseofencoderslikeBERT.
Inwhatissometimescalledabi-encoderweusetwoseparateencodermodels,onetoencodethequeryandonetoencodethedocument,andusethedotproductbetweenthesetwovectorsasthescore(Fig.
23.
8.
Forexample,ifweusedBERT,wewouldhavetwoencodersBERTQandBERTDandwecouldrepresentthequeryanddocumentasthe[CLS]tokenoftherespectiveencoders(Karpukhinetal.
,2020):hq=BERTQ(q)[CLS]hd=BERTD(d)[CLS]score(d,q)=hq·hd(23.
15)10CHAPTER23QUESTIONANSWERINGFigure23.
8BERTbi-encoderforcomputingrelevanceofadocumenttoaquery.
Morecomplexversionscanuseotherwaystorepresenttheencodedtext,suchasusingaveragepoolingovertheBERToutputsofalltokensinsteadofusingtheCLStoken,orcanaddextraweightmatricesaftertheencodingordotproductsteps(Liuetal.
2016,Leeetal.
2019).
UsingdensevectorsforIRortheretrievercomponentofquestionanswerersisstillanopenareaofresearch.
Amongthemanyareasofactiveresearcharehowtodothene-tuningoftheencodermodulesontheIRtask(generallybyne-tuningonquery-documentcombinations,withvariouscleverwaystogetnegativeexamples),andhowtodealwiththefactthatdocumentsareoftenlongerthanencoderslikeBERTcanprocess(generallybybreakingupdocumentsintopassages).
Efciencyisalsoanissue.
AtthecoreofeveryIRengineistheneedtorankev-erypossibledocumentforitssimilaritytothequery.
Forsparseword-countvectors,theinvertedindexallowsthisveryefciently.
FordensevectoralgorithmslikethosebasedonBERTorotherTransformerencoders,ndingthesetofdensedocumentvectorsthathavethehighestdotproductwithadensequeryvectorisanexampleofnearestneighborsearch.
ModernsystemsthereforemakeuseofapproximatenearestneighborvectorsearchalgorithmslikeFaiss(Johnsonetal.
,2017).
Faiss23.
2IR-basedFactoidQuestionAnsweringThegoalofIR-basedQA(sometimescalledopendomainQA)istoanswerauser'sIR-basedQAquestionbyndingshorttextsegmentsfromtheweborsomeotherlargecollectionofdocuments.
Figure23.
9showssomesamplefactoidquestionsandtheiranswers.
QuestionAnswerWhereistheLouvreMuseumlocatedinParis,FranceWhatarethenamesofOdin'sravensHuginnandMuninnWhatkindofnutsareusedinmarzipanalmondsWhatinstrumentdidMaxRoachplaydrumsWhat'stheofciallanguageofAlgeriaArabicFigure23.
9Somefactoidquestionsandtheiranswers.
ThedominantparadigmforIR-basedQAistheretrieveandreadmodelshownretrieveandreadinFig.
23.
10.
Intherststageofthis2-stagemodelweretrieverelevantpassagesfromatextcollection,usuallyusingasearchenginesofthetypewesawintheprevioussection.
Inthesecondstage,aneuralreadingcomprehensionalgorithmpassesovereachpassageandndsspansthatarelikelytoanswerthequestion.
Somequestionansweringsystemsfocusonlyonthesecondtask,thereadingcomprehensiontask.
Readingcomprehensionsystemsaregivenafactoidquestionreadingcomprehensionqandapassagepthatcouldcontaintheanswer,andreturnananswers(orperhapsdeclarethatthereisnoanswerinthepassage,orinsomesetupsmakeachoicefrom23.
2IR-BASEDFACTOIDQUESTIONANSWERING11Figure23.
10IR-basedfactoidquestionansweringhastwostages:retrieval,whichreturnsrelevantdoc-umentsfromthecollection,andreading,inwhichaneuralreadingcomprehensionsystemextractsanswerspans.
asetofpossibleanswers).
Ofcoursethissetupdoesnotmatchtheinformationneedofuserswhohaveaquestiontheyneedanswered(afterall,ifauserknewwhichpas-sagecontainedtheanswer,theycouldjustreaditthemselves).
Instead,thistaskwasoriginallymodeledonchildren'sreadingcomprehensiontests—pedagogicalinstru-mentsinwhichachildisgivenapassagetoreadandmustanswerquestionsaboutit—asawaytoevaluatenaturallanguageunderstandingperformance(Hirschmanetal.
,1999).
Readingcomprehensionsystemsarestillusedthatway,buthavealsoevolvedtofunctionasthesecondstageofthemodernretrieveandreadmodel.
Otherquestionansweringsystemsaddresstheentireretrieveandreadtask;theyaregivenafactoidquestionandalargedocumentcollection(suchasWikipediaoracrawloftheweb)andreturnananswer,usuallyaspanoftextextractedfromadocument.
ThistaskisoftencalledopendomainQA.
Inthenextfewsectionswe'lllayoutthevariouspiecesofIR-basedQA,startingwithsomecommonlyuseddatasetsforboththereadingcomprehensionandfullQAtasks.
23.
2.
1IR-basedQA:DatasetsDatasetsforIR-basedQAaremostcommonlycreatedbyrstdevelopingreadingcomprehensiondatasetscontainingtuplesof(passage,question,answer).
Readingcomprehensionsystemscanusethedatasetstotrainareaderthatisgivenapassageandaquestion,andpredictsaspaninthepassageastheanswer.
IncludingthepassagefromwhichtheansweristobeextractedeliminatestheneedforreadingcomprehensionsystemstodealwithIR.
ForexampletheStanfordQuestionAnsweringDataset(SQuAD)consistsofSQuADpassagesfromWikipediaandassociatedquestionswhoseanswersarespansfromthepassage(Rajpurkaretal.
2016).
Squad2.
0inadditionaddssomequestionsthataredesignedtobeunanswerable(Rajpurkaretal.
2018),withatotalofjustover150,000questions.
Fig.
23.
11showsa(shortened)excerptfromaSQUAD2.
0passagetogetherwiththreequestionsandtheirgoldanswerspans.
SQuADwasbuiltbyhavinghumansreadagivenWikipediapassage,writeques-tionsaboutthepassage,andchooseaspecicanswerspan.
Otherdatasetsarecreatedbysimilartechniquesbuttrytomakethequestionsmorecomplex.
TheHotpotQAdataset(Yangetal.
,2018)wascreatedbyshowingHotpotQAcrowdworkersmultiplecontextdocumentsandaskedtocomeupwithquestionsthatrequirereasoningaboutallofthedocuments.
ThefactthatquestionsindatasetslikeSQuADorHotpotQAarecreatedbyan-notatorswhohaverstreadthepassagemaymaketheirquestionseasiertoanswer,12CHAPTER23QUESTIONANSWERINGBeyonceGiselleKnowles-Carter(bornSeptember4,1981)isanAmericansinger,songwriter,recordproducerandactress.
BornandraisedinHouston,Texas,sheperformedinvarioussinginganddancingcompetitionsasachild,androsetofameinthelate1990sasleadsingerofR&Bgirl-groupDestiny'sChild.
Managedbyherfather,MathewKnowles,thegroupbecameoneoftheworld'sbest-sellinggirlgroupsofalltime.
TheirhiatussawthereleaseofBeyonce'sdebutalbum,DangerouslyinLove(2003),whichestablishedherasasoloartistworldwide,earnedveGrammyAwardsandfeaturedtheBillboardHot100number-onesingles"CrazyinLove"and"BabyBoy".
Q:"InwhatcityandstatedidBeyoncegrowup"A:"Houston,Texas"Q:"WhatareasdidBeyoncecompeteinwhenshewasgrowingup"A:"singinganddancing"Q:"WhendidBeyoncereleaseDangerouslyinLove"A:"2003"Figure23.
11A(Wikipedia)passagefromtheSQuAD2.
0dataset(Rajpurkaretal.
,2018)with3samplequestionsandthelabeledanswerspans.
sincetheannotatormay(subconsciously)makeuseofwordsfromtheanswertext.
Acommonsolutiontothispossiblebiasistomakedatasetsfromquestionsthatwerenotwrittenwithapassageinmind.
TheTriviaQAdataset(Joshietal.
,2017)contains94Kquestionswrittenbytriviaenthusiasts,togetherwithsupportingdoc-umentsfromWikipediaandthewebresultingin650Kquestion-answer-evidencetriples.
TheNaturalQuestionsdataset(Kwiatkowskietal.
,2019)incorporatesrealNaturalQuestionsanonymizedqueriestotheGooglesearchengine.
Annotatorsarepresentedaquery,alongwithaWikipediapagefromthetop5searchresults,andannotateaparagraph-lengthlonganswerandashortspananswer,ormarknullifthetextdoesn'tcontaintheparagraph.
Forexamplethequestion"Whenarehopsaddedtothebrewingprocess"hastheshortanswertheboilingprocessandalonganswerwhichthesurroundingentireparagraphfromtheWikipediapageonBrewing.
Inusingthisdataset,areadingcomprehensionmodelisgivenaquestionandaWikipediapageandmustreturnalonganswer,shortanswer,or'noanswer'response.
TheabovedatasetsareallinEnglish.
TheTyDiQAdatasetcontains204KTyDiQAquestion-answerpairsfrom11typologicallydiverselanguages,includingArabic,Bengali,Kiswahili,Russian,andThai(Clarketal.
,2020).
IntheTYDIQAtask,asystemisgivenaquestionandthepassagesfromaWikipediaarticleandmust(a)selectthepassagecontainingtheanswer(orNULLifnopassagecontainstheanswer),and(b)marktheminimalanswerspan(orNULL).
Manyquestionshavenoanswer.
ThevariouslanguagesinthedatasetbringupchallengesforQAsystemslikemorphologicalvariationbetweenthequestionandtheanswer,orcomplexissuewithwordsegmentationormultiplealphabets.
Inthereadingcomprehensiontask,asystemisgivenaquestionandthepassageinwhichtheanswershouldbefound.
Inthefulltwo-stageQAtask,however,sys-temsarenotgivenapassage,butarerequiredtodotheirownretrievalfromsomedocumentcollection.
Acommonwaytocreateopen-domainQAdatasetsistomod-ifyareadingcomprehensiondataset.
ForresearchpurposesthisismostcommonlydonebyusingQAdatasetsthatannotateWikipedia(likeSQuADorHotpotQA).
Fortraining,theentire(question,passage,answer)tripleisusedtotrainthereader.
Butatinferencetime,thepassagesareremovedandsystemisgivenonlythequestion,togetherwithaccesstotheentireWikipediacorpus.
ThesystemmustthendoIRto23.
2IR-BASEDFACTOIDQUESTIONANSWERING13ndasetofpagesandthenreadthem.
23.
2.
2IR-basedQA:Reader(AnswerSpanExtraction)TherststageofIR-basedQAisaretriever,forexampleofthetypewesawinSection23.
1.
ThesecondstageofIR-basedquestionansweringisthereader.
Thereader'sjobistotakeapassageasinputandproducetheanswer.
IntheextractiveQAwediscusshere,theanswerisaspanoftextinthepassage.
5ForexamplegivenextractiveQAaquestionlike"HowtallisMt.
Everest"andapassagethatcontainstheclauseReaching29,029feetatitssummit,areaderwilloutput29,029feet.
Theanswerextractiontaskiscommonlymodeledbyspanlabeling:identifyinginthepassageaspan(acontinuousstringoftext)thatconstitutesananswer.
Neuralspanalgorithmsforreadingcomprehensionaregivenaquestionqofntokensq1,.
.
.
,qnandapassagepofmtokensp1,.
.
.
,pm.
TheirgoalisthustocomputetheprobabilityP(a|q,p)thateachpossiblespanaistheanswer.
Ifeachspanastartsatpositionasandendsatpositionae,wemakethesimplify-ingassumptionthatthisprobabilitycanbeestimatedasP(a|q,p)=Pstart(as|q,p)Pend(ae|q,p).
Thusforforeachtokenpiinthepassagewe'llcomputetwoprobabilities:pstart(i)thatpiisthestartoftheanswerspan,andpend(i)thatpiistheendoftheanswerspan.
Astandardbaselinealgorithmforreadingcomprehensionistopasstheques-tionandpassagetoanyencoderlikeBERT(Fig.
23.
12),asstringsseparatedwitha[SEP]token,resultinginanencodingtokenembeddingforeverypassagetokenpi.
Figure23.
12Anencodermodel(usingBERT)forspan-basedquestionansweringfromreading-comprehension-basedquestionansweringtasks.
Forspan-basedquestionanswering,werepresentthequestionastherstse-quenceandthepassageasthesecondsequence.
We'llalsoneedtoaddalinearlayerthatwillbetrainedinthene-tuningphasetopredictthestartandendpositionofthespan.
We'lladdtwonewspecialvectors:aspan-startembeddingSandaspan-endembeddingE,whichwillbelearnedinne-tuning.
Togetaspan-startprobabilityforeachoutputtokenpi,wecomputethedotproductbetweenSandpiandthenuse5HereweskipthemoredifculttaskofabstractiveQA,inwhichthesystemcanwriteananswerwhichisnotdrawnexactlyfromthepassage.
14CHAPTER23QUESTIONANSWERINGasoftmaxtonormalizeoveralltokenspiinthepassage:Pstarti=exp(S·pi)jexp(S·pj)(23.
16)Wedotheanalogousthingtocomputeaspan-endprobability:Pendi=exp(E·pi)jexp(E·pj)(23.
17)ThescoreofacandidatespanfrompositionitojisS·pi+E·pj,andthehighestscoringspaninwhichj≥iischosenisthemodelprediction.
Thetraininglossforne-tuningisthenegativesumofthelog-likelihoodsofthecorrectstartandendpositionsforeachinstance:L=logPstartilogPendi(23.
18)Manydatasets(likeSQuAD2.
0andNaturalQuestions)alsocontain(question,passage)pairsinwhichtheanswerisnotcontainedinthepassage.
Wethusalsoneedawaytoestimatetheprobabilitythattheanswertoaquestionisnotinthedocument.
Thisisstandardlydonebytreatingquestionswithnoanswerashavingthe[CLS]tokenastheanswer,andhencetheanswerspanstartandendindexwillpointat[CLS](Devlinetal.
,2019).
Formanydatasetswealsoneedtohandlethesituationwheretheannotateddoc-uments/passagesarelongerthanthemaximum512inputtokensBERTallows.
Con-siderforexamplecaseslikeNaturalQuestions,wherethegold-labeledpassagesarefullWikipediapages.
Insuchcaseswecancreatemultiplepseudo-passageobserva-tionsfromthelabeledWikipediapage.
Eachobservationisformedbyconcatenating[CLS],thequestion,[SEP],andtokensfromthedocument.
Wewalkthroughthedocument,slidingawindowofsize512(orrather,512minusthequestionlengthnminusspecialtokens)andpackingthewindowoftokensintoeachnextpseudo-passage.
Theanswerspanfortheobservationiseitherlabeled[CLS](=noanswerinthisparticularwindow)orthegold-labeledspanismarked.
Albertietal.
(2019)suggestalsoallowingthewindowstooverlap,byusingastrideof128tokens.
Thesameprocesscanbeusedforinference,breakingupeachretrieveddocumentintoseparateobservationpassagesandlabelingeachobservation.
Theanswercanbechosenasthespanwiththehighestprobability(ornilifnospanismoreprobablethan[CLS]).
OrAlbertietal.
(2019)suggestnormalizingeachscoreg(s,e)foraspanofstartsandendebythenilscore:g(s,e)=starti+logPendistartCLS+logPendCLS(23.
19)23.
3EntityLinkingWe'venowseentherstmajorparadigmforquestionanswering,IR-basedQA.
Beforeweturntothesecondmajorparadigmforquestionanswering,knowledge-basedquestionanswering,weintroducetheimportantcoretechnologyofentitylinking,sinceitisrequiredforanyknowledge-basedQAalgorithm.
Entitylinkingisthetaskofassociatingamentionintextwiththerepresentationentitylinkingofsomereal-worldentityinanontology(JiandGrishman,2011).
23.
3ENTITYLINKING15Themostcommonontologyforfactoidquestion-answeringisWikipedia,sinceWikipediaisoftenthesourceofthetextthatanswersthequestion.
Inthisusage,eachuniqueWikipediapageactsastheuniqueidforaparticularentity.
ThistaskofdecidingwhichWikipediapagecorrespondingtoanindividualisbeingreferredtobyatextmentionhasitsownname:wikication(MihalceaandCsomai,2007).
wikicationSincetheearliestsystems(MihalceaandCsomai2007,Cucerzan2007,MilneandWitten2008),entitylinkingisdonein(roughly)twostages:mentiondetec-tionandmentiondisambiguation.
We'llgivetwoalgorithms,onesimpleclassicbaselinethatusesanchordictionariesandinformationfromtheWikipediagraphstructure(FerraginaandScaiella,2011)andonemodernneuralalgorithm(Lietal.
,2020).
We'llfocusheremainlyontheapplicationofentitylinkingtoquestionsratherthanothergenres.
23.
3.
1LinkingbasedonAnchorDictionariesandWebGraphAsasimplebaselineweintroducetheTAGMElinker(FerraginaandScaiella,2011)forWikipedia,whichitselfdrawsonearlieralgorithms(MihalceaandCsomai2007,Cucerzan2007,MilneandWitten2008).
WikicationalgorithmsdenethesetofentitiesasthesetofWikipediapages,sowe'llrefertoeachWikipediapageasauniqueentitye.
TAGMErstcreatesacatalogofallentities(i.
e.
allWikipediapages,removingsomedisambiguationandothermeta-pages)andindexestheminastandardIRenginelikeLucene.
Foreachpagee,thealgorithmcomputesanin-linkcountin(e):thetotalnumberofin-linksfromotherWikipediapagesthatpointtoe.
ThesecountscanbederivedfromWikipediadumps.
Finally,thealgorithmrequiresananchordictionary.
AnanchordictionarylistsforeachWikipediapage,itsanchortexts:thehyperlinkedspansoftextonanchortextsotherpagesthatpointtoit.
Forexample,thewebpageforStanfordUniversity,http://www.
stanford.
edu,mightbepointedtofromanotherpageusinganchortextslikeStanfordorStanfordUniversity:StanfordUniversityWecomputeaWikipediaanchordictionarybyincluding,foreachWikipediapagee,e'stitleaswellasalltheanchortextsfromallWikipediapagesthatpointtoe.
Foreachanchorstringawe'llalsocomputeitstotalfrequencyfreq(a)inWikipedia(includingnon-anchoruses),thenumberoftimesaoccursasalink(whichwe'llcalllink(a)),anditslinkprobabilitylinkprob(a)=link(a)/freq(a).
Somecleanupofthenalanchordictionaryisrequired,forexampleremovinganchorstringscomposedonlyofnumbersorsinglecharacters,thatareveryrare,orthatareveryunlikelytobeusefulentitiesbecausetheyhaveaverylowlinkprob.
MentionDetectionGivenaquestion(orothertextwearetryingtolink),TAGMEdetectsmentionsbyqueryingtheanchordictionaryforeachtokensequenceupto6words.
Thislargesetofsequencesisprunedwithsomesimpleheuristics(forexamplepruningsubstringsiftheyhavesmalllinkprobs).
Thequestion:WhenwasAdaLovelacebornmightgiverisetotheanchorAdaLovelaceandpossiblyAda,butsubstringsspanslikeLovelacemightbeprunedashavingtoolowalinkprob,andbutspanslikebornhavesuchalowlinkprobthattheywouldnotbeintheanchordictionaryatall.
MentionDisambiguationIfamentionspanisunambiguous(pointstoonlyoneentity/Wikipediapage),wearedonewithentitylinking!
However,manyspansare16CHAPTER23QUESTIONANSWERINGambiguous,matchinganchorsformultipleWikipediaentities/pages.
TheTAGMEalgorithmusestwofactorsfordisambiguatingambiguousspans,whichhavebeenreferredtoaspriorprobabilityandrelatedness/coherence.
Therstfactorisp(e|a),theprobabilitywithwhichthespanreferstoaparticularentity.
Foreachpagee∈E(a),theprobabilityp(e|a)thatanchorapointstoe,istheratioofthenumberoflinksintoewithanchortextatothetotalnumberofoccurrencesofaasananchor:prior(a→e)=p(e|a)=count(a→e)link(a)(23.
20)Let'sseehowthatfactorworksinlinkingentitiesinthefollowingquestion:WhatChineseDynastycamebeforetheYuanThemostcommonassociationforthespanYuanintheanchordictionaryisthenameoftheChinesecurrency,i.
e.
,theprobabilityp(Yuancurrency|yuan)isveryhigh.
RarerWikipediaassociationsforYuanincludethecommonChineselastname,alanguagespokeninThailand,andthecorrectentityinthiscase,thenameoftheChinesedynasty.
Soifwechosebasedonlyonp(e|a),wewouldmakethewrongdisambiguationandmissthecorrectlink,Yuandynasty.
Tohelpinjustthissortofcase,TAGMEusesasecondfactor,therelatednessofthisentitytootherentitiesintheinputquestion.
Inourexample,thefactthatthequestionalsocontainsthespanChineseDynasty,whichhasahighprobabilitylinktothepageDynastiesinChinesehistory,oughttohelpmatchYuandynasty.
Let'sseehowthisworks.
Givenaquestionq,foreachcandidateanchorsspanadetectedinq,weassignarelatednessscoretoeachpossibleentitye∈E(a)ofa.
Therelatednessscoreofthelinka→eistheweightedaveragerelatednessbetweeneandallotherentitiesinq.
TwoentitiesareconsideredrelatedtotheextenttheirWikipediapagessharemanyin-links.
Moreformally,therelatednessbetweentwoentitiesAandBiscomputedasrel(A,B)=log(max(|in(A)|,|in(B)|))log(|in(A)|∩|in(B)|)log(|W|)log(min(|in(A)|,|in(B)|))(23.
21)wherein(x)isthesetofWikipediapagespointingtoxandWisthesetofallWiki-pediapagesinthecollection.
Thevotegivenbyanchorbtothecandidateannotationa→Xistheaverage,overallthepossibleentitiesofb,oftheirrelatednesstoX,weightedbytheirpriorprobability:vote(b,X)=1|E(b)|Y∈E(b)rel(X,Y)p(Y|b)(23.
22)Thetotalrelatednessscorefora→Xisthesumofthevotesofalltheotheranchorsdetectedinq:relatedness(a→X)=b∈Xq\avoteb,X(23.
23)Toscorea→X,wecombinerelatednessandpriorbychoosingtheentityXthathasthehighestrelatedness(a→X),ndingotherentitieswithinasmallofthisvalue,andfromthisset,choosingtheentitywiththehighestpriorP(X|a).
Theresultofthisstepisasingleentityassignedtoeachspaninq.
23.
3ENTITYLINKING17TheTAGMEalgorithmhasonefurtherstepofpruningspuriousanchor/entitypairs,assigningascoreaveraginglinkprobabilitywiththecoherence.
coherence(a→X)=1|S|1B∈S\Xrel(B,X)score(a→X)=coherence(a→X)+linkprob(a)2(23.
24)Finally,pairsareprunedifscore(a→X))bygeneratingthemissingspans(separatedby)inthedecoder.
Itisthenne-tunedonQAdatasets,giventhequestion,withoutaddinganyadditionalcontextorpassages.
FigurefromRobertsetal.
(2020).
architecture;thedeletedspansaremarkedby,andthesystemistrainedtohavethedecodergeneratingthemissingspans(separatedby).
Robertsetal.
(2020)thennetunetheT5systemtothequestionansweringtask,bygivingitaquestion,andtrainingittooutputtheanswertextinthedecoder.
Usingthelargest11-billion-parameterT5modeldoescompetitively,althoughnotquiteaswellassystemsdesignedspecicallyforquestionanswering.
Languagemodelingisnotyetacompletesolutionforquestionanswering;forexampleinadditiontonotworkingquiteaswell,theysufferfrompoorinterpretabil-ity(unlikestandardQAsystems,forexample,theycurrentlycan'tgiveusersmorecontextbytellingthemwhatpassagetheanswercamefrom).
Nonetheless,thestudyofextractinganswerfromlanguagemodelsisanintriguingareaforfuturequestionanswerresearch.
23.
6ClassicQAModelsWhileneuralarchitecturesarethestateoftheartforquestionanswering,pre-neuralarchitecturesusinghybridsofrulesandfeature-basedclassierscansometimesachievehigherperformance.
Herewesummarizeoneinuentialclassicsystem,theWatsonDeepQAsystemfromIBMthatwontheJeopardy!
challengein2011(Fig.
23.
17).
Let'sconsiderhowithandlestheseJeopardy!
examples,eachwithacategoryfollowedbyaquestion:PoetsandPoetry:HewasabankclerkintheYukonbeforehepublished"SongsofaSourdough"in1907.
THEATRE:AnewplaybasedonthisSirArthurConanDoylecanineclassicopenedontheLondonstagein2007.
QuestionProcessingInthisstagethequestionsareparsed,namedentitiesareex-tracted(SirArthurConanDoyleidentiedasaPERSON,YukonasaGEOPOLITICALENTITY,"SongsofaSourdough"asaCOMPOSITION),coreferenceisrun(heislinkedwithclerk).
Thequestionfocus,showninboldinbothexamples,isextracted.
Thefocusisfocusthestringofwordsinthequestionthatcoreferswiththeanswer.
Itislikelytobe23.
6CLASSICQAMODELS23Figure23.
17The4broadstagesofWatsonQA:(1)QuestionProcessing,(2)CandidateAnswerGeneration,(3)CandidateAnswerScoring,and(4)AnswerMergingandCondenceScoring.
replacedbytheanswerinanyanswerstringfoundandsocanbeusedtoalignwithasupportingpassage.
InDeepQAThefocusisextractedbyhandwrittenrules—madepossiblebytherelativelystylizedsyntaxofJeopardy!
questions—suchasaruleextractinganynounphrasewithdeterminer"this"asintheConanDoyleexample,andrulesextractingpronounslikeshe,he,hers,him,asinthepoetexample.
Thelexicalanswertype(showninblueabove)isawordorwordswhichtelllexicalanswertypeussomethingaboutthesemantictypeoftheanswer.
BecauseofthewidevarietyofquestionsinJeopardy!
,DeepQAchoosesawidevarietyofwordstobeanswertypes,ratherthanasmallsetofnamedentities.
Theselexicalanswertypesareagainextractedbyrules:thedefaultruleistochoosethesyntacticheadwordofthefocus.
Otherrulesimprovethisdefaultchoice.
Forexampleadditionallexicalanswertypescanbewordsinthequestionthatarecoreferentwithorhaveaparticularsyntacticrelationwiththefocus,suchasheadwordsofappositivesorpredicativenominativesofthefocus.
InsomecaseseventheJeopardy!
categorycanactasalexicalanswertype,ifitreferstoatypeofentitythatiscompatiblewiththeotherlexicalanswertypes.
Thusintherstcaseabove,he,poet,andclerkarealllexicalanswertypes.
Inadditiontousingtherulesdirectlyasaclassier,theycaninsteadbeusedasfeaturesinalogisticregressionclassierthatcanreturnaprobabilityaswellasalexicalanswertype.
Theseanswertypeswillbeusedinthelater'candidateanswerscoring'phaseasasourceofevidenceforeachcandidate.
Relationslikethefollowingarealsoextracted:authorof(focus,"Songsofasourdough")publish(e1,he,"Songsofasourdough")in(e2,e1,1907)temporallink(publish(.
.
.
),1907)Finallythequestionisclassiedbytype(denitionquestion,multiple-choice,puzzle,ll-in-the-blank).
Thisisgenerallydonebywritingpattern-matchingregularexpressionsoverwordsorparsetrees.
CandidateAnswerGenerationNextwecombinetheprocessedquestionwithex-ternaldocumentsandotherknowledgesourcestosuggestmanycandidateanswersfrombothtextdocumentsandstructuredknowledgebases.
Wecanquerystructured24CHAPTER23QUESTIONANSWERINGresourceslikeDBpediaorIMDBwiththerelationandtheknownentity,justaswesawinSection23.
4.
Thusifwehaveextractedtherelationauthorof(focus,"Songsofasourdough"),wecanqueryatriplestorewithauthorof(x,"Songsofasourdough")toreturnanauthor.
ToextractanswersfromtextDeepQAusessimpleversionsofRetrieveandRead.
ForexamplefortheIRstage,DeepQAgeneratesaqueryfromthequestionbyelimi-natingstopwords,andthenupweightinganytermswhichoccurinanyrelationwiththefocus.
Forexamplefromthisquery:MOVIE-"ING":RobertRedfordandPaulNewmanstarredinthisdepression-eragrifterick.
(Answer:"TheSting")thefollowingweightedquerymightbepassedtoastandardIRsystem:(2.
0RobertRedford)(2.
0PaulNewman)stardepressioneragrifter(1.
5ick)DeepQAalsomakesuseoftheconvenientfactthatthevastmajorityofJeopardy!
answersarethetitleofaWikipediadocument.
Tondthesetitles,wecandoasecondtextretrievalpassspecicallyonWikipediadocuments.
TheninsteadofextractingpassagesfromtheretrievedWikipediadocument,wedirectlyreturnthetitlesofthehighlyrankedretrieveddocumentsasthepossibleanswers.
Oncewehaveasetofpassages,weneedtoextractcandidateanswers.
IfthedocumenthappenstobeaWikipediapage,wecanjusttakethetitle,butforothertexts,likenewsdocuments,weneedotherapproaches.
Twocommonapproachesaretoextractallanchortextsinthedocument(anchortextisthetextbetweenanchortextsandusedtopointtoaURLinanHTMLpage),ortoextractallnounphrasesinthepassagethatareWikipediadocumenttitles.
CandidateAnswerScoringNextDeepQAusesmanysourcesofevidencetoscoreeachcandidate.
Thisincludesaclassierthatscoreswhetherthecandidateanswercanbeinterpretedasasubclassorinstanceofthepotentialanswertype.
Considerthecandidate"difcultyswallowing"andthelexicalanswertype"man-ifestation".
DeepQArstmatcheseachofthesewordswithpossibleentitiesinontologieslikeDBpediaandWordNet.
Thusthecandidate"difcultyswallowing"ismatchedwiththeDBpediaentity"Dysphagia",andthenthatinstanceismappedtotheWordNettype"Symptom".
Theanswertype"manifestation"ismappedtotheWordNettype"Condition".
Thesystemlooksforahyponymy,orsynonymylink,inthiscasendinghyponymybetween"Symptom"and"Condition".
OtherscorersarebasedonusingtimeandspacerelationsextractedfromDBpe-diaorotherstructureddatabases.
Forexample,wecanextracttemporalpropertiesoftheentity(whenwasapersonborn,whendied)andthencomparetotimeexpres-sionsinthequestion.
Ifatimeexpressioninthequestionoccurschronologicallybeforeapersonwasborn,thatwouldbeevidenceagainstthispersonbeingthean-swertothequestion.
Finally,wecanusetextretrievaltohelpretrieveevidencesupportingacandidateanswer.
Wecanretrievepassageswithtermsmatchingthequestion,thenreplacethefocusinthequestionwiththecandidateanswerandmeasuretheoverlappingwordsororderingofthepassagewiththemodiedquestion.
Theoutputofthisstageisasetofcandidateanswers,eachwithavectorofscoringfeatures.
AnswerMergingandScoringDeepQAnallymergesequivalentcandidatean-swers.
ThusifwehadextractedtwocandidateanswersJ.
F.
K.
andJohnF.
Kennedy,thisstagewouldmergethetwointoasinglecandidate,forexampleusingtheanchor23.
7EVALUATIONOFFACTOIDANSWERS25dictionariesdescribedaboveforentitylinking,whichwilllistmanysynonymsforWikipediatitles(e.
g.
,JFK,JohnF.
Kennedy,SenatorJohnF.
Kennedy,PresidentKennedy,JackKennedy).
Wethenmergetheevidenceforeachvariant,combiningthescoringfeaturevectorsforthemergedcandidatesintoasinglevector.
Nowwehaveasetofcandidates,eachwithafeaturevector.
Aclassiertakeseachfeaturevectorandassignsacondencevaluetothiscandidateanswer.
Theclassieristrainedonthousandsofcandidateanswers,eachlabeledforwhetheritiscorrectorincorrect,togetherwiththeirfeaturevectors,andlearnstopredictaprobabilityofbeingacorrectanswer.
Since,intraining,therearefarmoreincorrectanswersthancorrectanswers,weneedtouseoneofthestandardtechniquesfordealingwithveryimbalanceddata.
DeepQAusesinstanceweighting,assigninganinstanceweightof.
5foreachincorrectanswerexampleintraining.
Thecandidateanswersarethensortedbythiscondencevalue,resultinginasinglebestanswer.
DeepQA'sfundamentalintuitionisthustoproposeaverylargenumberofcandi-dateanswersfrombothtext-basedandknowledge-basedsourcesandthenusearichvarietyofevidencefeaturesforscoringthesecandidates.
Seethepapersmentionedattheendofthechapterformoredetails.
23.
7EvaluationofFactoidAnswersAcommonevaluationmetricforfactoidquestionanswering,introducedintheTRECQ/Atrackin1999,ismeanreciprocalrank,orMRR.
MRRassumesatestsetofmeanreciprocalrankMRRquestionsthathavebeenhuman-labeledwithcorrectanswers.
MRRalsoassumesthatsystemsarereturningashortrankedlistofanswersorpassagescontainingan-swers.
Eachquestionisthenscoredaccordingtothereciprocaloftherankoftherstcorrectanswer.
Forexampleifthesystemreturnedveanswersbuttherstthreearewrongandhencethehighest-rankedcorrectanswerisrankedfourth,thereciprocalrankscoreforthatquestionwouldbe14.
Questionswithreturnsetsthatdonotcontainanycorrectanswersareassignedazero.
Thescoreofasystemisthentheaverageofthescoreforeachquestionintheset.
Moreformally,foranevaluationofasystemreturningasetofrankedanswersforatestsetconsistingofNquestions,theMRRisdenedasMRR=1NNi=1s.
t.
ranki=01ranki(23.
35)ReadingcomprehensionsystemsondatasetslikeSQuADareoftenevaluatedusingtwometrics,bothignoringpunctuationandarticles(a,an,the)(Rajpurkaretal.
,2016):Exactmatch:Thepercentageofpredictedanswersthatmatchthegoldanswerexactly.
F1score:Theaverageoverlapbetweenpredictedandgoldanswers.
Treatthepredictionandgoldasabagoftokens,andcomputeF1,averagingtheF1overallquestions.
Anumberoftestsetsareavailableforquestionanswering.
EarlysystemsusedtheTRECQAdataset;questionsandhandwrittenanswersforTRECcompetitionsfrom1999to2004arepubliclyavailable.
MorerecentcompetitionsusesthevariousdatasetsdescribedinSection23.
2.
1.
26CHAPTER23QUESTIONANSWERINGThereareawidevarietyofdatasetsfortrainingandtestingreadingcomprehen-sion/answerextractioninadditiontothedatasetsdiscussedonpage11.
Sometaketheirstructurefromthefactthatreadingcomprehensiontasksdesignedforchildrentendtobemultiplechoice,withthetaskbeingtochooseamongthegivenanswers.
TheMCTestdatasetusesthisstructure,with500ctionalshortstoriescreatedbycrowdworkerswithquestionsandmultiplechoiceanswers(Richardsonetal.
,2013).
TheAI2ReasoningChallenge(ARC)(Clarketal.
,2018),hasquestionsthatarede-signedtobehardtoanswerfromsimplelexicalmethods:Whichpropertyofamineralcanbedeterminedjustbylookingatit(A)luster[correct](B)mass(C)weight(D)hardnessThisARCexampleisdifcultbecausethecorrectanswerlusterisunlikelytoco-occurfrequentlyonthewebwithphraseslikelookingatit,whilethewordmineralishighlyassociatedwiththeincorrectanswerhardness.
BibliographicalandHistoricalNotesQuestionansweringwasoneoftheearliestNLPtasks,andearlyversionsofthetext-basedandknowledge-basedparadigmsweredevelopedbytheveryearly1960s.
Thetext-basedalgorithmsgenerallyreliedonsimpleparsingofthequestionandofthesentencesinthedocument,andthenlookingformatches.
Thisapproachwasusedveryearlyon(Phillips,1960)butperhapsthemostcompleteearlysystem,andonethatstrikinglypreguresmodernrelation-basedsystems,wastheProtosynthexsys-temofSimmonsetal.
(1964).
Givenaquestion,Protosynthexrstformedaqueryfromthecontentwordsinthequestion,andthenretrievedcandidateanswersen-tencesinthedocument,rankedbytheirfrequency-weightedtermoverlapwiththequestion.
Thequeryandeachretrievedsentencewerethenparsedwithdependencyparsers,andthesentencewhosestructurebestmatchesthequestionstructurese-lected.
ThusthequestionWhatdowormseatwouldmatchwormseatgrass:bothhavethesubjectwormsasadependentofeat,intheversionofdependencygrammarusedatthetime,whilebirdseatwormshasbirdsasthesubject:WhatdowormseatWormseatgrassBirdseatwormsThealternativeknowledge-basedparadigmwasimplementedintheBASEBALLsystem(Greenetal.
,1961).
Thissystemansweredquestionsaboutbaseballgameslike"WheredidtheRedSoxplayonJuly7"byqueryingastructureddatabaseofgameinformation.
Thedatabasewasstoredasakindofattribute-valuematrixwithvaluesforattributesofeachgame:Month=JulyPlace=BostonDay=7GameSerialNo.
=96(Team=RedSox,Score=5)(Team=Yankees,Score=3)BIBLIOGRAPHICALANDHISTORICALNOTES27Eachquestionwasconstituency-parsedusingthealgorithmofZelligHarris'sTDAPprojectattheUniversityofPennsylvania,essentiallyacascadeofnite-statetransducers(seethehistoricaldiscussioninJoshiandHopely1999andKart-tunen1999).
Theninacontentanalysisphaseeachwordorphrasewasassociatedwithaprogramthatcomputedpartsofitsmeaning.
Thusthephrase'Where'hadcodetoassignthesemanticsPlace=,withtheresultthatthequestion"WheredidtheRedSoxplayonJuly7"wasassignedthemeaningPlace=Team=RedSoxMonth=JulyDay=7Thequestionisthenmatchedagainstthedatabasetoreturntheanswer.
Simmons(1965)summarizesotherearlyQAsystems.
Anotherimportantprogenitoroftheknowledge-basedparadigmforquestion-answeringisworkthatusedpredicatecalculusasthemeaningrepresentationlan-guage.
TheLUNARsystem(Woodsetal.
1972,Woods1978)wasdesignedtobeLUNARanaturallanguageinterfacetoadatabaseofchemicalfactsaboutlunargeology.
ItcouldanswerquestionslikeDoanysampleshavegreaterthan13percentaluminumbyparsingthemintoalogicalform(TEST(FORSOMEX16/(SEQSAMPLES):T;(CONTAIN'X16(NPR*X17/(QUOTEAL203))(GREATERTHAN13PCT))))Theriseofthewebbroughttheinformation-retrievalparadigmforquestionan-sweringtotheforefront.
TheU.
S.
government-sponsoredTREC(TextREtrievalConference)evaluations,runannuallysince1992,provideatestbedforevaluatinginformation-retrievaltasksandtechniques.
TRECprovideslargedocumentsetsforbothtrainingandtesting,alongwithuniformscoringsystems(VoorheesandHar-man,2005).
DetailsofallofthemeetingscanbefoundattheTRECpageontheNationalInstituteofStandardsandTechnologywebsite.
TRECaddedaninuentialQAtrackin1999,whichledtoawidevarietyoffactoidandnon-factoidsystemscompetinginannualevaluations.
Atthatsametime,Hirschmanetal.
(1999)introducedtheideaofusingchil-dren'sreadingcomprehensionteststoevaluatemachinetextcomprehensionalgo-rithms.
Theyacquiredacorpusof120passageswith5questionseachdesignedfor3rd-6thgradechildren,builtananswerextractionsystem,andmeasuredhowwelltheanswersgivenbytheirsystemcorrespondedtotheanswerkeyfromthetest'spublisher.
Theiralgorithmfocusedonwordoverlapasafeature;lateralgorithmsaddednamedentityfeaturesandmorecomplexsimilaritybetweenthequestionandtheanswerspan(RiloffandThelen2000,Ngetal.
2000).
Neuralreadingcomprehensionsystemsdrewontheinsightoftheseearlysys-temsthatanswerndingshouldfocusonquestion-passagesimilarity.
Manyofthearchitecturaloutlinesofmodernsystemswerelaidoutinearlyworklike(Hermannetal.
,2015),Chenetal.
(2017),andSeoetal.
(2017).
TBD:MORErecentQAhistory.
TheDeepQAcomponentoftheWatsonsystemthatwontheJeopardy!
chal-lengeisdescribedinaseriesofpapersinvolume56oftheIBMJournalofResearchandDevelopment;seeforexampleFerrucci(2012).
Otherquestion-answeringtasksincludeQuizBowl,whichhastimingconsiderationssincethequestioncanbeinter-rupted(Boyd-Graberetal.
,2018).
Questionansweringisalsoanimportantfunctionofmodernpersonalassistantdialogsystems;seeChapter24formore.
28CHAPTER23QUESTIONANSWERINGExercisesExercises29Alberti,C.
,Lee,K.
,andCollins,M.
(2019).
ABERTbase-lineforthenaturalquestions.
http://arxiv.
org/abs/1901.
08634.
Berant,J.
,Chou,A.
,Frostig,R.
,andLiang,P.
(2013).
Semanticparsingonfreebasefromquestion-answerpairs.
EMNLP.
Bizer,C.
,Lehmann,J.
,Kobilarov,G.
,Auer,S.
,Becker,C.
,Cyganiak,R.
,andHellmann,S.
(2009).
DBpedia—Acrys-tallizationpointfortheWebofData.
WebSemantics:sci-ence,servicesandagentsontheworldwideweb7(3),154–165.
Bollacker,K.
,Evans,C.
,Paritosh,P.
,Sturge,T.
,andTay-lor,J.
(2008).
Freebase:acollaborativelycreatedgraphdatabaseforstructuringhumanknowledge.
SIGMOD2008.
Bordes,A.
,Usunier,N.
,Chopra,S.
,andWeston,J.
(2015).
Large-scalesimplequestionansweringwithmemorynet-works.
arXivpreprintarXiv:1506.
02075.
Boyd-Graber,J.
,Feng,S.
,andRodriguez,P.
(2018).
Human-computerquestionanswering:Thecaseforquizbowl.
Es-calera,S.
andWeimer,M.
(Eds.
),TheNIPS'17Competi-tion:BuildingIntelligentSystems.
Springer.
Chen,D.
,Fisch,A.
,Weston,J.
,andBordes,A.
(2017).
Readingwikipediatoansweropen-domainquestions.
ACL.
Clark,J.
H.
,Choi,E.
,Collins,M.
,Garrette,D.
,Kwiatkowski,T.
,Nikolaev,V.
,andPalomaki,J.
(2020).
TyDiQA:Abenchmarkforinformation-seekingques-tionansweringintypologicallydiverselanguages.
arXivpreprintarXiv:2003.
05002.
Clark,P.
,Cowhey,I.
,Etzioni,O.
,Khot,T.
,Sabharwal,A.
,Schoenick,C.
,andTafjord,O.
(2018).
ThinkyouhavesolvedquestionansweringTryARC,theAI2reasoningchallenge.
.
arXivpreprintarXiv:1803.
05457.
Clark,P.
,Etzioni,O.
,Khashabi,D.
,Khot,T.
,Mishra,B.
D.
,Richardson,K.
,Sabharwal,A.
,Schoenick,C.
,Tafjord,O.
,Tandon,N.
,Bhakthavatsalam,S.
,Groeneveld,D.
,Guerquin,M.
,andSchmitz,M.
(2019).
From'F'to'A'ontheNYRegentsScienceExams:AnoverviewoftheAristoproject.
arXivpreprintarXiv:1909.
01958.
Cucerzan,S.
(2007).
Large-scalenamedentitydisambigua-tionbasedonWikipediadata.
EMNLP/CoNLL.
Deerwester,S.
C.
,Dumais,S.
T.
,Landauer,T.
K.
,Furnas,G.
W.
,andHarshman,R.
A.
(1990).
Indexingbylatentsemanticsanalysis.
JASIS41(6),391–407.
Devlin,J.
,Chang,M.
-W.
,Lee,K.
,andToutanova,K.
(2019).
BERT:Pre-trainingofdeepbidirectionaltransformersforlanguageunderstanding.
NAACLHLT.
Dua,D.
,Wang,Y.
,Dasigi,P.
,Stanovsky,G.
,Singh,S.
,andGardner,M.
(2019).
DROP:Areadingcomprehensionbenchmarkrequiringdiscretereasoningoverparagraphs.
NAACLHLT.
Ferragina,P.
andScaiella,U.
(2011).
Fastandaccuratean-notationofshorttextswithwikipediapages.
IEEESoftware29(1),70–75.
Ferrucci,D.
A.
(2012).
Introductionto"ThisisWatson".
IBMJournalofResearchandDevelopment56(3/4),1:1–1:15.
Furnas,G.
W.
,Landauer,T.
K.
,Gomez,L.
M.
,andDumais,S.
T.
(1987).
Thevocabularyprobleminhuman-systemcommunication.
CommunicationsoftheACM30(11),964–971.
Green,B.
F.
,Wolf,A.
K.
,Chomsky,C.
,andLaughery,K.
(1961).
Baseball:Anautomaticquestionanswerer.
Pro-ceedingsoftheWesternJointComputerConference19.
Hermann,K.
M.
,Kocisky,T.
,Grefenstette,E.
,Espeholt,L.
,Kay,W.
,Suleyman,M.
,andBlunsom,P.
(2015).
Teachingmachinestoreadandcomprehend.
NeurIPS.
Hirschman,L.
,Light,M.
,Breck,E.
,andBurger,J.
D.
(1999).
DeepRead:Areadingcomprehensionsystem.
ACL.
Iyer,S.
,Konstas,I.
,Cheung,A.
,Krishnamurthy,J.
,andZettlemoyer,L.
(2017).
Learninganeuralsemanticparserfromuserfeedback.
ACL.
Ji,H.
andGrishman,R.
(2011).
Knowledgebasepopulation:Successfulapproachesandchallenges.
ACL.
Jiang,K.
,Wu,D.
,andJiang,H.
(2019).
FreebaseQA:AnewfactoidQAdatasetmatchingtrivia-stylequestion-answerpairswithFreebase.
NAACLHLT.
Johnson,J.
,Douze,M.
,andJegou,H.
(2017).
Billion-scalesimilaritysearchwithGPUs.
arXivpreprintarXiv:1702.
08734.
Joshi,A.
K.
andHopely,P.
(1999).
Aparserfromantiq-uity.
Kornai,A.
(Ed.
),ExtendedFiniteStateModelsofLanguage,6–15.
CambridgeUniversityPress.
Joshi,M.
,Choi,E.
,Weld,D.
S.
,andZettlemoyer,L.
(2017).
Triviaqa:Alargescaledistantlysupervisedchal-lengedatasetforreadingcomprehension.
ACL.
Jurafsky,D.
(2014).
TheLanguageofFood.
W.
W.
Norton,NewYork.
Kamphuis,C.
,deVries,A.
P.
,Boytsov,L.
,andLin,J.
(2020).
Whichbm25doyoumeanalarge-scalerepro-ducibilitystudyofscoringvariants.
EuropeanConferenceonInformationRetrieval.
Karpukhin,V.
,Oguz,B.
,Min,S.
,Lewis,P.
,Wu,L.
,Edunov,S.
,Chen,D.
,andYih,W.
-t.
(2020).
Densepassageretrievalforopen-domainquestionanswering.
EMNLP.
Karttunen,L.
(1999).
CommentsonJoshi.
Kornai,A.
(Ed.
),ExtendedFiniteStateModelsofLanguage,16–18.
Cam-bridgeUniversityPress.
Kwiatkowski,T.
,Palomaki,J.
,Redeld,O.
,Collins,M.
,Parikh,A.
,Alberti,C.
,Epstein,D.
,Polosukhin,I.
,Devlin,J.
,Lee,K.
,Toutanova,K.
,Jones,L.
,Kelcey,M.
,Chang,M.
-W.
,Dai,A.
M.
,Uszkoreit,J.
,Le,Q.
,andPetrov,S.
(2019).
Naturalquestions:Abenchmarkforquestionan-sweringresearch.
TACL7,452–466.
Lee,K.
,Chang,M.
-W.
,andToutanova,K.
(2019).
Latentretrievalforweaklysupervisedopendomainquestionan-swering.
ACL.
Li,B.
Z.
,Min,S.
,Iyer,S.
,Mehdad,Y.
,andYih,W.
-t.
(2020).
Efcientone-passend-to-endentitylinkingforquestions.
EMNLP.
Lin,J.
,Nogueira,R.
,andYates,A.
(2020).
Pretrainedtrans-formersfortextranking:BERTandbeyond.
arXivpreprintarXiv:2010.
06467.
Liu,C.
-W.
,Lowe,R.
T.
,Serban,I.
V.
,Noseworthy,M.
,Char-lin,L.
,andPineau,J.
(2016).
HowNOTtoevaluateyourdialoguesystem:Anempiricalstudyofunsupervisedeval-uationmetricsfordialogueresponsegeneration.
EMNLP.
30Chapter23QuestionAnsweringLukovnikov,D.
,Fischer,A.
,andLehmann,J.
(2019).
Pre-trainedtransformersforsimplequestionansweringoverknowledgegraphs.
InternationalSemanticWebConfer-ence.
Manning,C.
D.
,Raghavan,P.
,andSch¨utze,H.
(2008).
In-troductiontoInformationRetrieval.
Cambridge.
Mihalcea,R.
andCsomai,A.
(2007).
Wikify!
:Linkingdoc-umentstoencyclopedicknowledge.
CIKM2007.
Milne,D.
andWitten,I.
H.
(2008).
Learningtolinkwithwikipedia.
CIKM2008.
Ng,H.
T.
,Teo,L.
H.
,andKwan,J.
L.
P.
(2000).
Ama-chinelearningapproachtoansweringquestionsforreadingcomprehensiontests.
EMNLP.
Oren,I.
,Herzig,J.
,Gupta,N.
,Gardner,M.
,andBerant,J.
(2020).
Improvingcompositionalgeneralizationinseman-ticparsing.
arXivpreprintarXiv:2010.
05647.
Phillips,A.
V.
(1960).
Aquestion-answeringroutine.
Tech.
rep.
16,MITAILab.
Rajpurkar,P.
,Jia,R.
,andLiang,P.
(2018).
Knowwhatyoudon'tknow:UnanswerablequestionsforSQuAD.
ACL.
Rajpurkar,P.
,Zhang,J.
,Lopyrev,K.
,andLiang,P.
(2016).
SQuAD:100,000+questionsformachinecomprehensionoftext.
EMNLP.
Richardson,M.
,Burges,C.
J.
C.
,andRenshaw,E.
(2013).
MCTest:Achallengedatasetfortheopen-domainmachinecomprehensionoftext.
EMNLP.
Riloff,E.
andThelen,M.
(2000).
Arule-basedques-tionansweringsystemforreadingcomprehensiontests.
ANLP/NAACLworkshoponreadingcomprehensiontests.
Roberts,A.
,Raffel,C.
,andShazeer,N.
(2020).
Howmuchknowledgecanyoupackintotheparametersofalanguagemodel.
arXivpreprintarXiv:2002.
08910.
Robertson,S.
,Walker,S.
,Jones,S.
,Hancock-Beaulieu,M.
M.
,andGatford,M.
(1995).
OkapiatTREC-3.
OverviewoftheThirdTextREtrievalConference(TREC-3).
Salton,G.
(1971).
TheSMARTRetrievalSystem:Experi-mentsinAutomaticDocumentProcessing.
PrenticeHall.
Seo,M.
,Kembhavi,A.
,Farhadi,A.
,andHajishirzi,H.
(2017).
Bidirectionalattentionowformachinecompre-hension.
ICLR.
Simmons,R.
F.
(1965).
AnsweringEnglishquestionsbycomputer:Asurvey.
CACM8(1),53–70.
Simmons,R.
F.
,Klein,S.
,andMcConlogue,K.
(1964).
In-dexinganddependencylogicforansweringEnglishques-tions.
AmericanDocumentation15(3),196–204.
Sorokin,D.
andGurevych,I.
(2018).
Mixingcontextgran-ularitiesforimprovedentitylinkingonquestionansweringdataacrossentitycategories.
*SEM.
SparckJones,K.
(1972).
Astatisticalinterpretationoftermspecicityanditsapplicationinretrieval.
JournalofDoc-umentation28(1),11–21.
Su,Y.
,Sun,H.
,Sadler,B.
,Srivatsa,M.
,G¨ur,I.
,Yan,Z.
,andYan,X.
(2016).
Ongeneratingcharacteristic-richquestionsetsforQAevaluation.
EMNLP.
Talmor,A.
andBerant,J.
(2018).
Thewebasaknowledge-baseforansweringcomplexquestions.
NAACLHLT.
Voorhees,E.
M.
andHarman,D.
K.
(2005).
TREC:Experi-mentandEvaluationinInformationRetrieval.
MITPress.
Vrandeˇcic,D.
andKr¨otzsch,M.
(2014).
Wikidata:afreecollaborativeknowledgebase.
CACM57(10),78–85.
Wolfson,T.
,Geva,M.
,Gupta,A.
,Gardner,M.
,Goldberg,Y.
,Deutch,D.
,andBerant,J.
(2020).
Breakitdown:Aquestionunderstandingbenchmark.
TACL8,183–198.
Woods,W.
A.
(1978).
Semanticsandquanticationinnat-urallanguagequestionanswering.
Yovits,M.
(Ed.
),Ad-vancesinComputers,2–64.
Academic.
Woods,W.
A.
,Kaplan,R.
M.
,andNash-Webber,B.
L.
(1972).
Thelunarsciencesnaturallanguageinformationsystem:Finalreport.
Tech.
rep.
2378,BBN.
Wu,L.
,Petroni,F.
,Josifoski,M.
,Riedel,S.
,andZettle-moyer,L.
(2019).
Zero-shotentitylinkingwithdenseentityretrieval.
arXivpreprintarXiv:1911.
03814.
Yang,Z.
,Qi,P.
,Zhang,S.
,Bengio,Y.
,Cohen,W.
,Salakhut-dinov,R.
,andManning,C.
D.
(2018).
HotpotQA:Adatasetfordiverse,explainablemulti-hopquestionanswer-ing.
EMNLP.
Yih,W.
-t.
,Richardson,M.
,Meek,C.
,Chang,M.
-W.
,andSuh,J.
(2016).
Thevalueofsemanticparselabelingforknowledgebasequestionanswering.
ACL.
Zelle,J.
M.
andMooney,R.
J.
(1996).
Learningtoparsedatabasequeriesusinginductivelogicprogramming.
AAAI.

展开全文