Lucenegraphcore

graphcore  时间:2021-03-26  阅读:()
Semi-AutomatedPreventionandCurationofDuplicateContentinSocialSupportSystemsIgorA.
PodgornyIntuit,Inc.
SanDiego,USAigor_podgorny@intuit.
comChrisGielowIntuit,Inc.
SanDiego,USAchris_gielow@intuit.
comABSTRACTTurboTaxAnswerXchangeisapopularsocialQ&AsystemsupportingusersworkingonU.
S.
federalandstatetaxreturns.
Basedonacustom-builtduplicatescoringmodel,35%ofAnswerXchangequestionshavebeenfoundtobenear-duplicatesresponsiblefor56%ofAnswerXchangedocumentviews.
Thisdegradestheuserexperienceforboththeaskerwhoisunabletofindanansweramidduplicates,andtheanswererwhoisunabletoefficientlyansweratscale.
Theduplicatequestionstendtoformmicro-clustersthatgrowviapreferentialattachmentand,onceexceedingsome25questionsinsize,startmorphingintomega-clusterswithacomplexnetworktopology.
Thisbehaviorcanbeleveragedtodesignsemi-automatedcontentcurationsystemstodetectwhetheranewlypostedquestionisaduplicateand,ifso,whichduplicateclusteritbelongsto.
InordertoimproveuserexperienceinAnswerXchange,weexplorehowhumanandartificialintelligencecanbejointlyemployedandthenpresentseveraldata-drivenintelligentuserinterfaces.
Theduplicatescoringmodelscanbeutilizedaselementsofquestion-postingandansweringexperiences,unansweredquestionqueueingandanswerbots.
TheseapproachescanbeextendedtoanysocialsupportQ&Asystemwhereduplicatepostingnegativelyimpactssearchrelevanceandcontentconsumption.
AuthorKeywordsTurboTax;AnswerXchange;CQA;communityquestionanswering;socialquestionanswering;duplicateclusters;contentdeduplication.
ACMClassificationKeywordsH.
5.
m.
InformationInterfacesandPresentation(e.
g.
HCI):MiscellaneousINTRODUCTIONSocialQ&Asystemsprovideaconvenientself-supportoptionfortaxandfinancialsoftwareapplicationswherepersonalizedlong-tailcontentgeneratedbytheuserscansupplementcuratedknowledgebaseanswers.
Usersoftenpreferself-helptoassistedmeasures(e.
g.
phonesupportoronlinechat)andareoftenabletofindandapplytheirsolutionfaster.
Thisalsoreducestheloadonassistedchannels,ensuringtheyremainavailabletothosewhoneedit.
AnswerXchange(http://ttlc.
intuit.
com)isasocialQ&AsitewherecustomerscanlearnandsharetheirknowledgewithotherTurboTaxcustomerswhilepreparingU.
S.
federalandstatetaxreturnsandalsofindstep-by-stepinstructionsonusingtheTurboTaxapplication[5,6].
AstheusersstepthroughtheTurboTaxinterviewpages,theycanaskquestionsaboutsoftwareandtaxtopics(Figure1)andreceiveanswersinamatterofminutes.
AnswerXchangehasgeneratedmillionsofquestionsandanswersthathavehelpedtensofmillionsofTurboTaxcustomerssincelaunchingin2007.
Figure1.
AnswerXchangequestion-postinguserexperience.
Questiontitle(ashortsummaryofquestionlimitedto255characters)ismandatory.
Questiondetails(notshown)areoptionalandunlimitedinsize.
Themajorityofuserscanfindanswersbysearchingtheexistingcontent.
Theoverallqualityofacustomerself-helpsystemisthereforedeterminedbyhowwelltheself-helpsystemassistsinfindingtherelevantcontent.
Thenumberofsearchsessionsresultinginassistedsupportcontacts(beingaslargeashundredsofthousandsofcustomersperyear)andfractionofuserupordownvotesonself-supportcontentprovideaconvenientproxymetricsofcontentqualityandsearchrelevanceinTurboTaxself-help[5].
2018.
Copyrightfortheindividualpapersremainswiththeauthors.
Copyingpermittedforprivateandacademicpurposes.
ESIDA'18,March11,Tokyo,Japan.
Figure2.
AnexampleofduplicateAnswerXchangesearchresults.
Questiontitlesandanswersnippetsareshowninpurpleandinblack,respectively.
Oneproblemwiththeexistingquestion-postingexperience(Figure1)isthatsearchesmayresultinmultipleandoftenduplicateanswersthatarerelativelyclosetotheintentoftheoriginalquestion,butstilldonotmatchtheoriginalsearchintent(Figure2).
Thisinterfereswiththeuser'sabilitytoselectfromadiversesetofpossibleanswers[5]and,oftenresultseitherinthesubmissionofaduplicatequestionorswitchingtoaless-desiredsupportchannel.
Arelatedproblemisthatusersmaysubmitpoorqualityquestionsbynotprovidingalloftherelevantinformationneededforagoodqualityanswer[5].
Onesolutionisamanualreviewoftheusergeneratedcontenttoarchivesomeoftheduplicatequestionsandrelatedanswers,ifany,andkeepingthebestperformingcontentin"live"status(i.
e.
makingitavailableforsearch).
Thisapproachislaborintensiveanddoesnotaddresstheproblemwiththequestion-postinguserexperience.
Duplicatequestionsmayquicklybuildup,addingunnecessaryburdenoncommunityquestionansweringalongtheway.
ThegoalofthisstudyistoaddresstheproblemsofduplicatecontentpreventioninAnswerXchangebycombiningmachinelearningandintelligentuserinterfaces.
Inwhatfollows,wedescribeduplicatedetectionalgorithmsdevelopedearlierandpresentacustommodeltrainedonAnswerXchangequestions.
Next,weintroducetheconceptof"duplicateclusters"thatprovideaframeworkforsemi-automatedduplicatecontentprevention.
Finally,wepresentseveralcustomdesigneddata-drivenintelligentuserinterfacesforaddressingduplicatecontentproblem.
RELATEDWORKThetaskofestimatingsemanticsimilarityoftextdocumentshasmultiplepracticalapplicationsandisofgrowinginterestfromtheresearchcommunity.
Theareasofresearchincludewebpagesimilarity,documentsimilarity,sentencesimilarity,searchquerysimilarityandutterancesimilarityinconversationaluserinterfaces.
Thesetasksarealsorelatedtoamoregeneralproblemofdetectingduplicatesindatabaserecords[2].
QuestionsinsocialQ&Asystemsmediaareoftenconfinedtooneortworelativelyshortsentencesandmaywarrantdomainspecificapproachestoaddressingquestionsimilarity.
Forexample,twoquestionsinasocialQ&Asystemcanbeconsideredsemanticallyidenticalifasingleanswersatisfiestheneedsofbothoriginalaskers[3].
Theanswermaynotyetexistintheproductiondatabasebutcouldbegeneratedifneeded.
Thetaskofduplicate-questiondetectionisalsorelatedtothetaskofre-formulatinganewlyformedquestion[6]andautomaticallyfindingananswertoanewquestion[8].
Themostrecentresultsintheareaofduplicatecontentscoringcamefromthe2017Kaggle"QuoraPair"competitionwithmodelsubmissionsfrommorethan3,000teams(https://www.
kaggle.
com/c/quora-question-pairs).
Inthiscompetition,theparticipantsweretaskedtoclassifyifQuoraquestionpairsareduplicatesornotbasedon200,000traininginstances.
Finally,SemEval2017TaskonCommunityQuestionAnswering("Question–CommentSimilarity","Question–QuestionSimilarity",etc.
)resultedinsubmissionsfrom23teams[4].
TheproblemofduplicatedetectionandcurationiscloselyrelatedtothetaskofpredictingcontentqualityinsocialQ&Asystems.
Contentqualitymetricsmaybehelpfulinselectingthebestperformingquestionandanswerfortheduplicate-questionpair.
AnswerandquestionqualityinthesocialQ&Asystemshasbeenthefocusofincreasingattentionfromthescientificcommunity[1,9].
DUPLICATE-SCORINGMODELAnswerXchangeSearchAnswerXchangesearchisbuiltwithApacheLuceneopen-sourcesoftware(http://lucene.
apache.
org).
Bydefault,Luceneuses"tf-idf"(https://en.
wikipedia.
org/wiki/tf-idf)and"cosine-similarity"asstandardmethodsofrankingsearchresults.
Shorterdocumentswiththesamesetofmatchingkeywordstypicallyrankhigherthanlongerdocumentswithsimilarsemanticmeaning.
AnaverageAnswerXchangesearchqueryis2-3termslong(i.
e.
shorterthanatypicalAnswerXchangequestion)anditisoftencomparableinlengthwiththetitleofapotentiallyduplicatequestion.
ThequestiondetailsplayalesserrolecomparedtotitlescontributingtoextraboostingofduplicatecontentbyLucene.
TheAnswerXchangeLucenerankingalgorithmtendstoboostnewcontentandalsoaccountsforvariousmetadatasuchashelpfulnessvotes.
TrainingDataTheproblemofnear-duplicatedetectioncanbeformulatedasanunsupervisedorsupervisedmachinelearningtask[7].
Intheunsupervisedcase,duplicatepairsandclusterscanbefoundbasedondistancemetricssuchascosine-similarityoftheweightedtf-idfvectors,Jaccardsimilaritycoefficient,distanceinword2vecspace,etc.
Inthesupervisedcase,theproblemoffindingtopicalnear-duplicaterelationscanbeformulatedasfollows:givenapairofquestions,themachinelearntmodelhastopredicta"duplicatescore"anddetermineifquestionsareduplicatesbasedonapre-definedthreshold.
Inthispaper,weemploya"hybrid"approachstartingwithcosine-similaritymetricsfordatapre-processingandthenaddingamoreaccuratecustom-builtscoringmodeltotheprocessingpipeline.
AsthefractionofduplicatepairsinAnswerXchangeisrelativelylow,thequestionpairsrankedbycosine-similarityprovideaconvenientdatasetforlabelingbasedontheimportancesamplingapproach.
Towardsthisgoal,wecomputedbag-of-wordscosine-similarity(AppendixA)for790,000questionsavailableforsearchinAnswerXchangeattheendof2017U.
S.
TaxDay(April18).
Next,fourAnswerXchangemoderatorsaddedclasslabels(0or1)toarandomsampleof4,000near-duplicatepairs.
Instancesopentodoubthavebeenflaggedbymoderatorsandthenre-labeledbyaconsensus.
1,000randomlysamplednon-duplicatepairshavebeenaddedforthefinalversionofthetrainingdatasettomakeitequallydividedbetweenduplicateandnon-duplicatepairs.
Duplicate-ScoringModelFeaturesThemodelfeaturescanbelearntfromtrainingdataand/orbyknowledgeacquisitionfromAnswerXchangemoderators.
Wehaveusedthefollowingmodelfeatures:Cosine-similaritywithtf-idfweighting(seeAppendixA).
ProbabilistictopicIDofthequestioncomputedwithLatentDirichletAllocation(seeAppendixA).
U.
S.
taxyearinthequestion.
Distinctwordsinthequestionpair.
Commonwordsinthequestionpair.
Typeofthequestion(e.
g.
"closed-ended"questions"CanIdeduct…"typicallyaccountfortaxrelated,while"how"questionsoftenaccountforproductrelatedquestion).
Firstwordofthequestion.
Duplicate-ScoringModelPerformanceBasedonthesetof5,000labeledquestionpairs,wetrainedandtestedalinear(logisticregression)andnon-linear(randomforest)binaryclassifiersusingPythonmachinelearninglibrary"scikit-learn".
Themodelpredictsclasslabel(0foranon-duplicateand1forduplicatepair)andalsotheduplicatescore(i.
e.
probabilityofthequestionpairtobelongtoeitherclassrangingfrom0.
0to1.
0)thatcanbeusedtoselectuserexperiencebasedonpredefinedthreshold(s).
Wealsotrainedaseparateversionofthelogisticregressionclassifierusingcosine-similarityasasinglemodelfeature.
ShowninTable1arecommonmetricsusedforpredictivemodelevaluation:areaundercurve(AUC)forreceiveroperatingcharacteristic,F1scoreandlogarithmicloss(logloss)functionforclassification.
ModelAUCF1ScoreLogLossLogisticRegression0.
950.
880.
27RandomForest0.
940.
870.
31Cosine-similarity0.
830.
730.
48Table1.
Modelperformancemetricsforduplicate-scoringmodels(detailsareexplainedinthetext).
AsseenfromTable1,bothlogisticregressionandrandomforestmodelsachieveperformancethatisconsistentwiththegoalsofthisexploratorystudy.
Atthesametime,cosine-similarityversionunderperformsthefirsttwobyawidemargin.
Thiscanbeexplainedbytheinabilitytofindanoptimalthresholdseparatingduplicateandnon-duplicatepairsusingthecosine-similarityalone.
Thefollowingtwoexamplesillustratetherelationshipbetweenkeyword-basedcosine-similarityandduplicate-questionscorecomputedwithlogisticregression.
ThefirstexampleisanAnswerXchangequestionpairwitharelativelylowcosine-similarityof0.
61:(1)"Ineedacopyofmyfederaltaxreturnfor2014"and(2)"Ineed2015TaxReturn".
BothquestionscanbeansweredwithasingleinstructionaboutgettingacopyofprioryeartaxreturnfiledwithTurboTaxandhenceareduplicates.
Thesecondexampleisaquestionpairwithhighcosine-similarityof1.
0:(1)"doihavetofilestatetaxes"and(2)"howtofilestatetaxes".
Thesequestionsarenotduplicatesbecausetheybelongtotaxandproductcategories[5],respectively,andwouldrequiretwodifferentanswers.
DUPLICATECLUSTERSPreferentialAttachmentandTopologyAfteridentifying5,597,799duplicatequestionpairsinAnswerXchange(AppendixA),webuiltanundirectedgraphof281,031duplicatequestions.
Eachduplicatepairandduplicatequestionidentifiedwiththemodelconstitutedgraphedgeandgraphvertex,respectively.
Theresultinggraphconsistsof14,616connectedcomponentshereafterreferredtoas"duplicateclusters.
"Toexploreduplicate-clusterscalingbehavior,werankedclustersbythenumberofquestionsandplottedthenumberofquestionsperclustervs.
clusterrankinlog-logscale(Figure3).
Thelargestclusterhas23,236questionsandthesmallestonesonlyhavetwo.
Theplotalsoincludesgraph(oredge)density:=21,whereEisnumberofedges(i.
e.
duplicatepairs)andVisthenumberofvertices(i.
e.
questions).
Graphdensityisequalto1.
0forthefullyconnectedgraphs.
Inthelattercase,eachquestionintheclusterisconnectedtoallremainingquestionsinthesameduplicatecluster.
Basedonbothquestioncountsandgraphdensity,theduplicateclustersinFigure3canbedividedintothreedistinctgroupsmarkedasmega-clusters,transitionalclustersandmicro-clusters.
Thesegroupsaccountfor84%,2%and14%ofduplicatequestions,respectively.
Figure3.
Scalingbehaviorofduplicateclusters(blackdots)inAnswerXchangequestions.
Theclustersarerankedbythenumberofquestionsinthedescendingorder.
Graphdensityfortheclustersisshowningray.
CyanandreddotsrefertotheclustersshowninFigures4and5,respectively.
Anexampleofmicro-clusterwith23verticesisshowninFigure4.
Graphdensityis0.
54andmostofverticesareinterconnectedwithanexceptionofthreeverticesconnectedbybridgestoadensergraphcore.
Thecorrespondingarticulationpointsaremarkedbybluedots.
Notethatevenifquestions1and2areduplicatesandquestions2and3areduplicates,thisdoesnotmeanthatquestions1and3areduplicatesaswell.
Thisexplainswhyaduplicate-clusterdensityistypicallylessthan1.
0unlessthegraphsizeislimitedtotwoquestions.
AsseenfromFigure3,micro-clusterscalingbehaviorfollowsZipfdistribution(https://en.
wikipedia.
org/wiki/zipf's_law):=+,,whererrangesfromabout100tothetotalnumberofclustersR.
Accordingly,thegrowthofN(Δ)andR(Δ)wouldbeconstrainedbythefollowingequation:Δ=Δ.
ItisworthmentioningthatZipfdistributionisanasymptoticcaseofamoregeneralYule-Simondistribution(https://en.
wikipedia.
org/wiki/Yule-Simon_distribution)typicalforthepreferentialattachmentprocess,meaningthatanewlypostedduplicateismorelikelytobecomeattachedtotheexistingclusterthantoformanewduplicatepair.
Thescalingparameterforthemicro-clusters:=log4log5log(4)log(5)canbeestimatedas0.
6.
ByextrapolatingZipfdistributiontor=1(thatwouldcorrespondtoanon-existinglargestmicro-cluster),onecanestimateNvalueas400.
Thisvalue,however,isalmosttwoordersofmagnitudelessthanthenumberofquestionsinthetopmega-cluster.
Figure4.
Amicro-clustermarkedbycyandotinFigure3.
Articulationpointsareshownbysmallerbluedots.
ToexplainthescalebreakinthedistributionshowninFigure3,letusexaminelargerduplicateclustersinmoredetail.
ShowninFigure5isamega-clusterwith4,549questions.
Theclusterhasdensityequalto0.
0017and1048articulationpoints.
Thismeansthatthemega-clustersmayconsistofmultiplesub-clustersthataresemanticallyrelatedtoeachotherbutwiththeelementsthatarenotduplicatesunlesstheybelongtothesamesub-cluster.
Figure5.
SameasinFigure4,butnowforamega-cluster.
Asthenumberofduplicatesreachescertainlevel,theclustersstartcoalescingbyestablishingbridgeswithotherclusters,duplicatepairsandstand-alonequestions,quicklyevolvingfromdenseconnectedgraphstosparsegraphswithacomplexnetworktopology.
TheareaoftransitionismarkedastransitionalclustersinFigure3.
Semi-AutomatedDuplicateContentCurationWhilethetaskofduplicatecontentarchivingisstraightforwardonceduplicatepairsarefound(AppendixA),theduplicatecontentcanbuildupagainunlessquestion-postingand/orsearchexperiencesaremodified.
Ournextgoalisthereforetoexplorehowtheconceptofduplicateclustersdiscussedintheprevioussectioncanbeappliedtothesetasks.
Thecurationofmicro-clusterscanbedoneautomaticallyorsemi-automatically(i.
e.
withminimumhumaninvolvement)byretainingoneorfewbestperforminglong-taildocuments(i.
e.
documentsthatincludebothquestionsandanswers)andassigningthemaclusterIDforsubsequentre-use.
Thecurationofmega-clustersrepresentsamorechallengingproblem.
First,asinglebestperformingdocumentinamega-clustermaysimplynotexistsincetheclustermaycontainmultiplesub-clustersconnectedbybridges.
Second,duplicatecurationbyahumanisacumbersometaskduetothemega-clustercomplextopology.
Whiletheexactsolutionmaysimplynotexist,approximatesolutionsmaybesufficienttoreducethenumberofduplicatespostedintheAnswerXchangetoanacceptablelevel.
Oneapproachwouldbetobreakthemega-clustersintosmallerpartsbydeletingbridgesinthegraphorbyemployingaconventionalhierarchicalclustering.
Forexample,theduplicateclustershowninFigure5canbesplitto1363connectedcomponentsbyremovingallarticulationpoints(bluedotsinFigure5).
Mostoftheresultingconnectedcomponents,however,aredisconnecteddocuments.
Amorepracticalapproachistoarchivenon-performingshort-tailcontentfromthemega-clusterandcuratetheresultingconnectedcomponents.
ShowninFigure6isasubsetofmega-clusterfromFigure5thatnowonlyincludesdocumentswithatleast100views.
Thisresultsinbreakingtheoriginalmega-clusterinto68connectedcomponentswhichareeasiertocurate.
Figure6.
Asubsetofthemega-clustershowninFigure5.
GreydotsmarkdocumentsusedinFigure7.
Thenexttaskistopresentduplicatecontentinaformsuitableforsemi-automatedcontentcuration.
Figure7showsanexampleofduplicatecontentmetricsforeightdocumentswithatleast1000views.
Theleftcolumnisasub-clusterIDfollowedbyapostIDidentifyinganAnswerXchangedocumentconsistingoftheoriginalquestionandallaccumulatedanswers(notshown).
Thetextofthequestionandtypeofthequestion(i.
e.
user-generatedcontentmarkedasUGCorknowledgebasecontentlabeledasFAQ)areincludedinthethirdandfourthcolumns,respectively.
Thelasttwocolumnsareviewsaccumulatedoveragivenperiodandpercentageofup-votes.
Thedocumentscanberankedbyviewsand/orvotesprovidingamechanismofidentifyingandremovingnon-performingcontenteithermanuallyorautomaticallybasedonasetofpredefinedcontentqualitythresholds.
Figure7.
DuplicatedocumentmetricsforthedocumentsmarkedbygreydotsinFigure6.
Duplicatemetricscanbeoperationalizedbyaddinganalgorithmtomatchthebestquestiontothebestanswerinthesub-cluster.
Suchasystemwouldincludeanswerdeletingandmergingmanuallyorautomaticallybyattachingautomaticallygenerated"best"answertothe"best"duplicatequestion.
Thesolutioncanbeimplementedasaback-endtoolfortrustedusersassignedtothetaskofduplicatearchivingandhiddenfromthelessexperiencedregularusers.
Thesolutiongoesbeyondsimpleduplicatearchivingbyprovidinganoptiontomergeavailableanswerstotheexistingduplicatequestions.
Thenon-humanpartofthesolutionincludesqualityrankingoftheexistinganswers,e.
g.
upanddownvotestatisticsasshowninFigure7.
Inthisway,thenewlyformedquestion-answerpairsprovidebetterqualitycontentavailableforsearchbycombiningthevisuallyappealingquestionsandthebestrankedanswers.
Thisisdonebycombiningartificialandhumanintelligencesincetheanswertoarelatedquestion(thatthesystemrecommended)canbeconfirmedbythecontributorifneeded.
Theclusternotescanbeeditedbytrustedusersandappliedtoallarticleswithinthecluster.
RealTimeDuplicateDetectionFindingduplicatestoagivenquestionrequires(N-1)pairwisecomparisonstothequestionsinthedatabaseandmaybenotfeasibleinrealtime.
ThecomputationaltimecanbereducedbyselectingpotentialduplicatematcheswithAnswerXchangesearch.
ThetopperformingdocumentsintheclusterscanbeassignedanIDandindexedseparatelybythesearchengine.
Oncethesearchenginereturnsthedocumentsrankedbyrelevancytothenewlyformulatedquestion,theduplicate-scoringmodelisappliedtothetopmatchestoseeifthenewquestionisaduplicateand,ifso,whichduplicateclusteritbelongsto.
IDPOST_IDDOCUMENTTYPEVIEWSUPVOTE11,899,475CanIdeductjob-searchexpensesFAQ17,01974.
812,666,148HI.
WheredoIentermyjobsearchUGC1,75977.
913,048,015WheredoIincludejobsearchUGC1,06078.
113,356,358WheredoIentermyjobsearchFAQ6,72770.
313,705,028WheredoIdeductjobsearchUGC2,9996722,895,188WheredoIentermymedicalFAQ25,24379.
922,899,090WhydoesntmyrefundchangeafterIentermymedicalexpensesFAQ13,76579.
122,956,890wheredoienterOUTOFPOCKETmedicalexpensesUGC1,50986.
6DATA-DRIVENUSEREXPERIENCESAccumulationofduplicatecontentcanbepreventedbyintegratingacustom-builtduplicate-scoringmodelandquestion-postingexperience.
Anotheroptionistoexposeanintelligentinterfacetothetrustedusersbyprovidingextrafeaturesforansweringduplicatequestions.
Finally,theduplicatequestioncurationcanbepartofthecontentmoderationprocesscarriedoutbytheAnswerXchangetrustedusersortrainedbots.
QuestionDeduplicationWhilePostingThefirstfeature(Figure8)extendstheAnswerXchange"QuestionOptimizer"system[6].
Thesystempromptstheaskerwithpersonalizedinstructionscreateddynamicallybasedonrealtimeanalysisofthequestion'ssemanticsandwritingstyle.
The"QuestionOptimizer"hasbeenre-designedtomakeduplicatequestionmoredifficulttosubmitwithoutaddressingtherecommendedre-phrasing.
Theannotationstoconceptarepresentednext.
Figure8.
Question-postingexperiencerevealstheduplicatesandhelpsusersre-phraseasauniquequestion.
A)The"Question-Optimizer"technologyisenvisionedtoincludeduplicatecontentdetectioninadditiontoprovidingtimelyadviceonhowtore-phraseordeflect.
B)Ifquestionfallsinaknownduplicatecluster,thebestmatchingandmostreferencedanswermatchesareshown.
C)Trustedusersmayattach"clusternotes"tocuratedduplicateclustersandappearautomaticallywithanyquestionwithinthecluster.
IntheexampleshowninFigure8,theduplicateclusterisaboutprintingandthemessagenotesthattheprintingexperiencerecentlychangedintheproduct-informationwhichmaybeusefultoanyonewithprinting-relatedquestions.
D)Thesuggestedanswersarededuplicatedusingduplicatescoreequalizationsotheanswersaremoreuseful.
A"clusterbrowser"isalsoaddedbelowtotheresultstohelprefineamongstthemostpopularvariations.
QuestionDeduplicationWhileAnsweringThesecondfeatureaddressesthesituationwhereapotentialduplicatehasbeensubmittedandneedstobeinterceptedaspartofquestionansweringexperience.
ThisconceptisillustratedinFigures9-10.
Figure9.
Contributorexperiencetaggingandattachingcuratedanswertothequestion.
Specifically,Figure9illustratesthecontributor(typicallyatrusteduser)answeringexperienceandincludesthefollowingannotation:Chris,trythistodownloadanewcopySUGGESTEDANSWERSANSWERTHISIneedacopyofmy2014Taxreturncopyof2014returnSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineIneedtogetacopyofmy2014returnandIdon'thavethecd.
92%match2,314duplicates5/3/16450attachattachandmarkansweredIneedacopyofmy2014TaxreturnAnswerEChrisasked30minutesagoE)Thesuggestedansweredquestionduplicateispresentedtotheoriginalaskerandalsodisplaystheduplicateprobability.
Thecontributorcaneasilyattachittotheiranswer,whichalsotellsthesystemthequestionwasaduplicateandshouldbearchivedinfavoroftheattached.
Figure10.
Originalaskerviewofdeduplicatedquestionwithpersonalizedanswer.
Oncetheduplicatequestionisanswereditbecomesavailabletotheoriginalasker(Figure10).
C)Re-purposingtrustedusersnotessimilartothoseusedinquestion-postingexperience(Figure8).
F)Apersonalizednoteintroducesthe"recommendedanswer"whileexplainingit'saduplicate.
G)Theduplicateanswerispresentedwithasenseofauthority.
H)Iftheoriginalaskerisunsatisfiedwiththeanswer,theymayrevisetheirquestionanditwillre-entertheanswerqueue.
Theyalsohavetheoptiontorequestanewanswerwithoutsubmittingthequestion.
Finally,flaggingtheunansweredquestionautomaticallyasaduplicatemaybevalidatedorinvalidatedbythetrustedusersandtoupdatetrainingdatasetformodelre-training.
QuestionDeduplicationwithAutomatedAnswersThe"AnswerBot"(Figure11)isafeaturedrivenbyartificialintelligencealone.
The"AnswerBot"increasesself-supportefficiencybyrespondingtoacustomer'squestionsbye-mailwithanswersfromthematchingduplicateclusterifthepostedquestionisflaggedbytheduplicate-scoringmodelasaduplicate.
I)"AnswerBots"mayautomaticallyanswerquestionsdeterminedtobeduplicates.
Likethecontributor-assistedexperience,thebotwillrecommendtheanswerfromthebestanswerwithintheduplicatecluster.
Theuserismadeawarethatabotansweredthequestion,andifunsatisfiedmayrequestanewanswer,orrevisetheirquestion.
Figure11.
Automateddeduplicationuserexperienceaspartofcustomizede-mailtotheoriginalasker.
Further,the"AnswerBot"attachesthequestiontotheexistingduplicateclusterautomaticallywhileprovidingagenericorpersonalizedanswer.
Thebotrepliestriggerautomatedarchivingoftheduplicatecontent.
ThequestionremainsvisibletotheoriginalaskerbutisnotmadeavailabletoAnswerXchangeusersandissuppressedfromsearchresults.
Arelatedoptionistocreatetwoseparatequeuesofduplicatequestionsforanswering.
Thequestionsinthefirstqueuewouldbeassignedtodesignatedmoderatorswhocancustomizeduplicatecontentfortheoriginalaskerandarchiveitafterwards.
Thelesscomplicatedquestionsinthesecondqueuecanbeassignedtothe"AnswerBot".
Yourquestionsharesthesameanswerasthissimilarquestion:Ineedacopyofmy2014TaxreturnChris,trythistodownloadanewcopyJaneDoe73SuperUser15minutesagoSweetieJeanRisingStar1yearagoSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineSelect2014astheyearfromyourTaxTimelineFromthelistofSomeThingsYouCanDoonyourTaxTimeline,selectDownload/PrintMyReturn(PDF)RECOMMENDEDANSWERNotetheprintingexperienceinTurboTaxchangedin2016FGCMOREACTIONSRevisemyquestionHRequestanewanswerIthinkyourquestionmightsharethesameanswerasthissimilarquestion:Ineedacopyofmy2014TaxreturnIamabot,andthisactionwasperformedautomatically.
Ifmyanswerisunhelpful,youmayrequestanewanswerorreviseyourquestion.
AnswerBot15minutesagoSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineSelect2014astheyearfromyourTaxTimelineFromthelistofSomeThingsYouCanDoonyourTaxTimeline,selectDownload/PrintMyReturn(PDF)RECOMMENDEDANSWERIDISCUSSIONANDCONCLUSIONSocialQ&Asystemsoftenpresumethattheuserscomplywithrecommendationsnottoreplicatetheexistingcontent.
ThisisnotthecaseforAnswerXchangewhereusersoftenavoidconsumingexistingcontentbypostinganewduplicatequestion.
TheseusersmaynotrealizethatAnswerXchangeisasocialQ&Asiteorlacktheabilitytofindandapplyexistinganswerstotheirquestion.
Weneedtointervenewithintelligentuserinterfacestoaltertheduplicatepostingbehavior.
Towardsthisgoal,wepresenttwoalgorithmsforduplicatecontentcurationandprovidingrealtimeinputstotheAnswerXchangeuserinterfaces.
Thefirstalgorithmdeterminesiftwoquestionsarenear-duplicatesandcanbecombinedwithasearchtodetectduplicatesinrealtime.
ThesecondalgorithmuncoversallduplicatepairsinAnswerXchangeandiscapableofhandlingdeduplicationtaskwithacorpusofmillionsofquestions.
Weconcludethepaperbypresentingthreequestiondeduplicationuserinterfaces.
Ourhypothesistovalidateinclude:(1)Willaskersacceptaduplicatewhenpresentedwithanacceptableanswer(2)Willtheyacceptaduplicatewithorwithoutapersonalizedcontributornote(3)Ifdissatisfiedwilltheyreviseorrequestanewanswer(4)WilltheyacceptrecommendedanswersfromAnswerBotsWeareplanningtovalidatethesehypothesiswithasetofrapidexperimentspriortoproduction.
APPENDIXA:DUPLICATEPAIRDETECTIONDetectingduplicatesforN=790,000questionsbasedonacustom-builtmodelwouldrequire(N(N-1)/2pairwisecomputations.
Thetaskoffindingduplicatepairsbecomescomputationallyexpensiveoncethecorpusreachesseveralhundredthousanddocuments.
Atthesametime,computingcosine-similarityforaquestionpairisfasterthanscoringthesamepairwithcustom-builtmodelandcanbeusedtoreducethenumberofpotentialduplicatepairsfrombillionstomillionsofpairs.
Further,dividingcontentbyMprobabilistictopicscanreducethenumberofpairwisecomparisonsbyM,whilenotnecessarilyaffectingthenumberofexpectednear-duplicatepairs.
MDuplicatesExecutiontime(min)5063,355133072,92018.
51073,06836183,773265TableA1.
Duplicatestatisticsandcomputationtimevs.
numberofprobabilistictopics(M).
Cosine-similaritythresholdis0.
7.
M=1meansprocessingN(N-1)/2pairs.
ShowninTableA1areresultsofthenumericalexperimentsconductedonMacBookProlaptopwith2.
8GHzprocessorspeed.
Theprocessingpipelineincluded(1)dividingquestionsintoMtopics,(2)computingcosine-similarityforallpairsinatopic,and(3)applyingduplicate-scoringmodeltothepairswithcosine-similarityaboveapre-definedthreshold.
Thetotalnumberofduplicatepairswasfoundtobe5,597,799andcontained281,031uniquequestions(or35%oftheAnswerXchange"live"questions).
In2017,theycontributed56%totheAnswerXchangedocumentviews.
Thedocumentsintheidentifiedduplicatepairscanberankedbyasuitablequestion(andanswer)proxycontentqualitymetricsasdiscussedearlier,forexamplebythenumberofviews,votes,ageofthepost,orbyaweighedcombinationthereof.
Thedocumentwiththelowerscorecanberemovedconsecutivelyfromeachpairresultinginaremovalof217,767documents(27%oftheAnswerXchange"live"questions).
ACKNOWLEDGMENTSWethankanonymousreviewersforvaluablecomments.
REFERENCES1.
EugeneAgichtein,CarlosCastillo,DeboraDonato,AristidesGionis,GiladMishne.
2008.
FindingHigh-QualityContentinSocialMedia.
In:Proc.
oftheInternationalConferenceonWebSearchandDataMining,183-193.
2.
AhmedK.
Elmagarmid,PanagiotisG.
Ipeirotis,VassiliosS.
Verykios.
2007.
DuplicateRecordDetection:ASurvey.
IEEETrans.
Knowl.
DataEng.
,19,1-16.
3.
KlemensMuthmann,AlinaPetrova.
2014.
Anautomaticapproachforidentifyingtopicalnear-duplicaterelationsbetweenquestionsfromsocialmediaQ/Asites.
In:ClassifyingBigDatafromtheWeb,1-6.
4.
PreslavNakov,DorisHoogeveen,LluísMàrquez,AlessandroMoschitti,HamdyMubarak,TimothyBaldwin,KarinVerspoor.
2017.
SemEval-2017Task3:CommunityQuestionAnswering.
In:Proc.
ofthe11thInt.
WorkshoponSemanticEvaluation,27-48.
5.
IgorA.
Podgorny,MatthewCannon,ToddGoodyear.
2015a.
Pro-activedetectionofcontentqualityinTurboTaxAnswerXchange.
In:Proc.
ofACMConferenceCompaniononCSCW,143-146.
6.
IgorA.
Podgorny,ChrisGielow,MatthewCannon,ToddGoodyear.
2015b.
Realtimedetectionandinterventionofpoorlyphrasedquestions.
InCHI'15ExtendedAbstracts,2205-2210.
7.
R.
S.
Ramya,K.
R.
Venugopal,S.
S.
Iyengar,L.
Patnaik.
2016.
FeatureExtractionandDuplicateDetectionforTextMining:ASurvey.
GlobalJournalofComputerScienceandTechnology56,5.
8.
AnnaShtok,GideonDror,YoelleMaarek,IdanSzpektor.
2012.
LearningfromthePast:AnsweringNewQuestionswithPastAnswers,WWW,759-768.
9.
IvanSrba,MáriaBieliková.
2016.
AComprehensiveSurveyandClassificationofApproachesforCommunityQuestionAnswering.
In:TWEB,10(3),18:1-18:63.

A2Hosting三年付$1.99/月,庆祝18周年/WordPress共享主机最高优惠81%/100GB SSD空间/无限流量

A2Hosting主机,A2Hosting怎么样?A2Hosting是UK2集团下属公司,成立于2003年的老牌国外主机商,产品包括虚拟主机、VPS和独立服务器等,数据中心提供包括美国、新加坡softlayer和荷兰三个地区机房。A2Hosting在国外是一家非常大非常有名气的终合型主机商,拥有几百万的客户,非常值得信赖,国外主机论坛对它家的虚拟主机评价非常不错,当前,A2Hosting主机庆祝1...

Digital-VM:服务器,$80/月;挪威/丹麦英国/Digital-VM:日本/新加坡/digital-vm:日本VPS仅$2.4/月

digital-vm怎么样?digital-vm在今年1月份就新增了日本、新加坡独立服务器业务,但是不知为何,期间终止了销售日本服务器和新加坡服务器,今天无意中在webhostingtalk论坛看到Digital-VM在发日本和新加坡独立服务器销售信息。服务器硬件是 Supermicro、采用最新一代 Intel CPU、DDR4 RAM 和 Enterprise Samsung SSD内存,默认...

ParkinHost:俄罗斯离岸主机,抗投诉VPS,200Mbps带宽/莫斯科CN2线路/不限流量/无视DMCA/55折促销26.4欧元 /年起

外贸主机哪家好?抗投诉VPS哪家好?无视DMCA。ParkinHost今年还没有搞过促销,这次parkinhost俄罗斯机房上新服务器,母机采用2个E5-2680v3处理器、128G内存、RAID10硬盘、2Gbps上行线路。具体到VPS全部200Mbps带宽,除了最便宜的套餐限制流量之外,其他的全部是无限流量VPS。ParkinHost,成立于 2013 年,印度主机商,隶属于 DiggDigi...

graphcore为你推荐
空间邮箱QQ邮箱的容量是多少美国互联网瘫痪2000年美国的互联网危机事件的原因?杰景新特萨克斯吉普特500是台湾原产的吗rawtools照片上面的RAW是什么意思,为什么不能到PS中去编辑seo优化工具SEO优化要用到什么软件?777k7.com怎么在这几个网站上下载图片啊www.777mu.com www.gangguan23.com4400av.com在www.dadady.com 达达电影看片子很快的啊bbs2.99nets.com天堂1单机版到底怎么做555sss.com不能在线播放了??555梦遗姐我姐姐很漂亮,她24了,我才15,晚上我和他睡在一起,我经常挨遗精,咋办?
动态域名 西安服务器租用 a5域名交易 3322动态域名 directspace 网站保姆 双12活动 lamp配置 南昌服务器托管 免费smtp服务器 e蜗 100m空间 域名接入 电信虚拟主机 1美金 idc查询 服务器硬件防火墙 跟踪路由命令 大化网 netvigator 更多