Lucenegraphcore

graphcore  时间:2021-03-26  阅读:()
Semi-AutomatedPreventionandCurationofDuplicateContentinSocialSupportSystemsIgorA.
PodgornyIntuit,Inc.
SanDiego,USAigor_podgorny@intuit.
comChrisGielowIntuit,Inc.
SanDiego,USAchris_gielow@intuit.
comABSTRACTTurboTaxAnswerXchangeisapopularsocialQ&AsystemsupportingusersworkingonU.
S.
federalandstatetaxreturns.
Basedonacustom-builtduplicatescoringmodel,35%ofAnswerXchangequestionshavebeenfoundtobenear-duplicatesresponsiblefor56%ofAnswerXchangedocumentviews.
Thisdegradestheuserexperienceforboththeaskerwhoisunabletofindanansweramidduplicates,andtheanswererwhoisunabletoefficientlyansweratscale.
Theduplicatequestionstendtoformmicro-clustersthatgrowviapreferentialattachmentand,onceexceedingsome25questionsinsize,startmorphingintomega-clusterswithacomplexnetworktopology.
Thisbehaviorcanbeleveragedtodesignsemi-automatedcontentcurationsystemstodetectwhetheranewlypostedquestionisaduplicateand,ifso,whichduplicateclusteritbelongsto.
InordertoimproveuserexperienceinAnswerXchange,weexplorehowhumanandartificialintelligencecanbejointlyemployedandthenpresentseveraldata-drivenintelligentuserinterfaces.
Theduplicatescoringmodelscanbeutilizedaselementsofquestion-postingandansweringexperiences,unansweredquestionqueueingandanswerbots.
TheseapproachescanbeextendedtoanysocialsupportQ&Asystemwhereduplicatepostingnegativelyimpactssearchrelevanceandcontentconsumption.
AuthorKeywordsTurboTax;AnswerXchange;CQA;communityquestionanswering;socialquestionanswering;duplicateclusters;contentdeduplication.
ACMClassificationKeywordsH.
5.
m.
InformationInterfacesandPresentation(e.
g.
HCI):MiscellaneousINTRODUCTIONSocialQ&Asystemsprovideaconvenientself-supportoptionfortaxandfinancialsoftwareapplicationswherepersonalizedlong-tailcontentgeneratedbytheuserscansupplementcuratedknowledgebaseanswers.
Usersoftenpreferself-helptoassistedmeasures(e.
g.
phonesupportoronlinechat)andareoftenabletofindandapplytheirsolutionfaster.
Thisalsoreducestheloadonassistedchannels,ensuringtheyremainavailabletothosewhoneedit.
AnswerXchange(http://ttlc.
intuit.
com)isasocialQ&AsitewherecustomerscanlearnandsharetheirknowledgewithotherTurboTaxcustomerswhilepreparingU.
S.
federalandstatetaxreturnsandalsofindstep-by-stepinstructionsonusingtheTurboTaxapplication[5,6].
AstheusersstepthroughtheTurboTaxinterviewpages,theycanaskquestionsaboutsoftwareandtaxtopics(Figure1)andreceiveanswersinamatterofminutes.
AnswerXchangehasgeneratedmillionsofquestionsandanswersthathavehelpedtensofmillionsofTurboTaxcustomerssincelaunchingin2007.
Figure1.
AnswerXchangequestion-postinguserexperience.
Questiontitle(ashortsummaryofquestionlimitedto255characters)ismandatory.
Questiondetails(notshown)areoptionalandunlimitedinsize.
Themajorityofuserscanfindanswersbysearchingtheexistingcontent.
Theoverallqualityofacustomerself-helpsystemisthereforedeterminedbyhowwelltheself-helpsystemassistsinfindingtherelevantcontent.
Thenumberofsearchsessionsresultinginassistedsupportcontacts(beingaslargeashundredsofthousandsofcustomersperyear)andfractionofuserupordownvotesonself-supportcontentprovideaconvenientproxymetricsofcontentqualityandsearchrelevanceinTurboTaxself-help[5].
2018.
Copyrightfortheindividualpapersremainswiththeauthors.
Copyingpermittedforprivateandacademicpurposes.
ESIDA'18,March11,Tokyo,Japan.
Figure2.
AnexampleofduplicateAnswerXchangesearchresults.
Questiontitlesandanswersnippetsareshowninpurpleandinblack,respectively.
Oneproblemwiththeexistingquestion-postingexperience(Figure1)isthatsearchesmayresultinmultipleandoftenduplicateanswersthatarerelativelyclosetotheintentoftheoriginalquestion,butstilldonotmatchtheoriginalsearchintent(Figure2).
Thisinterfereswiththeuser'sabilitytoselectfromadiversesetofpossibleanswers[5]and,oftenresultseitherinthesubmissionofaduplicatequestionorswitchingtoaless-desiredsupportchannel.
Arelatedproblemisthatusersmaysubmitpoorqualityquestionsbynotprovidingalloftherelevantinformationneededforagoodqualityanswer[5].
Onesolutionisamanualreviewoftheusergeneratedcontenttoarchivesomeoftheduplicatequestionsandrelatedanswers,ifany,andkeepingthebestperformingcontentin"live"status(i.
e.
makingitavailableforsearch).
Thisapproachislaborintensiveanddoesnotaddresstheproblemwiththequestion-postinguserexperience.
Duplicatequestionsmayquicklybuildup,addingunnecessaryburdenoncommunityquestionansweringalongtheway.
ThegoalofthisstudyistoaddresstheproblemsofduplicatecontentpreventioninAnswerXchangebycombiningmachinelearningandintelligentuserinterfaces.
Inwhatfollows,wedescribeduplicatedetectionalgorithmsdevelopedearlierandpresentacustommodeltrainedonAnswerXchangequestions.
Next,weintroducetheconceptof"duplicateclusters"thatprovideaframeworkforsemi-automatedduplicatecontentprevention.
Finally,wepresentseveralcustomdesigneddata-drivenintelligentuserinterfacesforaddressingduplicatecontentproblem.
RELATEDWORKThetaskofestimatingsemanticsimilarityoftextdocumentshasmultiplepracticalapplicationsandisofgrowinginterestfromtheresearchcommunity.
Theareasofresearchincludewebpagesimilarity,documentsimilarity,sentencesimilarity,searchquerysimilarityandutterancesimilarityinconversationaluserinterfaces.
Thesetasksarealsorelatedtoamoregeneralproblemofdetectingduplicatesindatabaserecords[2].
QuestionsinsocialQ&Asystemsmediaareoftenconfinedtooneortworelativelyshortsentencesandmaywarrantdomainspecificapproachestoaddressingquestionsimilarity.
Forexample,twoquestionsinasocialQ&Asystemcanbeconsideredsemanticallyidenticalifasingleanswersatisfiestheneedsofbothoriginalaskers[3].
Theanswermaynotyetexistintheproductiondatabasebutcouldbegeneratedifneeded.
Thetaskofduplicate-questiondetectionisalsorelatedtothetaskofre-formulatinganewlyformedquestion[6]andautomaticallyfindingananswertoanewquestion[8].
Themostrecentresultsintheareaofduplicatecontentscoringcamefromthe2017Kaggle"QuoraPair"competitionwithmodelsubmissionsfrommorethan3,000teams(https://www.
kaggle.
com/c/quora-question-pairs).
Inthiscompetition,theparticipantsweretaskedtoclassifyifQuoraquestionpairsareduplicatesornotbasedon200,000traininginstances.
Finally,SemEval2017TaskonCommunityQuestionAnswering("Question–CommentSimilarity","Question–QuestionSimilarity",etc.
)resultedinsubmissionsfrom23teams[4].
TheproblemofduplicatedetectionandcurationiscloselyrelatedtothetaskofpredictingcontentqualityinsocialQ&Asystems.
Contentqualitymetricsmaybehelpfulinselectingthebestperformingquestionandanswerfortheduplicate-questionpair.
AnswerandquestionqualityinthesocialQ&Asystemshasbeenthefocusofincreasingattentionfromthescientificcommunity[1,9].
DUPLICATE-SCORINGMODELAnswerXchangeSearchAnswerXchangesearchisbuiltwithApacheLuceneopen-sourcesoftware(http://lucene.
apache.
org).
Bydefault,Luceneuses"tf-idf"(https://en.
wikipedia.
org/wiki/tf-idf)and"cosine-similarity"asstandardmethodsofrankingsearchresults.
Shorterdocumentswiththesamesetofmatchingkeywordstypicallyrankhigherthanlongerdocumentswithsimilarsemanticmeaning.
AnaverageAnswerXchangesearchqueryis2-3termslong(i.
e.
shorterthanatypicalAnswerXchangequestion)anditisoftencomparableinlengthwiththetitleofapotentiallyduplicatequestion.
ThequestiondetailsplayalesserrolecomparedtotitlescontributingtoextraboostingofduplicatecontentbyLucene.
TheAnswerXchangeLucenerankingalgorithmtendstoboostnewcontentandalsoaccountsforvariousmetadatasuchashelpfulnessvotes.
TrainingDataTheproblemofnear-duplicatedetectioncanbeformulatedasanunsupervisedorsupervisedmachinelearningtask[7].
Intheunsupervisedcase,duplicatepairsandclusterscanbefoundbasedondistancemetricssuchascosine-similarityoftheweightedtf-idfvectors,Jaccardsimilaritycoefficient,distanceinword2vecspace,etc.
Inthesupervisedcase,theproblemoffindingtopicalnear-duplicaterelationscanbeformulatedasfollows:givenapairofquestions,themachinelearntmodelhastopredicta"duplicatescore"anddetermineifquestionsareduplicatesbasedonapre-definedthreshold.
Inthispaper,weemploya"hybrid"approachstartingwithcosine-similaritymetricsfordatapre-processingandthenaddingamoreaccuratecustom-builtscoringmodeltotheprocessingpipeline.
AsthefractionofduplicatepairsinAnswerXchangeisrelativelylow,thequestionpairsrankedbycosine-similarityprovideaconvenientdatasetforlabelingbasedontheimportancesamplingapproach.
Towardsthisgoal,wecomputedbag-of-wordscosine-similarity(AppendixA)for790,000questionsavailableforsearchinAnswerXchangeattheendof2017U.
S.
TaxDay(April18).
Next,fourAnswerXchangemoderatorsaddedclasslabels(0or1)toarandomsampleof4,000near-duplicatepairs.
Instancesopentodoubthavebeenflaggedbymoderatorsandthenre-labeledbyaconsensus.
1,000randomlysamplednon-duplicatepairshavebeenaddedforthefinalversionofthetrainingdatasettomakeitequallydividedbetweenduplicateandnon-duplicatepairs.
Duplicate-ScoringModelFeaturesThemodelfeaturescanbelearntfromtrainingdataand/orbyknowledgeacquisitionfromAnswerXchangemoderators.
Wehaveusedthefollowingmodelfeatures:Cosine-similaritywithtf-idfweighting(seeAppendixA).
ProbabilistictopicIDofthequestioncomputedwithLatentDirichletAllocation(seeAppendixA).
U.
S.
taxyearinthequestion.
Distinctwordsinthequestionpair.
Commonwordsinthequestionpair.
Typeofthequestion(e.
g.
"closed-ended"questions"CanIdeduct…"typicallyaccountfortaxrelated,while"how"questionsoftenaccountforproductrelatedquestion).
Firstwordofthequestion.
Duplicate-ScoringModelPerformanceBasedonthesetof5,000labeledquestionpairs,wetrainedandtestedalinear(logisticregression)andnon-linear(randomforest)binaryclassifiersusingPythonmachinelearninglibrary"scikit-learn".
Themodelpredictsclasslabel(0foranon-duplicateand1forduplicatepair)andalsotheduplicatescore(i.
e.
probabilityofthequestionpairtobelongtoeitherclassrangingfrom0.
0to1.
0)thatcanbeusedtoselectuserexperiencebasedonpredefinedthreshold(s).
Wealsotrainedaseparateversionofthelogisticregressionclassifierusingcosine-similarityasasinglemodelfeature.
ShowninTable1arecommonmetricsusedforpredictivemodelevaluation:areaundercurve(AUC)forreceiveroperatingcharacteristic,F1scoreandlogarithmicloss(logloss)functionforclassification.
ModelAUCF1ScoreLogLossLogisticRegression0.
950.
880.
27RandomForest0.
940.
870.
31Cosine-similarity0.
830.
730.
48Table1.
Modelperformancemetricsforduplicate-scoringmodels(detailsareexplainedinthetext).
AsseenfromTable1,bothlogisticregressionandrandomforestmodelsachieveperformancethatisconsistentwiththegoalsofthisexploratorystudy.
Atthesametime,cosine-similarityversionunderperformsthefirsttwobyawidemargin.
Thiscanbeexplainedbytheinabilitytofindanoptimalthresholdseparatingduplicateandnon-duplicatepairsusingthecosine-similarityalone.
Thefollowingtwoexamplesillustratetherelationshipbetweenkeyword-basedcosine-similarityandduplicate-questionscorecomputedwithlogisticregression.
ThefirstexampleisanAnswerXchangequestionpairwitharelativelylowcosine-similarityof0.
61:(1)"Ineedacopyofmyfederaltaxreturnfor2014"and(2)"Ineed2015TaxReturn".
BothquestionscanbeansweredwithasingleinstructionaboutgettingacopyofprioryeartaxreturnfiledwithTurboTaxandhenceareduplicates.
Thesecondexampleisaquestionpairwithhighcosine-similarityof1.
0:(1)"doihavetofilestatetaxes"and(2)"howtofilestatetaxes".
Thesequestionsarenotduplicatesbecausetheybelongtotaxandproductcategories[5],respectively,andwouldrequiretwodifferentanswers.
DUPLICATECLUSTERSPreferentialAttachmentandTopologyAfteridentifying5,597,799duplicatequestionpairsinAnswerXchange(AppendixA),webuiltanundirectedgraphof281,031duplicatequestions.
Eachduplicatepairandduplicatequestionidentifiedwiththemodelconstitutedgraphedgeandgraphvertex,respectively.
Theresultinggraphconsistsof14,616connectedcomponentshereafterreferredtoas"duplicateclusters.
"Toexploreduplicate-clusterscalingbehavior,werankedclustersbythenumberofquestionsandplottedthenumberofquestionsperclustervs.
clusterrankinlog-logscale(Figure3).
Thelargestclusterhas23,236questionsandthesmallestonesonlyhavetwo.
Theplotalsoincludesgraph(oredge)density:=21,whereEisnumberofedges(i.
e.
duplicatepairs)andVisthenumberofvertices(i.
e.
questions).
Graphdensityisequalto1.
0forthefullyconnectedgraphs.
Inthelattercase,eachquestionintheclusterisconnectedtoallremainingquestionsinthesameduplicatecluster.
Basedonbothquestioncountsandgraphdensity,theduplicateclustersinFigure3canbedividedintothreedistinctgroupsmarkedasmega-clusters,transitionalclustersandmicro-clusters.
Thesegroupsaccountfor84%,2%and14%ofduplicatequestions,respectively.
Figure3.
Scalingbehaviorofduplicateclusters(blackdots)inAnswerXchangequestions.
Theclustersarerankedbythenumberofquestionsinthedescendingorder.
Graphdensityfortheclustersisshowningray.
CyanandreddotsrefertotheclustersshowninFigures4and5,respectively.
Anexampleofmicro-clusterwith23verticesisshowninFigure4.
Graphdensityis0.
54andmostofverticesareinterconnectedwithanexceptionofthreeverticesconnectedbybridgestoadensergraphcore.
Thecorrespondingarticulationpointsaremarkedbybluedots.
Notethatevenifquestions1and2areduplicatesandquestions2and3areduplicates,thisdoesnotmeanthatquestions1and3areduplicatesaswell.
Thisexplainswhyaduplicate-clusterdensityistypicallylessthan1.
0unlessthegraphsizeislimitedtotwoquestions.
AsseenfromFigure3,micro-clusterscalingbehaviorfollowsZipfdistribution(https://en.
wikipedia.
org/wiki/zipf's_law):=+,,whererrangesfromabout100tothetotalnumberofclustersR.
Accordingly,thegrowthofN(Δ)andR(Δ)wouldbeconstrainedbythefollowingequation:Δ=Δ.
ItisworthmentioningthatZipfdistributionisanasymptoticcaseofamoregeneralYule-Simondistribution(https://en.
wikipedia.
org/wiki/Yule-Simon_distribution)typicalforthepreferentialattachmentprocess,meaningthatanewlypostedduplicateismorelikelytobecomeattachedtotheexistingclusterthantoformanewduplicatepair.
Thescalingparameterforthemicro-clusters:=log4log5log(4)log(5)canbeestimatedas0.
6.
ByextrapolatingZipfdistributiontor=1(thatwouldcorrespondtoanon-existinglargestmicro-cluster),onecanestimateNvalueas400.
Thisvalue,however,isalmosttwoordersofmagnitudelessthanthenumberofquestionsinthetopmega-cluster.
Figure4.
Amicro-clustermarkedbycyandotinFigure3.
Articulationpointsareshownbysmallerbluedots.
ToexplainthescalebreakinthedistributionshowninFigure3,letusexaminelargerduplicateclustersinmoredetail.
ShowninFigure5isamega-clusterwith4,549questions.
Theclusterhasdensityequalto0.
0017and1048articulationpoints.
Thismeansthatthemega-clustersmayconsistofmultiplesub-clustersthataresemanticallyrelatedtoeachotherbutwiththeelementsthatarenotduplicatesunlesstheybelongtothesamesub-cluster.
Figure5.
SameasinFigure4,butnowforamega-cluster.
Asthenumberofduplicatesreachescertainlevel,theclustersstartcoalescingbyestablishingbridgeswithotherclusters,duplicatepairsandstand-alonequestions,quicklyevolvingfromdenseconnectedgraphstosparsegraphswithacomplexnetworktopology.
TheareaoftransitionismarkedastransitionalclustersinFigure3.
Semi-AutomatedDuplicateContentCurationWhilethetaskofduplicatecontentarchivingisstraightforwardonceduplicatepairsarefound(AppendixA),theduplicatecontentcanbuildupagainunlessquestion-postingand/orsearchexperiencesaremodified.
Ournextgoalisthereforetoexplorehowtheconceptofduplicateclustersdiscussedintheprevioussectioncanbeappliedtothesetasks.
Thecurationofmicro-clusterscanbedoneautomaticallyorsemi-automatically(i.
e.
withminimumhumaninvolvement)byretainingoneorfewbestperforminglong-taildocuments(i.
e.
documentsthatincludebothquestionsandanswers)andassigningthemaclusterIDforsubsequentre-use.
Thecurationofmega-clustersrepresentsamorechallengingproblem.
First,asinglebestperformingdocumentinamega-clustermaysimplynotexistsincetheclustermaycontainmultiplesub-clustersconnectedbybridges.
Second,duplicatecurationbyahumanisacumbersometaskduetothemega-clustercomplextopology.
Whiletheexactsolutionmaysimplynotexist,approximatesolutionsmaybesufficienttoreducethenumberofduplicatespostedintheAnswerXchangetoanacceptablelevel.
Oneapproachwouldbetobreakthemega-clustersintosmallerpartsbydeletingbridgesinthegraphorbyemployingaconventionalhierarchicalclustering.
Forexample,theduplicateclustershowninFigure5canbesplitto1363connectedcomponentsbyremovingallarticulationpoints(bluedotsinFigure5).
Mostoftheresultingconnectedcomponents,however,aredisconnecteddocuments.
Amorepracticalapproachistoarchivenon-performingshort-tailcontentfromthemega-clusterandcuratetheresultingconnectedcomponents.
ShowninFigure6isasubsetofmega-clusterfromFigure5thatnowonlyincludesdocumentswithatleast100views.
Thisresultsinbreakingtheoriginalmega-clusterinto68connectedcomponentswhichareeasiertocurate.
Figure6.
Asubsetofthemega-clustershowninFigure5.
GreydotsmarkdocumentsusedinFigure7.
Thenexttaskistopresentduplicatecontentinaformsuitableforsemi-automatedcontentcuration.
Figure7showsanexampleofduplicatecontentmetricsforeightdocumentswithatleast1000views.
Theleftcolumnisasub-clusterIDfollowedbyapostIDidentifyinganAnswerXchangedocumentconsistingoftheoriginalquestionandallaccumulatedanswers(notshown).
Thetextofthequestionandtypeofthequestion(i.
e.
user-generatedcontentmarkedasUGCorknowledgebasecontentlabeledasFAQ)areincludedinthethirdandfourthcolumns,respectively.
Thelasttwocolumnsareviewsaccumulatedoveragivenperiodandpercentageofup-votes.
Thedocumentscanberankedbyviewsand/orvotesprovidingamechanismofidentifyingandremovingnon-performingcontenteithermanuallyorautomaticallybasedonasetofpredefinedcontentqualitythresholds.
Figure7.
DuplicatedocumentmetricsforthedocumentsmarkedbygreydotsinFigure6.
Duplicatemetricscanbeoperationalizedbyaddinganalgorithmtomatchthebestquestiontothebestanswerinthesub-cluster.
Suchasystemwouldincludeanswerdeletingandmergingmanuallyorautomaticallybyattachingautomaticallygenerated"best"answertothe"best"duplicatequestion.
Thesolutioncanbeimplementedasaback-endtoolfortrustedusersassignedtothetaskofduplicatearchivingandhiddenfromthelessexperiencedregularusers.
Thesolutiongoesbeyondsimpleduplicatearchivingbyprovidinganoptiontomergeavailableanswerstotheexistingduplicatequestions.
Thenon-humanpartofthesolutionincludesqualityrankingoftheexistinganswers,e.
g.
upanddownvotestatisticsasshowninFigure7.
Inthisway,thenewlyformedquestion-answerpairsprovidebetterqualitycontentavailableforsearchbycombiningthevisuallyappealingquestionsandthebestrankedanswers.
Thisisdonebycombiningartificialandhumanintelligencesincetheanswertoarelatedquestion(thatthesystemrecommended)canbeconfirmedbythecontributorifneeded.
Theclusternotescanbeeditedbytrustedusersandappliedtoallarticleswithinthecluster.
RealTimeDuplicateDetectionFindingduplicatestoagivenquestionrequires(N-1)pairwisecomparisonstothequestionsinthedatabaseandmaybenotfeasibleinrealtime.
ThecomputationaltimecanbereducedbyselectingpotentialduplicatematcheswithAnswerXchangesearch.
ThetopperformingdocumentsintheclusterscanbeassignedanIDandindexedseparatelybythesearchengine.
Oncethesearchenginereturnsthedocumentsrankedbyrelevancytothenewlyformulatedquestion,theduplicate-scoringmodelisappliedtothetopmatchestoseeifthenewquestionisaduplicateand,ifso,whichduplicateclusteritbelongsto.
IDPOST_IDDOCUMENTTYPEVIEWSUPVOTE11,899,475CanIdeductjob-searchexpensesFAQ17,01974.
812,666,148HI.
WheredoIentermyjobsearchUGC1,75977.
913,048,015WheredoIincludejobsearchUGC1,06078.
113,356,358WheredoIentermyjobsearchFAQ6,72770.
313,705,028WheredoIdeductjobsearchUGC2,9996722,895,188WheredoIentermymedicalFAQ25,24379.
922,899,090WhydoesntmyrefundchangeafterIentermymedicalexpensesFAQ13,76579.
122,956,890wheredoienterOUTOFPOCKETmedicalexpensesUGC1,50986.
6DATA-DRIVENUSEREXPERIENCESAccumulationofduplicatecontentcanbepreventedbyintegratingacustom-builtduplicate-scoringmodelandquestion-postingexperience.
Anotheroptionistoexposeanintelligentinterfacetothetrustedusersbyprovidingextrafeaturesforansweringduplicatequestions.
Finally,theduplicatequestioncurationcanbepartofthecontentmoderationprocesscarriedoutbytheAnswerXchangetrustedusersortrainedbots.
QuestionDeduplicationWhilePostingThefirstfeature(Figure8)extendstheAnswerXchange"QuestionOptimizer"system[6].
Thesystempromptstheaskerwithpersonalizedinstructionscreateddynamicallybasedonrealtimeanalysisofthequestion'ssemanticsandwritingstyle.
The"QuestionOptimizer"hasbeenre-designedtomakeduplicatequestionmoredifficulttosubmitwithoutaddressingtherecommendedre-phrasing.
Theannotationstoconceptarepresentednext.
Figure8.
Question-postingexperiencerevealstheduplicatesandhelpsusersre-phraseasauniquequestion.
A)The"Question-Optimizer"technologyisenvisionedtoincludeduplicatecontentdetectioninadditiontoprovidingtimelyadviceonhowtore-phraseordeflect.
B)Ifquestionfallsinaknownduplicatecluster,thebestmatchingandmostreferencedanswermatchesareshown.
C)Trustedusersmayattach"clusternotes"tocuratedduplicateclustersandappearautomaticallywithanyquestionwithinthecluster.
IntheexampleshowninFigure8,theduplicateclusterisaboutprintingandthemessagenotesthattheprintingexperiencerecentlychangedintheproduct-informationwhichmaybeusefultoanyonewithprinting-relatedquestions.
D)Thesuggestedanswersarededuplicatedusingduplicatescoreequalizationsotheanswersaremoreuseful.
A"clusterbrowser"isalsoaddedbelowtotheresultstohelprefineamongstthemostpopularvariations.
QuestionDeduplicationWhileAnsweringThesecondfeatureaddressesthesituationwhereapotentialduplicatehasbeensubmittedandneedstobeinterceptedaspartofquestionansweringexperience.
ThisconceptisillustratedinFigures9-10.
Figure9.
Contributorexperiencetaggingandattachingcuratedanswertothequestion.
Specifically,Figure9illustratesthecontributor(typicallyatrusteduser)answeringexperienceandincludesthefollowingannotation:Chris,trythistodownloadanewcopySUGGESTEDANSWERSANSWERTHISIneedacopyofmy2014Taxreturncopyof2014returnSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineIneedtogetacopyofmy2014returnandIdon'thavethecd.
92%match2,314duplicates5/3/16450attachattachandmarkansweredIneedacopyofmy2014TaxreturnAnswerEChrisasked30minutesagoE)Thesuggestedansweredquestionduplicateispresentedtotheoriginalaskerandalsodisplaystheduplicateprobability.
Thecontributorcaneasilyattachittotheiranswer,whichalsotellsthesystemthequestionwasaduplicateandshouldbearchivedinfavoroftheattached.
Figure10.
Originalaskerviewofdeduplicatedquestionwithpersonalizedanswer.
Oncetheduplicatequestionisanswereditbecomesavailabletotheoriginalasker(Figure10).
C)Re-purposingtrustedusersnotessimilartothoseusedinquestion-postingexperience(Figure8).
F)Apersonalizednoteintroducesthe"recommendedanswer"whileexplainingit'saduplicate.
G)Theduplicateanswerispresentedwithasenseofauthority.
H)Iftheoriginalaskerisunsatisfiedwiththeanswer,theymayrevisetheirquestionanditwillre-entertheanswerqueue.
Theyalsohavetheoptiontorequestanewanswerwithoutsubmittingthequestion.
Finally,flaggingtheunansweredquestionautomaticallyasaduplicatemaybevalidatedorinvalidatedbythetrustedusersandtoupdatetrainingdatasetformodelre-training.
QuestionDeduplicationwithAutomatedAnswersThe"AnswerBot"(Figure11)isafeaturedrivenbyartificialintelligencealone.
The"AnswerBot"increasesself-supportefficiencybyrespondingtoacustomer'squestionsbye-mailwithanswersfromthematchingduplicateclusterifthepostedquestionisflaggedbytheduplicate-scoringmodelasaduplicate.
I)"AnswerBots"mayautomaticallyanswerquestionsdeterminedtobeduplicates.
Likethecontributor-assistedexperience,thebotwillrecommendtheanswerfromthebestanswerwithintheduplicatecluster.
Theuserismadeawarethatabotansweredthequestion,andifunsatisfiedmayrequestanewanswer,orrevisetheirquestion.
Figure11.
Automateddeduplicationuserexperienceaspartofcustomizede-mailtotheoriginalasker.
Further,the"AnswerBot"attachesthequestiontotheexistingduplicateclusterautomaticallywhileprovidingagenericorpersonalizedanswer.
Thebotrepliestriggerautomatedarchivingoftheduplicatecontent.
ThequestionremainsvisibletotheoriginalaskerbutisnotmadeavailabletoAnswerXchangeusersandissuppressedfromsearchresults.
Arelatedoptionistocreatetwoseparatequeuesofduplicatequestionsforanswering.
Thequestionsinthefirstqueuewouldbeassignedtodesignatedmoderatorswhocancustomizeduplicatecontentfortheoriginalaskerandarchiveitafterwards.
Thelesscomplicatedquestionsinthesecondqueuecanbeassignedtothe"AnswerBot".
Yourquestionsharesthesameanswerasthissimilarquestion:Ineedacopyofmy2014TaxreturnChris,trythistodownloadanewcopyJaneDoe73SuperUser15minutesagoSweetieJeanRisingStar1yearagoSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineSelect2014astheyearfromyourTaxTimelineFromthelistofSomeThingsYouCanDoonyourTaxTimeline,selectDownload/PrintMyReturn(PDF)RECOMMENDEDANSWERNotetheprintingexperienceinTurboTaxchangedin2016FGCMOREACTIONSRevisemyquestionHRequestanewanswerIthinkyourquestionmightsharethesameanswerasthissimilarquestion:Ineedacopyofmy2014TaxreturnIamabot,andthisactionwasperformedautomatically.
Ifmyanswerisunhelpful,youmayrequestanewanswerorreviseyourquestion.
AnswerBot15minutesagoSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineSelect2014astheyearfromyourTaxTimelineFromthelistofSomeThingsYouCanDoonyourTaxTimeline,selectDownload/PrintMyReturn(PDF)RECOMMENDEDANSWERIDISCUSSIONANDCONCLUSIONSocialQ&Asystemsoftenpresumethattheuserscomplywithrecommendationsnottoreplicatetheexistingcontent.
ThisisnotthecaseforAnswerXchangewhereusersoftenavoidconsumingexistingcontentbypostinganewduplicatequestion.
TheseusersmaynotrealizethatAnswerXchangeisasocialQ&Asiteorlacktheabilitytofindandapplyexistinganswerstotheirquestion.
Weneedtointervenewithintelligentuserinterfacestoaltertheduplicatepostingbehavior.
Towardsthisgoal,wepresenttwoalgorithmsforduplicatecontentcurationandprovidingrealtimeinputstotheAnswerXchangeuserinterfaces.
Thefirstalgorithmdeterminesiftwoquestionsarenear-duplicatesandcanbecombinedwithasearchtodetectduplicatesinrealtime.
ThesecondalgorithmuncoversallduplicatepairsinAnswerXchangeandiscapableofhandlingdeduplicationtaskwithacorpusofmillionsofquestions.
Weconcludethepaperbypresentingthreequestiondeduplicationuserinterfaces.
Ourhypothesistovalidateinclude:(1)Willaskersacceptaduplicatewhenpresentedwithanacceptableanswer(2)Willtheyacceptaduplicatewithorwithoutapersonalizedcontributornote(3)Ifdissatisfiedwilltheyreviseorrequestanewanswer(4)WilltheyacceptrecommendedanswersfromAnswerBotsWeareplanningtovalidatethesehypothesiswithasetofrapidexperimentspriortoproduction.
APPENDIXA:DUPLICATEPAIRDETECTIONDetectingduplicatesforN=790,000questionsbasedonacustom-builtmodelwouldrequire(N(N-1)/2pairwisecomputations.
Thetaskoffindingduplicatepairsbecomescomputationallyexpensiveoncethecorpusreachesseveralhundredthousanddocuments.
Atthesametime,computingcosine-similarityforaquestionpairisfasterthanscoringthesamepairwithcustom-builtmodelandcanbeusedtoreducethenumberofpotentialduplicatepairsfrombillionstomillionsofpairs.
Further,dividingcontentbyMprobabilistictopicscanreducethenumberofpairwisecomparisonsbyM,whilenotnecessarilyaffectingthenumberofexpectednear-duplicatepairs.
MDuplicatesExecutiontime(min)5063,355133072,92018.
51073,06836183,773265TableA1.
Duplicatestatisticsandcomputationtimevs.
numberofprobabilistictopics(M).
Cosine-similaritythresholdis0.
7.
M=1meansprocessingN(N-1)/2pairs.
ShowninTableA1areresultsofthenumericalexperimentsconductedonMacBookProlaptopwith2.
8GHzprocessorspeed.
Theprocessingpipelineincluded(1)dividingquestionsintoMtopics,(2)computingcosine-similarityforallpairsinatopic,and(3)applyingduplicate-scoringmodeltothepairswithcosine-similarityaboveapre-definedthreshold.
Thetotalnumberofduplicatepairswasfoundtobe5,597,799andcontained281,031uniquequestions(or35%oftheAnswerXchange"live"questions).
In2017,theycontributed56%totheAnswerXchangedocumentviews.
Thedocumentsintheidentifiedduplicatepairscanberankedbyasuitablequestion(andanswer)proxycontentqualitymetricsasdiscussedearlier,forexamplebythenumberofviews,votes,ageofthepost,orbyaweighedcombinationthereof.
Thedocumentwiththelowerscorecanberemovedconsecutivelyfromeachpairresultinginaremovalof217,767documents(27%oftheAnswerXchange"live"questions).
ACKNOWLEDGMENTSWethankanonymousreviewersforvaluablecomments.
REFERENCES1.
EugeneAgichtein,CarlosCastillo,DeboraDonato,AristidesGionis,GiladMishne.
2008.
FindingHigh-QualityContentinSocialMedia.
In:Proc.
oftheInternationalConferenceonWebSearchandDataMining,183-193.
2.
AhmedK.
Elmagarmid,PanagiotisG.
Ipeirotis,VassiliosS.
Verykios.
2007.
DuplicateRecordDetection:ASurvey.
IEEETrans.
Knowl.
DataEng.
,19,1-16.
3.
KlemensMuthmann,AlinaPetrova.
2014.
Anautomaticapproachforidentifyingtopicalnear-duplicaterelationsbetweenquestionsfromsocialmediaQ/Asites.
In:ClassifyingBigDatafromtheWeb,1-6.
4.
PreslavNakov,DorisHoogeveen,LluísMàrquez,AlessandroMoschitti,HamdyMubarak,TimothyBaldwin,KarinVerspoor.
2017.
SemEval-2017Task3:CommunityQuestionAnswering.
In:Proc.
ofthe11thInt.
WorkshoponSemanticEvaluation,27-48.
5.
IgorA.
Podgorny,MatthewCannon,ToddGoodyear.
2015a.
Pro-activedetectionofcontentqualityinTurboTaxAnswerXchange.
In:Proc.
ofACMConferenceCompaniononCSCW,143-146.
6.
IgorA.
Podgorny,ChrisGielow,MatthewCannon,ToddGoodyear.
2015b.
Realtimedetectionandinterventionofpoorlyphrasedquestions.
InCHI'15ExtendedAbstracts,2205-2210.
7.
R.
S.
Ramya,K.
R.
Venugopal,S.
S.
Iyengar,L.
Patnaik.
2016.
FeatureExtractionandDuplicateDetectionforTextMining:ASurvey.
GlobalJournalofComputerScienceandTechnology56,5.
8.
AnnaShtok,GideonDror,YoelleMaarek,IdanSzpektor.
2012.
LearningfromthePast:AnsweringNewQuestionswithPastAnswers,WWW,759-768.
9.
IvanSrba,MáriaBieliková.
2016.
AComprehensiveSurveyandClassificationofApproachesforCommunityQuestionAnswering.
In:TWEB,10(3),18:1-18:63.

云如故枣庄高防(49元)大内存2H2G49元8H8G109元

云如故是一家成立于2018年的国内企业IDC服务商,由山东云如故网络科技有限公司运营,IDC ICP ISP CDN VPN IRCS等证件齐全!合法运营销售,主要从事自营高防独立服务器、物理机、VPS、云服务器,虚拟主机等产品销售,适合高防稳定等需求的用户,可用于建站、游戏、商城、steam、APP、小程序、软件、资料存储等等各种个人及企业级用途。机房可封UDP 海外 支持策略定制 双层硬件(傲...

HyperVMart:加拿大vps,2核/3G/25G NVMe/G口不限流量/季付$10.97,免费Windows系统

hypervmart怎么样?hypervmart是一家成立了很多年的英国主机商家,上一次分享他家还是在2年前,商家销售虚拟主机、独立服务器和VPS,VPS采用Hyper-V虚拟架构,这一点从他家的域名上也可以看出来。目前商家针对VPS有一个75折的优惠,而且VPS显示的地区为加拿大,但是商家提供的测速地址为荷兰和英国,他家的优势就是给到G口不限流量,硬盘为NVMe固态硬盘,这个配置用来跑跑数据非常...

陆零(¥25)云端专用的高性能、安全隔离的物理集群六折起

陆零网络是正规的IDC公司,我们采用优质硬件和网络,为客户提供高速、稳定的云计算服务。公司拥有一流的技术团队,提供7*24小时1对1售后服务,让您无后顾之忧。我们目前提供高防空间、云服务器、物理服务器,高防IP等众多产品,为您提供轻松上云、安全防护 为核心数据库、关键应用系统、高性能计算业务提供云端专用的高性能、安全隔离的物理集群。分钟级交付周期助你的企业获得实时的业务响应能力,助力核心业务飞速成...

graphcore为你推荐
编程小学生惊库克小学生学编程好吗?特朗普取消访问丹麦特朗普专机抵达日本安保警力情形如何?lunwenjiancewritecheck论文检测准吗?百度关键词分析怎样对关键词进行分析和选择porntimesexy time 本兮 MP3地址www.5ff.comhttp://www.940777.com/网站,是不是真的网投六合javbibinobibi的中文意思是?dadi.tv电视机如何从iptv转换成tv?hao.rising.cn如何解除瑞星主页锁定(hao.rising.cn). 不想用瑞星安全助手盗车飞侠侠盗飞车罪恶都市警车任务怎么做
php主机空间 工信部域名备案查询 查询ip地址 秒解服务器 webhosting sugarsync 鲨鱼机 视频存储服务器 谷歌香港 双11抢红包攻略 密码泄露 2017年万圣节 ibox官网 hdd metalink 爱奇艺会员免费试用 国外视频网站有哪些 彩虹云 域名dns 华为云建站 更多