Lucenegraphcore

graphcore  时间:2021-03-26  阅读:()
Semi-AutomatedPreventionandCurationofDuplicateContentinSocialSupportSystemsIgorA.
PodgornyIntuit,Inc.
SanDiego,USAigor_podgorny@intuit.
comChrisGielowIntuit,Inc.
SanDiego,USAchris_gielow@intuit.
comABSTRACTTurboTaxAnswerXchangeisapopularsocialQ&AsystemsupportingusersworkingonU.
S.
federalandstatetaxreturns.
Basedonacustom-builtduplicatescoringmodel,35%ofAnswerXchangequestionshavebeenfoundtobenear-duplicatesresponsiblefor56%ofAnswerXchangedocumentviews.
Thisdegradestheuserexperienceforboththeaskerwhoisunabletofindanansweramidduplicates,andtheanswererwhoisunabletoefficientlyansweratscale.
Theduplicatequestionstendtoformmicro-clustersthatgrowviapreferentialattachmentand,onceexceedingsome25questionsinsize,startmorphingintomega-clusterswithacomplexnetworktopology.
Thisbehaviorcanbeleveragedtodesignsemi-automatedcontentcurationsystemstodetectwhetheranewlypostedquestionisaduplicateand,ifso,whichduplicateclusteritbelongsto.
InordertoimproveuserexperienceinAnswerXchange,weexplorehowhumanandartificialintelligencecanbejointlyemployedandthenpresentseveraldata-drivenintelligentuserinterfaces.
Theduplicatescoringmodelscanbeutilizedaselementsofquestion-postingandansweringexperiences,unansweredquestionqueueingandanswerbots.
TheseapproachescanbeextendedtoanysocialsupportQ&Asystemwhereduplicatepostingnegativelyimpactssearchrelevanceandcontentconsumption.
AuthorKeywordsTurboTax;AnswerXchange;CQA;communityquestionanswering;socialquestionanswering;duplicateclusters;contentdeduplication.
ACMClassificationKeywordsH.
5.
m.
InformationInterfacesandPresentation(e.
g.
HCI):MiscellaneousINTRODUCTIONSocialQ&Asystemsprovideaconvenientself-supportoptionfortaxandfinancialsoftwareapplicationswherepersonalizedlong-tailcontentgeneratedbytheuserscansupplementcuratedknowledgebaseanswers.
Usersoftenpreferself-helptoassistedmeasures(e.
g.
phonesupportoronlinechat)andareoftenabletofindandapplytheirsolutionfaster.
Thisalsoreducestheloadonassistedchannels,ensuringtheyremainavailabletothosewhoneedit.
AnswerXchange(http://ttlc.
intuit.
com)isasocialQ&AsitewherecustomerscanlearnandsharetheirknowledgewithotherTurboTaxcustomerswhilepreparingU.
S.
federalandstatetaxreturnsandalsofindstep-by-stepinstructionsonusingtheTurboTaxapplication[5,6].
AstheusersstepthroughtheTurboTaxinterviewpages,theycanaskquestionsaboutsoftwareandtaxtopics(Figure1)andreceiveanswersinamatterofminutes.
AnswerXchangehasgeneratedmillionsofquestionsandanswersthathavehelpedtensofmillionsofTurboTaxcustomerssincelaunchingin2007.
Figure1.
AnswerXchangequestion-postinguserexperience.
Questiontitle(ashortsummaryofquestionlimitedto255characters)ismandatory.
Questiondetails(notshown)areoptionalandunlimitedinsize.
Themajorityofuserscanfindanswersbysearchingtheexistingcontent.
Theoverallqualityofacustomerself-helpsystemisthereforedeterminedbyhowwelltheself-helpsystemassistsinfindingtherelevantcontent.
Thenumberofsearchsessionsresultinginassistedsupportcontacts(beingaslargeashundredsofthousandsofcustomersperyear)andfractionofuserupordownvotesonself-supportcontentprovideaconvenientproxymetricsofcontentqualityandsearchrelevanceinTurboTaxself-help[5].
2018.
Copyrightfortheindividualpapersremainswiththeauthors.
Copyingpermittedforprivateandacademicpurposes.
ESIDA'18,March11,Tokyo,Japan.
Figure2.
AnexampleofduplicateAnswerXchangesearchresults.
Questiontitlesandanswersnippetsareshowninpurpleandinblack,respectively.
Oneproblemwiththeexistingquestion-postingexperience(Figure1)isthatsearchesmayresultinmultipleandoftenduplicateanswersthatarerelativelyclosetotheintentoftheoriginalquestion,butstilldonotmatchtheoriginalsearchintent(Figure2).
Thisinterfereswiththeuser'sabilitytoselectfromadiversesetofpossibleanswers[5]and,oftenresultseitherinthesubmissionofaduplicatequestionorswitchingtoaless-desiredsupportchannel.
Arelatedproblemisthatusersmaysubmitpoorqualityquestionsbynotprovidingalloftherelevantinformationneededforagoodqualityanswer[5].
Onesolutionisamanualreviewoftheusergeneratedcontenttoarchivesomeoftheduplicatequestionsandrelatedanswers,ifany,andkeepingthebestperformingcontentin"live"status(i.
e.
makingitavailableforsearch).
Thisapproachislaborintensiveanddoesnotaddresstheproblemwiththequestion-postinguserexperience.
Duplicatequestionsmayquicklybuildup,addingunnecessaryburdenoncommunityquestionansweringalongtheway.
ThegoalofthisstudyistoaddresstheproblemsofduplicatecontentpreventioninAnswerXchangebycombiningmachinelearningandintelligentuserinterfaces.
Inwhatfollows,wedescribeduplicatedetectionalgorithmsdevelopedearlierandpresentacustommodeltrainedonAnswerXchangequestions.
Next,weintroducetheconceptof"duplicateclusters"thatprovideaframeworkforsemi-automatedduplicatecontentprevention.
Finally,wepresentseveralcustomdesigneddata-drivenintelligentuserinterfacesforaddressingduplicatecontentproblem.
RELATEDWORKThetaskofestimatingsemanticsimilarityoftextdocumentshasmultiplepracticalapplicationsandisofgrowinginterestfromtheresearchcommunity.
Theareasofresearchincludewebpagesimilarity,documentsimilarity,sentencesimilarity,searchquerysimilarityandutterancesimilarityinconversationaluserinterfaces.
Thesetasksarealsorelatedtoamoregeneralproblemofdetectingduplicatesindatabaserecords[2].
QuestionsinsocialQ&Asystemsmediaareoftenconfinedtooneortworelativelyshortsentencesandmaywarrantdomainspecificapproachestoaddressingquestionsimilarity.
Forexample,twoquestionsinasocialQ&Asystemcanbeconsideredsemanticallyidenticalifasingleanswersatisfiestheneedsofbothoriginalaskers[3].
Theanswermaynotyetexistintheproductiondatabasebutcouldbegeneratedifneeded.
Thetaskofduplicate-questiondetectionisalsorelatedtothetaskofre-formulatinganewlyformedquestion[6]andautomaticallyfindingananswertoanewquestion[8].
Themostrecentresultsintheareaofduplicatecontentscoringcamefromthe2017Kaggle"QuoraPair"competitionwithmodelsubmissionsfrommorethan3,000teams(https://www.
kaggle.
com/c/quora-question-pairs).
Inthiscompetition,theparticipantsweretaskedtoclassifyifQuoraquestionpairsareduplicatesornotbasedon200,000traininginstances.
Finally,SemEval2017TaskonCommunityQuestionAnswering("Question–CommentSimilarity","Question–QuestionSimilarity",etc.
)resultedinsubmissionsfrom23teams[4].
TheproblemofduplicatedetectionandcurationiscloselyrelatedtothetaskofpredictingcontentqualityinsocialQ&Asystems.
Contentqualitymetricsmaybehelpfulinselectingthebestperformingquestionandanswerfortheduplicate-questionpair.
AnswerandquestionqualityinthesocialQ&Asystemshasbeenthefocusofincreasingattentionfromthescientificcommunity[1,9].
DUPLICATE-SCORINGMODELAnswerXchangeSearchAnswerXchangesearchisbuiltwithApacheLuceneopen-sourcesoftware(http://lucene.
apache.
org).
Bydefault,Luceneuses"tf-idf"(https://en.
wikipedia.
org/wiki/tf-idf)and"cosine-similarity"asstandardmethodsofrankingsearchresults.
Shorterdocumentswiththesamesetofmatchingkeywordstypicallyrankhigherthanlongerdocumentswithsimilarsemanticmeaning.
AnaverageAnswerXchangesearchqueryis2-3termslong(i.
e.
shorterthanatypicalAnswerXchangequestion)anditisoftencomparableinlengthwiththetitleofapotentiallyduplicatequestion.
ThequestiondetailsplayalesserrolecomparedtotitlescontributingtoextraboostingofduplicatecontentbyLucene.
TheAnswerXchangeLucenerankingalgorithmtendstoboostnewcontentandalsoaccountsforvariousmetadatasuchashelpfulnessvotes.
TrainingDataTheproblemofnear-duplicatedetectioncanbeformulatedasanunsupervisedorsupervisedmachinelearningtask[7].
Intheunsupervisedcase,duplicatepairsandclusterscanbefoundbasedondistancemetricssuchascosine-similarityoftheweightedtf-idfvectors,Jaccardsimilaritycoefficient,distanceinword2vecspace,etc.
Inthesupervisedcase,theproblemoffindingtopicalnear-duplicaterelationscanbeformulatedasfollows:givenapairofquestions,themachinelearntmodelhastopredicta"duplicatescore"anddetermineifquestionsareduplicatesbasedonapre-definedthreshold.
Inthispaper,weemploya"hybrid"approachstartingwithcosine-similaritymetricsfordatapre-processingandthenaddingamoreaccuratecustom-builtscoringmodeltotheprocessingpipeline.
AsthefractionofduplicatepairsinAnswerXchangeisrelativelylow,thequestionpairsrankedbycosine-similarityprovideaconvenientdatasetforlabelingbasedontheimportancesamplingapproach.
Towardsthisgoal,wecomputedbag-of-wordscosine-similarity(AppendixA)for790,000questionsavailableforsearchinAnswerXchangeattheendof2017U.
S.
TaxDay(April18).
Next,fourAnswerXchangemoderatorsaddedclasslabels(0or1)toarandomsampleof4,000near-duplicatepairs.
Instancesopentodoubthavebeenflaggedbymoderatorsandthenre-labeledbyaconsensus.
1,000randomlysamplednon-duplicatepairshavebeenaddedforthefinalversionofthetrainingdatasettomakeitequallydividedbetweenduplicateandnon-duplicatepairs.
Duplicate-ScoringModelFeaturesThemodelfeaturescanbelearntfromtrainingdataand/orbyknowledgeacquisitionfromAnswerXchangemoderators.
Wehaveusedthefollowingmodelfeatures:Cosine-similaritywithtf-idfweighting(seeAppendixA).
ProbabilistictopicIDofthequestioncomputedwithLatentDirichletAllocation(seeAppendixA).
U.
S.
taxyearinthequestion.
Distinctwordsinthequestionpair.
Commonwordsinthequestionpair.
Typeofthequestion(e.
g.
"closed-ended"questions"CanIdeduct…"typicallyaccountfortaxrelated,while"how"questionsoftenaccountforproductrelatedquestion).
Firstwordofthequestion.
Duplicate-ScoringModelPerformanceBasedonthesetof5,000labeledquestionpairs,wetrainedandtestedalinear(logisticregression)andnon-linear(randomforest)binaryclassifiersusingPythonmachinelearninglibrary"scikit-learn".
Themodelpredictsclasslabel(0foranon-duplicateand1forduplicatepair)andalsotheduplicatescore(i.
e.
probabilityofthequestionpairtobelongtoeitherclassrangingfrom0.
0to1.
0)thatcanbeusedtoselectuserexperiencebasedonpredefinedthreshold(s).
Wealsotrainedaseparateversionofthelogisticregressionclassifierusingcosine-similarityasasinglemodelfeature.
ShowninTable1arecommonmetricsusedforpredictivemodelevaluation:areaundercurve(AUC)forreceiveroperatingcharacteristic,F1scoreandlogarithmicloss(logloss)functionforclassification.
ModelAUCF1ScoreLogLossLogisticRegression0.
950.
880.
27RandomForest0.
940.
870.
31Cosine-similarity0.
830.
730.
48Table1.
Modelperformancemetricsforduplicate-scoringmodels(detailsareexplainedinthetext).
AsseenfromTable1,bothlogisticregressionandrandomforestmodelsachieveperformancethatisconsistentwiththegoalsofthisexploratorystudy.
Atthesametime,cosine-similarityversionunderperformsthefirsttwobyawidemargin.
Thiscanbeexplainedbytheinabilitytofindanoptimalthresholdseparatingduplicateandnon-duplicatepairsusingthecosine-similarityalone.
Thefollowingtwoexamplesillustratetherelationshipbetweenkeyword-basedcosine-similarityandduplicate-questionscorecomputedwithlogisticregression.
ThefirstexampleisanAnswerXchangequestionpairwitharelativelylowcosine-similarityof0.
61:(1)"Ineedacopyofmyfederaltaxreturnfor2014"and(2)"Ineed2015TaxReturn".
BothquestionscanbeansweredwithasingleinstructionaboutgettingacopyofprioryeartaxreturnfiledwithTurboTaxandhenceareduplicates.
Thesecondexampleisaquestionpairwithhighcosine-similarityof1.
0:(1)"doihavetofilestatetaxes"and(2)"howtofilestatetaxes".
Thesequestionsarenotduplicatesbecausetheybelongtotaxandproductcategories[5],respectively,andwouldrequiretwodifferentanswers.
DUPLICATECLUSTERSPreferentialAttachmentandTopologyAfteridentifying5,597,799duplicatequestionpairsinAnswerXchange(AppendixA),webuiltanundirectedgraphof281,031duplicatequestions.
Eachduplicatepairandduplicatequestionidentifiedwiththemodelconstitutedgraphedgeandgraphvertex,respectively.
Theresultinggraphconsistsof14,616connectedcomponentshereafterreferredtoas"duplicateclusters.
"Toexploreduplicate-clusterscalingbehavior,werankedclustersbythenumberofquestionsandplottedthenumberofquestionsperclustervs.
clusterrankinlog-logscale(Figure3).
Thelargestclusterhas23,236questionsandthesmallestonesonlyhavetwo.
Theplotalsoincludesgraph(oredge)density:=21,whereEisnumberofedges(i.
e.
duplicatepairs)andVisthenumberofvertices(i.
e.
questions).
Graphdensityisequalto1.
0forthefullyconnectedgraphs.
Inthelattercase,eachquestionintheclusterisconnectedtoallremainingquestionsinthesameduplicatecluster.
Basedonbothquestioncountsandgraphdensity,theduplicateclustersinFigure3canbedividedintothreedistinctgroupsmarkedasmega-clusters,transitionalclustersandmicro-clusters.
Thesegroupsaccountfor84%,2%and14%ofduplicatequestions,respectively.
Figure3.
Scalingbehaviorofduplicateclusters(blackdots)inAnswerXchangequestions.
Theclustersarerankedbythenumberofquestionsinthedescendingorder.
Graphdensityfortheclustersisshowningray.
CyanandreddotsrefertotheclustersshowninFigures4and5,respectively.
Anexampleofmicro-clusterwith23verticesisshowninFigure4.
Graphdensityis0.
54andmostofverticesareinterconnectedwithanexceptionofthreeverticesconnectedbybridgestoadensergraphcore.
Thecorrespondingarticulationpointsaremarkedbybluedots.
Notethatevenifquestions1and2areduplicatesandquestions2and3areduplicates,thisdoesnotmeanthatquestions1and3areduplicatesaswell.
Thisexplainswhyaduplicate-clusterdensityistypicallylessthan1.
0unlessthegraphsizeislimitedtotwoquestions.
AsseenfromFigure3,micro-clusterscalingbehaviorfollowsZipfdistribution(https://en.
wikipedia.
org/wiki/zipf's_law):=+,,whererrangesfromabout100tothetotalnumberofclustersR.
Accordingly,thegrowthofN(Δ)andR(Δ)wouldbeconstrainedbythefollowingequation:Δ=Δ.
ItisworthmentioningthatZipfdistributionisanasymptoticcaseofamoregeneralYule-Simondistribution(https://en.
wikipedia.
org/wiki/Yule-Simon_distribution)typicalforthepreferentialattachmentprocess,meaningthatanewlypostedduplicateismorelikelytobecomeattachedtotheexistingclusterthantoformanewduplicatepair.
Thescalingparameterforthemicro-clusters:=log4log5log(4)log(5)canbeestimatedas0.
6.
ByextrapolatingZipfdistributiontor=1(thatwouldcorrespondtoanon-existinglargestmicro-cluster),onecanestimateNvalueas400.
Thisvalue,however,isalmosttwoordersofmagnitudelessthanthenumberofquestionsinthetopmega-cluster.
Figure4.
Amicro-clustermarkedbycyandotinFigure3.
Articulationpointsareshownbysmallerbluedots.
ToexplainthescalebreakinthedistributionshowninFigure3,letusexaminelargerduplicateclustersinmoredetail.
ShowninFigure5isamega-clusterwith4,549questions.
Theclusterhasdensityequalto0.
0017and1048articulationpoints.
Thismeansthatthemega-clustersmayconsistofmultiplesub-clustersthataresemanticallyrelatedtoeachotherbutwiththeelementsthatarenotduplicatesunlesstheybelongtothesamesub-cluster.
Figure5.
SameasinFigure4,butnowforamega-cluster.
Asthenumberofduplicatesreachescertainlevel,theclustersstartcoalescingbyestablishingbridgeswithotherclusters,duplicatepairsandstand-alonequestions,quicklyevolvingfromdenseconnectedgraphstosparsegraphswithacomplexnetworktopology.
TheareaoftransitionismarkedastransitionalclustersinFigure3.
Semi-AutomatedDuplicateContentCurationWhilethetaskofduplicatecontentarchivingisstraightforwardonceduplicatepairsarefound(AppendixA),theduplicatecontentcanbuildupagainunlessquestion-postingand/orsearchexperiencesaremodified.
Ournextgoalisthereforetoexplorehowtheconceptofduplicateclustersdiscussedintheprevioussectioncanbeappliedtothesetasks.
Thecurationofmicro-clusterscanbedoneautomaticallyorsemi-automatically(i.
e.
withminimumhumaninvolvement)byretainingoneorfewbestperforminglong-taildocuments(i.
e.
documentsthatincludebothquestionsandanswers)andassigningthemaclusterIDforsubsequentre-use.
Thecurationofmega-clustersrepresentsamorechallengingproblem.
First,asinglebestperformingdocumentinamega-clustermaysimplynotexistsincetheclustermaycontainmultiplesub-clustersconnectedbybridges.
Second,duplicatecurationbyahumanisacumbersometaskduetothemega-clustercomplextopology.
Whiletheexactsolutionmaysimplynotexist,approximatesolutionsmaybesufficienttoreducethenumberofduplicatespostedintheAnswerXchangetoanacceptablelevel.
Oneapproachwouldbetobreakthemega-clustersintosmallerpartsbydeletingbridgesinthegraphorbyemployingaconventionalhierarchicalclustering.
Forexample,theduplicateclustershowninFigure5canbesplitto1363connectedcomponentsbyremovingallarticulationpoints(bluedotsinFigure5).
Mostoftheresultingconnectedcomponents,however,aredisconnecteddocuments.
Amorepracticalapproachistoarchivenon-performingshort-tailcontentfromthemega-clusterandcuratetheresultingconnectedcomponents.
ShowninFigure6isasubsetofmega-clusterfromFigure5thatnowonlyincludesdocumentswithatleast100views.
Thisresultsinbreakingtheoriginalmega-clusterinto68connectedcomponentswhichareeasiertocurate.
Figure6.
Asubsetofthemega-clustershowninFigure5.
GreydotsmarkdocumentsusedinFigure7.
Thenexttaskistopresentduplicatecontentinaformsuitableforsemi-automatedcontentcuration.
Figure7showsanexampleofduplicatecontentmetricsforeightdocumentswithatleast1000views.
Theleftcolumnisasub-clusterIDfollowedbyapostIDidentifyinganAnswerXchangedocumentconsistingoftheoriginalquestionandallaccumulatedanswers(notshown).
Thetextofthequestionandtypeofthequestion(i.
e.
user-generatedcontentmarkedasUGCorknowledgebasecontentlabeledasFAQ)areincludedinthethirdandfourthcolumns,respectively.
Thelasttwocolumnsareviewsaccumulatedoveragivenperiodandpercentageofup-votes.
Thedocumentscanberankedbyviewsand/orvotesprovidingamechanismofidentifyingandremovingnon-performingcontenteithermanuallyorautomaticallybasedonasetofpredefinedcontentqualitythresholds.
Figure7.
DuplicatedocumentmetricsforthedocumentsmarkedbygreydotsinFigure6.
Duplicatemetricscanbeoperationalizedbyaddinganalgorithmtomatchthebestquestiontothebestanswerinthesub-cluster.
Suchasystemwouldincludeanswerdeletingandmergingmanuallyorautomaticallybyattachingautomaticallygenerated"best"answertothe"best"duplicatequestion.
Thesolutioncanbeimplementedasaback-endtoolfortrustedusersassignedtothetaskofduplicatearchivingandhiddenfromthelessexperiencedregularusers.
Thesolutiongoesbeyondsimpleduplicatearchivingbyprovidinganoptiontomergeavailableanswerstotheexistingduplicatequestions.
Thenon-humanpartofthesolutionincludesqualityrankingoftheexistinganswers,e.
g.
upanddownvotestatisticsasshowninFigure7.
Inthisway,thenewlyformedquestion-answerpairsprovidebetterqualitycontentavailableforsearchbycombiningthevisuallyappealingquestionsandthebestrankedanswers.
Thisisdonebycombiningartificialandhumanintelligencesincetheanswertoarelatedquestion(thatthesystemrecommended)canbeconfirmedbythecontributorifneeded.
Theclusternotescanbeeditedbytrustedusersandappliedtoallarticleswithinthecluster.
RealTimeDuplicateDetectionFindingduplicatestoagivenquestionrequires(N-1)pairwisecomparisonstothequestionsinthedatabaseandmaybenotfeasibleinrealtime.
ThecomputationaltimecanbereducedbyselectingpotentialduplicatematcheswithAnswerXchangesearch.
ThetopperformingdocumentsintheclusterscanbeassignedanIDandindexedseparatelybythesearchengine.
Oncethesearchenginereturnsthedocumentsrankedbyrelevancytothenewlyformulatedquestion,theduplicate-scoringmodelisappliedtothetopmatchestoseeifthenewquestionisaduplicateand,ifso,whichduplicateclusteritbelongsto.
IDPOST_IDDOCUMENTTYPEVIEWSUPVOTE11,899,475CanIdeductjob-searchexpensesFAQ17,01974.
812,666,148HI.
WheredoIentermyjobsearchUGC1,75977.
913,048,015WheredoIincludejobsearchUGC1,06078.
113,356,358WheredoIentermyjobsearchFAQ6,72770.
313,705,028WheredoIdeductjobsearchUGC2,9996722,895,188WheredoIentermymedicalFAQ25,24379.
922,899,090WhydoesntmyrefundchangeafterIentermymedicalexpensesFAQ13,76579.
122,956,890wheredoienterOUTOFPOCKETmedicalexpensesUGC1,50986.
6DATA-DRIVENUSEREXPERIENCESAccumulationofduplicatecontentcanbepreventedbyintegratingacustom-builtduplicate-scoringmodelandquestion-postingexperience.
Anotheroptionistoexposeanintelligentinterfacetothetrustedusersbyprovidingextrafeaturesforansweringduplicatequestions.
Finally,theduplicatequestioncurationcanbepartofthecontentmoderationprocesscarriedoutbytheAnswerXchangetrustedusersortrainedbots.
QuestionDeduplicationWhilePostingThefirstfeature(Figure8)extendstheAnswerXchange"QuestionOptimizer"system[6].
Thesystempromptstheaskerwithpersonalizedinstructionscreateddynamicallybasedonrealtimeanalysisofthequestion'ssemanticsandwritingstyle.
The"QuestionOptimizer"hasbeenre-designedtomakeduplicatequestionmoredifficulttosubmitwithoutaddressingtherecommendedre-phrasing.
Theannotationstoconceptarepresentednext.
Figure8.
Question-postingexperiencerevealstheduplicatesandhelpsusersre-phraseasauniquequestion.
A)The"Question-Optimizer"technologyisenvisionedtoincludeduplicatecontentdetectioninadditiontoprovidingtimelyadviceonhowtore-phraseordeflect.
B)Ifquestionfallsinaknownduplicatecluster,thebestmatchingandmostreferencedanswermatchesareshown.
C)Trustedusersmayattach"clusternotes"tocuratedduplicateclustersandappearautomaticallywithanyquestionwithinthecluster.
IntheexampleshowninFigure8,theduplicateclusterisaboutprintingandthemessagenotesthattheprintingexperiencerecentlychangedintheproduct-informationwhichmaybeusefultoanyonewithprinting-relatedquestions.
D)Thesuggestedanswersarededuplicatedusingduplicatescoreequalizationsotheanswersaremoreuseful.
A"clusterbrowser"isalsoaddedbelowtotheresultstohelprefineamongstthemostpopularvariations.
QuestionDeduplicationWhileAnsweringThesecondfeatureaddressesthesituationwhereapotentialduplicatehasbeensubmittedandneedstobeinterceptedaspartofquestionansweringexperience.
ThisconceptisillustratedinFigures9-10.
Figure9.
Contributorexperiencetaggingandattachingcuratedanswertothequestion.
Specifically,Figure9illustratesthecontributor(typicallyatrusteduser)answeringexperienceandincludesthefollowingannotation:Chris,trythistodownloadanewcopySUGGESTEDANSWERSANSWERTHISIneedacopyofmy2014Taxreturncopyof2014returnSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineIneedtogetacopyofmy2014returnandIdon'thavethecd.
92%match2,314duplicates5/3/16450attachattachandmarkansweredIneedacopyofmy2014TaxreturnAnswerEChrisasked30minutesagoE)Thesuggestedansweredquestionduplicateispresentedtotheoriginalaskerandalsodisplaystheduplicateprobability.
Thecontributorcaneasilyattachittotheiranswer,whichalsotellsthesystemthequestionwasaduplicateandshouldbearchivedinfavoroftheattached.
Figure10.
Originalaskerviewofdeduplicatedquestionwithpersonalizedanswer.
Oncetheduplicatequestionisanswereditbecomesavailabletotheoriginalasker(Figure10).
C)Re-purposingtrustedusersnotessimilartothoseusedinquestion-postingexperience(Figure8).
F)Apersonalizednoteintroducesthe"recommendedanswer"whileexplainingit'saduplicate.
G)Theduplicateanswerispresentedwithasenseofauthority.
H)Iftheoriginalaskerisunsatisfiedwiththeanswer,theymayrevisetheirquestionanditwillre-entertheanswerqueue.
Theyalsohavetheoptiontorequestanewanswerwithoutsubmittingthequestion.
Finally,flaggingtheunansweredquestionautomaticallyasaduplicatemaybevalidatedorinvalidatedbythetrustedusersandtoupdatetrainingdatasetformodelre-training.
QuestionDeduplicationwithAutomatedAnswersThe"AnswerBot"(Figure11)isafeaturedrivenbyartificialintelligencealone.
The"AnswerBot"increasesself-supportefficiencybyrespondingtoacustomer'squestionsbye-mailwithanswersfromthematchingduplicateclusterifthepostedquestionisflaggedbytheduplicate-scoringmodelasaduplicate.
I)"AnswerBots"mayautomaticallyanswerquestionsdeterminedtobeduplicates.
Likethecontributor-assistedexperience,thebotwillrecommendtheanswerfromthebestanswerwithintheduplicatecluster.
Theuserismadeawarethatabotansweredthequestion,andifunsatisfiedmayrequestanewanswer,orrevisetheirquestion.
Figure11.
Automateddeduplicationuserexperienceaspartofcustomizede-mailtotheoriginalasker.
Further,the"AnswerBot"attachesthequestiontotheexistingduplicateclusterautomaticallywhileprovidingagenericorpersonalizedanswer.
Thebotrepliestriggerautomatedarchivingoftheduplicatecontent.
ThequestionremainsvisibletotheoriginalaskerbutisnotmadeavailabletoAnswerXchangeusersandissuppressedfromsearchresults.
Arelatedoptionistocreatetwoseparatequeuesofduplicatequestionsforanswering.
Thequestionsinthefirstqueuewouldbeassignedtodesignatedmoderatorswhocancustomizeduplicatecontentfortheoriginalaskerandarchiveitafterwards.
Thelesscomplicatedquestionsinthesecondqueuecanbeassignedtothe"AnswerBot".
Yourquestionsharesthesameanswerasthissimilarquestion:Ineedacopyofmy2014TaxreturnChris,trythistodownloadanewcopyJaneDoe73SuperUser15minutesagoSweetieJeanRisingStar1yearagoSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineSelect2014astheyearfromyourTaxTimelineFromthelistofSomeThingsYouCanDoonyourTaxTimeline,selectDownload/PrintMyReturn(PDF)RECOMMENDEDANSWERNotetheprintingexperienceinTurboTaxchangedin2016FGCMOREACTIONSRevisemyquestionHRequestanewanswerIthinkyourquestionmightsharethesameanswerasthissimilarquestion:Ineedacopyofmy2014TaxreturnIamabot,andthisactionwasperformedautomatically.
Ifmyanswerisunhelpful,youmayrequestanewanswerorreviseyourquestion.
AnswerBot15minutesagoSignbackintoyourTurboTaxonlineaccount.
FromtheWelcomeBackscreen,selectVisitMyTaxTimelineSelect2014astheyearfromyourTaxTimelineFromthelistofSomeThingsYouCanDoonyourTaxTimeline,selectDownload/PrintMyReturn(PDF)RECOMMENDEDANSWERIDISCUSSIONANDCONCLUSIONSocialQ&Asystemsoftenpresumethattheuserscomplywithrecommendationsnottoreplicatetheexistingcontent.
ThisisnotthecaseforAnswerXchangewhereusersoftenavoidconsumingexistingcontentbypostinganewduplicatequestion.
TheseusersmaynotrealizethatAnswerXchangeisasocialQ&Asiteorlacktheabilitytofindandapplyexistinganswerstotheirquestion.
Weneedtointervenewithintelligentuserinterfacestoaltertheduplicatepostingbehavior.
Towardsthisgoal,wepresenttwoalgorithmsforduplicatecontentcurationandprovidingrealtimeinputstotheAnswerXchangeuserinterfaces.
Thefirstalgorithmdeterminesiftwoquestionsarenear-duplicatesandcanbecombinedwithasearchtodetectduplicatesinrealtime.
ThesecondalgorithmuncoversallduplicatepairsinAnswerXchangeandiscapableofhandlingdeduplicationtaskwithacorpusofmillionsofquestions.
Weconcludethepaperbypresentingthreequestiondeduplicationuserinterfaces.
Ourhypothesistovalidateinclude:(1)Willaskersacceptaduplicatewhenpresentedwithanacceptableanswer(2)Willtheyacceptaduplicatewithorwithoutapersonalizedcontributornote(3)Ifdissatisfiedwilltheyreviseorrequestanewanswer(4)WilltheyacceptrecommendedanswersfromAnswerBotsWeareplanningtovalidatethesehypothesiswithasetofrapidexperimentspriortoproduction.
APPENDIXA:DUPLICATEPAIRDETECTIONDetectingduplicatesforN=790,000questionsbasedonacustom-builtmodelwouldrequire(N(N-1)/2pairwisecomputations.
Thetaskoffindingduplicatepairsbecomescomputationallyexpensiveoncethecorpusreachesseveralhundredthousanddocuments.
Atthesametime,computingcosine-similarityforaquestionpairisfasterthanscoringthesamepairwithcustom-builtmodelandcanbeusedtoreducethenumberofpotentialduplicatepairsfrombillionstomillionsofpairs.
Further,dividingcontentbyMprobabilistictopicscanreducethenumberofpairwisecomparisonsbyM,whilenotnecessarilyaffectingthenumberofexpectednear-duplicatepairs.
MDuplicatesExecutiontime(min)5063,355133072,92018.
51073,06836183,773265TableA1.
Duplicatestatisticsandcomputationtimevs.
numberofprobabilistictopics(M).
Cosine-similaritythresholdis0.
7.
M=1meansprocessingN(N-1)/2pairs.
ShowninTableA1areresultsofthenumericalexperimentsconductedonMacBookProlaptopwith2.
8GHzprocessorspeed.
Theprocessingpipelineincluded(1)dividingquestionsintoMtopics,(2)computingcosine-similarityforallpairsinatopic,and(3)applyingduplicate-scoringmodeltothepairswithcosine-similarityaboveapre-definedthreshold.
Thetotalnumberofduplicatepairswasfoundtobe5,597,799andcontained281,031uniquequestions(or35%oftheAnswerXchange"live"questions).
In2017,theycontributed56%totheAnswerXchangedocumentviews.
Thedocumentsintheidentifiedduplicatepairscanberankedbyasuitablequestion(andanswer)proxycontentqualitymetricsasdiscussedearlier,forexamplebythenumberofviews,votes,ageofthepost,orbyaweighedcombinationthereof.
Thedocumentwiththelowerscorecanberemovedconsecutivelyfromeachpairresultinginaremovalof217,767documents(27%oftheAnswerXchange"live"questions).
ACKNOWLEDGMENTSWethankanonymousreviewersforvaluablecomments.
REFERENCES1.
EugeneAgichtein,CarlosCastillo,DeboraDonato,AristidesGionis,GiladMishne.
2008.
FindingHigh-QualityContentinSocialMedia.
In:Proc.
oftheInternationalConferenceonWebSearchandDataMining,183-193.
2.
AhmedK.
Elmagarmid,PanagiotisG.
Ipeirotis,VassiliosS.
Verykios.
2007.
DuplicateRecordDetection:ASurvey.
IEEETrans.
Knowl.
DataEng.
,19,1-16.
3.
KlemensMuthmann,AlinaPetrova.
2014.
Anautomaticapproachforidentifyingtopicalnear-duplicaterelationsbetweenquestionsfromsocialmediaQ/Asites.
In:ClassifyingBigDatafromtheWeb,1-6.
4.
PreslavNakov,DorisHoogeveen,LluísMàrquez,AlessandroMoschitti,HamdyMubarak,TimothyBaldwin,KarinVerspoor.
2017.
SemEval-2017Task3:CommunityQuestionAnswering.
In:Proc.
ofthe11thInt.
WorkshoponSemanticEvaluation,27-48.
5.
IgorA.
Podgorny,MatthewCannon,ToddGoodyear.
2015a.
Pro-activedetectionofcontentqualityinTurboTaxAnswerXchange.
In:Proc.
ofACMConferenceCompaniononCSCW,143-146.
6.
IgorA.
Podgorny,ChrisGielow,MatthewCannon,ToddGoodyear.
2015b.
Realtimedetectionandinterventionofpoorlyphrasedquestions.
InCHI'15ExtendedAbstracts,2205-2210.
7.
R.
S.
Ramya,K.
R.
Venugopal,S.
S.
Iyengar,L.
Patnaik.
2016.
FeatureExtractionandDuplicateDetectionforTextMining:ASurvey.
GlobalJournalofComputerScienceandTechnology56,5.
8.
AnnaShtok,GideonDror,YoelleMaarek,IdanSzpektor.
2012.
LearningfromthePast:AnsweringNewQuestionswithPastAnswers,WWW,759-768.
9.
IvanSrba,MáriaBieliková.
2016.
AComprehensiveSurveyandClassificationofApproachesforCommunityQuestionAnswering.
In:TWEB,10(3),18:1-18:63.

易探云美国云服务器评测,主机低至33元/月,336元/年

美国服务器哪家平台好?美国服务器无需备案,即开即用,上线快。美国服务器多数带防御,且有时候项目运营的时候,防御能力是用户考虑的重点,特别是网站容易受到攻击的行业。现在有那么多美国一年服务器,哪家的美国云服务器好呢?美国服务器用哪家好?这里推荐易探云,有美国BGP、美国CN2、美国高防、美国GIA等云服务器,线路优化的不错。易探云刚好就是做香港及美国云服务器的主要商家之一,我们来看一下易探云美国云服...

HostYun 新上美国CN2 GIA VPS 月15元

HostYun 商家以前是玩具主机商,这两年好像发展还挺迅速的,有点在要做点事情的味道。在前面也有多次介绍到HostYun商家新增的多款机房方案,价格相对还是比较便宜的。到目前为止,我们可以看到商家提供的VPS主机包括KVM和XEN架构,数据中心可选日本、韩国、香港和美国的多个地区机房,电信双程CN2 GIA线路,香港和日本机房,均为国内直连线路。近期,HostYun上线低价版美国CN2 GIA ...

湖北50G防御物理服务器( 199元/月 ),国内便宜的高防服务器

4324云是成立于2012年的老牌商家,主要经营国内服务器资源,是目前国内实力很强的商家,从价格上就可以看出来商家实力,这次商家给大家带来了全网最便宜的物理服务器。只能说用叹为观止形容。官网地址 点击进入由于是活动套餐 本款产品需要联系QQ客服 购买 QQ 800083597 QQ 2772347271CPU内存硬盘带宽IP防御价格e5 2630 12核16GBSSD 500GB​30M​1个IP...

graphcore为你推荐
京沪高铁上市首秀京沪高铁将有哪些看点?美国互联网瘫痪美国掐断中国互联网怎么办,我们如何解决?是否有后招?permissiondenied求问permission denied是什么意思啊?摩根币摩根币原名【BBT】我是会员现在的我推介人把我从微信删除已经跑路,不给兑现了!请大家不要做了2020双十一成绩单2020年河南全县初二期末成绩排名?李子柒年入1.6亿魔兽rpg箱庭世界1.6怎么进入魔门22zizi.com河南福利彩票22选52010175开奖结果杰景新特杰普特长笛JFL-511SCE是不是有纯银的唇口片??价格怎样??巫正刚阿迪三叶草彩虹板鞋的鞋带怎么穿?详细点,最后有图解。高分求8090lu.com8090lu.com怎么样了?工程有进展吗?
网站空间租用 128m内存 密码泄露 patcha 空间服务商 免费ftp空间申请 个人域名 卡巴斯基试用版 万网空间购买 免费私人服务器 安徽双线服务器 网页提速 主机管理系统 湖南idc 阿里云邮箱登陆地址 网页加速 成都主机托管 阿里云邮箱申请 wordpress空间 privatetracker 更多