Methods and apparatuses for distinguishing topics

US2018366106A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018366106-A1
Application numberUS-201816112623-A
CountryUS
Kind codeA1
Filing dateAug 24, 2018
Priority dateFeb 26, 2016
Publication dateDec 20, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure discloses methods and apparatuses for distinguishing topics. One exemplary method for distinguishing topics includes: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. The methods and the apparatuses consistent with the present disclosure reduce the difference between human beings' understanding and machines' understanding of a question, and can increase the accuracy for identifying questions raised by users.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for distinguishing topics, comprising: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. 2 . The method for distinguishing topics of claim 1 , wherein clustering the training data set includes using a Latent Dirichlet Allocation (LDA) clustering method for clustering the training data set. 3 . The method for distinguishing topics of claim 2 , wherein the number of topics obtained by clustering using the LDA clustering method is greater than the number of known topics. 4 . The method for distinguishing topics of claim 1 , wherein an amount of the marked data is significantly less than an amount of the data to be trained. 5 . The method for distinguishing topics of claim 1 , wherein the step of distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic comprises: in response to determining that all marked data of a known topic appears in a topic, determining that the topic is a known topic; and in response to determining that no marked data of any known topic appears in a topic, determining that the topic is a new topic. 6 . The method for distinguishing topics of claim 5 , wherein clustering the training data set to obtain topics to which training data belongs further comprises: obtaining, by clustering, keywords of each topic obtained by clustering and a probability corresponding to each keyword. 7 . The method for distinguishing topics of claim 6 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic further comprises: determining, based on the keywords of each topic obtained by clustering, whether the topic is a known topic or a new topic. 8 . An apparatus for distinguishing topics, comprising: a memory storing a set of instructions; and a processor configured to execute the set of instructions to cause the apparatus for distinguishing topics to perform: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. 9 . The apparatus for distinguishing topics of claim 8 , wherein clustering the training data set includes using a Latent Dirichlet Allocation (LDA) clustering method for clustering the training data set. 10 . The apparatus for distinguishing topics of claim 9 , wherein the number of topics obtained by clustering using the LDA clustering method is greater than the number of known topics. 11 . The apparatus for distinguishing topics of claim 8 , wherein an amount of the marked data is significantly less than an amount of the data to be trained. 12 . The apparatus for distinguishing topics of claim 8 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic comprises: in response to determining that all marked data of a known topic appears in a topic, determining the topic as a known topic; and in response to determining that no marked data of any known topic appears in a topic, determining the topic as a new topic. 13 . The apparatus for distinguishing topics of claim 12 , wherein clustering the training data set to obtain topics to which training data belongs further comprises: obtaining, by clustering, keywords of each topic obtained by clustering and a probability corresponding to each keyword. 14 . The apparatus for distinguishing topics of claim 13 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic further comprises: determining, based on the keywords of each topic obtained by clustering, whether the topic is a known topic or a new topic. 15 . A non-transitory computer readable medium that stores a set of instructions that is executable by at least one processor of a computer to cause the computer to perform a method for distinguishing topics, the method comprising: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. 16 . The non-transitory computer readable medium of claim 15 , wherein clustering the training data set includes using a Latent Dirichlet Allocation (LDA) clustering method for clustering the training data set. 17 . The non-transitory computer readable medium of claim 16 , wherein the number of topics obtained by clustering using the LDA clustering method is greater than the number of known topics. 18 . The non-transitory computer readable medium of claim 15 , wherein an amount of the marked data is significantly less than an amount of the data to be trained. 19 . The non-transitory computer readable medium of claim 15 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic comprises: in response to determining that all marked data of a known topic appears in a topic, determining the topic as a known topic; and in response to determining that no marked data of any n topic appears in a topic, determining the topic as a new topic. 20 . The non-transitory computer readable medium of claim 19 , wherein clustering the training data set to obtain topics to which training data belongs further comprises: obtaining, by clustering, keywords of each topic obtained by clustering and a probability corresponding to each keyword.

Assignees

Inventors

Classifications

  • G06F16/353Primary

    into predefined classes · CPC title

  • G10L15/02Primary

    Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Training · CPC title

  • G06F16/36Primary

    Creation of semantic tools, e.g. ontology or thesauri · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018366106A1 cover?
The present disclosure discloses methods and apparatuses for distinguishing topics. One exemplary method for distinguishing topics includes: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distin…
Who is the assignee on this patent?
Alibaba Group Holding Ltd
What technology area does this patent fall under?
Primary CPC classification G06F16/353. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).