Methods and systems for document classification using machine learning
US-2019392250-A1 · Dec 26, 2019 · US
US2018366106A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2018366106-A1 |
| Application number | US-201816112623-A |
| Country | US |
| Kind code | A1 |
| Filing date | Aug 24, 2018 |
| Priority date | Feb 26, 2016 |
| Publication date | Dec 20, 2018 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure discloses methods and apparatuses for distinguishing topics. One exemplary method for distinguishing topics includes: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. The methods and the apparatuses consistent with the present disclosure reduce the difference between human beings' understanding and machines' understanding of a question, and can increase the accuracy for identifying questions raised by users.
Opening claim text (preview).
What is claimed is: 1 . A method for distinguishing topics, comprising: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. 2 . The method for distinguishing topics of claim 1 , wherein clustering the training data set includes using a Latent Dirichlet Allocation (LDA) clustering method for clustering the training data set. 3 . The method for distinguishing topics of claim 2 , wherein the number of topics obtained by clustering using the LDA clustering method is greater than the number of known topics. 4 . The method for distinguishing topics of claim 1 , wherein an amount of the marked data is significantly less than an amount of the data to be trained. 5 . The method for distinguishing topics of claim 1 , wherein the step of distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic comprises: in response to determining that all marked data of a known topic appears in a topic, determining that the topic is a known topic; and in response to determining that no marked data of any known topic appears in a topic, determining that the topic is a new topic. 6 . The method for distinguishing topics of claim 5 , wherein clustering the training data set to obtain topics to which training data belongs further comprises: obtaining, by clustering, keywords of each topic obtained by clustering and a probability corresponding to each keyword. 7 . The method for distinguishing topics of claim 6 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic further comprises: determining, based on the keywords of each topic obtained by clustering, whether the topic is a known topic or a new topic. 8 . An apparatus for distinguishing topics, comprising: a memory storing a set of instructions; and a processor configured to execute the set of instructions to cause the apparatus for distinguishing topics to perform: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. 9 . The apparatus for distinguishing topics of claim 8 , wherein clustering the training data set includes using a Latent Dirichlet Allocation (LDA) clustering method for clustering the training data set. 10 . The apparatus for distinguishing topics of claim 9 , wherein the number of topics obtained by clustering using the LDA clustering method is greater than the number of known topics. 11 . The apparatus for distinguishing topics of claim 8 , wherein an amount of the marked data is significantly less than an amount of the data to be trained. 12 . The apparatus for distinguishing topics of claim 8 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic comprises: in response to determining that all marked data of a known topic appears in a topic, determining the topic as a known topic; and in response to determining that no marked data of any known topic appears in a topic, determining the topic as a new topic. 13 . The apparatus for distinguishing topics of claim 12 , wherein clustering the training data set to obtain topics to which training data belongs further comprises: obtaining, by clustering, keywords of each topic obtained by clustering and a probability corresponding to each keyword. 14 . The apparatus for distinguishing topics of claim 13 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic further comprises: determining, based on the keywords of each topic obtained by clustering, whether the topic is a known topic or a new topic. 15 . A non-transitory computer readable medium that stores a set of instructions that is executable by at least one processor of a computer to cause the computer to perform a method for distinguishing topics, the method comprising: extracting data from data corresponding to known topics, marking the extracted data, and combining the marked data and data to be trained into a training data set; clustering the training data set to obtain topics to which training data belongs; and distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic. 16 . The non-transitory computer readable medium of claim 15 , wherein clustering the training data set includes using a Latent Dirichlet Allocation (LDA) clustering method for clustering the training data set. 17 . The non-transitory computer readable medium of claim 16 , wherein the number of topics obtained by clustering using the LDA clustering method is greater than the number of known topics. 18 . The non-transitory computer readable medium of claim 15 , wherein an amount of the marked data is significantly less than an amount of the data to be trained. 19 . The non-transitory computer readable medium of claim 15 , wherein distinguishing, based on the marked data, whether a topic obtained by clustering is a known topic or a new topic comprises: in response to determining that all marked data of a known topic appears in a topic, determining the topic as a known topic; and in response to determining that no marked data of any n topic appears in a topic, determining the topic as a new topic. 20 . The non-transitory computer readable medium of claim 19 , wherein clustering the training data set to obtain topics to which training data belongs further comprises: obtaining, by clustering, keywords of each topic obtained by clustering and a probability corresponding to each keyword.
into predefined classes · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Training · CPC title
Creation of semantic tools, e.g. ontology or thesauri · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.