Methods and systems for the analysis of large text corpora
US-9135242-B1 · Sep 15, 2015 · US
US10318552B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10318552-B2 |
| Application number | US-201414278433-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 15, 2014 |
| Priority date | May 15, 2014 |
| Publication date | Jun 11, 2019 |
| Grant date | Jun 11, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer processor generates a topic-based dataset based on parsing content received from a plurality of information sources, which includes historical data and scientific data, associated with a location of a natural resource. The processor generates a plurality of clusters, respectively corresponding to like-topic data of the topic-based dataset. The processor determines a plurality of hypotheses, respectively corresponding to the plurality of clusters of the like-topic data, wherein the plurality of hypotheses are based on features associated with each of the plurality of clusters of the like-topic data. The processor combines pairs of clusters, based on a similarity heuristic applied to the one or more pairs of clusters, and the processor determines a plurality of probabilities respectively corresponding to a validity of each hypothesis of the plurality of hypotheses, associated with the location of a natural resource.
Opening claim text (preview).
What is claimed is: 1. A method for predicting a location of a natural resource, the method comprising: generating, by a computer processor, a topic-based dataset based on parsing content received from a plurality of information sources, identifying topics and relevancy of data within the topic-based dataset by filtering out non-topic related content based on singular value decomposition and N-gram techniques applied to the received content, and annotating the topic-based dataset with numerical values wherein the topic-based dataset that is generated includes data associated with a plurality of locations of a natural resource; generating, by the computer processor, a plurality of clusters, respectively corresponding to like-topic data of the content of the topic-based dataset from the plurality of information sources, wherein numerated data of the content of the like-topic data included in each cluster of the plurality of clusters are extracted as features corresponding to characteristics of respective topics of the content associated with each respective cluster, and are represented as feature vectors having one or more dimensions, and stored in a proximity matrix, which is trained by unsupervised learning to determine a threshold of cluster aggregation and separation; determining, by the computer processor, a plurality of hypotheses corresponding respectively to the plurality of clusters, each hypothesis associated with a prediction of a particular location of the natural resource, the plurality of hypotheses respectively corresponding to the plurality of clusters of the content of the like-topic data, wherein each hypothesis is based on one or more features that are related and extracted respectively from the plurality of clusters of the like-topic data; determining, by the computer processors, a confidence level of each hypothesis based on the one or more features of a respective feature vector serving as dimensions of evidence; combining, by the computer processor, two or more clusters of the plurality of clusters into a plurality of aggregate clusters based, at least in part, on a similarity heuristic applied to the clusters; generating, by the computer processor, a sequence of regression models, wherein a regression model of the sequence of regression models is based on the proximity matrix storing feature vectors corresponding to respective aggregate clusters of the plurality of aggregate clusters, and the particular sequence of regression models through which the respective feature vectors and respective hypotheses of the plurality of aggregate clusters are routed is based on groups of related features as dimensions of evidence; and generating, by the computer processor, a level of validity, respectively, of the plurality of hypotheses associated with the prediction of the particular location of the natural resource by processing the hypotheses through the sequence of regression models and identifying a highest probability hypothesis. 2. The method of claim 1 , wherein the topic-based dataset generated from the plurality of information sources includes multimedia data. 3. The method of claim 1 , wherein the topic-based dataset generated from the plurality of information sources further includes substantially real-time data. 4. The method of claim 1 , further comprising: transforming, by the computer processor, the topic-based dataset into a summarized format output, based, at least in part, on processing by one or more analytic engines. 5. The method of claim 1 , wherein the topic-based dataset is weighted based on respective beta distributions of the topic-based data. 6. The method of claim 1 , further comprising: combining, by the computer processor, hypotheses corresponding to the one or more clusters that are combined into the plurality of aggregate clusters. 7. The method of claim 1 , further comprising: generating, by the computer processor, one or more feature vectors from the one or more features of the plurality of clusters; generating, by the computer processor, a plurality of cluster spaces, which include the one or more clusters of the plurality of clusters that are combined into the plurality of aggregate clusters, wherein each cluster space is based on a disparate threshold of similarity; determining, by the computer processor, a cluster space of the plurality of cluster spaces that is favorable, based on a score of the cluster space and the disparate threshold of similarity; and generating, by the computer processor, one or more proximity matrices from the one or more feature vectors, based on the cluster space that is favorable. 8. The method of claim 7 , wherein determining the cluster space that is favorable further comprises: determining, by the computer processor, a limit of combining clusters of the plurality of clusters based on the disparate threshold of similarity which produces the score of the cluster space that is favorable. 9. The method of claim 7 , wherein generating one or more proximity matrices from the one or more feature vectors, further comprises: generating, by the computer processor, the one or more proximity matrices, based on a probability density function of the one or more features of the plurality of clusters that are combined. 10. The method of claim 1 , wherein generating the sequence of regression models associated with location of the natural resource, further comprises: training, by the computer processor, the sequence of regression models based on successive refinement of supervised learning performed on the plurality of aggregate clusters, wherein the sequence of regression models is based, at least in part, on the one or more proximity matrices. 11. The method of claim 1 , wherein an output of a previous model of the sequence of regression models is used as input to a subsequent model of the sequence of regression models. 12. The method of claim 1 , further comprising: representing, by the computer processor, the plurality of probabilities respectively corresponding to the validity of each hypothesis of the plurality of hypotheses as a heat map, wherein a first element of the heat map corresponds to a first probability of a hypothesis of the plurality of hypotheses, and disparate from a second element of the heat map corresponding to a second probability of a hypothesis of the plurality of hypotheses. 13. A computer program product for predicting a location of a natural resource, the computer program product comprising: a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer processor to cause the computer processor to perform a method comprising: generating a topic-based dataset based on parsing content received from a plurality of information sources, identifying topics and relevancy of data within the topic-based dataset by filtering out non-topic related content based on singular value decomposition and N-gram techniques applied to the received content, and annotating the topic-based dataset with numerical values, wherein the topic-based dataset that is generated includes data associated with a plurality of locations of a natural resource; generating a plurality of clusters respectively corresponding to like-topic data of the content of the topic-based dataset from the plurality of information sources, wherein numerated data of the content of the like-topic data included in each cluster of the plurality of clusters are extracted as features corresponding to characteristics of respective topics of the content associated with each respective cluster, and are represented as feature vectors having one or more
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Clustering or classification · CPC title
Machine learning · CPC title
Information retrieval; Database structures therefor; File system structures therefor · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.