Data item clustering and analysis
US-9202249-B1 · Dec 1, 2015 · US
US11714869B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11714869-B2 |
| Application number | US-202117564056-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 28, 2021 |
| Priority date | May 2, 2017 |
| Publication date | Aug 1, 2023 |
| Grant date | Aug 1, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are provided for identifying relevant information for an entity, referred to as a seed entity. A plurality of search queries can be generated each comprising a property of a seed entity or one of the entities associated with the seed entity (seed-linked entities). Preferably, a collection of search queries includes ones representing different properties of the seed entity and properties of different seed-linked entities. Optionally, the collection of search queries is optimized to reduce search burden. Searches can then be conducted with the search queries in one or more data sources to obtain a plurality of search results, wherein each search result comprises a hit entity and one or more entities associated with the hit entity (hit-linked entity). For each of the search results, a score can be determined taking as input (a) likelihood of match between the seed entity and the hit entity or between a seed-linked entity and a hit-linked entity, (b) presence of a new entity in the search result not present in the search queries or a difference between the new entity and an entity present in the search queries, and (c) characteristic of the new entity in the search result. Based on the scores, high priority search results can be presented a user for further analysis.
Opening claim text (preview).
The invention claimed is: 1. A system for identifying relevant information for an entity comprising: one or more processors; and a memory storing instructions that, when executed by the one or more processors, cause the system to: generate a plurality of search queries comprising a seed entity and one or more entities associated with the seed entity, the generation comprising: determining a second entity validated to be linked to the seed entity, the second entity and the seed entity forming a seed cluster; identifying properties associated with the second entity and the seed entity; generating a search query that is associated with a subset of the identified properties; determining that the seed entity is associated with a third entity; and in response to the determination that the seed entity is associated with the third entity: determining degrees of difference between: a first link between the seed entity and the second entity; and a second link between the third entity and a fourth entity validated to be linked to the third entity; determining a probability of a match between one or more types of the identified properties and a particular backend datasource against which the search query is run, selected from different backend datasources; and creating a second search query based on the determined degrees of difference and the determined probabiltiy of the match. 2. The system of claim 1 , wherein the instructions further cause the system to: determine a frequency at which the third entity appears across one or more backend datasources; and wherein the creating of the second search query is further based on the frequency. 3. The system of claim 2 , wherein the creating of the second search query comprises: selcecting a highest-scoring query, wherein a score of the highest-sciring query is determined based on the degrees of difference, the determined probability of a match, and the frequency; and in response to selecting a highest-scoring query, selecting a next highest-scoring query. 4. The system of claim 1 , wherein the instructions further cause the system to: determine a second degree of difference between: the second entity or the seed entity; and the third entity; and wherein the creating of the second search query is based on the second degree of difference. 5. The system of claim 1 , wherein the instructions further cause the system to: conduct the second search query; determine probabilities that respective results of the second search query are spurious based on a number of the results; determine whether to discard a subset of the results based on the determined probabilities; and selectively discard the subset of the results based on the determination of whether to discard the subset. 6. The system of claim 1 , wherein the first link indicates a first relationship between the seed entity and the second entity and the second link indicates a second relationship between the third entity and the fourth entity. 7. The system of claim 1 , wherein the second search query corresponds to the third entity. 8. The system of claim 1 , wherein the instructions, when executed, further cause the system to: create a third search query based on a misspelling of the third entity. 9. The system of claim 1 , wherein the seed entity comprises a pseudonym. 10. The system of claim 1 , wherein the instructions further cause the system to: determine second degrees of difference between: the seed entity and the second entity; and the third entity and the fourth entity; and wherein the second search query is created based on the determined second degrees of difference. 11. The method of claim 1 , further comprising determining a frequency at which the third entity appears across one or more backend datasources; and wherein the creating of the second search query is further based on the frequency. 12. A computer-implemented method comprising: generating a plurality of search queries comprising a seed entity and one or more entities associated with the seed entity, the generation comprising: determining a second entity validated to be linked to the seed entity, the second entity and the seed entity forming a seed cluster; identifying properties associated with the second entity and the seed entity; generating a search query that is associated with a subset of the identified properties; determining that the seed entity is associated with a third entity; and in response to the determination that the seed entity is associated with the third entity: determining degrees of difference between: a first link between the seed entity and the second entity; and a second link between the third entity and a fourth entity validated to be linked to the third entity; determining a probability of a match between one or more types of the identified properties and a particular backend datasource against which the search query is run, selected from different backend datasources; and creating a second search query based on the determined degrees of difference and the determined probabiltiy of the match. 13. The method of claim 12 , further comprising determining a second degree of difference between: the second entity or the seed entity; and the third entity; and wherein the creating of the second search query is based on the second degree of difference. 14. The method of claim 12 , further comprising: conducting the second search query; determining probabilities that respective results of the second search query are spurious based on a number of the results; determining whether to discard a subset of the results based on the determined probabilities; and selectively discarding the subset of the results based on the determination of whether to discard the subset. 15. The method of claim 12 , wherein the first link indicates a first relationship between the seed entity and the second entity and the second link indicates a second relationship between the third entity and the fourth entity. 16. The method of claim 12 , wherein the second search query corresponds to the third entity. 17. The method of claim 12 , further comprising creating a third search query based on a misspelling of the third entity. 18. The method of claim 12 , further comprising: determining second degrees of difference between: the seed entity and the second entity; and the third entity and the fourth entity; and wherein the second search query is created based on the determined second degrees of difference. 19. A non-transitory computer readable medium comprising instructions that, when executed, cause one or more processors to perform: generating a plurality of search queries comprising a seed entity and one or more entities associated with the seed entity, the generation comprising: determining a second entity validated to be linked to the seed entity, the second entity and the seed entity forming a seed cluster; identifying properties associated with the second entity and the seed entity; generating a search query that is associated with a subset of the identified properties; determining that the seed entity is associated with a third entity; and in response to the determination that the seed entity is associated with the third entity: determining degrees of difference between: a first link between the seed entity and the second entity; and a second link between the third entity and a fourth entity validated to be linked to the third entity; determining a probability of a match between one or m
Indexing; Web crawling techniques · CPC title
Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title
Presentation of query results · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.