Who is the assignee on this patent?

Evernote Corp, Bending Spoons S P A

What technology area does this patent fall under?

Primary CPC classification G06N5/04. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Building training data and similarity relations for semantic space

US12020175B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12020175-B2
Application number	US-202017001311-A
Country	US
Kind code	B2
Filing date	Aug 24, 2020
Priority date	Jan 28, 2016
Publication date	Jun 25, 2024
Grant date	Jun 25, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system for selecting data from a source text corpus for training a semantic data analysis system. The method includes selecting an item of the text corpus, wherein the item includes at least one section. The method includes extracting a section of the at least one section of the item. The method also includes determining a length of the section of the at least one section of the item. Based on the length of the section being greater than a predetermined amount, the method includes subdividing the section into a plurality of fragments. Each fragment of the plurality of fragments is deemed to be similar to each other. Further, the method includes building a training set based on the plurality of fragments. The training set is used to train the semantic data analysis system.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of selecting data from a text corpus for training a semantic data analysis system, comprising: selecting an item of the text corpus, wherein the item includes at least one section; extracting a section of the at least one section of the item; determining a length of the section of the at least one section of the item; based on the length of the section being greater than a predetermined amount, subdividing the section into a plurality of fragments, wherein each fragment of the plurality of fragments is deemed to be similar to each other; and building a training set based on the plurality of fragments, wherein the training set is used to train the semantic data analysis system. 2. The method of claim 1 , wherein building the training set includes: updating a similarity relation between each fragment in the training set; and updating a fragment count of the training set. 3. The method of claim 2 , further comprising: after updating the fragment count of the training set, comparing the fragment count with a predetermined fragment value; and in accordance with a determination that the fragment count is below the predetermined fragment value, selecting another item of the text corpus. 4. The method of claim 1 , wherein the item is a first item and the plurality of fragments is a first plurality of fragments; and the method further comprises: selecting a second item of the text corpus, wherein the second item includes at least one section; extracting a section of the at least one section of the second item; determining a length of the section of the at least one section of the second item; based on the length of the section being greater than a predetermined amount, subdividing the section into a second plurality of fragments, wherein: each fragment of the second plurality of fragments is deemed to be similar to each other, and each fragment of the second plurality of fragments is deemed to be dissimilar to each fragment of the first plurality of fragments, and including the second plurality of fragments in the training set. 5. The method of claim 4 , further comprising: after including the second plurality of fragments in the training set, updating a similarity relation between each fragment in the training set; and updating a fragment count of the training set. 6. The method of claim 5 , further comprising: after updating the fragment count of the training set, comparing the fragment count with a predetermined fragment value; and in accordance with a determination that the fragment count satisfies the predetermined fragment value, forgo selecting another item of the text corpus. 7. The method of claim 1 , further comprising; determining whether the item includes another section; in accordance with a determination that the item includes another section, extracting the other section of the item; based on the length of the other section being greater than a predetermined amount, subdividing the other section into another plurality of fragments, wherein each fragment of the other plurality of fragments is deemed to be similar to each other and deemed to be of undefined similarity with regard to the plurality of fragments; and including the other plurality of fragments in the training set. 8. The method of claim 1 , wherein the predetermined amount is approximately twice a size of a fragment. 9. The method of claim 8 , wherein a fragment has approximately 100 words. 10. The method of claim 9 , wherein a fragment has between 40 and 60 words. 11. The method of claim 1 , further comprising ignoring one or more sections of the at least one section having a length less than the predetermined amount. 12. The method of claim 1 , wherein the item is an article. 13. The method of claim 1 , wherein similar fragments are semantically similar. 14. A computer server system for selecting data from a text corpus for training a semantic data analysis system, the computer server system comprising: one or more processors; and memory storing one or more instructions that, when executed by the one or more processors, cause the computer server system to perform operations including: selecting an item of the text corpus, wherein the item includes at least one section; extracting a section of the at least one section of the item; determining a length of the section of the at least one section of the item; based on the length of the section being greater than a predetermined amount, subdividing the section into a plurality of fragments, wherein each fragment of the plurality of fragments is deemed to be similar to each other; and building a training set based on the plurality of fragments, wherein the training set is used to train the semantic data analysis system. 15. The computer server system of claim 14 , wherein building the training set includes: updating a similarity relation between each fragment in the training set; and updating a fragment count of the training set. 16. The computer server system of claim 15 , further comprising instructions that, when executed by the one or more processors, cause the computer server system to perform operations including: after updating the fragment count of the training set, comparing the fragment count with a predetermined fragment value; and in accordance with a determination that the fragment count is below the predetermined fragment value, selecting another item of the text corpus. 17. The computer server system of claim 14 , further comprising instructions that, when executed by the one or more processors, cause the computer server system to perform operations including: determining whether the item includes another section; in accordance with a determination that the item includes another section, extracting the other section of the item; based on the length of the other section being greater than a predetermined amount, subdividing the other section into another plurality of fragments, wherein each fragment of the other plurality of fragments is deemed to be similar to each other and deemed to be of undefined similarity with regard to the plurality of fragments; and including the other plurality of fragments in the training set. 18. A non-transitory computer readable storage medium configured to select data from a text corpus for training a semantic data analysis system, the non-transitory computer readable storage medium comprising instructions which, when executed on at least one processor, cause the at least one processor to: select an item of the text corpus, wherein the item includes at least one section; extract a section of the at least one section of the item; determine a length of the section of the at least one section of the item; based on the length of the section being greater than a predetermined amount, subdivide the section into a plurality of fragments, wherein each fragment of the plurality of fragments is deemed to be similar to each other; and build a training set based on the plurality of fragments, wherein the training set is used to train the semantic data analysis system. 19. The non-transitory computer readable storage medium of claim 18 , wherein building the training set includes: updating a similarity relation between each fragment in the training set; and updating a fragment count of the training set. 20. The non-transitory computer readable storage medium of claim 18 , further comprising instructions which, when executed on at least one processor, cause the a

Assignees

Inventors

Classifications

G06F16/3344
using natural language analysis · CPC title
G06F40/205
Parsing · CPC title
G06F16/334
Query execution (filtering based on additional data G06F16/335) · CPC title
G06N20/00
Machine learning · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

View patent family 72140991

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12020175B2 cover?: A method and system for selecting data from a source text corpus for training a semantic data analysis system. The method includes selecting an item of the text corpus, wherein the item includes at least one section. The method includes extracting a section of the at least one section of the item. The method also includes determining a length of the section of the at least one section of the it…
Who is the assignee on this patent?: Evernote Corp, Bending Spoons S P A
What technology area does this patent fall under?: Primary CPC classification G06N5/04. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).