System and method for organizing and processing feature based data structures

US10127219B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10127219-B2
Application numberUS-201615374479-A
CountryUS
Kind codeB2
Filing dateDec 9, 2016
Priority dateDec 9, 2016
Publication dateNov 13, 2018
Grant dateNov 13, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for organizing and processing feature based data structures that can be used in linguistic spell checking and auto-correction, comprising: splitting an original dictionary into sub-dictionaries based on different values of a common feature such as high frequency words; receiving an input text that contains errors; determining a sub-dictionary selection feature from the input human-readable text; selecting the sub-dictionary based on the determined sub-dictionary selection feature; executing a first matching in the selected sub-dictionary, wherein a match is found if a similarity between the characters, words, or phrases in proximity of the errors in the input text and a character, word, or phrase in the sub-dictionary is above a threshold; if a unique match is found, the result is returned as an output to correct the errors; otherwise, executing a second matching with a raised threshold, and repeating the second matching until a unique match is found.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for organizing and processing feature based data structures in linguistic spell checking and auto-correction, comprising: a computer processor configured to: split an original dictionary into two or more sub-dictionaries using an explicit split or an implicit split based on a common feature of high frequency words, wherein each of the sub-dictionaries is smaller in size than the original dictionary and overlapping among the sub-dictionaries is allowed, and wherein contents in each of the sub-dictionaries are organized in a hierarchical tree; receive an input human-readable text that contains one or more errors; determine a sub-dictionary selection feature or selection criteria from the input human-readable text; select the sub-dictionary based on the determined sub-dictionary selection feature or selection criteria; execute a first matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the characters, words, and phrases in the selected sub-dictionary, wherein a match is found if a similarity between the one or more characters, words, or phrases in proximity of the errors in the input human-readable text and a candidate matching character, word, or phrase in the sub-dictionary is above a threshold of degree of similarity; if a unique match is found, return the uniquely matching character, word, or phrase from the selected sub-dictionary as an output to correct the errors; otherwise if more than one candidate matches are found, execute a second matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the character, word, and phrase in the selected sub-dictionary with the threshold of degree of similarity raised; and repeat the second matching until a unique match is found and the uniquely matching character, word, or phrase from the selected sub-dictionary is returned as an output to correct the errors. 2. The system of claim 1 , wherein the explicit split comprises: recognizing the common feature of high frequency words among characters, words, and phrases in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries according to different values of the recognized common feature of high frequency words. 3. The system of claim 1 , wherein the implicit split comprises: determining a vector space for each character, word, and phrase in the original dictionary using Unicode values of the character, word, and phrase; determining a center value of the vector space for the character, word, and phrase in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries such that each sub-dictionary contains the characters, words, and phrases having their vector-space centers within certain value range. 4. The system of claim 1 , wherein the first matching and the second matching are performed by determining a Unicode difference between the one or more characters, words, or phrases in the proximity of the errors in the input human-readable text and the character, word, and phrase in the selected sub-dictionary under comparison. 5. A method for organizing and processing feature based data structures in linguistic spell checking and auto-correction, comprising: splitting an original dictionary into two or more sub-dictionaries using an explicit split or an implicit split based on a common feature of high frequency words, wherein each of the sub-dictionaries is smaller in size than the original dictionary and overlapping among the sub-dictionaries is allowed, and wherein contents in each of the sub-dictionaries are organized in a hierarchical tree; receiving an input human-readable text that contains one or more errors; determining a sub-dictionary selection feature or selection criteria from the input human-readable text; selecting the sub-dictionary based on the determined sub-dictionary selection feature or selection criteria; executing a first matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the characters, words, and phrases in the selected sub-dictionary, wherein a match is found if a similarity between the one or more characters, words, or phrases in proximity of the errors in the input human-readable text and a candidate matching character, word, or phrase in the sub-dictionary is above a threshold of degree of similarity; if a unique match is found, returning the candidate matching character, word, or phrase from the selected sub-dictionary as an output to correct the errors; otherwise if more than one candidate matches are found, executing a second matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the character, word, and phrase in the selected sub-dictionary with the threshold of degree of similarity raised; and repeating the second matching until a unique match is found and the uniquely matching character, word, or phrase from the selected sub-dictionary is returned as an output to correct the errors. 6. The method of claim 5 , wherein the explicit split comprises: recognizing the common feature of high frequency words among characters, words, and phrases in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries according to different values of the recognized common feature of high frequency words. 7. The method of claim 5 , wherein the implicit split comprises: determining a vector space for each character, word, and phrase in the original dictionary using Unicode values of the character, word, and phrase; determining a center value of the vector space for the character, word, and phrase in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries such that each sub-dictionary contains the characters, words, and phrases having their vector-space centers within certain value range. 8. The method of claim 5 , wherein the first matching and the second matching are performed by determining a Unicode difference between the one or more characters, words, or phrases in the proximity of the errors in the input human-readable text and the character, word, and phrase in the selected sub-dictionary under comparison.

Assignees

Inventors

Classifications

  • G06F40/232Primary

    Orthographic correction, e.g. spell checking or vowelisation · CPC title

  • Processing of non-Latin text (kana-to-kanji conversion G06F40/129; vowelisation G06F40/232) · CPC title

  • Dictionaries · CPC title

  • G06F17/273Primary

    Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10127219B2 cover?
A method for organizing and processing feature based data structures that can be used in linguistic spell checking and auto-correction, comprising: splitting an original dictionary into sub-dictionaries based on different values of a common feature such as high frequency words; receiving an input text that contains errors; determining a sub-dictionary selection feature from the input human-read…
Who is the assignee on this patent?
Hong Kong Applied Science & Tech Research Inst Co Ltd, Hong Kong Applied Science And Technoloy Res Institute Company Limited
What technology area does this patent fall under?
Primary CPC classification G06F40/232. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 13 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).