System and method for performing Unicode matching
US-9275019-B2 · Mar 1, 2016 · US
US10127219B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10127219-B2 |
| Application number | US-201615374479-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 9, 2016 |
| Priority date | Dec 9, 2016 |
| Publication date | Nov 13, 2018 |
| Grant date | Nov 13, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for organizing and processing feature based data structures that can be used in linguistic spell checking and auto-correction, comprising: splitting an original dictionary into sub-dictionaries based on different values of a common feature such as high frequency words; receiving an input text that contains errors; determining a sub-dictionary selection feature from the input human-readable text; selecting the sub-dictionary based on the determined sub-dictionary selection feature; executing a first matching in the selected sub-dictionary, wherein a match is found if a similarity between the characters, words, or phrases in proximity of the errors in the input text and a character, word, or phrase in the sub-dictionary is above a threshold; if a unique match is found, the result is returned as an output to correct the errors; otherwise, executing a second matching with a raised threshold, and repeating the second matching until a unique match is found.
Opening claim text (preview).
What is claimed is: 1. A system for organizing and processing feature based data structures in linguistic spell checking and auto-correction, comprising: a computer processor configured to: split an original dictionary into two or more sub-dictionaries using an explicit split or an implicit split based on a common feature of high frequency words, wherein each of the sub-dictionaries is smaller in size than the original dictionary and overlapping among the sub-dictionaries is allowed, and wherein contents in each of the sub-dictionaries are organized in a hierarchical tree; receive an input human-readable text that contains one or more errors; determine a sub-dictionary selection feature or selection criteria from the input human-readable text; select the sub-dictionary based on the determined sub-dictionary selection feature or selection criteria; execute a first matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the characters, words, and phrases in the selected sub-dictionary, wherein a match is found if a similarity between the one or more characters, words, or phrases in proximity of the errors in the input human-readable text and a candidate matching character, word, or phrase in the sub-dictionary is above a threshold of degree of similarity; if a unique match is found, return the uniquely matching character, word, or phrase from the selected sub-dictionary as an output to correct the errors; otherwise if more than one candidate matches are found, execute a second matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the character, word, and phrase in the selected sub-dictionary with the threshold of degree of similarity raised; and repeat the second matching until a unique match is found and the uniquely matching character, word, or phrase from the selected sub-dictionary is returned as an output to correct the errors. 2. The system of claim 1 , wherein the explicit split comprises: recognizing the common feature of high frequency words among characters, words, and phrases in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries according to different values of the recognized common feature of high frequency words. 3. The system of claim 1 , wherein the implicit split comprises: determining a vector space for each character, word, and phrase in the original dictionary using Unicode values of the character, word, and phrase; determining a center value of the vector space for the character, word, and phrase in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries such that each sub-dictionary contains the characters, words, and phrases having their vector-space centers within certain value range. 4. The system of claim 1 , wherein the first matching and the second matching are performed by determining a Unicode difference between the one or more characters, words, or phrases in the proximity of the errors in the input human-readable text and the character, word, and phrase in the selected sub-dictionary under comparison. 5. A method for organizing and processing feature based data structures in linguistic spell checking and auto-correction, comprising: splitting an original dictionary into two or more sub-dictionaries using an explicit split or an implicit split based on a common feature of high frequency words, wherein each of the sub-dictionaries is smaller in size than the original dictionary and overlapping among the sub-dictionaries is allowed, and wherein contents in each of the sub-dictionaries are organized in a hierarchical tree; receiving an input human-readable text that contains one or more errors; determining a sub-dictionary selection feature or selection criteria from the input human-readable text; selecting the sub-dictionary based on the determined sub-dictionary selection feature or selection criteria; executing a first matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the characters, words, and phrases in the selected sub-dictionary, wherein a match is found if a similarity between the one or more characters, words, or phrases in proximity of the errors in the input human-readable text and a candidate matching character, word, or phrase in the sub-dictionary is above a threshold of degree of similarity; if a unique match is found, returning the candidate matching character, word, or phrase from the selected sub-dictionary as an output to correct the errors; otherwise if more than one candidate matches are found, executing a second matching of one or more characters, words, or phrases in proximity of the errors in the input human-readable text against the character, word, and phrase in the selected sub-dictionary with the threshold of degree of similarity raised; and repeating the second matching until a unique match is found and the uniquely matching character, word, or phrase from the selected sub-dictionary is returned as an output to correct the errors. 6. The method of claim 5 , wherein the explicit split comprises: recognizing the common feature of high frequency words among characters, words, and phrases in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries according to different values of the recognized common feature of high frequency words. 7. The method of claim 5 , wherein the implicit split comprises: determining a vector space for each character, word, and phrase in the original dictionary using Unicode values of the character, word, and phrase; determining a center value of the vector space for the character, word, and phrase in the original dictionary; and splitting the characters, words, and phrases in the original dictionary into the two or more sub-dictionaries such that each sub-dictionary contains the characters, words, and phrases having their vector-space centers within certain value range. 8. The method of claim 5 , wherein the first matching and the second matching are performed by determining a Unicode difference between the one or more characters, words, or phrases in the proximity of the errors in the input human-readable text and the character, word, and phrase in the selected sub-dictionary under comparison.
Orthographic correction, e.g. spell checking or vowelisation · CPC title
Processing of non-Latin text (kana-to-kanji conversion G06F40/129; vowelisation G06F40/232) · CPC title
Dictionaries · CPC title
Physics · mapped topic
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.