Systems and methods for multilingual metadata
US-2017154101-A1 · Jun 1, 2017 · US
US9858258B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9858258-B1 |
| Application number | US-201615282350-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 30, 2016 |
| Priority date | Sep 30, 2016 |
| Publication date | Jan 2, 2018 |
| Grant date | Jan 2, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Automatic locale determination for documents is described. In an embodiment, a computer server receives an electronic document comprising a plurality of unknown-language data elements each associated with one or more types. Based on a document schema of the document, the computer system selects one or more unknown-language data elements from the plurality of unknown-language data elements and assigning to each of the one or more unknown-language data elements a corresponding weight value based on a respective type of the unknown-language data element. The computer system compares the one or more unknown-language data elements with a plurality of known-language data elements that are associated with the document schema and based on the comparing, determines a number of unknown-language data elements in the one or more unknown-language data elements that matched any in a subset of the plurality of known-language data elements, wherein the subset of known-language data elements corresponds to a particular language. Based on the number of data elements that matched to the subset of known-language data elements and based on the corresponding weight assigned to each unknown-language data element in the number of unknown-language data elements, the computer system determines a language confidence level value specifying a level of machine confidence that the document is expressed in the particular language and based on the language confidence value for the particular language exceeding a language threshold value, automatically processes the document using the particular language.
Opening claim text (preview).
What is claimed is: 1. A data processing method comprising: receiving, at a server computer, an electronic document comprising a plurality of unknown-language data elements each associated with one or more types; based on a document schema of the document, selecting one or more unknown-language data elements from the plurality of unknown-language data elements; assigning to each of the one or more unknown-language data elements a corresponding weight value based on a respective type of the unknown-language data element; comparing the one or more unknown-language data elements with a plurality of known-language data elements that are associated with the document schema; based on the comparing, determining a number of unknown-language data elements in the one or more unknown-language data elements that matched any in a subset of the plurality of known-language data elements, wherein the subset of known-language data elements corresponds to a particular language; based on the number of unknown-language data elements in the one or more unknown-language data elements that matched to the subset of known-language data elements and based on the corresponding weight value assigned to each unknown-language data element in the number of unknown-language data elements, determining a language confidence level value specifying a level of machine confidence that the document is expressed in the particular language; based on the language confidence level value for the particular language exceeding a language threshold value, automatically processing the document using the particular language. 2. The method of claim 1 , further comprising: receiving the document as part of receiving a request to process the document, the request comprising one or more additional data elements; selecting an additional data element that indicates possible language for the request, the additional data element assigned to a particular weight; based on a data value of the additional data element and the particular weight, adjusting the language confidence level value for the document. 3. The method of claim 1 , wherein the respective type of the unknown-language data element is a data field name of the unknown-language data element or a data value of the unknown-language data element of the document. 4. The method of claim 1 , wherein selecting one or more unknown-language data elements from the plurality of unknown-language data elements is further based on a document type of the document. 5. The method of claim 1 , wherein the document schema of the document depends on a type of structured data included in the document, and wherein the type of the structured data is one or more of XML (Extensible Markup Language), JSON (JavaScript Object Notation), cXML (commerce eXtensible Markup Language), IDoc (Intermediate Document), CSV (Comma Separated values), or ODF (Open Document). 6. The method of claim 1 , further comprising: storing the plurality of known-language data elements associated with the document schema of the document in a data store in a plurality of language sets of known-language data elements, each set of known-language data elements corresponding to a supported language in a plurality of supported languages that includes the particular language; comparing the one or more unknown-language data elements with one or more known-language data elements in said each set of known-language data elements to determine corresponding number of unknown-language data elements that matched for the corresponding supported language. 7. The method of claim 1 , wherein the comparing further comprises stemming the one or more unknown-language data elements to match with the plurality of known-language data elements. 8. The method of claim 1 , further comprising: based on the document schema of the document, selecting at least one unknown-language data element of the plurality of unknown-language data elements such that the at least one unknown-language data element has a data value that can vary in formats based on a locale of the document; based on a format of the data value, determining a locale confidence level value for the document. 9. The method of claim 8 , wherein the format of the data value is based at least on one of the following: a date format, a number format, or a currency value format. 10. The method of claim 1 , further comprising determining the threshold language value based on a maximum language confidence value possible for the document. 11. The method of claim 1 , further comprising determining the language threshold value based on a plurality of language confidence level values, for a plurality of languages, determined for the document that includes the language confidence level value. 12. The method of claim 1 , further comprising: automatically determining that a file that includes the document is compressed; in response to automatically determining that the file that includes the document is compressed, automatically decompressing the file to extract the document. 13. The method of claim 1 , further comprising: automatically determining that the document is encrypted; in response to automatically determining that the document is encrypted, automatically decrypting the document. 14. A data-processing method comprising: using a first computer, obtaining from one or more non-transitory computer-readable data storage media a copy of one or more sequences of instructions that are stored on the media and are arranged, when executed using a second computer among a plurality of other computers to cause the second computer to perform: using a computer, receiving an electronic document comprising a plurality of unknown-language data elements each associated with one or more types; using the computer, based on a document schema of the document, selecting one or more unknown-language data elements from the plurality of unknown-language data elements; using the computer, assigning to each of the one or more unknown-language data elements a corresponding weight value based on a respective type of the unknown-language data element; using the computer, comparing the one or more unknown-language data elements with a plurality of known-language data elements that are associated with the document schema; using the computer, based on the comparing, determining a number of unknown-language data elements in the one or more unknown-language data elements that matched any in a subset of the plurality of known-language data elements, wherein the subset of known-language data elements corresponds to a particular language; using the computer, based on the number of unknown-language data elements in the one or more unknown-language data elements that matched to the subset of known-language data elements and based on the corresponding weight value assigned to each unknown-language data element in the number of unknown-language data elements, determining a language confidence level value specifying a level of machine confidence that the document is expressed in the particular language; using the computer, based on the language confidence level value for the particular language exceeding a language threshold value, automatically processing the document using the particular language. 15. The method of claim 14 , further comprising: receiving the document as part of receiving a request to process the document, the request comprising one or more additional data elements; selecting an additional data element that indicates possible language for the request, the additional data element assigned to a particular weight; based on a data value
Search customisation based on user profiles and personalisation · CPC title
Recognition of textual entities · CPC title
Indexing, e.g. XML tags; Data structures therefor; Storage structures · CPC title
Coding or compression of tree-structured data · CPC title
Distributed file systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.