System and method for ontology induction through statistical profiling and reference schema matching
US-2018052870-A1 · Feb 22, 2018 · US
US11249960B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11249960-B2 |
| Application number | US-201816004863-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 11, 2018 |
| Priority date | Jun 11, 2018 |
| Publication date | Feb 15, 2022 |
| Grant date | Feb 15, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments generally relate transforming data for a target schema. In some embodiments, a method includes receiving input data, where the input data includes a plurality of segments, and where the segments include a plurality of source fields containing target data. The method further includes characterizing the input data based at least in part on a plurality of predetermined metrics, where the predetermined metrics determine a structure of the input data. The method further includes mapping the target data in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the characterizing. The method further includes populating the target fields of the target schema with the target data from the source fields based at least in part on the mapping.
Opening claim text (preview).
What is claimed is: 1. A system comprising: at least one processor and a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by the at least one processor to cause the at least one processor to perform operations comprising: receiving input data, wherein the input data includes a plurality of segments, and wherein the segments include a plurality of source fields; parsing each of the segments into tokens, wherein each token is data that is contained in a particular source field of the plurality of source fields, and wherein each token includes at least one alphanumeric or numeric value; determining contextual information associated with each token, wherein the contextual information comprises one or more features associated with each token, and wherein determining the contextual information comprises determining one or more features of each token based on metrics and determining whether the numeric value of a given token conforms to an expected range of a first numeric target field versus a second numeric target field based on a type of the target field; mapping the tokens in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the contextual information and confidence values, and wherein the confidence values that meet one or more confidence value thresholds indicate degrees of matching between each token and one or more target fields; and populating the target fields of the target schema with the tokens from the source fields based at least in part on the mapping, wherein the parsing of the segments into tokens, the determining of the contextual information associated the tokens, and the mapping of the tokens in the source fields to the target fields is performed substantially during the populating the target fields of the target schema with the tokens from the source fields. 2. The system of claim 1 , wherein the input data is semi-structured data. 3. The system of claim 1 , wherein the target schema is a structured schema. 4. The system of claim 1 , wherein the structure of each token comprises a relationship between one or more structural features of each token and at least one other token in a same segment. 5. The system of claim 1 , wherein, to map the tokens in the source fields of the segments to the target fields of a target schema, the at least one processor further performs operations comprising: comparing each target field to the contextual information associated with each token; and matching the token in each source field to one of the target fields of the target schema based at least in part on the comparing of each target field to the contextual information. 6. A computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by at least one processor to cause the at least one processor to perform operations comprising: receiving input data, wherein the input data includes a plurality of segments, and wherein the segments include a plurality of source fields; parsing each of the segments into tokens, wherein each token is data that is contained in a particular source field of the plurality of source fields, and wherein each token includes at least one alphanumeric or numeric value; determining contextual information associated with each token, wherein the contextual information comprises one or more features associated with each token, and wherein determining the contextual information comprises determining one or more features of each token based on metrics and determining whether the numeric value of a given token conforms to an expected range of a first numeric target field versus a second numeric target field based on a type of the target field; mapping the tokens in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the contextual information and confidence values, and wherein the confidence values that meet one or more confidence value thresholds indicate degrees of matching between each token and one or more target fields; and populating the target fields of the target schema with the tokens from the source fields based at least in part on the mapping, wherein the parsing of the segments into tokens, the determining of the contextual information associated the tokens, and the mapping of the tokens in the source fields to the target fields is performed substantially during the populating the target fields of the target schema with the tokens from the source fields. 7. The computer program product of claim 6 , wherein the input data is semi-structured data. 8. The computer program product of claim 6 , wherein the target schema is a structured schema. 9. The computer program product of claim 6 , wherein the structure of each token comprises a relationship between one or more structural features of each token and at least one other token in a same segment. 10. The computer program product of claim 6 , wherein, to map the tokens in the source fields of the segments to the target fields of a target schema, the at least one processor further performs operations comprising: comparing each target field to the contextual information associated with each token; and matching the token in each source field to one of the target fields of the target schema based at least in part on the comparing of each target field to the contextual information. 11. The computer program product of claim 6 , wherein, to map the tokens in the source fields of the segments to the target fields of a target schema, the at least one processor further performs operations comprising: determining confidence values, wherein the confidence values indicate degrees of matching between each token and one or more target fields; and matching tokens in the source fields to the target fields of the target schema. 12. A computer-implemented method for transforming data for a target schema, the method comprising: receiving input data, wherein the input data includes a plurality of segments, and wherein the segments include a plurality of source fields; parsing each of the segments into tokens, wherein each token is data that is contained in a particular source field of the plurality of source fields, and wherein each token includes at least one alphanumeric or numeric value; determining contextual information associated with each token, wherein the contextual information comprises one or more features associated with each token, and wherein determining the contextual information comprises determining one or more features of each token based on metrics and determining whether the numeric value of a given token conforms to an expected range of a first numeric target field versus a second numeric target field based on a type of the target field; mapping the tokens in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the contextual information and confidence values, and wherein the confidence values that meet one or more confidence value thresholds indicate degrees of matching between each token and one or more target fields; and populating the target fields of the target schema with the tokens from the source fields based at least in part on the mapping, wherein the parsing of the segments into tokens, the determining of the contextual information associated the tokens, and the mapping of the tokens in the source fields to the target fields is performed substantially during the populating the target fields of the target schema with the
Mapping; Conversion · CPC title
Data format conversion from or to a database · CPC title
Schema design and management · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.