Transforming data for a target schema

US11249960B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11249960-B2
Application numberUS-201816004863-A
CountryUS
Kind codeB2
Filing dateJun 11, 2018
Priority dateJun 11, 2018
Publication dateFeb 15, 2022
Grant dateFeb 15, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments generally relate transforming data for a target schema. In some embodiments, a method includes receiving input data, where the input data includes a plurality of segments, and where the segments include a plurality of source fields containing target data. The method further includes characterizing the input data based at least in part on a plurality of predetermined metrics, where the predetermined metrics determine a structure of the input data. The method further includes mapping the target data in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the characterizing. The method further includes populating the target fields of the target schema with the target data from the source fields based at least in part on the mapping.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: at least one processor and a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by the at least one processor to cause the at least one processor to perform operations comprising: receiving input data, wherein the input data includes a plurality of segments, and wherein the segments include a plurality of source fields; parsing each of the segments into tokens, wherein each token is data that is contained in a particular source field of the plurality of source fields, and wherein each token includes at least one alphanumeric or numeric value; determining contextual information associated with each token, wherein the contextual information comprises one or more features associated with each token, and wherein determining the contextual information comprises determining one or more features of each token based on metrics and determining whether the numeric value of a given token conforms to an expected range of a first numeric target field versus a second numeric target field based on a type of the target field; mapping the tokens in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the contextual information and confidence values, and wherein the confidence values that meet one or more confidence value thresholds indicate degrees of matching between each token and one or more target fields; and populating the target fields of the target schema with the tokens from the source fields based at least in part on the mapping, wherein the parsing of the segments into tokens, the determining of the contextual information associated the tokens, and the mapping of the tokens in the source fields to the target fields is performed substantially during the populating the target fields of the target schema with the tokens from the source fields. 2. The system of claim 1 , wherein the input data is semi-structured data. 3. The system of claim 1 , wherein the target schema is a structured schema. 4. The system of claim 1 , wherein the structure of each token comprises a relationship between one or more structural features of each token and at least one other token in a same segment. 5. The system of claim 1 , wherein, to map the tokens in the source fields of the segments to the target fields of a target schema, the at least one processor further performs operations comprising: comparing each target field to the contextual information associated with each token; and matching the token in each source field to one of the target fields of the target schema based at least in part on the comparing of each target field to the contextual information. 6. A computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by at least one processor to cause the at least one processor to perform operations comprising: receiving input data, wherein the input data includes a plurality of segments, and wherein the segments include a plurality of source fields; parsing each of the segments into tokens, wherein each token is data that is contained in a particular source field of the plurality of source fields, and wherein each token includes at least one alphanumeric or numeric value; determining contextual information associated with each token, wherein the contextual information comprises one or more features associated with each token, and wherein determining the contextual information comprises determining one or more features of each token based on metrics and determining whether the numeric value of a given token conforms to an expected range of a first numeric target field versus a second numeric target field based on a type of the target field; mapping the tokens in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the contextual information and confidence values, and wherein the confidence values that meet one or more confidence value thresholds indicate degrees of matching between each token and one or more target fields; and populating the target fields of the target schema with the tokens from the source fields based at least in part on the mapping, wherein the parsing of the segments into tokens, the determining of the contextual information associated the tokens, and the mapping of the tokens in the source fields to the target fields is performed substantially during the populating the target fields of the target schema with the tokens from the source fields. 7. The computer program product of claim 6 , wherein the input data is semi-structured data. 8. The computer program product of claim 6 , wherein the target schema is a structured schema. 9. The computer program product of claim 6 , wherein the structure of each token comprises a relationship between one or more structural features of each token and at least one other token in a same segment. 10. The computer program product of claim 6 , wherein, to map the tokens in the source fields of the segments to the target fields of a target schema, the at least one processor further performs operations comprising: comparing each target field to the contextual information associated with each token; and matching the token in each source field to one of the target fields of the target schema based at least in part on the comparing of each target field to the contextual information. 11. The computer program product of claim 6 , wherein, to map the tokens in the source fields of the segments to the target fields of a target schema, the at least one processor further performs operations comprising: determining confidence values, wherein the confidence values indicate degrees of matching between each token and one or more target fields; and matching tokens in the source fields to the target fields of the target schema. 12. A computer-implemented method for transforming data for a target schema, the method comprising: receiving input data, wherein the input data includes a plurality of segments, and wherein the segments include a plurality of source fields; parsing each of the segments into tokens, wherein each token is data that is contained in a particular source field of the plurality of source fields, and wherein each token includes at least one alphanumeric or numeric value; determining contextual information associated with each token, wherein the contextual information comprises one or more features associated with each token, and wherein determining the contextual information comprises determining one or more features of each token based on metrics and determining whether the numeric value of a given token conforms to an expected range of a first numeric target field versus a second numeric target field based on a type of the target field; mapping the tokens in the source fields of the segments to a plurality of target fields of a target schema based at least in part on the contextual information and confidence values, and wherein the confidence values that meet one or more confidence value thresholds indicate degrees of matching between each token and one or more target fields; and populating the target fields of the target schema with the tokens from the source fields based at least in part on the mapping, wherein the parsing of the segments into tokens, the determining of the contextual information associated the tokens, and the mapping of the tokens in the source fields to the target fields is performed substantially during the populating the target fields of the target schema with the

Assignees

Inventors

Classifications

  • Mapping; Conversion · CPC title

  • G06F16/258Primary

    Data format conversion from or to a database · CPC title

  • G06F16/211Primary

    Schema design and management · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11249960B2 cover?
Embodiments generally relate transforming data for a target schema. In some embodiments, a method includes receiving input data, where the input data includes a plurality of segments, and where the segments include a plurality of source fields containing target data. The method further includes characterizing the input data based at least in part on a plurality of predetermined metrics, where t…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/258. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 15 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).