What technology area does this patent fall under?

Primary CPC classification G10L15/1815. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Calibrated noise for text modification

US11538467B1 · US · B1

Patent metadata
Field	Value
Publication number	US-11538467-B1
Application number	US-202016900283-A
Country	US
Kind code	B1
Filing date	Jun 12, 2020
Priority date	Jun 12, 2020
Publication date	Dec 27, 2022
Grant date	Dec 27, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Devices and techniques are generally described for calibrating noise for natural language data modification. In various examples, first data representing a natural language input may be identified. A first vector representation of a first word of the first data may be determined. Sensitivity data may be determined for the first vector representation based at least in part on a first density of one or more vector representations adjacent to the first vector representation in an embedding space. In some examples, a first noise vector may be determined based at least in part on the sensitivity data. A first modified vector representation may be generated by adding the first noise vector to the first vector representation. A second word may be determined based at least in part on the first modified vector representation. Modified first data may be generated by replacing the first word with the second word.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: identifying first text data representing a natural language input, wherein the first text data is included in a training data set for a first natural language machine learning model; determining a first word of the first text data for input into a randomization function R; determining a first vector representation of the first word in an embedding space, the embedding space comprising representations of a plurality of words; determining a first average distance between the first vector representation of the first word and a first plurality of vector representations of k closest words to the first vector representation in the embedding space; determining a first local sensitivity of the randomization function R that limits a magnitude of modification of the first vector representation, wherein the first local sensitivity is determined using the first average distance; determining a first noise vector N by sampling from a Laplace distribution, wherein the first noise vector Nis limited by the first local sensitivity; generating a second vector representation by adding the first noise vector N to the first vector representation of the first word; determining a third vector representation among the first plurality of vector representations that is closest to the second vector representation in the embedding space; determining a second word associated with the third vector representation; generating, by a first computing device, second text data by replacing the first word of the first text data with the second word sending the second text data to a remote computing device; receiving, by the first computing device from the remote computing device, first data comprising the second text data and a class label; and updating at least one parameter of the first natural language machine learning model using the first data. 2. The method of claim 1 , further comprising: determining a third word of the first text data for input into the randomization function R; determining a vector representation of the third word in the embedding space; determining a second average distance between the vector representation of the third word and a second plurality of vector representations of k closest words to the vector representation of the third word in the embedding space, wherein the second average distance is greater than the first average distance; determining a second local sensitivity of the randomization function R that limits a second magnitude of modification of the vector representation of the third word, wherein the second local sensitivity is determined using the second average distance, wherein the second local sensitivity is greater than the first local sensitivity; determining a second noise vector N by sampling from the Laplace distribution, wherein the second noise vector Nis limited by the second local sensitivity, wherein a magnitude of the second noise vector N is greater than a magnitude of the first noise vector N; and using the second noise vector N to determine a fourth word to replace the third word of the first text data. 3. The method of claim 1 , further comprising: receiving a selection of a number, the number representing a threshold number of queries; receiving, from a second computing device, a first set of queries, wherein the second text data corresponds to a first query of the first set of queries; receiving, from the second computing device, a second set of queries; determining that a total number of queries in the first set of queries and the second set of queries is greater than or equal to the threshold number of queries; generating a third set of queries by at least one of randomly sampling queries or pseudo-randomly sampling queries from at least the first set of queries and the second set of queries; and sending the third set of queries to the remote computing device. 4. A method comprising: determining, by at least one first computing device, a first vector representation of a first word of first data, the first data representing a natural language input for natural language processing; determining second data for the first vector representation based at least in part on a first average distance from the first vector representation to a first plurality of vector representations of k words in an embedding space, wherein the second data controls an amount of noise used to modify the first vector representation; determining a first noise vector based at least in part on the second data; generating a first modified vector representation using the first noise vector and the first vector representation; determining a second word based at least in part on the first modified vector representation; generating modified first data by replacing the first word with the second word in the first data; and updating, using the modified first data, at least one parameter value of a first natural language processing machine learning model. 5. The method of claim 4 , further comprising: receiving a first set of natural language input data from a first source device, the first set of natural language input data comprising the modified first data; receiving a second set of natural language input data from a second source device; and generating a third set of natural language input data comprising at least some data from the first source device and at least some data from the second source device. 6. The method of claim 4 , further comprising: determining a second distance from the first vector representation to a third vector representation of a third word in the embedding space, wherein the second data is based at least in part on the second distance; and determining the second word based at least in part on the second data. 7. The method of claim 4 , further comprising: determining a routing destination for the first data; determining that the first data is modified prior to sending the first data to the routing destination; and generating the modified first data by replacing the first word with the second word in the first data based at least in part on the determination that the first data is modified prior to sending the first data to the routing destination. 8. The method of claim 4 , further comprising determining the first noise vector by sampling from an n-dimensional Laplace distribution, wherein a magnitude of the first noise vector is controlled by a magnitude parameter ε. 9. The method of claim 4 , further comprising: receiving, by the at least one first computing device, a first utterance representing the natural language input; generating, by the at least one first computing device, the first data representing the natural language input using automatic speech recognition; generating, by the at least one first computing device, the modified first data; and sending, by the at least one first computing device, the modified first data to a remote computing device. 10. The method of claim 4 , further comprising: determining a third vector representation of a third word of the first data; determining third data for a second vector representation representing a different word based at least in part on a second distance from the second vector representation to the third vector representation in the embedding space, wherein the second distance is greater than the first average distance; determining a second noise vector based at least in part on the third data, wherein a magnitude of the second noise vector is greater than a magnitude of the first noise vector based at least in part on the second distance being greater than the first average distance; generating a second modified vector representation u

Assignees

Amazon Tech Inc

Inventors

Feyisetan Oluwaseyi Oluwafemi

Classifications

G10L15/1815Primary
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
G06F40/30Primary
Semantic analysis · CPC title
G10L15/1822
Parsing for meaning understanding · CPC title

Patent family

Related publications grouped by family.

View patent family 84689656

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11538467B1 cover?: Devices and techniques are generally described for calibrating noise for natural language data modification. In various examples, first data representing a natural language input may be identified. A first vector representation of a first word of the first data may be determined. Sensitivity data may be determined for the first vector representation based at least in part on a first density of …
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/1815. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).