Training language models and preserving privacy

US12412038B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12412038-B2
Application numberUS-202318173199-A
CountryUS
Kind codeB2
Filing dateFeb 23, 2023
Priority dateOct 5, 2022
Publication dateSep 9, 2025
Grant dateSep 9, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In implementations of systems for training language models and preserving privacy, a computing device implements a privacy system to predict a next word after a last word in a sequence of words by processing input data using a machine learning model trained on training data to predict next words after last words in sequences of words. The training data describes a corpus of text associated with clients and including sensitive samples and non-sensitive samples. The machine learning model is trained by sampling a client of the clients and using a subset of the sensitive samples associated with the client and a subset of the non-sensitive samples associated with the client to update parameters of the machine learning model. The privacy system generates an indication of the next word after the last word in the sequence of words for display in a user interface.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, by a processing device, input data describing a sequence of words ending with a last word; predicting, by the processing device, a next word after the last word in the sequence of words by processing the input data using a machine learning model trained on injected Gaussian noise and training data to update parameters of the machine learning model to predict next words after last words in sequences of words, the training data describing a corpus of text associated with clients and including sensitive samples and non-sensitive samples taken from databases that are client-content adjacent as differing in that a client and a sensitive entity are present in one of the client-content adjacent databases and are not present in another one of the client-content adjacent databases; and generating, by the processing device, an indication of the next word after the last word in the sequence of words for display in a user interface. 2. The method as described in claim 1 , wherein the machine learning model includes at least one of a Long Short Term Memory model, a Bidirectional Encoder Representations from Transformers model, or a Generative Pretrained Transformer 2 model. 3. The method as described in claim 1 , wherein the sensitive samples and the non-sensitive samples are identified by processing the corpus of text using a named entity recognition model. 4. The method as described in claim 3 , wherein the non-sensitive samples include a sensitive sample from the corpus of text based on an error rate associated with the named entity recognition model. 5. The method as described in claim 3 , wherein the sensitive samples include a non-sensitive sample from the corpus of text based on an identification error rate associated with the named entity recognition model. 6. The method as described in claim 1 , wherein the sensitive samples and the non-sensitive samples are sentences included in the corpus of text. 7. The method as described in claim 1 , wherein the sensitive samples and the non-sensitive samples are paragraphs included in the corpus of text. 8. A method comprising: forming, by a processing device, client-content adjacent databases that include a client database and a sensitive contents database, the client-content adjacent databases differing in that a client and a sensitive entity are present a corresponding database of the client-content adjacent databases and are not present in another database of the client-content adjacent databases, the forming including: removing samples associated with a client of a plurality of clients from the respective database of the client-content adjacent databases; and removing sensitive samples associated with a particular instance of sensitive content of a plurality of sensitive content regardless of client association from the respective database of the client-content adjacent databases; identifying, by the processing device, a set of clients from the plurality of clients from the client-content adjacent databases; identifying, by the processing device, a set of sensitive samples from the plurality of sensitive content from the client-content adjacent databases; generating training data by applying one or more differential privacy techniques to the samples associated with the set of clients or the set of sensitive samples; and training a machine learning model using the training data by a loss function using an aggregated gradient that is aggregated across the plurality of clients and the plurality of sensitive content, the training including injecting Gaussian noise and updating parameters of the machine learning model. 9. The method as described in claim 8 , wherein the sensitive samples and the samples are determined by processing a corpus of text using an additional machine learning model. 10. The method as described in claim 9 , wherein the samples include a respective said sensitive sample based on an error rate associated with the additional machine learning model. 11. The method as described in claim 9 , wherein the additional machine learning model includes a named entity recognition model. 12. The method as described in claim 8 , wherein the machine learning model includes at least one of a Long Short Term Memory model, a Bidirectional Encoder Representations from Transformers model, or a Generative Pretrained Transformer 2 model. 13. The method as described in claim 8 , wherein the sensitive samples and the samples are sentences or paragraph included in a corpus of text. 14. A computing device comprising: a processing device; and a computer-readable storage medium storing instructions that, responsive to execution by the processing device, causes the processing device to perform operations including: forming client-content adjacent databases that include a client database and a sensitive contents database, the client-content adjacent databases differing in that a client and a sensitive entity are present a corresponding database of the client-content adjacent databases and are not present in another database of the client-content adjacent databases, the forming including: removing samples associated with a client of a plurality of clients from the respective database of the client-content adjacent databases; and removing sensitive samples associated with a particular instance of sensitive content of a plurality of sensitive content regardless of client association from the respective database of the client-content adjacent databases; identifying a set of clients from the plurality of clients from the client-content adjacent databases; identifying a set of sensitive samples from the plurality of sensitive content from the client-content adjacent databases; generating training data by applying one or more differential privacy techniques to the samples associated with the set of clients or the set of sensitive samples; and training a machine learning model using the training data by a loss function using an aggregated gradient that is aggregated across the plurality of clients and the plurality of sensitive content, the training including injecting Gaussian noise and updating parameters of the machine learning model. 15. The computing device as described in claim 14 , wherein the sensitive samples and the samples are determined by processing a corpus of text using an additional machine learning model. 16. The computing device as described in claim 15 , wherein the samples include a respective said sensitive sample based on an error rate associated with the additional machine learning model. 17. The computing device as described in claim 15 , wherein the additional machine learning model includes a named entity recognition model. 18. The computing device as described in claim 14 , wherein the machine learning model includes at least one of a Long Short Term Memory model, a Bidirectional Encoder Representations from Transformers model, or a Generative Pretrained Transformer 2 model. 19. The computing device as described in claim 14 , wherein the sensitive samples and the samples are sentences or paragraph included in a corpus of text.

Assignees

Inventors

Classifications

  • Converting codes to words; Guess-ahead of partial word inputs · CPC title

  • G06F40/295Primary

    Named entity recognition · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12412038B2 cover?
In implementations of systems for training language models and preserving privacy, a computing device implements a privacy system to predict a next word after a last word in a sequence of words by processing input data using a machine learning model trained on training data to predict next words after last words in sequences of words. The training data describes a corpus of text associated with…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/295. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 09 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).