What technology area does this patent fall under?

Primary CPC classification G10L13/033. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Text-to-speech and speech recognition for noisy environments

US12315490B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12315490-B2
Application number	US-202117565826-A
Country	US
Kind code	B2
Filing date	Dec 30, 2021
Priority date	Dec 31, 2020
Publication date	May 27, 2025
Grant date	May 27, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates generally to speech processing. Humans change their speech patterns in noisy environments. The systems and devices described herein can compensate for noisy environments to be more human-like. Thus, the configurations and implementations herein can determine a sound profile for the sound environment where the user is listening. Based on the sound profile, the devices can determine a transform to apply to output speech from the device. This transform is applied to the wake word, speech recognition, and to the output speech to compensate for the noise level of the environment by mimicking the Lombard effect.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: providing, by a media delivery system, a first audio representation of a first simulated sound environment; receiving, by the media delivery system, first speech from a user speaking subject to audio playout of in the first simulated sound environment; providing, by the media delivery system, a second audio representation of a second simulated sound environment, wherein the second simulated sound environment has different acoustic characteristics than the first simulated sound environment; receiving, by the media delivery system, second speech from the user speaking subject to audio playout of the second simulated sound environment; determining a change in a speech component between the first speech and the second speech; and based on the change in the speech component, creating a transform to adjust the speech component. 2. The method of claim 1 , wherein the change in the speech component is associated with one or more of a change to a phoneme, in a speed of the phoneme, in a duration of the phoneme, to a separation between phonemes, to a separation between morphemes, in a pitch of the phoneme, in a frequency range for the phoneme, or in a pause between words. 3. The method of claim 2 , wherein the change in the speech component mimics the Lombard Effect. 4. The method of claim 2 , wherein a first phoneme is pronounced in a first frequency range and a second phoneme is pronounced in a second frequency range, and wherein the change to the speech component involves a first change to the first phoneme that is different than a second change to the second phoneme based on a difference between the first frequency range and the second frequency range. 5. The method of claim 1 , further comprising: assigning a desired voice; receiving a request from the user; determining a current sound environment for the user; determining text to output to the user in response to the request; synthesizing the text, by Text-To-Speech (TTS), to create speech output; applying the transform to the speech output; and playing the transformed speech output. 6. The method of claim 5 , wherein the request from the user is a wake word, the method further comprising: retrieving the transform; and adjusting a reception of the wake word based on the transform. 7. The method of claim 5 , further comprising: receiving third speech from the user in the request; retrieving the transform; and adjusting a speech recognition of the third speech based on the transform. 8. The method of claim 1 , wherein the transform is associated with the second simulated sound environment. 9. The method of claim 1 , wherein the first simulated sound environment is quieter than the first simulated sound environment. 10. The method of claim 9 , wherein the first simulated sound environment has a sound level of 45 dB or less, and wherein the second simulated sound environment has a sound level of over 50 dB. 11. The method of claim 1 , wherein the transform is applied to sounds in a partial band of frequencies. 12. A system comprising: a memory; and a processing unit coupled to the memory, wherein the processing unit is operative to: provide, by a media delivery system, a first audio representation of a first simulated sound environment; receive, by the media delivery system, first speech from a user speaking subject to audio playout of the first simulated sound environment; provide, by the media delivery system, a second audio representation of a second simulated sound environment, wherein the second simulated sound environment has different acoustic characteristics than the first simulated sound environment; receive, by the media delivery system, second speech from the user speaking subject to audio playout of the second simulated sound environment; determine a change in a speech component between the first speech and the second speech; and based on the change in the speech component, create a transform to adjust the speech component. 13. The system of claim 12 , wherein the change in the speech component is associated with one or more of a change to a phoneme, in a speed of the phoneme, in a duration of the phoneme, to a separation between phonemes, to a separation between morphemes, in a pitch of the phoneme, in a frequency range for the phoneme, and wherein the change in the speech component mimics the Lombard Effect. 14. The system of claim 12 , the processing unit further operative to: assign a desired voice; receive a request from the user; determine a current sound environment for the user; determine text to output to the user in response to the request; synthesize the text, by Text-To-Speech (TTS), to create speech output; apply the transform to the speech output; and play the transformed speech output. 15. The system of claim 14 , wherein the request from the user is a wake word, the processing unit further operative to: retrieve the transform; and adjust a reception of the wake word based on the transform. 16. The system of claim 14 , the processing unit further operative to: receive third speech from the user in the request; retrieve the transform; and adjust a speech recognition of the third speech based on the transform. 17. A method comprising: determining, by a media-playback device, a sound environment from received background noise; selecting a sound profile with similar audio characteristics as the received background noise, wherein the sound profile is associated with a transform for speech; determining speech output; applying the transform to the speech output to create transformed speech; and playing, by the media-playback device, the transformed speech. 18. The method of claim 17 , wherein a characteristic associated with the transform comprises one or more of a change to a phoneme, in a speed of the phoneme, in a duration of the phoneme, to a separation between phonemes, to a separation between morphemes, in a pause between words, in a pitch of the phoneme, in a frequency range for the phoneme, and wherein the transform mimics the Lombard effect. 19. The method of claim 17 , wherein a user can understand the transformed speech in the sound environment without changing a volume of the media-playback device. 20. The method of claim 17 , wherein determining the sound environment comprises determining a dB (A) or a dB (C).

Assignees

Spotify Ab

Inventors

Classifications

G10L15/08
Speech classification or search · CPC title
G10L25/84
for discriminating voice from noise · CPC title
G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G10L2015/088
Word spotting · CPC title
G10L13/08
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

Patent family

Related publications grouped by family.

View patent family 82117499

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12315490B2 cover?: The present disclosure relates generally to speech processing. Humans change their speech patterns in noisy environments. The systems and devices described herein can compensate for noisy environments to be more human-like. Thus, the configurations and implementations herein can determine a sound profile for the sound environment where the user is listening. Based on the sound profile, the devi…
Who is the assignee on this patent?: Spotify Ab
What technology area does this patent fall under?: Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).