Context-based device arbitration
US-10546583-B2 · Jan 28, 2020 · US
US12190877B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12190877-B1 |
| Application number | US-202217685232-A |
| Country | US |
| Kind code | B1 |
| Filing date | Mar 2, 2022 |
| Priority date | Dec 9, 2021 |
| Publication date | Jan 7, 2025 |
| Grant date | Jan 7, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Devices and techniques are generally described for nearest device arbitration. In various examples, a first device may receive first audio data representing a wakeword spoken by a first speaker at a first time. In some examples, a second device may receive second audio data representing the wakeword spoken by the first speaker at the first time. In some cases, the first device may generate first feature data representing the first audio data and the second device may generate second feature data representing the second audio data. In various examples, a machine learning model may use the first feature data and the second feature data to generate first prediction data representing a prediction that the first device is closer to the first speaker than the second device.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: determining room size data by sampling a length l from a first distribution and a width w from a second distribution, wherein the room size data represents a room having dimensions including l×w; determining a first location of a speaker in the room, the first location defined by a first two-dimensional coordinate, wherein the first location is sampled from a third distribution; determining a second location of a noise source in the room, the second location defined by a second two-dimensional coordinate, wherein the second location is sampled from a fourth distribution; determining a third location of a first speech processing-enabled device in the room, the third location defined by a third two-dimensional coordinate, wherein the third location is sampled from a fifth distribution; determining a first audio sample representing a spoken wake word for the first speech processing-enabled device, wherein the first audio sample is sampled from a dataset of audio samples; determining a second audio sample representing background noise for the noise source, wherein the second audio sample is sampled from a dataset of noise samples; generating first audio data by convolving the first audio sample with a first impulse response associated with the first speech processing-enabled device, the first impulse response being associated with audio received at the third location from the first location; generating second audio data by convolving the second audio sample with a second impulse response associated with the first speech processing-enabled device, the second impulse response being associated with audio received at the third location from the second location; generating third audio data by adding the first audio data and the second audio data; generating, by a deep learning model, first feature representation data representing the third audio data; and generating, by a classifier of the deep learning model using the first feature representation data, first prediction data indicating that the first speech processing-enabled device is closest to the first location among other speech processing-enabled devices. 2. The method of claim 1 , further comprising sending, to a second speech processing-enabled device of the other speech processing-enabled devices a first signal, the first signal effective to cause the second speech processing-enabled device to cease recording of audio data. 3. The method of claim 1 , further comprising: determining a first distance between the first location and the third location; determining a second distance between the first location and a fourth location, wherein the fourth location is associated with a second speech processing-enabled device of the other speech processing-enabled devices; and determining that the first distance is less than the second distance, wherein the first prediction data is generated based on the first distance being less than the second distance. 4. A method comprising: receiving, by a first device, first audio data representing sound at a first time, the first audio data being detected by one or more microphones of the first device; receiving, by a second device, second audio data representing the sound at the first time, the second audio data being detected by one or more microphones of the second device; generating, by a first encoder associated with the first device, first feature data representing the first audio data based at least in part on a first impulse response associated with the first device; generating, by a second encoder associated with the second device, second feature data representing the second audio data based at least in part on a second impulse response associated with the second device; and generating, by a machine learning model using the first feature data and the second feature data, first prediction data representing a prediction that the first device is closer to a source of the sound than the second device. 5. The method of claim 4 , further comprising sending first data to the second device, the first data effective to cause the second device to cease at least one of additional processing or sending audio data. 6. The method of claim 4 , further comprising: determining a first simulated room having a first dimension sampled from a first distribution and a second dimension sampled from a second distribution; determining a first location of a third device in the first simulated room by sampling a first two-dimensional location from a third distribution; determining a second location of a simulated speaker by sampling a second two-dimensional location from a fourth distribution; and determining a third location of a first noise source by sampling a third two-dimensional location from a fifth distribution. 7. The method of claim 6 , further comprising: determining a third impulse response for the third device for audio emitted by the simulated speaker at the second location; and determining a fourth impulse response for the third device for audio emitted by the first noise source at the third location. 8. The method of claim 7 , further comprising: determining third audio data by convolving a first audio sample of speech with the third impulse response; and determining fourth audio data by convolving a second audio sample with the fourth impulse response. 9. The method of claim 8 , further comprising: determining fifth audio data for the third device by mixing the first audio data and the second audio data; and generating a label for the fifth audio data indicating a distance between the first location and the second location. 10. The method of claim 9 , further comprising: inputting the fifth audio data into the machine learning model, wherein the machine learning model is effective to generate third feature data representing the fifth audio data; and generating, by a classifier of the machine learning model using the third feature data, first data representing a prediction of the third device being a closest device to a first speaker. 11. The method of claim 4 , further comprising: executing the machine learning model by the first device, the second device, or a third device, wherein the first device, the second device, and the third device are configured in communication on a local area network. 12. The method of claim 4 , further comprising: generating two-dimensional log-filterbank energy (LFBE) data from the first audio data, wherein the first feature data comprises the two-dimensional LFBE data and the machine learning model comprises a convolutional neural network and a classifier. 13. A system comprising: a first device; a second device; at least one processor; and non-transitory computer-readable memory storing instructions that, when executed by the at least one processor, are effective to: receive, by the first device, first audio data representing sound generated at a first time, the first audio data being detected by one or more microphones of the first device; receive, by the second device, second audio data representing the sound generated at the first time, the second audio data being detected by one or more microphones of the second device; generate, by a first encoder associated with the first device, first feature data representing the first audio data based at least in part on a first impulse response associated with the first device; generate, by a second encoder associated with the second device, second feature data representing the second audio data based at least in part on a second impulse response associated with the second device; and
Direction finding using a sum-delay beam-former · CPC title
Word spotting · CPC title
Execution procedure of a spoken command · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
microphones · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.