What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 05 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Approach for processing audio data at network sites

US10198160B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10198160-B2
Application number	US-201615171705-A
Country	US
Kind code	B2
Filing date	Jun 2, 2016
Priority date	Jun 2, 2016
Publication date	Feb 5, 2019
Grant date	Feb 5, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Several approaches are provided for processing audio data to generate transcription data that is supplemented with visual content items. The visual content items may be any type of data that may vary depending upon a particular implementation. Examples of visual content items include, without limitation, images, videos, symbols, etc. Embodiments include adding visual content items to transcription data based upon user input, specialized keywords contained in the transcription data and various correspondences with the audio data, including time-based correspondence and correspondences based upon a common user, storage location or logical entity.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising: one or more processors; one or more memories storing instructions which, when processed by the one or more processors, cause the apparatus to: retrieve audio data that represents a plurality of spoken words, cause the audio data to be processed to generate transcription data that provides a textual representation of the audio data, identify one or more specified keywords contained in the transcription data, wherein each specified keyword from the one or more specified keywords indicates a location in the transcription data where a visual content item is to be added to the transcription data, display, via a user interface, the transcription data and visually indicate one or more locations in the transcription data that correspond to the one or more specified keywords, provide user interface controls that allow a user to specify a visual content item or a link to a visual content item for each of the one or more locations in the transcription data, and generate revised transcription data that includes the visual content item or a reference to the visual content item at each of the one or more locations in the transcription data. 2. The apparatus of claim 1 , wherein the transcription data contains visual content identification data that identifies a visual content item to be added to the revised transcription data. 3. The apparatus of claim 2 , wherein the visual content identification data is adjacent to a specified keyword. 4. The apparatus of claim 2 , wherein the visual content identification data is included in the visual content item or was generated by a device that acquired the visual content item. 5. The apparatus of claim 1 , wherein a location in the revised transcription data of the added visual content item or link to the visual content item corresponds to one or more locations of the one or more specified keywords. 6. The apparatus of claim 1 , wherein the user interface is implemented on a client device that is separate from the apparatus. 7. The apparatus of claim 1 , wherein: the audio data corresponds to a plurality of visual content items or a plurality of references to visual content items based upon time, and the user interface controls allow the user to select the visual content item or the reference to the visual content item for each of the one or more locations in the transcription data from the plurality of visual content items or the plurality of references to visual content items. 8. The apparatus of claim 7 , wherein the audio data corresponds to the plurality of visual content items or the plurality of references to the visual content items based upon a time for each visual content item from the plurality of visual content items having a specified time that is within a time range covered by the audio data. 9. The apparatus of claim 1 , wherein: the audio data corresponds to a plurality of visual content items or a plurality of references to visual content items, and the user interface controls allow the user to select the visual content item or the reference to the visual content item for each of the one or more locations in the transcription data from the plurality of visual content items or the plurality of references to visual content items. 10. The apparatus of claim 9 , wherein the audio data corresponds to the plurality of visual content items or the plurality of references to visual content items based upon one or more of a user in common, a storage location in common, or a logical entity in common. 11. The apparatus of claim 1 , wherein the one or more memories store additional instructions which, when processed by the one or more processors, cause the apparatus to: remove the one or more specified keywords from the revised transcription data. 12. The apparatus of claim 1 , wherein the plurality of visual content items includes one or more of one or more images, or one or more video clips. 13. One or more non-transitory computer-readable media storing instructions which, when processed by one or more processors, cause: at a computing device retrieving audio data that represents a plurality of spoken words, causing the audio data to be processed to generate transcription data that provides a textual representation of the audio data, identifying one or more specified keywords contained in the transcription data, wherein each specified keyword from the one or more specified keywords indicates a location in the transcription data where a visual content item is to be added to the transcription data, displaying, via a user interface, the transcription data and visually indicate one or more locations in the transcription data that correspond to the one or more specified keywords, providing user interface controls that allow a user to specify a visual content item or a link to a visual content item for each of the one or more locations in the transcription data, and generating revised transcription data that includes the visual content item or a reference to the visual content item at each of the one or more locations in the transcription data. 14. The one or more non-transitory computer-readable media of claim 13 , wherein the transcription data contains visual content identification data that identifies a visual content item to be added to the revised transcription data. 15. The one or more non-transitory computer-readable media of claim 13 , wherein a location in the revised transcription data of the added visual content item or link to the visual content item corresponds to one or more locations of the one or more specified keywords. 16. The one or more non-transitory computer-readable media of claim 13 , wherein: the audio data corresponds to a plurality of visual content items or a plurality of references to visual content items based upon time, and the user interface controls allow the user to select the visual content item or the reference to the visual content item for each of the one or more locations in the transcription data from the plurality of visual content items or the plurality of references to visual content items. 17. A computer-implemented method comprising: at a computing device retrieving audio data that represents a plurality of spoken words, causing the audio data to be processed to generate transcription data that provides a textual representation of the audio data, identifying one or more specified keywords contained in the transcription data, wherein each specified keyword from the one or more specified keywords indicates a location in the transcription data where a visual content item is to be added to the transcription data, displaying, via a user interface, the transcription data and visually indicate one or more locations in the transcription data that correspond to the one or more specified keywords, providing user interface controls that allow a user to specify a visual content item or a link to a visual content item for each of the one or more locations in the transcription data, and generating revised transcription data that includes the visual content item or a reference to the visual content item at each of the one or more locations in the transcription data. 18. The computer-implemented method of claim 17 , wherein the transcription data contains visual content identification data that identifies a visual content item to be added to the revised transcription data. 19. The computer-implemented method of claim 17 , wherein a location in the revised transcription data of the added visual content

Assignees

Inventors

Knodt Kurt

Classifications

G10L15/26Primary
Speech to text systems (G10L15/08 takes precedence) · CPC title
G06F3/167
Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title
G10L2015/088
Word spotting · CPC title
G06F3/04842Primary
Selection of displayed objects or displayed text elements (G06F3/0482 takes precedence) · CPC title

Patent family

Related publications grouped by family.

View patent family 60483300

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10198160B2 cover?: Several approaches are provided for processing audio data to generate transcription data that is supplemented with visual content items. The visual content items may be any type of data that may vary depending upon a particular implementation. Examples of visual content items include, without limitation, images, videos, symbols, etc. Embodiments include adding visual content items to transcript…
Who is the assignee on this patent?: Knodt Kurt, Ricoh Co Ltd
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 05 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Multi-component viewing tool for contact center agents

Hybrid audio representations for editing audio content

Methods and apparatus for associating dictation with an electronic record

Method for Providing Context-Based Correction of Voice Recognition Results

Frequently asked questions