Training deep neural network acoustic models using distributed hessian-free optimization

US9390370B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9390370-B2
Application numberUS-201313783812-A
CountryUS
Kind codeB2
Filing dateMar 4, 2013
Priority dateAug 28, 2012
Publication dateJul 12, 2016
Grant dateJul 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for training a neural network includes receiving labeled training data at a master node, generating, by the master node, partitioned training data from the labeled training data and a held-out set of the labeled training data, determining a plurality of gradients for the partitioned training data, wherein the determination of the gradients is distributed across a plurality of worker nodes, determining a plurality of curvature matrix-vector products over the plurality of samples of the partitioned training data, wherein the determination of the plurality of curvature matrix-vector products is distributed across the plurality of worker nodes, and determining, by the master node, a second-order optimization of the plurality of gradients and the plurality of curvature matrix-vector products, producing a trained neural network configured to perform a structured classification task using a sequence-discriminative criterion.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for training a neural network, the method comprising: receiving labeled training data at a master node; generating, by the master node, partitioned training data from the labeled training data and a held-out set of the labeled training data; determining a plurality of gradients for the partitioned training data, wherein the determination of the gradients is distributed across a plurality of worker nodes; determining a plurality of curvature matrix-vector products over a plurality of samples of the partitioned training data, wherein the determination of the plurality of curvature matrix-vector products is distributed across the plurality of worker nodes; and determining, by the master node, a second-order optimization of the plurality of gradients and the plurality of curvature matrix-vector products, wherein the second-order optimization forms a plurality of quadratic approximations of a loss function corresponding to the gradients determined by the worker nodes, the plurality of quadratic approximations of the loss function being formed using the curvature matrix-vector products, the second-order optimization selecting, from the plurality of quadratic approximations, a quadratic approximation determined to reduce a loss on the held-out set of the labeled training data, and producing a trained neural network having network parameters corresponding to the quadratic approximation selected, wherein the trained neural network is configured to perform a structured classification task. 2. The method of claim 1 , further comprising assigning, by the master node, the partitioned training data to the plurality of worker nodes. 3. The method of claim 1 , further comprising coordinating, by the master node, activity of the plurality of worker nodes. 4. The method of claim 1 , wherein the second-order optimization comprises a Hessian-free optimization. 5. The method of claim 1 , wherein the trained neural network comprises a plurality of nodes connected by a plurality of edges, wherein the second-order optimization determines weights for the plurality of edges, and wherein the weights are the network parameters. 6. The method of claim 1 , further comprising generating, by the master node, the held-out set of the labeled training data, wherein determining the second-order optimization further comprises the iterative steps of: determining an actual loss for the gradient of the quadratic approximation selected based on the held-out set of the labeled training data; and adjusting a damping parameter according to a comparison of the actual loss to a predicted loss of the quadratic approximation selected, wherein the damping parameter controls the quadratic approximation. 7. The method of claim 1 , wherein the master node and the plurality of worker nodes constitute a computer system configured to produce the trained neural network, wherein the plurality of worker nodes perform data-parallel computations to determine the plurality of gradients and curvature matrix-vector products, and wherein the master node performs a computation to determine the second-order optimization. 8. A computer program product for training a neural network, the computer program product comprising a non-transitory computer readable storage medium having program code embodied therewith, the program code readable by a processor to: receive labeled training data; generate partitioned training data from the labeled training data and a held-out set of the labeled training data; assign the partitioned training data to a plurality of worker nodes; receive a plurality of gradients and a plurality of curvature matrix-vector products from the plurality of worker nodes; and determine a second-order optimization of the plurality of gradients and the plurality of curvature matrix-vector products, using the held-out set and a damping parameter determined using the held-out set, producing a trained neural network configured to perform a structured classification task using a sequence-discriminative criterion. 9. The computer program product of claim 8 , wherein the processor coordinates activity of the plurality of worker nodes. 10. The computer program product of claim 8 , wherein the second-order optimization comprises a Hessian-free optimization. 11. The computer program product of claim 8 , wherein the processor generates the held-out set of the labeled training data, and determines the second-order optimization by determining an actual loss for a current gradient of a quadratic approximation of a loss function based on the held-out set of the labeled training data, and adjusting the damping parameter according to a comparison of the actual loss to a predicted loss of the current quadratic approximation of the loss function, wherein the damping parameter controls the quadratic approximation. 12. The computer program product of claim 8 , wherein the trained neural network comprises a plurality of nodes connected by a plurality of edges, wherein the second-order optimization determines weights for the plurality of edges, and wherein the weights are the network parameters. 13. A system for training deep neural network acoustic models comprising: a plurality of distributed worker computing devices configured to perform data-parallel computation of gradients and curvature matrix-vector products for partitioned training data generated from labeled training data; and a master computing device connected to the plurality of distributed worker computing devices by inter-process communication flow, wherein the master computing device is configured to determine a second-order optimization given the gradients and the curvature matrix-vector products and to coordinate activity of the plurality of distributed worker computing devices, wherein the second-order optimization forms a plurality of quadratic approximations of a loss function corresponding to the gradients determined by the distributed worker computing devices, the plurality of quadratic approximations of the loss function being formed using the curvature matrix-vector products, the second-order optimization selecting, from the plurality of quadratic approximations, a quadratic approximation determined to reduce a loss on a held-out set of the labeled training data, and producing a trained neural network having network parameters corresponding to the quadratic approximation selected, wherein the trained neural network is configured to perform a structured classification task. 14. The system of claim 13 , wherein the plurality of distributed worker computing devices are each configured to: receive partitioned training data from the master computing device; determine the gradients for the partitioned training data; and determine the curvature matrix-vector products over the partitioned training data. 15. The system of claim 13 , wherein the master computing device is configured to: receive the labeled training data; generate the partitioned training data from the labeled training data and the held-out set of the labeled training data; assign the partitioned training data to the plurality of distributed worker computing devices; and receive the gradients and the curvature matrix-vector products from the plurality of distributed worker computing devices. 16. The system of claim 13 , wherein the master computing device coordinates activity of the plurality of distributed worker computing devices.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • G06N3/09Primary

    Supervised learning · CPC title

  • Feedforward networks · CPC title

  • Distributed learning, e.g. federated learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9390370B2 cover?
A method for training a neural network includes receiving labeled training data at a master node, generating, by the master node, partitioned training data from the labeled training data and a held-out set of the labeled training data, determining a plurality of gradients for the partitioned training data, wherein the determination of the gradients is distributed across a plurality of worker no…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).