Routing to expert subnetworks in mixture-of-experts neural networks

US2025131251A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025131251-A1
Application numberUS-202318834070-A
CountryUS
Kind codeA1
Filing dateJan 30, 2023
Priority dateJan 28, 2022
Publication dateApr 24, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing a machine learning task on a network input to generate a network output. In one aspect, one of the systems includes a neural network configured to perform the machine learning task, the neural network including one or more expert neural network blocks that each include router that performs expert-choice routing between multiple expert neural networks.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement a neural network that is configured to process a network input and to generate a network output for the network input, the neural network comprising a sequence of one or more network blocks, the sequence comprising one or more expert network blocks configured to perform operations comprising: obtaining a block input that represents an intermediate representation of the network input, the block input comprising a plurality of elements; determining a plurality of sub-inputs from the block input, each sub-input comprising a respective different subset of the plurality of elements of the block input; for each of a plurality of expert subnetworks of the expert network block: processing the plurality of sub-inputs to generate a respective score for each sub-input; selecting one or more of the sub-inputs according to the respective scores; and for each selected sub-input, processing the selected sub-input using the expert subnetwork to generate a respective sub-output; for each of the plurality of sub-inputs, processing the sub-outputs corresponding to the sub-input generated by respective expert subnetworks to generate a combined sub-output for the sub-input; and generating a block output by combining the respective combined sub-outputs for the plurality of sub-inputs. 2 . The system of claim 1 , wherein each expert subnetwork is configured to process a same number of sub-inputs. 3 . The system of claim 2 , wherein the same number k of sub-inputs processed by each expert subnetwork is equal to: k = l · c e wherein I is a number of sub-inputs in the block input, e is a number of expert subnetworks in the expert network block, and c is a hyperparameter of the neural network representing an average number of sub-inputs to be processed per expert subnetwork. 4 . The system of claim 1 , wherein, for each expert subnetwork, processing the plurality of sub-inputs to generate a respective score for each sub-input comprises computing: S =Softmax( X·W g ) wherein X∈ l×d is a matrix that includes a respective row corresponding to each sub-input, l is a number of sub-inputs in the block input, d is a dimensionality of each sub-input, W g ∈ d×s is a matrix that includes a respective column corresponding to each expert subnetwork, and e is a number of expert subnetworks in the expert network block. 5 . The system of claim 4 , wherein, for each expert subnetwork, selecting one or more of the sub-inputs according to the respective scores comprises computing: G,I =TopK( S T ,k ) P =Onehot( I ) wherein k is a number of sub-inputs selected by each expert subnetwork, I∈ e×k is a matrix whose (i,j) th element identifies the sub-input that has the f th -largest score for the i th expert subnetwork, and G∈ e×k is a matrix whose (i,j) th element represents the score of the sub-input that has the j th -largest score for the i th expert subnetwork, and P∈ e×k×l is a one-hot matrix whose (i,f,m) th element is equal to one if the m th sub-input has the j th -largest score for the i th expert subnetwork and zero otherwise. 6 . The system of claim 1 , wherein each sub-input is processed by at most a threshold number b different expert subnetworks. 7 . The system of claim 6 , wherein for each expert subnetwork, selecting one or more of the sub-inputs according to the respective scores comprises: computing: max A 〈 S ⊤ , A 〉 + λ ⁢ H ⁡ ( A ) ⁢ s . t . ∀ i : ∑ j ′ A [ i , j ′ ] = k ⁢ ∀ j : ∑ i ′ A [ i ′ , j ] ≤ b ⁢ ∀ i , j : 0 ≤ A [ i , j ] ≤ 1 wherein (S T ,A) represents an inner product between S T and A, and wherein H(A)=Σ ij −A[i,j] log A [i,j]; and computing: G,I =TopK( A,k ) P =Onehot( I ) wherein k is a number of sub-inputs selected by each expert subnetwork, J∈ e×k is a matrix whose (i,j) th element identifies the sub-input that has the j th -largest score for the i th expert subnetwork, and G∈ e×k is a matrix whose (i,j) th element represents the score of the sub-input that has the j th -largest score fo

Assignees

Inventors

Classifications

  • G06N3/048Primary

    Activation functions · CPC title

  • Distributed learning, e.g. federated learning · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • G06N3/042Primary

    Knowledge-based neural networks; Logical representations of neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025131251A1 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing a machine learning task on a network input to generate a network output. In one aspect, one of the systems includes a neural network configured to perform the machine learning task, the neural network including one or more expert neural network blocks that each include router that p…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/048. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 24 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).