Granular neural network architecture search over low-level primitives
US-2024428071-A1 · Dec 26, 2024 · US
US2025131251A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025131251-A1 |
| Application number | US-202318834070-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jan 30, 2023 |
| Priority date | Jan 28, 2022 |
| Publication date | Apr 24, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing a machine learning task on a network input to generate a network output. In one aspect, one of the systems includes a neural network configured to perform the machine learning task, the neural network including one or more expert neural network blocks that each include router that performs expert-choice routing between multiple expert neural networks.
Opening claim text (preview).
What is claimed is: 1 . A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement a neural network that is configured to process a network input and to generate a network output for the network input, the neural network comprising a sequence of one or more network blocks, the sequence comprising one or more expert network blocks configured to perform operations comprising: obtaining a block input that represents an intermediate representation of the network input, the block input comprising a plurality of elements; determining a plurality of sub-inputs from the block input, each sub-input comprising a respective different subset of the plurality of elements of the block input; for each of a plurality of expert subnetworks of the expert network block: processing the plurality of sub-inputs to generate a respective score for each sub-input; selecting one or more of the sub-inputs according to the respective scores; and for each selected sub-input, processing the selected sub-input using the expert subnetwork to generate a respective sub-output; for each of the plurality of sub-inputs, processing the sub-outputs corresponding to the sub-input generated by respective expert subnetworks to generate a combined sub-output for the sub-input; and generating a block output by combining the respective combined sub-outputs for the plurality of sub-inputs. 2 . The system of claim 1 , wherein each expert subnetwork is configured to process a same number of sub-inputs. 3 . The system of claim 2 , wherein the same number k of sub-inputs processed by each expert subnetwork is equal to: k = l · c e wherein I is a number of sub-inputs in the block input, e is a number of expert subnetworks in the expert network block, and c is a hyperparameter of the neural network representing an average number of sub-inputs to be processed per expert subnetwork. 4 . The system of claim 1 , wherein, for each expert subnetwork, processing the plurality of sub-inputs to generate a respective score for each sub-input comprises computing: S =Softmax( X·W g ) wherein X∈ l×d is a matrix that includes a respective row corresponding to each sub-input, l is a number of sub-inputs in the block input, d is a dimensionality of each sub-input, W g ∈ d×s is a matrix that includes a respective column corresponding to each expert subnetwork, and e is a number of expert subnetworks in the expert network block. 5 . The system of claim 4 , wherein, for each expert subnetwork, selecting one or more of the sub-inputs according to the respective scores comprises computing: G,I =TopK( S T ,k ) P =Onehot( I ) wherein k is a number of sub-inputs selected by each expert subnetwork, I∈ e×k is a matrix whose (i,j) th element identifies the sub-input that has the f th -largest score for the i th expert subnetwork, and G∈ e×k is a matrix whose (i,j) th element represents the score of the sub-input that has the j th -largest score for the i th expert subnetwork, and P∈ e×k×l is a one-hot matrix whose (i,f,m) th element is equal to one if the m th sub-input has the j th -largest score for the i th expert subnetwork and zero otherwise. 6 . The system of claim 1 , wherein each sub-input is processed by at most a threshold number b different expert subnetworks. 7 . The system of claim 6 , wherein for each expert subnetwork, selecting one or more of the sub-inputs according to the respective scores comprises: computing: max A 〈 S ⊤ , A 〉 + λ H ( A ) s . t . ∀ i : ∑ j ′ A [ i , j ′ ] = k ∀ j : ∑ i ′ A [ i ′ , j ] ≤ b ∀ i , j : 0 ≤ A [ i , j ] ≤ 1 wherein (S T ,A) represents an inner product between S T and A, and wherein H(A)=Σ ij −A[i,j] log A [i,j]; and computing: G,I =TopK( A,k ) P =Onehot( I ) wherein k is a number of sub-inputs selected by each expert subnetwork, J∈ e×k is a matrix whose (i,j) th element identifies the sub-input that has the j th -largest score for the i th expert subnetwork, and G∈ e×k is a matrix whose (i,j) th element represents the score of the sub-input that has the j th -largest score fo
Activation functions · CPC title
Distributed learning, e.g. federated learning · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Knowledge-based neural networks; Logical representations of neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.