What technology area does this patent fall under?

Primary CPC classification G06F40/284. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jan 22 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Inference Methods For Word Or Wordpiece Tokenization

US2026023928A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2026023928-A1
Application number	US-202519346824-A
Country	US
Kind code	A1
Filing date	Oct 1, 2025
Priority date	May 18, 2020
Publication date	Jan 22, 2026
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token_ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those tokens or token_IDs that correspond to the longest matching vocabulary entries in the sample string, without the need for backtracking. In some examples, the vocabulary may be organized into a trie in which each node has a fail link, and any node that would share token(s) or token_ID(s) of a preceding node is instead given a prev_match link that points back to a chain of nodes with those token(s) or token_ID(s).

First claim

Opening claim text (preview).

1 . A computer-implemented method comprising: performing, by one or more processors of a processing system, tokenization of a string of text, comprising: analyzing a set of nodes of a vocabulary structure to identify one or more links between nodes of the vocabulary structure corresponding to one or more characters of the string; identifying a fail link between a pair of nodes in the set of nodes; and forming an array of tokens based at least in part on the fail link; and providing, by the one or more processors, the array of tokens to a neural network for natural language processing. 2 . The method of claim 1 , wherein identifying the fail link between the pair of nodes in the set of nodes includes analyzing one of the pair of nodes to determine whether that node has no link corresponding to a given character of the string. 3 . The method of claim 1 , wherein a first token of the array of tokens comprises a word or wordpiece including a first character and a second character of the string. 4 . The method of claim 3 , wherein a second token of the array of tokens includes a third character of the string. 5 . The method of claim 1 , wherein a first token of the array of tokens identifies an entry in a vocabulary for a word or wordpiece including a first character and a second character of the string. 6 . The method of claim 5 , wherein a second token of the array of tokens identifies an entry in the vocabulary for a third character of the string. 7 . The method of claim 1 , wherein the string further comprises a given character that is a symbol representing the end of the string. 8 . The method of claim 1 , further comprising performing the natural language processing on a segment of text using the neural network. 9 . The method of claim 1 , further comprising using the fail link to arrive at a next node of the vocabulary structure. 10 . A processing system comprising: a memory; and one or more processors coupled to the memory and configured to: perform tokenization of a string of text, comprising: analyze a set of nodes of a vocabulary structure to identify one or more links between nodes of the vocabulary structure corresponding to one or more characters of the string; identify a fail link between a pair of nodes in the set of nodes; and form an array of tokens based at least in part on the fail link; and provide the array of tokens to a neural network for natural language processing. 11 . The system of claim 10 , wherein identification of the fail link between the pair of nodes in the set of nodes includes analysis of one of the pair of nodes to determine whether that node has no link corresponding to a given character of the string. 12 . The system of claim 10 , wherein a first token of the array of tokens comprises a word or wordpiece including a first character and a second character of the string. 13 . The system of claim 12 , wherein a second token of the array of tokens includes a third character of the string. 14 . The system of claim 10 , wherein a first token of the array of tokens identifies an entry in a vocabulary for a word or wordpiece including a first character and a second character of the string. 15 . The system of claim 14 , wherein a second token of the array of tokens identifies an entry in the vocabulary for a third character of the string. 16 . The system of claim 10 , wherein the string further comprises a given character that is a symbol representing the end of the string. 17 . The system of claim 10 , wherein the one or more processors are further configured to perform the natural language processing on a segment of text via the neural network. 18 . The system of claim 10 , wherein the one or more processors are further configured to us the fail link to arrive at a next node of the vocabulary structure. 19 . A non-transitory recording medium having computer-readable instructions stored thereon, the instructions, when executed by one or more processors of a processing system: performing tokenization of a string of text, comprising: analyzing a set of nodes of a vocabulary structure to identify one or more links between nodes of the vocabulary structure corresponding to one or more characters of the string; identifying a fail link between a pair of nodes in the set of nodes; and forming an array of tokens based at least in part on the fail link; and providing the array of tokens to a neural network for natural language processing. 20 . The recording medium of claim 19 , wherein identifying the fail link between the pair of nodes in the set of nodes includes analyzing one of the pair of nodes to determine whether that node has no link corresponding to a given character of the string.

Assignees

Google Llc

Inventors

Classifications

G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
G06F16/322
Trees · CPC title
G06F16/3334
Selection or weighting of terms from queries, including natural language queries · CPC title
G06F40/284Primary
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/237Primary
Lexical tools · CPC title

Patent family

Related publications grouped by family.

View patent family 78708743

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026023928A1 cover?: Systems and methods for performing inference for word or wordpiece tokenization are disclosed using a left-to-right longest-match-first greedy process. In some examples, the vocabulary may be organized into a trie structure in which each node includes a precomputed token or token_ID and a fail link, so that the tokenizer can parse the trie in a single pass to generate a list of only those token…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jan 22 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).