Hybrid approach to collating unicode text strings consisting primarily of ASCII characters

US10089282B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10089282-B1
Application numberUS-201815885646-A
CountryUS
Kind codeB1
Filing dateJan 31, 2018
Priority dateNov 6, 2016
Publication dateOct 2, 2018
Grant dateOct 2, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Collating text strings having Unicode encoding includes receiving two text strings S=s 1 s 2 . . . s n and T=t 1 t 2 . . . t m . When the two text strings are not identical, there is a smallest positive integer p for which the two text strings differ. The process looks up the characters s p and t p in a predefined lookup table. If either of these characters is missing from the lookup table, the collation of the text strings is determined using the standard Unicode comparison of the text strings s p s p+1 . . . s n and t p t p+1 . . . t m . Otherwise, the lookup table assigns weights v p and w p for the characters s p and t p . When v p ≠w p , these weights define the collation order of the strings S and T. When v p =w p , the collation of S and T is determined recursively using the suffix strings s p+1 . . . s n and t p+1 . . . t m .

First claim

Opening claim text (preview).

What is claimed is: 1. A method of collating text strings having Unicode encoding, comprising: at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors: receiving a first text string S=s 1 s 2 . . . s n having Unicode encoding and a second text string T=t 1 t 2 . . . t m having Unicode encoding, wherein n and m are positive integers, s 1 , s 2 , . . . , s n and t 1 , t 2 , . . . , t m are Unicode characters, and S is not identical to T; (1) identifying a positive integer p with s 1 =t 1 , s 2 =t 2 , . . . , s p−1 =t p−1 and s p ≠t p , wherein at least one of s p and t p is a non-ASCII character; (2) looking up the characters s p and t p in a predefined lookup table to determine a weight v p for the character s p and a weight w p for the character t p ; (3) when at least one of s p and t p is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s p s p+1 . . . s n and t p t p+1 . . . t m ; (4) when both s p and t p are found in the lookup table and v p <w p , determining that S is collated before T; (5) when both s p and t p are found in the lookup table and w p <v p , determining that T is collated before S; (6) when both s p and t p are found in the lookup table, v p =w p , and s p+1 . . . s n =t p+1 . . . t m , determining that S and T have the same collation position; and when both s p and t p are found in the lookup table, v p =w p , and s p+1 . . . s n ≠t p+1 . . . t m , determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings s p+1 . . . s n and t p+1 . . . t m . 2. The method of claim 1 , wherein the lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters. 3. The method of claim 1 , wherein each weight in the lookup table is encoded as a respective single byte. 4. The method of claim 1 , further comprising, when m≠n, padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length. 5. The method of claim 4 , wherein the padding comprises ASCII null characters. 6. The method of claim 1 , wherein the Unicode weights for the strings s p s p+1 . . . s n and t p t p+1 . . . t m are computed, the computation comprising: for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight; forming a primary Unicode weight w p as a concatenation of the identified primary weights; forming an accent Unicode weight w a as a concatenation of the identified accent weights; forming a case Unicode weight w c as a concatenation of the identified case weights; and forming the Unicode weight as a concatenation w p +w a +w c of the primary Unicode weight, the accent Unicode weight, and the case Unicode weight. 7. The method of claim 6 , wherein the collation order is in accordance with a specified language, and the Unicode weight table is selected according to the specified language. 8. A computing device, comprising: one or more processors; memory; and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for: receiving a first text string S=s 1 s 2 . . . s n having Unicode encoding and a second text string T=t 1 t 2 . . . t m having Unicode encoding, wherein n and m are positive integers, s 1 , s 2 , . . . , s n and t 1 , t 2 , . . . , t m are Unicode characters, and S is not identical to T; (1) identifying a positive integer p with s 1 =t 1 , s 2 =t 2 , . . . , s p−1 =t p−1 and s p ≠t p , wherein at least one of s p and t p is a non-ASCII character; (2) looking up the characters s p and t p in a predefined lookup table to determine a weight v p for the character s p and a weight w p for the character t p ; (3) when at least one of s p and t p is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s p s p+1 . . . s n and t p t p+1 . . . t m ; (4) when both s p and t p are found in the lookup table and v p <w p , determining that S is collated before T; (5) when both s p and t p are found in the lookup table and w p <v p , determining that T is collated before S; (6) when both s p and t p are found in the lookup table, v p =w p , and s p+1 . . . s n =t p+1 . . . t m , determining that S and T have the same collation position; and when both s p and t p are found in the lookup table, v p =w p , and s p+1 . . . s n ≠t p+1 . . . t m , determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings s p+1 . . . s n and t p+1 . . . t m . 9. The computing device of claim 8 , wherein the lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters. 10. The computing device of claim 8 , wherein each weight in the lookup table is encoded as a respective single byte. 11. The computing device of claim 8 , wherein the one or more programs further comprise instructions padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length when m≠n. 12. The computing device of claim 11 , wherein the padding comprises ASCII null characters. 13. The computing device of claim 8 , wherein the one or more programs comprise instructions for computing the Unicode weights for the strings s p s p+1 . . . s n and t p t p+1 . . . t m are computed, the computation comprising: for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight; forming a primary Unicode weight w p as a concatenation of the identified primary weights; forming an accent Unicode weight w a as a concatenation of the identified accent weights; forming a case Unicode weight w c as a concatenation of the identified case weights; and forming the Unicode weight as a concatenation w p +w a +w c of the primary Unicode weight, the accent Unicode weight, and the case Unicode weight. 14. The computing device of claim 13 , wherein the collation order is in accordance with a specified language, and the Unicode weight table is selected according to the specified language. 15. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, the one or more programs comprising instructions for: receiving a first text string S=s 1 s 2 . . . s n having Unicode encoding and a second text string T=t 1 t 2 . . . t m having Unicode encoding, wherein n and m are positive integers, s 1 , s 2 , . . . , s n and t 1 , t 2 , . . . , t m are Unicode characters, and S is not identical to T; (1) identifying a positive integer p with s 1 =t 1 , s 2 =t 2 , . . . , s p−1 =t p−1 and s p ≠t p , wherein at least one of s p and t p is a non-ASCII character; (2) looking up the characters s p and t p in a predefined lookup table to determine a weight v p for the character s p and a weight w p for the character t p ; (3) when at least one of s p and t p is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s p s p+1 . . . s n and t p t p+1 . . . t m ; (4)

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10089282B1 cover?
Collating text strings having Unicode encoding includes receiving two text strings S=s 1 s 2 . . . s n and T=t 1 t 2 . . . t m . When the two text strings are not identical, there is a smallest positive integer p for which the two text strings differ. The process looks up the characters s p and t p in a predefined lookup table. If either of these characters is missing from the lookup table…
Who is the assignee on this patent?
Tableau Software Inc
What technology area does this patent fall under?
Primary CPC classification H03M7/14. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).