Hybrid approach to collating unicode text strings consisting primarily of ASCII characters

US10325010B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10325010-B1
Application numberUS-201816134919-A
CountryUS
Kind codeB1
Filing dateSep 18, 2018
Priority dateNov 6, 2016
Publication dateJun 18, 2019
Grant dateJun 18, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Collating text strings having Unicode encoding includes receiving two text strings S=s1s2 . . . s and T=t1t2 . . . tm. When the two text strings are not identical, there is a smallest positive integer p for which the two text strings differ. The process looks up the characters sp and tp in a predefined lookup table. If either of these characters is missing from the lookup table, the collation of the text strings is determined using the standard Unicode comparison of the text strings spsp+1 . . . sn and tptp+1 . . . tm. Otherwise, the lookup table assigns weights vp and wp for the characters sp and tp. When vp≠wp, these weights define the collation order of the strings S and T. When vp=wp, the collation of S and T is determined recursively using the suffix strings sp+1 . . . sn and tp+1 . . . tm.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of collating text strings having Unicode encoding, comprising: at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors: receiving a first text string S=s 1 s 2 . . . s n having Unicode encoding and a second text string T=t 1 t 2 . . . t m having Unicode encoding, wherein n and m are positive integers, s 1 , s 2 . . . , s n , and t 1 , t 2 , . . . , t m are Unicode characters, and S is not identical to T; identifying a positive integer p with s 1 =t 1 , s 2 =t 2 , . . . , s p−1 =t p−1 , and s p ≠t p ; looking up the characters s p and t p in a predefined lookup table to determine a weight v p for the character s p and a weight w p for the character t p ; when both s p and t p are found in the lookup table, determining a collation order of the text strings S and T according to a comparison of the weights v p and w p ; and when at least one of s p and t p is not found in the lookup table, determining the collation order of the text strings S and T using Unicode weights for the corresponding strings s p s p+1 . . . s n and t p t p+1 . . . t m . 2. The method of claim 1 , wherein the predefined lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters. 3. The method of claim 1 , wherein each weight in the predefined lookup table is encoded as a respective single byte. 4. The method of claim 1 , further comprising, when m≠n, padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length. 5. The method of claim 4 , wherein the padding comprises ASCII null characters. 6. The method of claim 1 , wherein the Unicode weights for the strings s p s p+1 . . . s n and t p t p+1 , . . . t m are computed, the computation comprising: for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight; forming a primary Unicode weight w a as a concatenation of the identified primary weights; forming an accent Unicode weight w b as a concatenation of the identified accent weights; forming a case Unicode weight w c as a concatenation of the identified case weights; and forming the Unicode weight as a concatenation w a +W b +w c of the primary Unicode weight, the accent Unicode weight, and the case Unicode weight. 7. The method of claim 6 , wherein the collation order of the text strings S and T is in accordance with a specified language, and the Unicode weight table is selected according to the specified language. 8. A computing device, comprising: one or more processors; memory; and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for: receiving a first text string S=s 1 s 2 . . . s n having Unicode encoding and a second text string T=t 1 t 2 . . . t m having Unicode encoding, wherein n and m are positive integers, s 1 , s 2 , . . . , s n and t 1 , t 2 , . . . , t m are Unicode characters, and S is not identical to T; identifying a positive integer p with s 1 =t 1 , s 2 =t 2 , . . . , s p−1 =t p−1 , and s p ≠t p ; looking up the characters s p and t p in a predefined lookup table to determine a weight v p for the character s p and a weight w p for the character t p ; when both s p and t p are found in the lookup table, determining a collation order of the text strings S and T according to a comparison of the weights v p and w p ; and when at least one of s p and t p is not found in the lookup table, determining the collation order of the text strings S and T using Unicode weights for the corresponding strings s p s p+1 . . . s n and t p t p+1 . . . t m . 9. The computing device of claim 8 , wherein the predefined lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters. 10. The computing device of claim 8 , wherein each weight in the predefined lookup table is encoded as a respective single byte. 11. The computing device of claim 8 , wherein the one or more programs further comprise instructions for padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length when m≠n. 12. The computing device of claim 11 , wherein the padding comprises ASCII null characters. 13. The computing device of claim 8 , wherein the one or more programs comprise instructions for computing the Unicode weights for the strings s p s p+1 . . . s n and t p t p+1 . . . t m , the computation comprising: for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight; forming a primary Unicode weight w a as a concatenation of the identified primary weights; forming an accent Unicode weight w b as a concatenation of the identified accent weights; forming a case Unicode weight w c as a concatenation of the identified case weights; and forming the Unicode weight as a concatenation w a +w b +w c of the primary Unicode weight, the accent Unicode weight, and the case Unicode weight. 14. The computing device of claim 13 , wherein the collation order of the text strings S and T is in accordance with a specified language, and the Unicode weight table is selected according to the specified language. 15. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, the one or more programs comprising instructions for: receiving a first text string S=s 1 s 2 . . . s n having Unicode encoding and a second text string T=t 1 t 2 . . . t m having Unicode encoding, wherein n and m are positive integers, s 1 , s 2 , . . . , s n and t 1 , t 2 , . . . , t m are Unicode characters, and S is not identical to T; identifying a positive integer p with s 1 =t 1 , s 2 =t 2 , . . . , s p−1 =t p−1 , and s p ≠t p ; looking up the characters s p and t p in a predefined lookup table to determine a weight v p for the character s p and a weight w p for the character t p ; when both s p and t p are found in the lookup table, determining a collation order of the text strings S and T according to a comparison of the weights v p and w p ; and when at least one of s p and t p is not found in the lookup table, determining the collation order of the text strings S and T using Unicode weights for the corresponding strings s p s p+1 . . . s n and t p t p+1 . . . t m . 16. The computer readable storage medium of claim 15 , wherein the predefined lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters. 17. The computer readable storage medium of claim 15 , wherein each weight in the predefined lookup table is encoded as a respective single byte. 18. The computer readable storage medium of claim 15 , wherein the one or more programs further comprise instructions for padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length when m≠n. 19. The computer readable storage medium of claim 15 , wherein the one or more programs comprise instructions for computing the Unicode weights for the strings s p s p+1 . . . s n and t p t p+1 . . . t m , the computation comprising: for each cha

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10325010B1 cover?
Collating text strings having Unicode encoding includes receiving two text strings S=s1s2 . . . s and T=t1t2 . . . tm. When the two text strings are not identical, there is a smallest positive integer p for which the two text strings differ. The process looks up the characters sp and tp in a predefined lookup table. If either of these characters is missing from the lookup table, the collation o…
Who is the assignee on this patent?
Tableau Software Inc
What technology area does this patent fall under?
Primary CPC classification H03M7/14. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Jun 18 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).