Automated identification of start-of-reading location for ebooks

US10042880B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10042880-B1
Application numberUS-201614989098-A
CountryUS
Kind codeB1
Filing dateJan 6, 2016
Priority dateJan 6, 2016
Publication dateAug 7, 2018
Grant dateAug 7, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A machine-learning system analyzes electronic books to determine a “start-of-reading location” (SRL) in each book. Based on this location, when an electronic book is opened on a reading device for the first time, the book can be opened to where a reader is likely to want to start reading, automatically skipping past introductory pages. Books are divided into logical blocks (e.g., title page, forward, chapters, etc.), and a title portion and a body-text portion is identified in each block. A title classifier attempts to determine whether or not a block should be marked as the SRL. If the score from the title classifier is indefinite, a body-text classifier is used.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: processing an electronic-book (“e-book”) to extract a plurality of blocks, each block constituting a logical entity within the e-book; categorizing text within each block, of the plurality of blocks, as corresponding to a title or to body-text of that block; determining a first set of title features relating to a first title of a first block of the plurality of blocks, including features based on a bag-of-words analysis of the e-book; providing the first set of title features to a title classifier; receiving, from the title classifier, a first title score for the first set of title features; determining, based on the first title score, that the first block is unlikely to be where a hypothetical person reading the e-book would begin reading; determining a second set of title features relating to a second title of a second block of the plurality of blocks, including features based on the bag-of-words analysis of the e-book; providing the second set of title features to the title classifier; receiving, from the title classifier, a second title score for the second set of title features; determining, based on the second title score, that further processing is required to determine whether or not the second block is likely to be where the hypothetical person reading the e-book would begin reading; determining a first set of body-text features relating to first body-text of the second block of the plurality of blocks, including features based on name-entity recognition; providing the first set of body-text features to a body-text classifier; receiving, from the body-text classifier, a first body-text score for the first set of body-text features; determining, based on the first body-text score, that the second block is likely to be where the hypothetical person reading the e-book would begin reading; and annotating metadata of the e-book with a start-of-reading location (“SRL”) indicator, wherein the SRL indicator indicates, to a computing device used to access the e-book, to output the second block upon initially opening the e-book. 2. The method of claim 1 , further comprising: performing named-entity recognition on the e-book to generate a list of named entities within the e-book, and a frequency of occurrence of each of the named entities in the e-book, wherein named entities include at least names of people and place names; ranking the named entities based on their frequency of occurrence; and selecting a first named entity based on the ranking, wherein determining the first set of body-text features further includes: identifying occurrences of the first named entity in the body-text of the second block; and calculating a ratio of a number of occurrences of the first named entity in the body-text of the second block to a number of named entities in the list of named entities, the first set of body-text features including the ratio. 3. The method of claim 1 , further comprising: comparing words in the body-text of the second block with a list of words that indicate that the second block is likely to occur prior to the SRL, each word in the list being associated with a weight; and calculating a sum of the weight of words in the list that correspond to words occurring in the body-text of the second block, wherein the first set of body-text features includes the sum. 4. A computing system comprising: at least one processor; and a memory including instructions operable to be executed by the at least one processor to configure the computing system to: process a first electronic document to determine a first block and a second block, each block constituting a logical entity within the first electronic document; categorize portions of the first block to identify a first title portion and a first body-text portion; determine a first plurality of features from the first block, wherein the first plurality of features relate, at least in part, to the first body-text portion; provide the first plurality of features from the first block to a first classifier, the first classifier to identify whether the first block is likely to be where a hypothetical person would begin reading the first electronic document; determine, based on a first score output by the first classifier in response to the first plurality of features, that the first block is not likely to be where the hypothetical person would begin reading the electronic document; categorize portions of the second block to identify a second title, a second title portion and a second body-text portion; determine a second plurality of features from a second block, wherein the second plurality of features relate, at least in part, to the second body-text portion; provide the second plurality of features to the first classifier; determine, based on a second score output by the first classifier in response to the second plurality of features, that the second block is likely to be where the hypothetical person would begin reading the first electronic document; and generate data for the first electronic document to indicate a start-of-reading location to a document output device, used to access the first electronic document, to output the second block upon initially opening the first electronic document. 5. The computing system of claim 4 , the memory further comprises instructions that further configure the computing system to: determine a third plurality of features from the second block, the third plurality of features relating, at least in part, to the second title portion; provide the third plurality of features to a second classifier, the second classifier to identify whether the second block is likely to be where the hypothetical person would begin reading the first electronic document based on the second title portion; and determine, based on a third score output by the second classifier in response to the third plurality of features, that the second block is not likely to be where the hypothetical person would begin reading the first electronic document, wherein the third score is determined prior to the second plurality of features being provided to the first classifier. 6. The computing system of claim 5 , the memory further comprises instructions that further configure the computing system to: process each document in a training set of documents to determine training blocks, each training block constituting a logical entity within the training set; categorize a portion of each training block as being a title portion of the training block; determine a frequency of occurrence of each words appearing in title portions of the training blocks; rank the words based on their frequency of occurrence; and select a set of the words based on their ranking, wherein the instructions to determine the third plurality of features further configure the computing device to: determine which of the words in the set occur in the second title portion and which of the words in the set do not occur in the second title portion, wherein the third plurality of features include an indication of how many of the words in the set do occur in the second title portion, and how many of the words in the set do not occur in the second title portion. 7. The computing system of claim 5 , the memory further comprises instructions that further configure the computing system to: process each document in a set of documents to determine training blocks, wherein metadata for each document in the set includes an annotation indicating from where the document should be opened upon initially opening the respective document; categorize portions of each training block, categories for the portions including a title portion and a body-text portion; train the first classifi

Assignees

Inventors

Classifications

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces · CPC title

  • Hierarchical processing, e.g. outlines · CPC title

  • G06F40/163Primary

    Handling of whitespace · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10042880B1 cover?
A machine-learning system analyzes electronic books to determine a “start-of-reading location” (SRL) in each book. Based on this location, when an electronic book is opened on a reading device for the first time, the book can be opened to where a reader is likely to want to start reading, automatically skipping past introductory pages. Books are divided into logical blocks (e.g., title page, fo…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/163. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 07 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).