Scene text detection is typically defined as the process of detecting words or lines of text in digital imagery and estimating coordinates of boxes that tightly bound that text. Subsequently, the boxes are usually cropped from the image and processed by a text recognition engine, which identifies the individual characters in the text.

Recent advances in scene text detection

Detection of text in natural photographic imagery is historically a challenging task particularly in contrast to optical character recognition (OCR) for documents. In the latter case, the source documents often have standard fonts, few colors (mostly black and white), and straight rows of text, thereby simplifying the task. Text in photographic imagery is much more varied, as demonstrated in the figure above.

Nevertheless, the advent of deep learning has allowed for significant progress on this task. Region Proposal Networks (RPNs) trained on monolingual or multilingual text, combined with a text recognition algorithm, can correctly detect and recognize upwards of 80% of words in some benchmark test sets. For example, see the Deep TextSpotter network [1]. Text detection and recognition capabilities can allow for a variety of applications such as advanced street and location mapping (identifying business names, house numbers, etc.), inventory monitoring, security, and situational awareness. However, significant processing power is required for the aforementioned RPNs.

Implementation for low-power hardware

At Binary Cognition we’ve leveraged an alternative to RPNs for text detection, and made further adaptations that allow the network to be implemented on Google’s Edge TPU, a hardware platform with limited neural network architecture flexibility but with exceptional processing speed and power requirements for its form factor.

Google’s Edge TPU provides extremely energy-efficient neural network processing in a small package. It can be incorporated into embedded hardware, or added to an existing system via a USB dongle.

The Binary Cognition solution is based on the PixelLink [2] approach, in which the model classifies each pixel (or superpixel) as inside or outside of a text-containing region. It also makes separate classifications as to whether or not immediately adjacent pixels (left, right, up or down) are also in the same text region (“linked”). After these classifications are made, optimal bounding boxes are straight-forwardly determined for groups of predicted text pixels that are linked.

We make three adjustments to the PixelLink model in order to allow for implementation on the Edge TPU.

  1. Replace softmax layers with single-valued outputs that are thresholded.
  2. Remove batch-norm layers after training by integrating parameters with those of preceding convolutional layers.
  3. Leverage separable convolutions like those of MobileNet [3] for memory and computational efficiency (Google engineers claim that separable convolutions will be supported on the Edge TPU in the near future).

Viola! Highly efficient, low-energy text detection can new easily be supported on any device that has a camera and USB port. The same capability will soon be available for a variety of embedded hardware.

REFERENCES

  1. Busta et al. (2015). Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. International Conference on Computer Vision 2017.
  2. Deng et al. (2018). PixelLink: Detecting Scene Text via Instance Segmentation. http://arxiv.org/abs/1801.01315.
  3. Howard et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. http://arxiv.org/abs/1704.04861.