Optical Character Recognition (Ocr) Using R

OCR transforms text images into machine-readable formats. With applications ranging from receipts to license plates, we explore the process, syntax, and examples, demonstrating its versatility. In this tutorial, we will learn to perform Optical Character Recognition in R programming language using the Tesseract and Magick libraries.

Optical Character Recognition

OCR stands for Optical Character Recognition. It is the procedure that transforms a text image into a text format that computers can read. OCR generally scans the image and extracts the text from the image that we can store in any string variable. OCRs are used to read receipts, cheques, code scanners, license plate scanners, and other numerous applications.

The libraries used will be:

  • tesseract: It is a Neural Net LSTM-based OCR engine that is used for text recognition.
  • magick: This library is used for image processing in R. We can print the image and also required for Tesseract.

The Tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable to tune the detection algorithms and obtain the best possible results.


To perform Optical Character Recognition, we simply use the ocr() method and pass the file.

text <- ocr(pngfile)

ocr method takes the png file and extracts the text using its pre-trained model.

Example 1: Reading text from an Image

Step 1: Install and load the libraries:



Step 2: Load an image from a URL or file storage.


# Reading the image
img = image_read('https://media.w3wiki.net/wp-content/uploads/20190328185307/gfg28.png')
# Display



Step 3: Apply the OCR method on it.


text <- ocr(img)
# extracted text


[1] "w3wiki\nA computer science portal for Beginner\n"

Example 2: Converting text from PDF.

Here we need to convert the PDF into png and then perform the OCR. The syntax is as follows:

pngfile <- pdftools::pdf_convert('https://www.africau.edu/images/default/sample.pdf', dpi = 600)

Here is the full code:


# fetching text from pdf
pngfile <- pdftools::pdf_convert('https://www.africau.edu/images/default/sample.pdf', dpi = 600)
text <- ocr(pngfile)


Converting page 1 to sample_1.png... done!
Converting page 2 to sample_2.png... done!
This is a small demonstration .pdf file -
just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. Boring, zzzzz. And more text. And more text. And
more text. And more text. And more text. And more text. And more text.
And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. And more text. Even more. Continued on page 2 ...
 Simple PDF File 2
...continued from page 1. Yet more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. Oh, how boring typing this stuff. But not as boring as watching
paint dry. And more text. And more text. And more text. And more text.
Boring. More, a little more text. The end, and just as well.

Text Localization in OCR

Now we will learn to get the position of text and prepare a bounding box around it.

To get the bounding box, we can run the ocr_data() method on the image.

bound_box = ocr_data(img)

Step 1: Load the libraries


install.packages(c("png", "tesseract", "magick", "boundingbox", "grid", "magrittr", "ggplot2"))

Step 2: Load image and generate the bounding box data. The ocr_data() method takes an image and sends the coordinates of the rectangle box in form of (x1, y1, x2, y2) coordinates separated by comma which we extract in later step. The coordinates data is stored in bound_box variable.


# png load image
img = image_read('https://media.w3wiki.net/wp-content/uploads/20190328185307/gfg28.png')
# getting word and bounding box
bound_box = ocr_data(img)

Step 3: Convert the coordinates from chr to double by extracting the bound_box data splitting by comma and then saving them as xmin, ymin, xmax and ymax respectively.


bound_box = as.data.frame(bound_box)
# convert the co ordinates into dataframe
bound_box$bbox <- strsplit(bound_box$bbox, ",")
bound_box$xmin <- sapply(bound_box$bbox, function(x) as.numeric(x[1]))
bound_box$ymin <- sapply(bound_box$bbox, function(x) as.numeric(x[2]))
bound_box$xmax <- sapply(bound_box$bbox, function(x) as.numeric(x[3]))
bound_box$ymax <- sapply(bound_box$bbox, function(x) as.numeric(x[4]))


           word confidence            bbox
1 w3wiki 92.04797 5,15,661,96
2 A 96.76034 48,124,71,150
3 computer 96.31223 82,126,237,158
4 science 96.52452 248,123,362,150
5 portal 96.56268 376,122,466,158
6 for 96.14149 480,122,524,150
7 Beginner 96.14149 536,122,626,158

Step 4: Plot the image


# Plot image with bounding boxes
ggplot() +
  annotation_custom(rasterGrob(img)) +
  geom_rect(data = bound_box, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "red", fill = NA) +
  geom_text(data = bound_box, aes(x = (xmin+xmax)/2, y = ymax+10, label = word), color = "red", size = 3) +


Advantages of OCR

  1. Search the pdfs or images using text easily.
  2. Digitizing the paper records
  3. Convert the handwritten or image text easily to strings.

Disadvantages of OCR

  1. All text are not always converted and prone to error based on quality of image.
  2. Handwritten images return poor result due to variety of handwriting of people.


In conclusion, Optical Character Recognition in R opens avenues for text extraction from diverse sources. Tesseract and Magick libraries facilitate seamless integration, enabling tasks such as reading images and converting PDFs. While powerful, OCR’s effectiveness depends on image quality, with potential challenges in handwritten text recognition.

