Extract data from PDF file using UiPath and Python

Read Data from PDF/Image Using UiPath & Python

In last month blog post we learned how to use different OCR Engine with UiPath for Optical Character Recognition (OCR). In the same blog post, we applied 6 Different types of OCR Engine to test and evaluate the performance of the OCR engine on a very small set of example images & PDF files.

Contents

Read Data from PDF/Image Using UiPath & Python

OCR Python Code which will take Image as Input and provide relevant data in text format further processing
Uipath workflow to use Python Activity & OCR Python Code (Written in step 1)

Python Code for OCR (Say UiPathOCR.py)
Uipath workflow

Summary

As our results demonstrated, most of the cloud provider has performed well that traditional available OCR Tools.

However, many readers have reached out to me and said why can’t we use the power of Python to Read Image/PDF in UiPath Instead of using cloud variant of ABBYY, Microsoft Vision API or Google Vision API.

Nevertheless, it’s important to understand how OCR works with Python. You will see almost all the cloud OCR engine provider provides SDK for Python language.

Even if you use the Python Language instead of Activities Provided by UiPath its works in similar fashion…

You need to read pdf/image
You need to pass image to engine
The engine will return data in a structured format

Then why some people prefer to go to python language for OCR capability … the reason is preprocessing of the image before it is passed to the engine & post-processing of data received from the engine. This can later be then subjected to any amount of pre-processing for additional tasks.

Let’s think …OCR working as a process consists of several sub-processes to perform as accurately as possible. These subprocesses are:

Preprocessing of the Image
Text Localization
Character Segmentation
Character Recognition
Post Processing

If you wish to read more about OCR working, you can read the links provided in the reference section.

In the remainder of this blog post, we’ll learn to work with Tesseract OCR + Python and integrating the same python script into UiPath.

By the end of the tutorial, you’ll be able to convert the text in an image/pdf to a Python string data type and then finally using the python script inside the UiPath to perform post-processing of data as you wish to do!

Just keep reading…

Pre Requisites –

You need to install pytesseract (Using pip install pytesseract) – Wrapper on top of tesseract
CV2 can be also used with tesseract for better image processing.
Add Uipath.python.Activities Package in your project Dependency in Uipath along with setting for Python Path
You also need to configure and install tesseract binary on the same machine where this script needs to be executed

Sample Example –

For better understanding, this post has been divided into two parts –

OCR Python Code which will take Image as Input and provide relevant data in text format further processing
Uipath workflow to use Python Activity & OCR Python Code (Written in step 1)

Python Code for OCR (Say UiPathOCR.py)

# import the necessary packages
from PIL import Image
import pytesseract
import cv2
import os

def ocr(image_path, preprocess):
   """
   Takes Image  and preprocess for some common handling
   :param image_path:
   :param preprocess: should be thresh, blur,
   :return:
   """
   # load the example image and convert it to grayscale
   image = cv2.imread(image_path)
   gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
   # check to see if we should apply thresholding to preprocess the
   # image
   if preprocess == "thresh":
      gray = cv2.threshold(gray, 0, 255,
         cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
      # make a check to see if median blurring should be done to remove
      # noise
   elif preprocess == "blur":
      gray = cv2.medianBlur(gray, 3)
   # write the grayscale image to disk as a temporary file so we can
   # apply OCR to it
   filename = "{}.png".format(os.getpid())
   cv2.imwrite(filename, gray)
   # load the image as a PIL/Pillow image, apply OCR, and then delete
   # the temporary file
   # You might Need to set the path for tesseract incase its not in your system path like below
   # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
   text = pytesseract.image_to_string(Image.open(filename))
   os.remove(filename)
   return text

Uipath workflow

You need to Drag “Python Scope” activity into the designer panel and set the required parameters.

Path: System python installed path i.e. “C:\Python36”
Version: python version .i.e 3.6

Drag “Load Python Script” activity into the designer panel to load the ( UiPathOCR.py )python script and supply the below parameters.

File: Your python script path
Result: Create a PythonObject type variable

Drag “Invoke Python Method” activity into the designer panel and supply the function name that we want to invoke with the required arguments to it.

Function Name: ocr
two arguments i.e 1. image_path:“Data\invoice-sample.jpg” & preprocess: “” as dictionary {“”,”None”}

Step 3 Invoke Python OCR Script With Input

Drag “Get Python Object” activity into the designer panel to convert the python Object obtained from the above activity to our desired datatype (In our case its “dictionary” type)

Step 4 Save Python OCR Result to Your Datatype

Finally, when you execute you will receive your result in your desired variable as below –

[”, ”, ”, ”, ‘’’, ‘Invoice’, ”, ”, ”, ‘Your’, ‘Company’, ‘LLC’, ‘Address’, ‘123,’, ‘State,’, ‘My’, ‘Country’, ‘P’, ‘111-222-333,’, ‘F’, ‘111-222-334’, ”, ”, ”, ‘BILL’, ‘TO:’, ”, ‘John’, ‘Doe’, ”, ”, ‘Alpha’, ‘Bravo’, ‘Road’, ’33’, ”, ”, ‘P:’, ‘111-222-338,’, ‘F:’, ‘111-222-334’, ”, ‘client@example.net’, ”, ”, ”, ‘SHIPPING’, ‘TO:’, ”, ‘John’, ‘Doe’, ‘Office’, ”, ”, ‘Office’, ‘Road’, ’38,’, ”, ”, ‘P:’, ‘111-383-222,’, ‘F:’, ‘122-222-834’, ”, ‘office@example.net’, ”, ”, ”, ‘http://mrsinvoice.com’, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘Invoice’, ‘#’, ‘00001’, ”, ‘Invoice’, ‘Date’, ’12/12/2001′, ”, ‘Name’, ‘of’, ‘Rep.’, ‘Bob’, ”, ”, ‘Contact’, ‘Phone’, ‘101-102-103’, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘Payment’, ‘Terms’, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘Cash’, ‘on’, ‘Delivery’, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘Amount’, ‘Due:’, ‘$4,170’, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘ ‘, ”, ”, ”, ‘NO’, ‘PRODUCTS’, ‘/’, ‘SERVICE’, ‘QUANTITY’, ‘/’, ‘RATE’, ‘/’, ‘UNIT’, ‘AMOUNT’, ”, ‘HOURS.’, ‘PRICE’, ”, ”, ‘1’, ‘tye’, ‘2’, ‘$20’, ‘$40’, ”, ”, ‘2__|’, ‘Steering’, ‘Wheel’, ‘5’, ‘$10’, ‘$50’, ”, ”, ‘3’, ‘|’, ‘Engine’, ‘oil’, ’10’, ‘$15’, ‘$150’, ”, ”, ‘4’, ‘|’, ‘Brake’, ‘Pad’, ’24’, ‘$1000’, ‘$2,400’, ”, ”, ‘Subtotal’, ‘$275’, ”, ”, ‘Tax’, ‘(10%)’, ‘$27.5’, ”, ”, ‘Grand’, ‘Total’, ‘$302.5’, ”, ”, ”, ‘‘THANK’, ‘YOU’, ‘FOR’, ‘YOUR’, ‘BUSINESS.’]

Summary

Today we learned how to Use Tesseract on our machines with UiPath, the first part in a two-part series on using Tesseract for OCR.

Use Python script with tesseract binary to apply OCR to input images.
Then Using the UiPath to invoke the Python Script & perform the task

However, If you compare the result of this OCR with another Cloud-based engine result is poor …but for better accuracy, we can train a custom machine learning model to recognize characters in our specific use case.

We can also use CV2 with Tesseract for better pre-processing of image & then apply ocr.

Tesseract is best suited for situations with high-resolution inputs such as sample invoice pdf & formatted image with a clear background.

Next week we’ll learn how to use Uipath With Service Now … so stay tuned.

Notes –

You need to install Uipath.python.Activities
You might get pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it’s not in your path

References –

Tesseract installer binary for windows – https://digi.bib.uni-mannheim.de/tesseract/
OCR with Tesseract https://nanonets.com/blog/ocr-with-tesseract/
https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/