Extract data from PDF file using UiPath and Python

Satish Prasad
10 Min Read
Extract data from PDF file using UiPath and Python

Read Data from PDF/Image Using UiPath & Python

In last month blog post we learned how to use different OCR Engine with UiPath for Optical Character Recognition (OCR). In the same blog post, we applied 6 Different types of OCR Engine to test and evaluate the performance of the OCR engine on a very small set of example images & PDF files.

As our results demonstrated, most of the cloud provider has performed well that traditional available OCR Tools.

However, many readers have reached out to me and said why canโ€™t we use the power of Python to Read Image/PDF in UiPath Instead of using cloud variant of ABBYY, Microsoft Vision API or Google Vision API.

Nevertheless, itโ€™s important to understand how OCR works with Python. You will see almost all the cloud OCR engine provider provides SDK for Python language.

Even if you use the Python Language instead of Activities Provided by UiPath its works in similar fashionโ€ฆ

  • You need to read pdf/image
  • You need to pass image to engine
  • The engine will return data in a structured format

Then why some people prefer to go to python language for OCR capability โ€ฆ the reason is preprocessing of the image before it is passed to the engine & post-processing of data received from the engine. This can later be then subjected to any amount of pre-processing for additional tasks.

Letโ€™s think โ€ฆOCR working as a process consists of several sub-processes to perform as accurately as possible. These subprocesses are:

  • Preprocessing of the Image
  • Text Localization
  • Character Segmentation
  • Character Recognition
  • Post Processing

If you wish to read more about OCR working, you can read the links provided in the reference section.

In the remainder of this blog post, weโ€™ll learn to work with Tesseract OCR + Python and integrating the same python script into UiPath.

By the end of the tutorial, youโ€™ll be able to convert the text in an image/pdf to a Python string data type and then finally using the python script inside the UiPath to perform post-processing of data as you wish to do!

Just keep readingโ€ฆ

Pre Requisites โ€“ย 

  1. You need to install pytesseract (Using pip install pytesseract) โ€“ Wrapper on top of tesseract
  2. CV2 can be also used with tesseract for better image processing.
  3. Add Uipath.python.Activities Package in your project Dependency in Uipath along with setting for Python Pathย 
  4. You also need to configure and install tesseract binary on the same machine where this script needs to be executed

Sample Example โ€“

For better understanding, this post has been divided into two parts โ€“

  1. OCR Python Code which will take Image as Input and provide relevant data in textย  format further processing
  2. Uipath workflow to use Python Activity & OCR Python Code (Written in step 1)

Python Code for OCR (Say UiPathOCR.py)

# import the necessary packages
from PIL import Image
import pytesseract
import cv2
import os

def ocr(image_path, preprocess):
   """
   Takes Image  and preprocess for some common handling
   :param image_path:
   :param preprocess: should be thresh, blur,
   :return:
   """
   # load the example image and convert it to grayscale
   image = cv2.imread(image_path)
   gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
   # check to see if we should apply thresholding to preprocess the
   # image
   if preprocess == "thresh":
      gray = cv2.threshold(gray, 0, 255,
         cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
      # make a check to see if median blurring should be done to remove
      # noise
   elif preprocess == "blur":
      gray = cv2.medianBlur(gray, 3)
   # write the grayscale image to disk as a temporary file so we can
   # apply OCR to it
   filename = "{}.png".format(os.getpid())
   cv2.imwrite(filename, gray)
   # load the image as a PIL/Pillow image, apply OCR, and then delete
   # the temporary file
   # You might Need to set the path for tesseract incase its not in your system path like below
   # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
   text = pytesseract.image_to_string(Image.open(filename))
   os.remove(filename)
   return text

Uipath workflow

You need to Drag โ€œPython Scopeโ€ activity into the designer panel and set the required parameters.

  • Path: System python installed path i.e. โ€œC:\Python36โ€
  • Version: python version .i.e 3.6
Step 1 Use Python Scope Activity
Step 1 Use Python Scope Activity

Drag โ€œLoad Python Scriptโ€ activity into the designer panel to load the ( UiPathOCR.py )python script and supply the below parameters.

  • File:ย Your python script path
  • Result:ย Create a PythonObject type variable
Step 2 Load Python OCR Script
Step 2 Load Python OCR Script

Drag โ€œInvoke Python Methodโ€ activity into the designer panel and supply theย function nameย that we want to invoke with theย required argumentsย to it.

  • Function Name: ocr
  • two arguments i.eย  1. image_path:โ€œData\invoice-sample.jpgโ€ & preprocess: โ€œโ€ as dictionary {โ€œโ€,โ€Noneโ€}
Step 3 Invoke Python OCR Script With Input
Step 3 Invoke Python OCR Script With Input

Drag โ€œGet Python Objectโ€ activity into the designer panel to convert the python Object obtained from the above activity to our desired datatype (In our case its โ€œdictionaryโ€ type)

Step 4 Save Python OCR Result to Your Datatype
Step 4 Save Python OCR Result to Your Datatype

Finally, when you execute you will receive your result in your desired variable as below โ€“

[โ€, โ€, โ€, โ€, โ€˜โ€™โ€™, โ€˜Invoiceโ€™, โ€, โ€, โ€, โ€˜Yourโ€™, โ€˜Companyโ€™, โ€˜LLCโ€™, โ€˜Addressโ€™, โ€˜123,โ€™, โ€˜State,โ€™, โ€˜Myโ€™, โ€˜Countryโ€™, โ€˜Pโ€™, โ€˜111-222-333,โ€™, โ€˜Fโ€™, โ€˜111-222-334โ€™, โ€, โ€, โ€, โ€˜BILLโ€™, โ€˜TO:โ€™, โ€, โ€˜Johnโ€™, โ€˜Doeโ€™, โ€, โ€, โ€˜Alphaโ€™, โ€˜Bravoโ€™, โ€˜Roadโ€™, โ€™33โ€™, โ€, โ€, โ€˜P:โ€™, โ€˜111-222-338,โ€™, โ€˜F:โ€™, โ€˜111-222-334โ€™, โ€, โ€˜client@example.netโ€™, โ€, โ€, โ€, โ€˜SHIPPINGโ€™, โ€˜TO:โ€™, โ€, โ€˜Johnโ€™, โ€˜Doeโ€™, โ€˜Officeโ€™, โ€, โ€, โ€˜Officeโ€™, โ€˜Roadโ€™, โ€™38,โ€™, โ€, โ€, โ€˜P:โ€™, โ€˜111-383-222,โ€™, โ€˜F:โ€™, โ€˜122-222-834โ€™, โ€, โ€˜office@example.netโ€™, โ€, โ€, โ€, โ€˜http://mrsinvoice.comโ€™, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜Invoiceโ€™, โ€˜#โ€™, โ€˜00001โ€™, โ€, โ€˜Invoiceโ€™, โ€˜Dateโ€™, โ€™12/12/2001โ€ฒ, โ€, โ€˜Nameโ€™, โ€˜ofโ€™, โ€˜Rep.โ€™, โ€˜Bobโ€™, โ€, โ€, โ€˜Contactโ€™, โ€˜Phoneโ€™, โ€˜101-102-103โ€™, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜Paymentโ€™, โ€˜Termsโ€™, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜Cashโ€™, โ€˜onโ€™, โ€˜Deliveryโ€™, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜Amountโ€™, โ€˜Due:โ€™, โ€˜$4,170โ€™, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜ โ€˜, โ€, โ€, โ€, โ€˜NOโ€™, โ€˜PRODUCTSโ€™, โ€˜/โ€™, โ€˜SERVICEโ€™, โ€˜QUANTITYโ€™, โ€˜/โ€™, โ€˜RATEโ€™, โ€˜/โ€™, โ€˜UNITโ€™, โ€˜AMOUNTโ€™, โ€, โ€˜HOURS.โ€™, โ€˜PRICEโ€™, โ€, โ€, โ€˜1โ€™, โ€˜tyeโ€™, โ€˜2โ€™, โ€˜$20โ€™, โ€˜$40โ€™, โ€, โ€, โ€˜2__|โ€™, โ€˜Steeringโ€™, โ€˜Wheelโ€™, โ€˜5โ€™, โ€˜$10โ€™, โ€˜$50โ€™, โ€, โ€, โ€˜3โ€™, โ€˜|โ€™, โ€˜Engineโ€™, โ€˜oilโ€™, โ€™10โ€™, โ€˜$15โ€™, โ€˜$150โ€™, โ€, โ€, โ€˜4โ€™, โ€˜|โ€™, โ€˜Brakeโ€™, โ€˜Padโ€™, โ€™24โ€™, โ€˜$1000โ€™, โ€˜$2,400โ€™, โ€, โ€, โ€˜Subtotalโ€™, โ€˜$275โ€™, โ€, โ€, โ€˜Taxโ€™, โ€˜(10%)โ€™, โ€˜$27.5โ€™, โ€, โ€, โ€˜Grandโ€™, โ€˜Totalโ€™, โ€˜$302.5โ€™, โ€, โ€, โ€, โ€˜โ€˜THANKโ€™, โ€˜YOUโ€™, โ€˜FORโ€™, โ€˜YOURโ€™, โ€˜BUSINESS.โ€™]

Summary

Today we learned how to Use Tesseract on our machines with UiPath, the first part in a two-part series on using Tesseract for OCR.

  1. Use Python script with tesseract binary to apply OCR to input images.
  2. Then Using the UiPath to invoke the Python Script & perform the task

However, If you compare the result of this OCR with another Cloud-based engine result is poor โ€ฆbut for better accuracy, we can train a custom machine learning model to recognize characters in our specific use case.

We can also use CV2 with Tesseract for better pre-processing of image & then apply ocr.

Tesseract is best suited for situations with high-resolution inputs such as sample invoice pdf & formatted image with a clear background.

Next week weโ€™ll learn how to use Uipath With Service Now โ€ฆ so stay tuned.

Notes โ€“

  1. You need to install Uipath.python.Activities
  2. You might get pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or itโ€™s not in your path

References โ€“

  1. Tesseract installer binary for windows โ€“ https://digi.bib.uni-mannheim.de/tesseract/
  2. OCR with Tesseract https://nanonets.com/blog/ocr-with-tesseract/
  3. https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/
Share This Article
Follow:
Hey there, I'm Satish Prasad, and I've got a Master's Degree (MCA) from NIT Kurukshetra. With over 12 years in the game, I've been diving deep into Data Analytics, Delaware House, ETL, Production Support, Robotic Process Automation (RPA), and Intelligent Automation. I've hopped around various IT firms, hustling in functions like Investment Banking, Mutual Funds, Logistics, Travel, and Tourism. My jam? Building over 100 Production Bots to amp up efficiency. Let's connect! Join me in exploring the exciting realms of Data Analytics, RPA, and Intelligent Automation. It's been a wild ride, and I'm here to share insights, stories, and tech vibes that'll keep you in the loop. Catch you on the flip side
Leave a Comment