PDF Data Extraction using UiPath

PDF Data Extraction using UiPath

The blog will help to understand the usages of the functions:

  • Read PDF Text
  • Read PDF with OCR

In UiPath we can extract text data as well as any text present in image form, through the different functions present in the PDF packages provided by UiPath.

For that we have to install the PDF Packages in UiPath. Follow the below steps: –

  • Go to Manage Packages and select All Packages

  • Type PDF and select UiPath.PDF.Activities

  • Install Package UiPath.PDF.Activities. 
  • The different options for PDF Automation will be available in UiTool once the installation is completed. 

Read PDF Text The section illustrates the Reading and extracting data from PDF to another Text file. 

  1. Under Activities, we start with a sequence. Then write PDF to view all its functions present and Drag and Drop “Read PDF Text”
  2. We have to give the Input file Path for the PDF to be read. We can store the Output Text into a String Variable.

  1. Now we are writing the Output to a Text file using the Write Text File Activity.
  1. Under Write Text File we have to give the Input Text as the Variable Name in which we have stored the output in earlier steps. Also, we have to provide the File path where our Output Text File to be created.
  2. Once we run our Process, we can see the Result as follows. 

We can see that all the Text from our Input PDF has been successfully extracted to out Output File except the Image that consists of Text. In Order to recognize and extract the Text from Image we have to use the Read PDF with OCR function.

Read PDF with OCR

Our input PDF file consists of an image part which we want to extract too along with other texts.

OCR – optical character recognition is the technology used for distinguishing text character inside digital images. In UiPath we have multiple ways to read a text from an Image. Here we are using the Read PDF with OCR.

For this we have to use an OCR engine. UiPath OCR is a proprietary OCR technology of UiPath, supporting characters used by the following Latin script languages: English, French, German, Italian, Portuguese, Romanian and Spanish. Text in other languages will be recognized but without accents.

The default Engines Provided by UiPath are- Google Cloud Vision OCR, Microsoft Azure Computer Vision OCR,Microsoft OCR,Microsoft Project Oxford Online OCR,Tessaract OCR.

These various engines are used depending on the Document we are using.

 Below steps show the Use of OCR to read a PDF

  1. Under Activities,We start with a sequence. Then write PDF to view all its functions present and Drag and Drop “Read PDF with OCR”
  2. We have to give the Input file Path for the PDF to be read. We can store the Output Text into a  String Variable.
  3. We have to now search for OCR in activities and Select Any of the OCR.
  4. We have to enter the output variable for the OCR used.
  5. Now we are writing the Output to a Text file using the Write Text File Activity.
  6. Under Write Text File we have to give the Input Text as the Variable Name in which we have stored the output in earlier steps. Also, we have to provide the File path where our Output Text File to be created.

  1. Once we run our Process, we can see the Result as follows.

We can see that all the Text from our Input PDF has been successfully extracted to our Output File along with the Text Present in the Image. Both this Activity are self-contained meaning even if the PDFs are not open, they can read as well as extract data.

Leave a Reply

Retype the CAPTCHA code from the image
Change the CAPTCHA codeSpeak the CAPTCHA code
 

SOAIS - Worksoft Newsletter

To view on your browser, click here
Facebook Twitter LinkedIn
Dear Default Value,
 

Welcome to SOAIS Newsletter of September 2021!

Continuous Testing with Remote Execution
 
The speed of innovation continues to increase, driving rapid and relentless change for today’s ever-evolving IT landscapes, creating greater risk as IT and business teams scramble to ensure timely delivery. How can your organization keep pace? Test more, worry less. With Worksoft’s Connective Automation Platform, you can easily build and maintain automated tests, accelerating testing time without losing scope or volume. You can schedule and execute remote, continuous tests to intercept defects sooner and prioritize remediation - without sacrificing your nights and weekends. Explore how continuous test automation and remote execution can empower your organization.

Click here to connect with us to get more information on our services.
 

Skip Costly Rework with Dynamic Change Resiliency​

Change resiliency is imperative in ever-evolving IT environments. Our patented object action framework streamlines change management by assigning object definitions to your shared assets. The same object may be used in a thousand automation steps, but it can be easily updated by making one simple change to the model definition. The change automatically propagates to every single instance where that object may have been used without a single line of code or manual human involvement. For more change readiness you can also engage our Impact Analysis for SAP to predict how changes in SAP transports will affect your business processes. 

Please click here to watch the video to get a gist.
 

SOAIS Blog – Nuts and bolts of Certify Database Maintenance​

One of the key thing, which is often missed by the organizations, who have invested in using Worksoft Certify for automating their Business Process Validation initiatives, is implementing a Database Maintenance Plan. While the business and the test automation consultants get excited about the shiny new thing that they have got and start building the regression suite; planning and executing a database maintenance plan for most of the customers gets pushed down the priority list. However, since all the test assets in Certify are stored in a Database, a robust database maintenance plan is very important to maintain smooth operation of Certify with acceptable performance criteria. The customers usually start facing issues once they have built significant number of Certify processes which they have started executing on regular basis. Such executions add a lot of data to the tables storing results data and increase the overall size of the Certify database.

Please click here to read the complete blog.
 

Worksoft Blog – Process Intelligence: A Multi-Dimensional Approach

The ability to extract process knowledge has become easier through the years. Technology has evolved to the point where we can deploy capabilities that connect at multiple levels to extract different types of process insight. In the past, organizations were forced to spend enormous energy extracting data manually from different applications and databases. Then, they would have to use things like spreadsheets to transform the data and convert it into meaningful information. 

Please click here and read the complete blog.
 
India
Unit 9, Level 5, Navigator, ITPL,
Bangalore - 560 066.
Phone: +91 80 40071234
US
Suite 101, 1979, N Mill St,
Naperville, IL 60563
Phone 1-800-262-2427
Please click here to Unsubscribe / Unsubscribe Preferences

Leave us your info