Convert PDF to Text in Python

Convert PDF to text in Python

Python is a feature-rich programming language. We can perform different operations on files in python using the different modules and libraries. In this article, we will discuss different ways to convert a pdf file to text in python.

Convert pdf to text using pypdf2

To convert a pdf to text in python, we can use the PyPDF2 module. You can install this module using PIP by executing the following command in the command prompt.

pip3 install PyPDF2

To convert the pdf file to text, we will first open the file using the open() function in the “rb” mode. i.e.Instead of the file contents, we will read the file in binary mode. The open() function takes the filename as the first input argument and the mode as the second input argument. After opening the file, it returns a file object that we assign to the variable myFile.  

After getting the file object, we will create a pdfFileReader object using the PdfFileReader() function defined in the PyPDF2 module. The PdfFileReader() function accepts the file object containing the pdf file as the input argument and returns a pdfFileReader object. Using the pdfFileReader, we can convert the PDF file to text. 

Also, we will open a file “output.txt” in write mode to save the text data extracted from the pdf file using the open() function. We will assign this file object to a variable output_file.

To create the text file from the PDF file, we will first find the number of pages in the PDF file. For this, we will use the numPages attribute of the pdfFileReader object.

Output:

You can view the pdf file used in this example here.

After getting the number of pages in the PDF file, we will use a for loop to process all the pages of the pdf file. In the for loop, we will extract each page from the PDF file using the getPage() method. The getPage() method, when invoked on a pdfFileReader object, accepts the page number as an input argument and returns a pageObject containing data from the specified page of the PDF file.

After getting the pageObject, we will use the extractText() method to extract text from the current page. After that, we will write the extracted text to the output text file.

After extracting the text from all the pages in pdf, we will close both the text file and the pdf file. Otherwise, the changes will not be saved.

Output:

Convert pdf to text using PDFminer

Instead of using PyPDF2, we can use the PDFminer.six module to convert a pdf file to a text file. You can install the PDFminer.six module as follows.

pip3 install pdfminer.six

The PDFminer.six module provides us with the extract_text() function that we can use to convert the PDF file to a text file. The extract_text() function accepts a file object representing the PDF file as an input argument and returns the text data in the file. 

After opening the PDF file and the output text file, we can extract the text from the PDF file using the extract_text() function. Then, we will save the text data to the output file. Don’t forget to close the files before the termination of the program.

Output:

Conclusion

In this article, we have discussed two ways to convert pdf to a text file in python. Out of all these, the approach using the PyPDF2 module is the fastest in terms of execution speed. However, you can use any approach at your convenience.

I hope you enjoyed reading this article. Stay tuned for more informative articles. 

Happy Learning! 


import_contacts

You may also like:

Related Posts

  • Python copy file to another directory
    17 January

    How to copy file to another directory in Python

    Table of ContentsWays to copy file to another directoy in PythonUsing file handlingUsing the shutil libraryUsing the pathlib libraryUsing the os moduleUsing the subprocess moduleConclusion In this article, we will see different ways to copy file to another directory in Python. We can read and write files in Python. We can also work with paths […]

  • Check if variable is String in Python
    13 January

    Check if variable is String in python

    Table of ContentsHow to check if a given variable is of the string type in Python?Using the isinstance() function.Using the type() function.Check if function parameter is String In this post, we will see what is a string in Python and how to check whether a given variable is a string or not. There are many […]

  • Count the number of characters in a String in Python
    13 January

    Count number of characters in a string in python

    Table of ContentsWays to count the number of characters in a string in PythonUsing the len() functionUsing the for loopUsing the collections.Counter classConclusion In this post, we will see how to count number of characters in a String in Python. We can think of strings as a collection of characters, with every character at a […]

  • Pandas apply function to column
    12 January

    Pandas apply function to column

    Table of ContentsHow do I apply function to column in pandas?Using dataframe.apply() functionUsing lambda function along with the apply() functionUsing dataframe.transform() functionUsing map() functionUsing NumPy.square() function We make use of the Pandas dataframe to store data in an organized and tabular manner. Sometimes there, is a need to apply a function over a specific column […]

  • Copy file Python
    12 January

    How to Copy File in Python?

    Table of ContentsCopy file using the shutil module in PythonCopy file using the copyfileobj() function in PythonCopy file using the copyfile() function in PythonCopy file using the copy() function in PythonCopy file using the copy2() function in PythonCopy file using the operating system command in PythonCopy file using the os module in PythonCopy file using […]

  • What is percent in python
    11 January

    What is % in Python?

    Table of ContentsWhat is % Operator in Python?The % Operator with strings in pythonThe % Operator as a placeholder for variables  in pythonThe % Operator as format specifiers in pythonConclusion In this article, we will cover what is % in Python and what are different ways to use % in Python. In python, there are […]

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe to our newletter

Get quality tutorials to your inbox. Subscribe now.