Get HTML from URL in Python

Get HTML from URL in Python

HTML pages with Python

Webpages are made using HTML. It is the programming code that defines the webpage and its contents. It is at the core of every website on the internet.

We can access and retrieve content from web pages using Python. Python allows us to access different types of data from URLs like JSON, HTML, XML, and more. We can use different libraries for working with HTML in Python.

Get HTML from URL in Python

We will now discuss how to get HTML from URL in Python.

Using the urllib library to get HTML from URL in Python

The urllib library in Python is used to handle operations related to fetching and working with URLs and accessing different URLs. We can use different functionalities from this module to get HTML from URL in Python.

First, we need to access the URL. For this, we use the urllib.request class. We can use the urllib.request.urlopen() function to create a urllib.request class object that creates a connection to the desired URL. We specify the URL within the urlopen() function.

Then, to get HTML from URL in Python, we use the read() function with this object. In Python 3, this returns a bytes object. So, we need to convert this object to a string by decoding it.

We will use the decode() function to retrieve the HTML as strings and display it. One should also terminate the urllib.request object using the close() function.

We will now use this in the code below.

Output:

<title>A very simple webpage</title>

In the above example,

  • We open the URL using the urllib.request.open() function.
  • We read the data and confirm that it is read as bytes.
  • We decode this bytes object to string using the decode() function.
  • We specify the utf-8 encoding in the decode() function to get the string.
  • The HTML is stored as a string and the first 50 characters are displayed.

There is a slight difference in using this library with Python 2. The urllib was introduced in Python 1.2. With Python 2, urllib2 was created which was intended to replace the urllib library. However, with Python 3, a new urllib was introduced which merged the previous versions. So, now we have to use the urllib library in Python 2 as well since the urllib2 library was split and divided into this library.

While using urllib in Python 2, we do not need to import the urllib.request class since the urllopen() function is present in urllib only. Also, the read() function is used to get HTML from URL in Python directly as a string. This removes the need for any decoding.

Notice the changes in the code below.

Output:

<title>A very simple webpage</title>

Using the requests library to get HTML from URL in Python

The requests library in Python is a simple, efficient library that aims to provide simple APIs to send HTTP requests. It is based on the urllib3 library, which is a third-party package and not part of the standard library.

We can use this library to get HTML from URL in Python. The requests.get() function is used to send a GET request to the URL specified within the function. It returns some response.

We can get the content from the response using the text() function. This will return the content of HTML as a string.

For example,

Output:

<title>A very simple webpage</title>

Using the urllib3 library to get HTML from URL in Python

As discussed earlier, the urllib3 library is a third-party library that is also used internally by the requests library. We can use this library also to get HTML from URL in Python.

First, we need to create a PoolManager object using urllib3.PoolManager() constructor. This object is used to handle the requirements for the requests and ensures thread safety.

We use the request() function with this object to send the GET request for the given URL. We read its contents using the data() function.

It also returns the HTML as bytes so we need to decode the content using the decode() function.

See the following example,

Output:

<title>A very simple webpage</title>

In the above example,

  • The http object belongs to the PoolManager class.
  • We use the request() function with this object.
  • We need to specify that we are sending a GET request within the function with the URL.
  • The HTML is then retrieved and decoded using the decode() function.

Conclusion

In this article, we discussed three libraries to get HTML from URL in Python. The first method used was the urllib library. This is one of the most popular and commonly used libraries to get HTML from URL in Python. There is a difference while using this library in Python 2 due to its history and changes. We also use the requests and urllib3 library. The requests library is based on the urllib3 library, which is not part of the standard Python library and was developed as a third-party package.

That’s all how to get HTML from URL in Python.

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *