Table of Contents
HTML pages with Python
Webpages are made using HTML. It is the programming code that defines the webpage and its contents. It is at the core of every website on the internet.
We can access and retrieve content from web pages using Python. Python allows us to access different types of data from URLs like JSON, HTML, XML, and more. We can use different libraries for working with HTML in Python.
Get HTML from URL in Python
We will now discuss how to get HTML from URL in Python.
Using the urllib
library to get HTML from URL in Python
The urllib
library in Python is used to handle operations related to fetching and working with URLs and accessing different URLs. We can use different functionalities from this module to get HTML from URL in Python.
First, we need to access the URL. For this, we use the urllib.request
class. We can use the urllib.request.urlopen()
function to create a urllib.request
class object that creates a connection to the desired URL. We specify the URL within the urlopen()
function.
Then, to get HTML from URL in Python, we use the read()
function with this object. In Python 3, this returns a bytes object. So, we need to convert this object to a string by decoding it.
We will use the decode()
function to retrieve the HTML as strings and display it. One should also terminate the urllib.request
object using the close()
function.
We will now use this in the code below.
1 2 3 4 5 6 7 8 |
import urllib.request o = urllib.request.urlopen("https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html") b = o.read() print(type(b)) s = b.decode("utf-8") print(s[:50]) |
Output:
In the above example,
- We open the URL using the
urllib.request.open()
function. - We read the data and confirm that it is read as bytes.
- We decode this bytes object to string using the
decode()
function. - We specify the
utf-8
encoding in thedecode()
function to get the string. - The HTML is stored as a string and the first 50 characters are displayed.
There is a slight difference in using this library with Python 2. The urllib
was introduced in Python 1.2. With Python 2, urllib2
was created which was intended to replace the urllib
library. However, with Python 3, a new urllib
was introduced which merged the previous versions. So, now we have to use the urllib
library in Python 2 as well since the urllib2
library was split and divided into this library.
While using urllib
in Python 2, we do not need to import the urllib.request
class since the urllopen()
function is present in urllib
only. Also, the read()
function is used to get HTML from URL in Python directly as a string. This removes the need for any decoding.
Notice the changes in the code below.
1 2 3 4 5 6 |
import urllib o = urllib.urlopen("https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html") s = o.read() print type(s), s[:50] |
Output:
Using the requests
library to get HTML from URL in Python
The requests
library in Python is a simple, efficient library that aims to provide simple APIs to send HTTP requests. It is based on the urllib3
library, which is a third-party package and not part of the standard library.
We can use this library to get HTML from URL in Python. The requests.get()
function is used to send a GET request to the URL specified within the function. It returns some response.
We can get the content from the response using the text()
function. This will return the content of HTML as a string.
For example,
1 2 3 4 5 6 |
import requests o = requests.get("https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html") s = o.text print(b[:50]) |
Output:
Using the urllib3
library to get HTML from URL in Python
As discussed earlier, the urllib3
library is a third-party library that is also used internally by the requests
library. We can use this library also to get HTML from URL in Python.
First, we need to create a PoolManager
object using urllib3.PoolManager()
constructor. This object is used to handle the requirements for the requests and ensures thread safety.
We use the request()
function with this object to send the GET request for the given URL. We read its contents using the data()
function.
It also returns the HTML as bytes so we need to decode the content using the decode()
function.
See the following example,
1 2 3 4 5 6 7 8 9 |
import urllib3 http = urllib3.PoolManager() o = http.request('GET', "https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html") b = o.data print(type(b)) s = b.decode("utf-8") print(s[:50]) |
Output:
In the above example,
- The
http
object belongs to thePoolManager
class. - We use the
request()
function with this object. - We need to specify that we are sending a GET request within the function with the URL.
- The HTML is then retrieved and decoded using the
decode()
function.
Conclusion
In this article, we discussed three libraries to get HTML from URL in Python. The first method used was the urllib
library. This is one of the most popular and commonly used libraries to get HTML from URL in Python. There is a difference while using this library in Python 2 due to its history and changes. We also use the requests
and urllib3
library. The requests
library is based on the urllib3
library, which is not part of the standard Python library and was developed as a third-party package.
That’s all how to get HTML from URL in Python.