1. Introduction
In Python, one often encounters the need to save complex objects, like instances of a class, to a file for purposes such as data persistence. Let’s take a Student
class as an example, which includes attributes like name, email, age, city, courses, and address. The goal is to efficiently serialize this object into a file and then deserialize it back into a class instance while maintaining the integrity of the data.
2. Using pickle.dump() Function
To save object to file in Python:
- Use the
open()
function with the with
keyword to open the specified file.
- Use the
pickle.dump()
function to convert the object to a binary format to write it to the file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
|
import pickle class Student(object): def __init__(self, name, email, age, city, courses, address): self.name = name self.email = email self.age = age self.city = city self.courses = courses self.address = address std1 = Student( name='Anonymous', email='@gmail.com', age=36, city='London', courses=['Web', 'OOP'], address={ 'streetAddress': '100A', 'town': 'My Town' } ) file_name = 'std1.pkl' # Writing the student object to a file using pickle with open(file_name, 'wb') as file: pickle.dump(std1, file) print(f'Object successfully saved to "{file_name}"') # Reading the student object back from the file with open("std1.pkl", "rb") as file: loaded_student = pickle.load(file) print(f"Deserialized Student Object: {loaded_student.name}, {loaded_student.email}") |
|
Object successfully saved to "std1.pkl" |
Python’s
pickle is a built-in module for serializing and de-serializing objects (data structures). It transforms an object into a sequence of bytes stored in memory or disk for later use in the same state.
We can save complete data structures such as lists, dictionaries, and custom classes. So, first, we imported the pickle
module. Then we created a class Student()
and an object std1
from it.
The with statement in Python is a control structure that simplifies code by handling exceptions and reducing the amount of code. It creates a context manager object responsible for managing the context block.
- Entering this context block requires opening a file or allocating memory.
- While exiting requires other activities like closing a file or deallocating memory.
We used the with
statement to open the file using the open(file_name, 'wb')
function that takes the filename and mode as arguments.
The pickle
module provides the dump() function that converts a Python data structure into a byte stream format to write it to a binary file on the disk. For example, we used this method to convert std1
to binary stream format and wrote it to the file std1.pkl
.
3. Using dill.dump() Function
Suppose we already have a Python library called dill
installed. If we don’t have it, we can install it with pip install dill
. To save an object to a file using the dill
module:
- Use the
open()
function wrapped in the with
statement to open the file.
- Use the
dump.dill()
function to save the object to the file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
|
import dill class Student(object): def __init__(self, name, email, age, city, courses, address): self.name = name self.email = email self.age = age self.city = city self.courses = courses self.address = address std1 = Student( name='Anonymous', email='@gmail.com', age=36, city='London', courses=['Web', 'OOP'], address={ 'streetAddress': '100A', 'town': 'My Town' } ) file_name = 'std1.pkl' with open(file_name, 'wb') as file: dill.dump(std1, file) print(f'Object successfully saved to "{file_name}"') # Reading the student object back from the file with open("std1.pkl", "rb") as file: loaded_student = pickle.load(file) print(f"Deserialized Student Object: {loaded_student.name}, {loaded_student.email}") Code Explanation: |
|
Object successfully saved to "std1.pkl" |
Dill is an extension of Python’s pickle
module, which we discussed while explaining the code snippet for using the pickle.dump()
function. It stores complex data structures in an efficient format so different systems can use them in distributed applications such as web services.
It can also store multiple objects in a single file using its multi-pickling mode. We imported the module. Then, after successfully creating the object std1
from the class Student()
, we used the with
statement to open the file std1.pkl
using the open()
function.
The dump()
method converts the object into binary format and saves them locally. It can create backups of important data or send data between two programs. For example, we used the dump()
method to store the object std1
in the file std1.pkl
.
4. Using pandas.to_pickle() Function
To save an object to a file using the pandas.to_pickle()
function:
- Use the
with
context manager with the open()
function to open the file.
- Use the
pandas.to_pickle()
function to store the object in a binary file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
|
import pandas as pd class Student(object): def __init__(self, name, email, age, city, courses, address): self.name = name self.email = email self.age = age self.city = city self.courses = courses self.address = address std1 = Student( name='Anonymous', email='@gmail.com', age=36, city='London', courses=['Web', 'OOP'], address={ 'streetAddress': '100A', 'town': 'My Town' } ) file_name = 'std1.pkl' with open(file_name, 'wb') as file: pd.to_pickle(std1, file) print(f'Object successfully saved to "{file_name}"') # Reading the student object back from the file with open("std1.pkl", "rb") as file: loaded_student = pickle.load(file) print(f"Deserialized Student Object: {loaded_student.name}, {loaded_student.email}") Code Explanation: |
|
Object successfully saved to "std1.pkl" |
Python’s pandas is an open-source module that provides high-performance data structures and analysis tools.
The primary purpose of this library is to provide efficient manipulation, filtering, reshaping, and merging operations on numerical tables or data frames. It aims to be the fundamental high-level building block for Python’s practical, real-world data analysis.
Once we created the object std1
from the class Student()
, we opened the file std1.pkl
using the built-in open()
function wrapped in the with
statement, which we discussed in the code section using the pickle.dump()
function.
The pandas
library provides a to_pickle() function that converts an object into a series of bytes and saves it in a file for future reference and use, enabling users to transfer data between applications.
It uses the pickle
module internally, which simplifies the process by abstracting away the code necessary for properly serializing an object before writing it out to disk. For example, we used this method to write the std1
object to the file std1.pkl
.
5. Using JSON Module
JSON is a widely-used format for data interchange, but it requires custom handling for class instances.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
import json class StudentJSONEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, Student): return obj.__dict__ return json.JSONEncoder.default(self, obj) # Function to decode JSON back into a Student object def decode_student(dct): return Student(**dct) # Writing the student object to a JSON file with open("student.json", "w") as file: json.dump(student, file, cls=StudentJSONEncoder) # Reading the student object back from the JSON file with open("student.json", "r") as file: loaded_student = json.load(file, object_hook=decode_student) print(f"Deserialized Student Object: {loaded_student.name}, {loaded_student.email}") |
Explanation:
- A custom
StudentJSONEncoder
is defined to convert Student
objects into a serializable format.
- The
decode_student
function is used to reconstruct the Student
object from the JSON data.
json.dump()
and json.load()
are used with these custom handlers for serialization and deserialization.
6. Using YAML Module
YAML is a user-friendly serialization format but requires custom handling for class instances.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
import yaml # Custom representation for Student objects in YAML def student_representer(dumper, data): return dumper.represent_mapping('!Student', data.__dict__) yaml.add_representer(Student, student_representer) # Writing the student object to a YAML file with open("student.yaml", "w") as file: yaml.dump(student, file) # Reading the student object back from the YAML file with open("student.yaml", "r") as file: loaded_student = yaml.load(file, Loader=yaml.Loader) print(f"Deserialized Student Object: {loaded_student.name}, {loaded_student.email}") |
Explanation:
- A custom representer for the
Student
class is defined for YAML serialization.
yaml.dump()
and yaml.load()
are used for writing to and reading from the YAML file.
7. Comparing Performance
Comparing the performance of different serialization libraries used in Python, such as Pickle, JSON, Dill, Pandas’ to_pickle()
, and YAML, involves evaluating several factors: speed, file size, and ease of use. Each library has its strengths and weaknesses, which I’ll outline below:
7.1. Pickle
- Speed: Pickle is generally very fast, especially for Python-specific data structures and objects.
- File Size: Creates relatively small files due to binary serialization, but the size can grow with complex objects.
- Ease of Use: Very straightforward for Python users, no additional setup required.
- Use Case Suitability: Best for Python-specific applications where data doesn’t need to be shared with other languages.
7.2. JSON
- Speed: JSON serialization can be slower than Pickle, particularly for complex or deeply nested objects.
- File Size: JSON files are larger compared to binary formats like Pickle, as they are text-based. However, they are easily compressible.
- Ease of Use: Straightforward, especially with the standard library. Custom serialization and deserialization can add complexity.
- Use Case Suitability: Ideal for web applications, data interchange between different languages, or when human readability is important.
7.3. Dill
- Speed: Similar to Pickle in terms of speed, but can vary based on the complexity of the objects being serialized.
- File Size: Comparable to Pickle, with file sizes increasing for more complex objects.
- Ease of Use: As easy as Pickle but with extended capabilities to serialize more complex Python objects.
- Use Case Suitability: Suitable for scenarios where Pickle falls short in terms of object complexity.
7.4. Pandas’ to_pickle()
- Speed: Offers good performance, especially for DataFrame objects. Might be slower for highly complex data structures.
- File Size: File size is generally larger than Pickle due to the nature of DataFrame storage.
- Ease of Use: Very convenient for users already working within the Pandas ecosystem.
- Use Case Suitability: Best for serializing DataFrame objects, especially in data analysis and scientific computing contexts.
7.5. YAML
- Speed: YAML serialization and deserialization are slower compared to binary formats like Pickle and Dill.
- File Size: YAML files are larger due to their text-based, human-readable format.
- Ease of Use: YAML is easy to read and write, but serializing and deserializing custom objects requires additional setup.
- Use Case Suitability: Ideal for configuration files, applications requiring human readability, and complex data structures.
8. Conclusion
In summary, the choice of serialization library in Python depends on the specific requirements of your application. If speed and file size are critical, and the data is Python-specific, Pickle or Dill are excellent choices. For interoperability and human readability, JSON and YAML are better suited, though they come with a performance trade-off. Pandas’ to_pickle()
is particularly useful when working with DataFrame objects within the Pandas ecosystem.
Let us know if this post was helpful. Feedbacks are monitored on daily basis. Please do provide feedback as that\'s the only way to improve.