Serialization with pickle and json
Serialization is the process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and resurrected later in the same or another computer environment.
When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object.
This process of serializing an object is also called deflating or marshalling an object. The opposite operation, extracting a data structure from a series of bytes, is deserialization (which is also called inflating or unmarshalling). wiki.
In Python, we have the pickle module. The bulk of the pickle module is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures. It is a cross-version customisable but unsafe (not secure against erroneous or malicious data) serialization format.
The standard library also includes modules serializing to standard data formats:
- json with built-in support for basic scalar and collection types and able to support arbitrary types via encoding and decoding hooks.
- XML-encoded property lists. (plistlib), limited to plist-supported types (numbers, strings, booleans, tuples, lists, dictionaries, datetime and binary blobs)
Finally, it is recommended that an object's __repr__ be evaluable in the right environment, making it a rough match for Common Lisp's print-object. wiki
Here are the things that the pickle module store:
- All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.
- Lists, tuples, dictionaries, and sets containing any combination of native datatypes.
- Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing any combination of native datatypes (and so on, to the maximum nesting level that Python supports).
- Functions, classes, and instances of classes (with caveats).
We will use two Python Shells, 'A' & 'B':
>>> shell = 'A'
Open another Shell:
>>> shell = 'B'
Here is the dictionary type data for Shell 'A':
>>> shell 'A' >>> book = {} >>> book['title'] = 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition' >>> book['page_link'] = 'http://www.amazon.com/Fil-Hunter/e/B001ITTV7A' >>> book['comment_link'] = None >>> book['id'] = b'\xAC\xE2\xC1\xD7' >>> book['tags'] = ('Photography', 'Kindle', 'Light') >>> book['published'] = True >>> import time >>> book['published_time'] = time.strptime('Mon Sep 10 23:18:32 2012') >>> book['published_time'] time.struct_time(tm_year=2012, tm_mon=9, tm_mday=10, tm_hour=23, tm_min=18, tm_sec=32, tm_wday=0, tm_yday=254, tm_isdst=-1) >>>
Here, we're trying to use as many data types as possible.
The time module contains a data structure, struct_time to represent a point in time and functions to manipulate time structs. The strptime() function takes a formatted string an converts it to a struct_time. This string is in the default format, but we can control that with format codes. For more details, visit the time module.
Now, we have a dictionay that has all the information about the book. Let's save it as a pickle file:
>>> import pickle >>> with open('book.pickle', 'wb') as f: pickle.dump(book, f)
We set the file mode to wb to open the file for writing in binary mode. Wrap it in a with statement to ensure the file is closed automatically when we're done with it. The dump() function in the pickle module takes a serializable Python data structure, serializes it into a binary, Python-specific format using the latest version of the pickle protocol, and saves it to an open file.
- The pickle module takes a Python data structure and saves it to a file.
- Serializes the data structure using a data format called the pickle protocol.
- The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility.
- Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed several times as new data types have been added to the Python language, but there are still limitations.
- So, there is no guarantee of compatibility between different versions of Python itself.
- Unless we specify otherwise, the functions in the pickle module will use the latest version of the pickle protocol.
- The latest version of the pickle protocol is a binary format. Be sure to open our pickle files in binary mode, or the data will get corrupted during writing.
Let's load the saved data from a pickle file on another Python Shell B.
>>> shell 'B' >>> import pickle >>> with open('book.pickle', 'rb') as f: b = pickle.load(f) >>> b {'published_time': time.struct_time(tm_year=2012, tm_mon=9, tm_mday=10, tm_hour=23, tm_min=18, tm_sec=32, tm_wday=0, tm_yday=254, tm_isdst=-1), 'title': 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition', 'tags': ('Photography', 'Kindle', 'Light'), 'page_link': 'http://www.amazon.com/Fil-Hunter/e/B001ITTV7A', 'published': True, 'id': b'\xac\xe2\xc1\xd7', 'comment_link': None}
- There is no book variable defined here since we defined a book variable in Python Shell A.
- We opened the book.pickle file we created in Python Shell A. The pickle module uses a binary data format, so we should always open pickle files in binary mode.
- The pickle.load() function takes a stream object, reads the serialized data from the stream, creates a new Python object, recreates the serialized data in the new Python object, and returns the new Python object.
- The pickle.dump()/pickle.load() cycle creates a new data structure that is equal to the original data structure.
Let's switch back to Python Shell A.
>>> shell 'A' >>> with open('book.pickle', 'rb') as f: book2 = pickle.load(f) >>> book2 == book True >>> book2 is book False
- We opened the book.pickle file, and loaded the serialized data into a new variable, book2.
- The two dictionaries, book and book2, are equal.
- After we serialized this dictionary and stored it in the book.pickle file, and then read it back the serialized data from that file and created a perfect replica of the original data structure.
- Equality is not the same as identity. We've created a perfect replica of the original data structure, which is true. But it's still a copy.
if we don't want use a file, we can still serialize an object in memory.
>>> shell 'A' >>> m = pickle.dumps(book) >>> type(m) <class 'bytes'> >>> book3 = pickle.loads(m) >>> book3 == book True
- The pickle.dumps() function (note that we're using the s at the end of the function name, not the dump()) performs the same serialization as the pickle.dump() function. Instead of taking a stream object and writing the serialized data to a file on disk, it simply returns the serialized data.
- Since the pickle protocol uses a binary data format, the pickle.dumps() function returns a bytes object.
- The pickle.loads() function (again, note the s at the end of the function name) performs the same deserialization as the pickle.load() function. Instead of taking a stream object and reading the serialized data from a file, it takes a bytes object containing serialized data, such as the one returned by the pickle.dumps() function.
- The end result is the same: a perfect replica of the original dictionary.
The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of our requirements, we need to look at other serialization formats. One such format is json.
JSON(JavaScript Object Notation) is a text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects. Despite its relationship with JavaScript, it is language-independent, with parsers available for many languages. json is explicitly designed to be usable across multiple programming languages. The JSON format is often used for serializing and transmitting structured data over a network connection. It is used primarily to transmit data between a server and web application, serving as an alternative to XML - from wiki
Python 3 includes a json module in the standard library. Like the pickle module, the json module has functions for serializing data structures, storing the serialized data on disk, loading serialized data from disk, and unserializing the data back into a new Python object. But there are some important differences, too.
- The json data format is text-based, not binary. All json values are case-sensitive.
- As with any text-based format, there is the issue of whitespace. json allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is insignificant, which means that json encoders can add as much or as little whitespace as they like, and json decoders are required to ignore the whitespace between values. This allows us to pretty-print our json data, nicely nesting values within values at different indentation levels so we can read it in a standard browser or text editor. Python's json module has options for pretty-printing during encoding.
- There's the perennial problem of character encoding. json encodes values as plain text, but as we know, there are no such thing as plain text. json must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, utf-8). Regarding an encoding with json, please visit RFC 4627
We're going to create a new data structure instead of re-using the existing entry data structure. json is a text-based format, which means we need to open this file in text mode and specify a character encoding. We can never go wrong with utf-8.
try: import simplejson as json except: import json book = {} book['title'] = 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition' book['tags'] = ('Photography', 'Kindle', 'Light') book['published'] = True book['comment_link'] = None book['id'] = 1024 with open('ebook.json', 'w') as f: json.dump(book, f)
Like the pickle module, the json module defines a dump() function which takes a Python data structure and a writable stream object. The dump() function serializes the Python data structure and writes it to the stream object. Doing this inside a with statement will ensure that the file is closed properly when we're done.
Let's see what's in ebook.json file:
$ cat ebook.json {"published": true, "tags": ["Photography", "Kindle", "Light"], "id": 1024, "com ment_link": null, "title": "Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition"}
It's clearly more readable than a pickle file. But json can contain arbitrary whitespace between values, and the json module provides an easy way to take advantage of this to create even more readable json files:
>>> with codecs.open('book_more_friendly.json', mode='w', encoding='utf-8') as f: json.dump(book, f, indent=3)
We passed an indent parameter to the json.dump() function, and it made the resulting json file more readable, at the expense of larger file size. The indent parameter is an integer.
$ cat book_more_friendly.json { "published": true, "tags": [ "Photography", "Kindle", "Light" ], "id": 1024, "comment_link": null, "title": "Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition" }
Here is another example for json:
#!/isr/bin/python import psutil import os import subprocess import string import bitstring import json import codecs try: import xml.etree.cElementTree as ET except ImportError: import xml.etree.ElementTree as ET procs_id = 0 procs = {} procs_data = [] ismvInfo = { 'baseName':' ', 'video': { 'src':[], 'TrackIDvalue':[], 'Duration': 0, 'QualityLevels': 1, 'Chunks': 0, 'Url': '', 'index':[], 'bitrate':[], 'fourCC':[], 'width': [], 'height':[], 'codecPrivateData': [], 'fragDurations':[] }, 'audio': { 'src':[], 'TrackIDvalue':[], 'QualityLevels': 1, 'index':[], 'bitrate':[], 'fourCC':[], 'samplingRate':[], 'channels':[], 'bitsPerSample':[], 'packetSize':[], 'audioTag': [], 'codecPrivateData': [], 'fragDurations': [], } } def runCommand(cmd, use_shell = False, return_stdout = False, busy_wait = True, poll_duration = 0.5): # Sanitize cmd to string cmd = map(lambda x: '%s' % x, cmd) if use_shell: command = ' '.join(cmd) else: command = cmd if return_stdout: proc = psutil.Popen(cmd, shell = use_shell, stdout = subprocess.PIPE, stderr = subprocess.PIPE) else: proc = psutil.Popen(cmd, shell = use_shell, stdout = open('/dev/null', 'w'), stderr = open('/dev/null', 'w')) global procs_id global procs global procs_data proc_id = procs_id procs[proc_id] = proc procs_id += 1 data = { } while busy_wait: returncode = proc.poll() if returncode == None: try: data = proc.as_dict(attrs = ['get_io_counters', 'get_cpu_times']) except Exception, e: pass time.sleep(poll_duration) else: break (stdout, stderr) = proc.communicate() returncode = proc.returncode del procs[proc_id] if returncode != 0: raise Exception(stderr) else: if data: procs_data.append(data) return stdout # server manifest def ismParse(data): # need to remove the string below to make xml parse work data = data.replace(' xmlns="http://www.w3.org/2001/SMIL20/Language"','') root = ET.fromstring(data) # head for m in root.iter('head'): for p in m.iter('meta'): ismvInfo['baseName'] = (p.attrib['content']).split('.')[0] # videoAttributes for v in root.iter('video'): ismvInfo['video']['src'].append(v.attrib['src']) for p in v.iter('param'): ismvInfo['video']['TrackIDvalue'].append(p.attrib['value']) # audioAttributes for a in root.iter('audio'): ismvInfo['audio']['src'].append(a.attrib['src']) for p in a.iter('param'): ismvInfo['audio']['TrackIDvalue'].append(p.attrib['value']) # client manifest def ismcParse(data): root = ET.fromstring(data) # duration # streamDuration = root.attrib['Duration'] ismvInfo['video']['Duration'] = root.attrib['Duration'] for s in root.iter('StreamIndex'): if(s.attrib['Type'] == 'video'): ismvInfo['video']['QualityLevels'] = s.attrib['QualityLevels'] ismvInfo['video']['Chunks'] = s.attrib['Chunks'] ismvInfo['video']['Url'] = s.attrib['Url'] for q in s.iter('QualityLevel'): ismvInfo['video']['index'].append(q.attrib['Index']) ismvInfo['video']['bitrate'].append(q.attrib['Bitrate']) ismvInfo['video']['fourCC'].append(q.attrib['FourCC']) ismvInfo['video']['width'].append(q.attrib['MaxWidth']) ismvInfo['video']['height'].append(q.attrib['MaxHeight']) ismvInfo['video']['codecPrivateData'].append(q.attrib['CodecPrivateData']) # video frag duration for c in s.iter('c'): ismvInfo['video']['fragDurations'].append(c.attrib['d']) elif(s.attrib['Type'] == 'audio'): ismvInfo['audio']['QualityLevels'] = s.attrib['QualityLevels'] ismvInfo['audio']['Url'] = s.attrib['Url'] for q in s.iter('QualityLevel'): #ismvInfo['audio']['index'] = q.attrib['Index'] ismvInfo['audio']['index'].append(q.attrib['Index']) ismvInfo['audio']['bitrate'].append(q.attrib['Bitrate']) ismvInfo['audio']['fourCC'].append(q.attrib['FourCC']) ismvInfo['audio']['samplingRate'].append(q.attrib['SamplingRate']) ismvInfo['audio']['channels'].append(q.attrib['Channels']) ismvInfo['audio']['bitsPerSample'].append(q.attrib['BitsPerSample']) ismvInfo['audio']['packetSize'].append(q.attrib['PacketSize']) ismvInfo['audio']['audioTag'].append(q.attrib['AudioTag']) ismvInfo['audio']['codecPrivateData'].append(q.attrib['CodecPrivateData']) # audio frag duration for c in s.iter('c'): #audioFragDuration.append(c.attrib['d']) ismvInfo['audio']['fragDurations'].append(c.attrib['d']) def populateManifestMetadata(base): try: # parse server manifest and populate ismv info data with open(base+'.ism', 'rb') as manifest: ismData = manifest.read() ismParse(ismData) # parse client manifest and populate ismv info data with open(base+'.ismc', 'rb') as manifest: ismcData = manifest.read() ismcParse(ismcData) except Exception, e: raise RuntimeError("issue opening ismv manifest file") # input # ismvFIles - list of ismv files # base - basename of ismv files def setManifestMetadata(ismvFiles, base): #cmd = ['ismindex','-n', ismTmpName,'bunny_400.ismv','bunny_894.ismv','bunny_2000.ismv' ] cmd = ['ismindex','-n', base] for ism in ismvFiles: cmd.append(ism) stdout = runCommand(cmd, return_stdout = True, busy_wait = False) populateManifestMetadata(base) if __name__ == '__main__': ismvFiles = ['bunny_400.ismv','bunny_894.ismv','bunny_2000.ismv'] base = 'bunny' setManifestMetadata(ismvFiles, base) # save to json file with codecs.open('ismvInfo.json', 'w', encoding='utf-8') as f: json.dump(ismvInfo, f)
The output is ismvInfo.json.
There are some mismatches in JSON's coverage of Python datatypes. Some of them are simply naming differences, but there are two important Python datatypes that are completely missing: tuples and bytes.
Python3 | Json |
---|---|
dictionary | object |
list | array |
tuple | N/A |
bytes | N/A |
float | real number |
True | true |
False | false |
None | null |
>>> import json >>> import codecs >>> with codecs.open('j.json', 'r', encoding='utf-8') as f: data_from_jason = json.load(f) >>> data_from_jason {'title': 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition', 'tags': ['Photography', 'Kindle', 'Light'], 'id': 1024, 'comment_link': None, 'published': True} >>>
The following code makes a list of dictionary items and the save it to json. The input used in the code is semicolon separated with three columns like this:
protocol;service;plugin
Before making it as a list of dictionary items, we add additional info field, 'value':
try: import simplejson as json except ImportError: import json def get_data(dat): with open('input.txt', 'rb') as f: for l in f: d = {} line = ((l.rstrip()).split(';')) line.append(0) d['protocol'] = line[0] d['service'] = line[1] d['plugin'] = line[2] d['value'] = line[3] dat.append(d) return dat def convert_to_json(data): with open('data.json', 'w') as f: json.dump(data, f) if __name__ == '__main__': data = [] data = get_data(data) convert_to_json(data)
The output json file looks like this:
[{"protocol": "pro1", "value": 0, "service": "service1", "plugin": "check_wmi_plus.pl -H 10.6.88.72 -m checkfolderfilecount -u administrator -p c0c1c -w 1000 -c 2000 -a 's:' -o 'error/' --nodatamode"}, {"protocol": "proto2", "value": 1, "service": "service2", "plugin": "check_wmi_plus.pl -H 10.6.88.72 -m checkdirage -u administrator -p a23aa8 --nodatamode -c :1 -a s -o input/ -3 `date --utc --date '-30 mins' +\"%Y%m%d%H%M%S.000000+000\" `"},...]
Some of the sections (pickle) of this chapter is largely based on http://getpython3.com/diveintopython3/serializing.html
more
Python tutorial
Python Home
Introduction
Running Python Programs (os, sys, import)
Modules and IDLE (Import, Reload, exec)
Object Types - Numbers, Strings, and None
Strings - Escape Sequence, Raw String, and Slicing
Strings - Methods
Formatting Strings - expressions and method calls
Files and os.path
Traversing directories recursively
Subprocess Module
Regular Expressions with Python
Regular Expressions Cheat Sheet
Object Types - Lists
Object Types - Dictionaries and Tuples
Functions def, *args, **kargs
Functions lambda
Built-in Functions
map, filter, and reduce
Decorators
List Comprehension
Sets (union/intersection) and itertools - Jaccard coefficient and shingling to check plagiarism
Hashing (Hash tables and hashlib)
Dictionary Comprehension with zip
The yield keyword
Generator Functions and Expressions
generator.send() method
Iterators
Classes and Instances (__init__, __call__, etc.)
if__name__ == '__main__'
argparse
Exceptions
@static method vs class method
Private attributes and private methods
bits, bytes, bitstring, and constBitStream
json.dump(s) and json.load(s)
Python Object Serialization - pickle and json
Python Object Serialization - yaml and json
Priority queue and heap queue data structure
Graph data structure
Dijkstra's shortest path algorithm
Prim's spanning tree algorithm
Closure
Functional programming in Python
Remote running a local file using ssh
SQLite 3 - A. Connecting to DB, create/drop table, and insert data into a table
SQLite 3 - B. Selecting, updating and deleting data
MongoDB with PyMongo I - Installing MongoDB ...
Python HTTP Web Services - urllib, httplib2
Web scraping with Selenium for checking domain availability
REST API : Http Requests for Humans with Flask
Blog app with Tornado
Multithreading ...
Python Network Programming I - Basic Server / Client : A Basics
Python Network Programming I - Basic Server / Client : B File Transfer
Python Network Programming II - Chat Server / Client
Python Network Programming III - Echo Server using socketserver network framework
Python Network Programming IV - Asynchronous Request Handling : ThreadingMixIn and ForkingMixIn
Python Coding Questions I
Python Coding Questions II
Python Coding Questions III
Python Coding Questions IV
Python Coding Questions V
Python Coding Questions VI
Python Coding Questions VII
Python Coding Questions VIII
Python Coding Questions IX
Python Coding Questions X
Image processing with Python image library Pillow
Python and C++ with SIP
PyDev with Eclipse
Matplotlib
Redis with Python
NumPy array basics A
NumPy Matrix and Linear Algebra
Pandas with NumPy and Matplotlib
Celluar Automata
Batch gradient descent algorithm
Longest Common Substring Algorithm
Python Unit Test - TDD using unittest.TestCase class
Simple tool - Google page ranking by keywords
Google App Hello World
Google App webapp2 and WSGI
Uploading Google App Hello World
Python 2 vs Python 3
virtualenv and virtualenvwrapper
Uploading a big file to AWS S3 using boto module
Scheduled stopping and starting an AWS instance
Cloudera CDH5 - Scheduled stopping and starting services
Removing Cloud Files - Rackspace API with curl and subprocess
Checking if a process is running/hanging and stop/run a scheduled task on Windows
Apache Spark 1.3 with PySpark (Spark Python API) Shell
Apache Spark 1.2 Streaming
bottle 0.12.7 - Fast and simple WSGI-micro framework for small web-applications ...
Flask app with Apache WSGI on Ubuntu14/CentOS7 ...
Fabric - streamlining the use of SSH for application deployment
Ansible Quick Preview - Setting up web servers with Nginx, configure enviroments, and deploy an App
Neural Networks with backpropagation for XOR using one hidden layer
NLP - NLTK (Natural Language Toolkit) ...
RabbitMQ(Message broker server) and Celery(Task queue) ...
OpenCV3 and Matplotlib ...
Simple tool - Concatenating slides using FFmpeg ...
iPython - Signal Processing with NumPy
iPython and Jupyter - Install Jupyter, iPython Notebook, drawing with Matplotlib, and publishing it to Github
iPython and Jupyter Notebook with Embedded D3.js
Downloading YouTube videos using youtube-dl embedded with Python
Machine Learning : scikit-learn ...
Django 1.6/1.8 Web Framework ...
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization