Serialization with pickle and json

bogotobogo.com site search:

Serialization

Serialization is the process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and resurrected later in the same or another computer environment.

When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object.

This process of serializing an object is also called deflating or marshalling an object. The opposite operation, extracting a data structure from a series of bytes, is deserialization (which is also called inflating or unmarshalling). wiki.

In Python, we have the pickle module. The bulk of the pickle module is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures. It is a cross-version customisable but unsafe (not secure against erroneous or malicious data) serialization format.

The standard library also includes modules serializing to standard data formats:

json with built-in support for basic scalar and collection types and able to support arbitrary types via encoding and decoding hooks.
XML-encoded property lists. (plistlib), limited to plist-supported types (numbers, strings, booleans, tuples, lists, dictionaries, datetime and binary blobs)

Finally, it is recommended that an object's __repr__ be evaluable in the right environment, making it a rough match for Common Lisp's print-object. wiki

Pickle

What data type can pickle store?

Here are the things that the pickle module store:

All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.

Lists, tuples, dictionaries, and sets containing any combination of native datatypes.

Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing any combination of native datatypes (and so on, to the maximum nesting level that Python supports).

Functions, classes, and instances of classes (with caveats).

Constructing Pickle data

We will use two Python Shells, 'A' & 'B':

>>> shell = 'A'

Open another Shell:

>>> shell = 'B'

Here is the dictionary type data for Shell 'A':

>>> shell
'A'
>>> book = {}
>>> book['title'] = 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition'
>>> book['page_link'] = 'http://www.amazon.com/Fil-Hunter/e/B001ITTV7A'
>>> book['comment_link'] = None
>>> book['id'] = b'\xAC\xE2\xC1\xD7'
>>> book['tags'] = ('Photography', 'Kindle', 'Light')
>>> book['published'] = True
>>> import time
>>> book['published_time'] = time.strptime('Mon Sep 10 23:18:32 2012')
>>> book['published_time']
time.struct_time(tm_year=2012, tm_mon=9, tm_mday=10, tm_hour=23,
 tm_min=18, tm_sec=32, tm_wday=0, tm_yday=254, tm_isdst=-1)
>>>

Here, we're trying to use as many data types as possible.
The time module contains a data structure, struct_time to represent a point in time and functions to manipulate time structs. The strptime() function takes a formatted string an converts it to a struct_time. This string is in the default format, but we can control that with format codes. For more details, visit the time module.

Saving data as a pickle file

Now, we have a dictionay that has all the information about the book. Let's save it as a pickle file:

>>> import pickle
>>> with open('book.pickle', 'wb') as f:
	pickle.dump(book, f)

We set the file mode to wb to open the file for writing in binary mode. Wrap it in a with statement to ensure the file is closed automatically when we're done with it. The dump() function in the pickle module takes a serializable Python data structure, serializes it into a binary, Python-specific format using the latest version of the pickle protocol, and saves it to an open file.

The pickle module takes a Python data structure and saves it to a file.
Serializes the data structure using a data format called the pickle protocol.
The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility.
Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed several times as new data types have been added to the Python language, but there are still limitations.
So, there is no guarantee of compatibility between different versions of Python itself.
Unless we specify otherwise, the functions in the pickle module will use the latest version of the pickle protocol.
The latest version of the pickle protocol is a binary format. Be sure to open our pickle files in binary mode, or the data will get corrupted during writing.

Loading data from a pickle file

Let's load the saved data from a pickle file on another Python Shell B.

>>> shell
'B'
>>> import pickle
>>> with open('book.pickle', 'rb') as f:
	b = pickle.load(f)

>>> b
{'published_time': time.struct_time(tm_year=2012, tm_mon=9, 
tm_mday=10, tm_hour=23, tm_min=18, tm_sec=32, tm_wday=0, tm_yday=254, tm_isdst=-1), 
'title': 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition', 
'tags': ('Photography', 'Kindle', 'Light'), 
'page_link': 'http://www.amazon.com/Fil-Hunter/e/B001ITTV7A', 
'published': True, 'id': b'\xac\xe2\xc1\xd7', 'comment_link': None}

There is no book variable defined here since we defined a book variable in Python Shell A.
We opened the book.pickle file we created in Python Shell A. The pickle module uses a binary data format, so we should always open pickle files in binary mode.
The pickle.load() function takes a stream object, reads the serialized data from the stream, creates a new Python object, recreates the serialized data in the new Python object, and returns the new Python object.
The pickle.dump()/pickle.load() cycle creates a new data structure that is equal to the original data structure.

Let's switch back to Python Shell A.

>>> shell
'A'
>>> with open('book.pickle', 'rb') as f:
	book2 = pickle.load(f)
	
>>> book2 == book
True
>>> book2 is book
False

We opened the book.pickle file, and loaded the serialized data into a new variable, book2.
The two dictionaries, book and book2, are equal.
After we serialized this dictionary and stored it in the book.pickle file, and then read it back the serialized data from that file and created a perfect replica of the original data structure.
Equality is not the same as identity. We've created a perfect replica of the original data structure, which is true. But it's still a copy.

Serializing data in memory with pickle

if we don't want use a file, we can still serialize an object in memory.

>>> shell
'A'
>>> m = pickle.dumps(book)
>>> type(m)
<class 'bytes'>
>>> book3 = pickle.loads(m)
>>> book3 == book
True

The pickle.dumps() function (note that we're using the s at the end of the function name, not the dump()) performs the same serialization as the pickle.dump() function. Instead of taking a stream object and writing the serialized data to a file on disk, it simply returns the serialized data.
Since the pickle protocol uses a binary data format, the pickle.dumps() function returns a bytes object.
The pickle.loads() function (again, note the s at the end of the function name) performs the same deserialization as the pickle.load() function. Instead of taking a stream object and reading the serialized data from a file, it takes a bytes object containing serialized data, such as the one returned by the pickle.dumps() function.
The end result is the same: a perfect replica of the original dictionary.

Python serialized object and JSON

The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of our requirements, we need to look at other serialization formats. One such format is json.

JSON(JavaScript Object Notation) is a text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects. Despite its relationship with JavaScript, it is language-independent, with parsers available for many languages. json is explicitly designed to be usable across multiple programming languages. The JSON format is often used for serializing and transmitting structured data over a network connection. It is used primarily to transmit data between a server and web application, serving as an alternative to XML - from wiki

Python 3 includes a json module in the standard library. Like the pickle module, the json module has functions for serializing data structures, storing the serialized data on disk, loading serialized data from disk, and unserializing the data back into a new Python object. But there are some important differences, too.

The json data format is text-based, not binary. All json values are case-sensitive.
As with any text-based format, there is the issue of whitespace. json allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is insignificant, which means that json encoders can add as much or as little whitespace as they like, and json decoders are required to ignore the whitespace between values. This allows us to pretty-print our json data, nicely nesting values within values at different indentation levels so we can read it in a standard browser or text editor. Python's json module has options for pretty-printing during encoding.
There's the perennial problem of character encoding. json encodes values as plain text, but as we know, there are no such thing as plain text. json must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, utf-8). Regarding an encoding with json, please visit RFC 4627

Saving data to JSON

We're going to create a new data structure instead of re-using the existing entry data structure. json is a text-based format, which means we need to open this file in text mode and specify a character encoding. We can never go wrong with utf-8.

try:
    import simplejson as json
except:
    import json

book = {}
book['title'] = 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition'
book['tags'] = ('Photography', 'Kindle', 'Light')
book['published'] = True
book['comment_link'] = None
book['id'] = 1024

with open('ebook.json',  'w') as f:
	json.dump(book, f)

Like the pickle module, the json module defines a dump() function which takes a Python data structure and a writable stream object. The dump() function serializes the Python data structure and writes it to the stream object. Doing this inside a with statement will ensure that the file is closed properly when we're done.

Let's see what's in ebook.json file:

$ cat ebook.json
{"published": true, "tags": ["Photography", "Kindle", "Light"], "id": 1024, "com
ment_link": null, "title": "Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition"}

It's clearly more readable than a pickle file. But json can contain arbitrary whitespace between values, and the json module provides an easy way to take advantage of this to create even more readable json files:

>>> with codecs.open('book_more_friendly.json', mode='w', encoding='utf-8') as f:
	json.dump(book, f, indent=3)

We passed an indent parameter to the json.dump() function, and it made the resulting json file more readable, at the expense of larger file size. The indent parameter is an integer.

$ cat book_more_friendly.json
{
   "published": true,
   "tags": [
      "Photography",
      "Kindle",
      "Light"
   ],
   "id": 1024,
   "comment_link": null,
   "title": "Light Science and Magic: An Introduction to Photographic Lighting,
Kindle Edition"
}

Here is another example for json:

#!/isr/bin/python
import psutil
import os
import subprocess
import string
import bitstring
import json
import codecs

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

procs_id = 0
procs = {}
procs_data = []

ismvInfo = {
   'baseName':' ',
   'video': {
      'src':[],
      'TrackIDvalue':[],
      'Duration': 0,
      'QualityLevels': 1,
      'Chunks': 0,
      'Url': '',
      'index':[],
      'bitrate':[],
      'fourCC':[],
      'width': [],
      'height':[],
      'codecPrivateData': [],
      'fragDurations':[]
   },
   'audio': {
      'src':[],
      'TrackIDvalue':[],
      'QualityLevels': 1,
      'index':[],
      'bitrate':[],
      'fourCC':[],
      'samplingRate':[],
      'channels':[],
      'bitsPerSample':[],
      'packetSize':[],
      'audioTag': [],
      'codecPrivateData': [],
      'fragDurations': [],
   }
}

def runCommand(cmd, use_shell = False, return_stdout = False, busy_wait = True, poll_duration = 0.5):
    # Sanitize cmd to string
    cmd = map(lambda x: '%s' % x, cmd)
    if use_shell:
        command = ' '.join(cmd)
    else:
        command = cmd

    if return_stdout:
        proc = psutil.Popen(cmd, shell = use_shell, stdout = subprocess.PIPE, stderr = subprocess.PIPE)
    else:
        proc = psutil.Popen(cmd, shell = use_shell,
                                stdout = open('/dev/null', 'w'),
                                stderr = open('/dev/null', 'w'))

    global procs_id
    global procs
    global procs_data
    proc_id = procs_id
    procs[proc_id] = proc
    procs_id += 1
    data = { }

    while busy_wait:
        returncode = proc.poll()
        if returncode == None:
            try:
                data = proc.as_dict(attrs = ['get_io_counters', 'get_cpu_times'])
            except Exception, e:
                pass
            time.sleep(poll_duration)
        else:
            break

    (stdout, stderr) = proc.communicate()
    returncode = proc.returncode
    del procs[proc_id]

    if returncode != 0:
        raise Exception(stderr)
    else:
        if data:
            procs_data.append(data)
        return stdout

# server manifest
def ismParse(data):
    # need to remove the string below to make xml parse work
    data = data.replace(' xmlns="http://www.w3.org/2001/SMIL20/Language"','')
    root = ET.fromstring(data)

    # head 
    for m in root.iter('head'):
        for p in m.iter('meta'):
            ismvInfo['baseName'] = (p.attrib['content']).split('.')[0]

    # videoAttributes
    for v in root.iter('video'):
        ismvInfo['video']['src'].append(v.attrib['src'])
        for p in v.iter('param'):
            ismvInfo['video']['TrackIDvalue'].append(p.attrib['value'])

    # audioAttributes
    for a in root.iter('audio'):
        ismvInfo['audio']['src'].append(a.attrib['src'])
        for p in a.iter('param'):
            ismvInfo['audio']['TrackIDvalue'].append(p.attrib['value'])

# client manifest
def ismcParse(data):
    root = ET.fromstring(data)

    # duration
    # streamDuration = root.attrib['Duration']
    ismvInfo['video']['Duration'] = root.attrib['Duration']

    for s in root.iter('StreamIndex'):
        if(s.attrib['Type'] == 'video'):
            ismvInfo['video']['QualityLevels'] = s.attrib['QualityLevels']
            ismvInfo['video']['Chunks'] = s.attrib['Chunks']
            ismvInfo['video']['Url'] = s.attrib['Url']
            for q in s.iter('QualityLevel'):
                ismvInfo['video']['index'].append(q.attrib['Index'])
                ismvInfo['video']['bitrate'].append(q.attrib['Bitrate'])
                ismvInfo['video']['fourCC'].append(q.attrib['FourCC'])
                ismvInfo['video']['width'].append(q.attrib['MaxWidth'])
                ismvInfo['video']['height'].append(q.attrib['MaxHeight'])
                ismvInfo['video']['codecPrivateData'].append(q.attrib['CodecPrivateData'])

            # video frag duration
            for c in s.iter('c'):
                ismvInfo['video']['fragDurations'].append(c.attrib['d'])

        elif(s.attrib['Type'] == 'audio'):
            ismvInfo['audio']['QualityLevels'] = s.attrib['QualityLevels']
            ismvInfo['audio']['Url'] = s.attrib['Url']
            for q in s.iter('QualityLevel'):
                #ismvInfo['audio']['index'] = q.attrib['Index'] 
                ismvInfo['audio']['index'].append(q.attrib['Index'])
                ismvInfo['audio']['bitrate'].append(q.attrib['Bitrate'])
                ismvInfo['audio']['fourCC'].append(q.attrib['FourCC'])
                ismvInfo['audio']['samplingRate'].append(q.attrib['SamplingRate'])
                ismvInfo['audio']['channels'].append(q.attrib['Channels'])
                ismvInfo['audio']['bitsPerSample'].append(q.attrib['BitsPerSample'])
                ismvInfo['audio']['packetSize'].append(q.attrib['PacketSize'])
                ismvInfo['audio']['audioTag'].append(q.attrib['AudioTag'])
                ismvInfo['audio']['codecPrivateData'].append(q.attrib['CodecPrivateData'])
            # audio frag duration
            for c in s.iter('c'):
                #audioFragDuration.append(c.attrib['d'])
                ismvInfo['audio']['fragDurations'].append(c.attrib['d'])

def populateManifestMetadata(base):
    try:
        # parse server manifest and populate ismv info data
        with open(base+'.ism', 'rb') as manifest:
            ismData = manifest.read()
            ismParse(ismData)

        # parse client manifest and populate ismv info data
        with open(base+'.ismc', 'rb') as manifest:
            ismcData = manifest.read()
            ismcParse(ismcData)

    except Exception, e:
        raise RuntimeError("issue opening ismv manifest file")

# input 
# ismvFIles - list of ismv files
# base      - basename of ismv files
def setManifestMetadata(ismvFiles, base):
    #cmd = ['ismindex','-n', ismTmpName,'bunny_400.ismv','bunny_894.ismv','bunny_2000.ismv' ] 
    cmd = ['ismindex','-n', base]
    for ism in ismvFiles:
        cmd.append(ism)
    stdout = runCommand(cmd, return_stdout = True, busy_wait = False)
    populateManifestMetadata(base)

if __name__ == '__main__':

   ismvFiles = ['bunny_400.ismv','bunny_894.ismv','bunny_2000.ismv']
   base = 'bunny'

   setManifestMetadata(ismvFiles, base)

   # save to json file
   with codecs.open('ismvInfo.json', 'w', encoding='utf-8') as f:
        json.dump(ismvInfo, f)

The output is ismvInfo.json.

Data type mapping

There are some mismatches in JSON's coverage of Python datatypes. Some of them are simply naming differences, but there are two important Python datatypes that are completely missing: tuples and bytes.

Python3	Json
dictionary	object
list	array
tuple	N/A
bytes	N/A
float	real number
True	true
False	false
None	null

Loading data from a JSON file

>>> import json
>>> import codecs
>>> with codecs.open('j.json', 'r', encoding='utf-8') as f:
	data_from_jason = json.load(f)

	
>>> data_from_jason
{'title': 'Light Science and Magic: An Introduction to Photographic Lighting, Kindle Edition', 
'tags': ['Photography', 'Kindle', 'Light'], 'id': 1024, 'comment_link': None, 'published': True}
>>>

List to JSON file

The following code makes a list of dictionary items and the save it to json. The input used in the code is semicolon separated with three columns like this:

protocol;service;plugin

Before making it as a list of dictionary items, we add additional info field, 'value':

try:
    import simplejson as json
except ImportError:
    import json

def get_data(dat):
    with open('input.txt', 'rb') as f:
        for l in f:
            d = {}
            line = ((l.rstrip()).split(';'))
            line.append(0)
            d['protocol'] = line[0]
            d['service'] = line[1]
            d['plugin'] = line[2]
            d['value'] = line[3]
            dat.append(d)
    return dat

def convert_to_json(data):
    with open('data.json', 'w') as f:
        json.dump(data, f)

if __name__ == '__main__':
    data = []
    data = get_data(data)
    convert_to_json(data)

The output json file looks like this:

[{"protocol": "pro1", "value": 0, "service": "service1", "plugin": "check_wmi_plus.pl -H 10.6.88.72 -m checkfolderfilecount -u administrator -p c0c1c -w 1000 -c 2000 -a 's:' -o 'error/' --nodatamode"}, {"protocol": "proto2", "value": 1, "service": "service2", "plugin": "check_wmi_plus.pl -H 10.6.88.72 -m checkdirage -u administrator -p a23aa8 --nodatamode -c :1 -a s -o input/ -3 `date --utc --date '-30 mins' +\"%Y%m%d%H%M%S.000000+000\" `"},...]

Some of the sections (pickle) of this chapter is largely based on http://getpython3.com/diveintopython3/serializing.html