AWS : S3 (Simple Storage Service) IV - Uploading a large file

bogotobogo.com site search:

File Uploading - small files

The code below is based on An Introduction to boto's S3 interface - Storing Data.

To setup boto on Mac:

$ sudo easy_install pip
$ sudo pip install boto

Because S3 requires AWS keys, we should provide our keys: AWS_ACCESS_KEY and AWS_ACCESS_SECRET_KEY. The code uses them from /etc/boto.conf or ~/.boto:

[Credentials]
AWS_ACCESS_KEY_ID = A...3
AWS_SECRET_ACCESS_KEY = W...9

Here is our Python code (s3upload.py):

#!/bin/python
import os
import argparse

import boto
import sys
from boto.s3.key import Key

AWS_ACCESS_KEY = boto.config.get('Credentials', 'aws_access_key_id')
AWS_ACCESS_SECRET_KEY = boto.config.get('Credentials', 'aws_secret_access_key')

def check_arg(args=None):
    parser = argparse.ArgumentParser(description='args : start/start, instance-id')
    parser.add_argument('-b', '--bucket',
                        help='bucket name',
                        required='True',
                        default='')
    parser.add_argument('-f', '--filename',
                        help='file to upload',
                        required='True',
                        default='')

    results = parser.parse_args(args)
    return (results.bucket,
            results.filename)

def upload_to_s3(aws_access_key_id, aws_secret_access_key, file, bucket, key, callback=None, md5=None, reduced_redundancy=False, content_type=None):
    """
    Uploads the given file to the AWS S3
    bucket and key specified.

    callback is a function of the form:

    def callback(complete, total)

    The callback should accept two integer parameters,
    the first representing the number of bytes that
    have been successfully transmitted to S3 and the
    second representing the size of the to be transmitted
    object.

    Returns boolean indicating success/failure of upload.
    """
    try:
        size = os.fstat(file.fileno()).st_size
    except:
        # Not all file objects implement fileno(),
        # so we fall back on this
        file.seek(0, os.SEEK_END)
        size = file.tell()

    conn = boto.connect_s3(aws_access_key_id, aws_secret_access_key)
    rs = conn.get_all_buckets()
    for b in rs:
        print b
    nonexistent = conn.lookup(bucket)
    if nonexistent is None:
        print 'Not there!'
    bucket = conn.get_bucket(bucket, validate=True)
    k = Key(bucket)
    k.key = key
    if content_type:
        k.set_metadata('Content-Type', content_type)
    sent = k.set_contents_from_file(file, cb=callback, md5=md5, reduced_redundancy=reduced_redundancy, rewind=True)

    # Rewind for later use
    file.seek(0)

    if sent == size:
        return True
    return False

if __name__ == '__main__':

    bucket, filename = check_arg(sys.argv[1:])
    file = open(filename, 'r+')

    print 'ACCESS_KEY=',AWS_ACCESS_KEY
    print 'ACCESS_SECRET_KEY=',AWS_ACCESS_SECRET_KEY
    key = file.name
    print 'key=',key
    print 'bucket=',bucket

    if upload_to_s3(AWS_ACCESS_KEY, AWS_ACCESS_SECRET_KEY, file, bucket, key):
        print 'It worked!'
    else:
        print 'The upload failed...'

To run, use the following syntax:

python s3upload.py -b bucket-name -f file-name

Real sample:

$ python s3upload.py -b s3-sample-bucket -f sample-file 
ACCESS_KEY= A...
ACCESS_SECRET_KEY= W...
key= sample-file
bucket= s3-sample-bucket
<Bucket: s3-sample-bucket>
It worked!

File Uploading - Large files

The code below is based on An Introduction to boto's S3 interface - Storing Large Data.

To make the code to work, we need to download and install boto and FileChunkIO.

To upload a big file, we split the file into smaller components, and then upload each component in turn. The S3 combines them into the final object. The python code below makes use of the FileChunkIO module. So, we may want to do

$ pip install FileChunkIO

if it isn't already installed.

Here is our Python code (s3upload2.py):

# s3upload.py
# Can be used to upload large file to S3

#!/bin/python
import os
import sys
import argparse
import math

import boto
from boto.s3.key import Key
from filechunkio import FileChunkIO

def check_arg(args=None):
    parser = argparse.ArgumentParser(description='args : start/start, instance-id')
    parser.add_argument('-b', '--bucket',
                        help='bucket name',
                        required='True',
                        default='')
    parser.add_argument('-f', '--filename',
                        help='file to upload',
                        required='True',
                        default='')

    results = parser.parse_args(args)
    return (results.bucket,
            results.filename)


def upload_to_s3(file, bucket):
    source_size = 0
    source_path = file.name
    try:
        source_size = os.fstat(file.fileno()).st_size
    except:
        # Not all file objects implement fileno(),
        # so we fall back on this
        file.seek(0, os.SEEK_END)
        source_size = file.tell()

    print 'source_size=%s MB' %(source_size/(1024*1024))

    aws_access_key = boto.config.get('Credentials', 'aws_access_key_id')
    aws_secret_access_key = boto.config.get('Credentials', 'aws_secret_access_key')

    conn = boto.connect_s3(aws_access_key, aws_secret_access_key)

    bucket = conn.get_bucket(bucket, validate=True)
    print 'bucket=%s' %(bucket)

    # Create a multipart upload request
    mp = bucket.initiate_multipart_upload(os.path.basename(source_path))

    # Use a chunk size of 50 MiB (feel free to change this)
    chunk_size = 52428800
    chunk_count = int(math.ceil(source_size / chunk_size))
    print 'chunk_count=%s' %(chunk_count)

    # Send the file parts, using FileChunkIO to create a file-like object
    # that points to a certain byte range within the original file. We
    # set bytes to never exceed the original file size.
    sent = 0
    for i in range(chunk_count + 1):
        offset = chunk_size * i
        bytes = min(chunk_size, source_size - offset)
        sent = sent +  bytes
        with FileChunkIO(source_path, 'r', offset=offset,
                         bytes=bytes) as fp:
            mp.upload_part_from_file(fp, part_num=i + 1)
        print '%s: sent = %s MBytes ' %(i, sent/1024/1024)

    # Finish the upload
    mp.complete_upload()

    if sent == source_size:
        return True
    return False

if __name__ == '__main__':
    '''
    Usage:
    python s3upload.py -b s3-sample-bucket -f sample-file2
    '''

    bucket, filename = check_arg(sys.argv[1:])
    file = open(filename, 'r+')

    if upload_to_s3(file, bucket):
        print 'It works!'
    else:
        print 'The upload failed...'

The code takes two args: bucket-name and file-name:

/Users/kihyuckhong/DATABACKUP_From_EC2$ python s3upload2.py -b s3-sample-bucket -f sample-file2
source_size=524 MB
bucket=<Bucket: s3-sample-bucket>
chunk_count=10
0: sent = 50 MBytes 
1: sent = 100 MBytes 
2: sent = 150 MBytes 
3: sent = 200 MBytes 
4: sent = 250 MBytes 
5: sent = 300 MBytes 
6: sent = 350 MBytes 
7: sent = 400 MBytes 
8: sent = 450 MBytes 
9: sent = 500 MBytes 
10: sent = 524 MBytes 
It works!