Importing multiple files from Google Cloud Bucket to Datalab instance - python-3.x

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3.
So, I can easily see them as objects using
gcs list --objects gs://<BUCKET_NAME>
Further, I can read in an individual file/object using
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('<BUCKET_NAME')
data_csv = myBucket.object('<FILE_NAME.json')
uri = data_csv.uri
%gcs read --object $uri --variable data
df = pd.read_csv(BytesIO(data))
df.head()
(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own)
What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?
As an extra bit- what if a file is saved as a json, but isn't actually that structure? How can I handle that?
Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab.
Any help is greatly appreciated.

This can be done using Bucket.objects which returns an iterator with all matching files. Specify a prefix or leave it empty to match all files in the bucket. I did an example with two files countries1.csv and countries2.csv:
$ cat countries1.csv
id,country
1,sweden
2,spain
$ cat countries2.csv
id,country
3,italy
4,france
And used the following Datalab snippet:
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')
df_list = []
for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()
which will output the combined csv:
id country
0 1 sweden
1 2 spain
2 3 italy
3 4 france
Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. If you want to retrieve all files in the bucket just use this instead:
object_list = myBucket.objects()

Related

Read n rows from csv in Google Cloud Storage to use with Python csv module

I have a variety of very large (~4GB each) csv files that contain different formats. These come from data recorders from over 10 different manufacturers. I am attempting to consolidate all of these into BigQuery. In order to load these up on a daily basis I want to first load these files into Cloud Storage, determine the schema, and then load into BigQuery. Due to the fact that some of the files have additional header information (from 2 - ~30 lines) I have produced my own functions to determine the most likely header row and the schema from a sample of each file (~100 lines), which I can then use in the job_config when loading the files to BQ.
This works fine when I am working with files from local storage direct to BQ as I can use a context manager and then Python's csv module, specifically the Sniffer and reader objects. However, there does not seem to be an equivalent method of using a context manager direct from Storage. I do not want to bypass Cloud Storage in case any of these files are interrupted when loading into BQ.
What I can get to work:
# initialise variables
with open(csv_file, newline = '', encoding=encoding) as datafile:
dialect = csv.Sniffer().sniff(datafile.read(chunk_size))
reader = csv.reader(datafile, dialect)
sample_rows = []
row_num = 0
for row in reader:
sample_rows.append(row)
row_num+=1
if (row_num >100):
break
sample_rows
# Carry out schema and header investigation...
With Google Cloud Storage I have attempted to use download_as_string and download_to_file, which provide binary object representations of the data, but then I cannot get the csv module to work with any of the data. I have attempted to use .decode('utf-8') and it returns a looong string with \r\n's. I then used splitlines() to get a list of the data but still the csv functions keep giving a dialect and reader that splits the data into single characters as each entry.
Has anyone managed to get a work around to use the csv module with files stored in Cloud Storage without downloading the whole file?
After having a look at the csv source code on GitHub, I have managed to use the io module and csv module in Python to solve this problem. The io.BytesIO and TextIOWrapper were the two key functions to use. Probably not a common use case but thought I would post the answer here to save some time for anyone that needs it.
# Set up storage client and create a blob object from csv file that you are trying to read from GCS.
content = blob.download_as_string(start = 0, end = 10240) # Read a chunk of bytes that will include all header data and the recorded data itself.
bytes_buffer = io.BytesIO(content)
wrapped_text = io.TextIOWrapper(bytes_buffer, encoding = encoding, newline = newline)
dialect = csv.Sniffer().sniff(wrapped_text.read())
wrapped_text.seek(0)
reader = csv.reader(wrapped_text, dialect)
# Do what you will with the reader object

How to pull only certain csv's and concat the data from s3?

I have a bucket with various files. I am only interested in pulling files that begin with the word 'member' and storing each member file in a list to be concated further into a dataframe.
Currently I am pulling data like this:
import boto3
my_bucket = s3.Bucket('my-bucket')
obj = s3.Object('my-bucket','member')
file_content = obj.get()['Body'].read().decode('utf-8')
df = pd.read_csv(file_content)
How ever this is only pulling the member file. I have member files that look like this 'member_1229013','member_2321903' etc.
How can I read in all the 'member' files, save the data in a list so I can concat later. All column names are the same in all csv's
You can only download/access one object per API call.
I normally recommend downloading the objects to a local directory, and then accessing them as normal local files. Here is an example of how to download an object from Amazon S3:
import boto3
s3 = boto3.client('s3')
s3.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
See: download_file() documentation
If you want to read multiple files, you will first need to obtain a listing of the files (eg with list_objects_v2(), and then access each object individually.
One tip for boto3... There are two ways to make calls: via a Resource (eg using s3.Object() or s3.Bucket()) or via a Client, which passes everything as parameters.

How to list S3 objects in parallel in PySpark using flatMap()?

I have a dataframe where each row contains a prefix that points to a location in S3. I want to use flatMap() to iterate over each row, list the S3 objects in each prefix and return a new dataframe that contains a row per file that was listed in S3.
I've got this code:
import boto3
s3 = boto3.resource('s3')
def flatmap_list_s3_files(row):
bucket = s3.Bucket(row.bucket)
s3_files = []
for obj in bucket.objects.filter(Prefix=row.prefix):
s3_files.append(obj.key)
rows = []
for f in s3_files:
row_dict = row.asDict()
row_dict['s3_obj'] = f
rows.append(Row(**row_dict))
return rows
df = <code that loads the dataframe>
df.rdd.flatMap(lambda x: flatmap_list_s3_files(x))).toDF()
The only problem is that the s3 object isn't pickleable I guess? So I'm getting this error and I'm not sure what to try next:
PicklingError: Cannot pickle files that are not opened for reading
I'm a spark noob so I'm hoping there's some other API or some way to parallelize the listing of files in S3 and join that together with the original dataframe. To be clear, I'm not trying to READ any of the data in the S3 files themselves, I'm building a table that is essentially a metadata catalogue of all the files in S3. Any tips would be greatly appreciated.
you can't send an s3 client around your spark cluster; you need to share all the information needed to create one and instantiate it at the far end. I don't know about .py but in the java APIs you'd just pass the path around as a string and then convert that to a Path object, call Path.getFileSystem() and work on there. The Spark workers will cache the Filesystem instances for fast reuse

Using boto to delete all buckets

I've been tasked with creating a script to delete all the current S3 buckets and create some new ones. This is something that they want to do on an ongoing basis. So far I have all the preliminaries:
import boto
from boto.s3.key import Key
import boto.s3.connection
from __future__ import print_function
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id='my_access_key', aws_secret_access_key='my_secret_key')
ls = conn.get_all_buckets()
print(*ls,sep='\n')
This gives me a list of all the current buckets. Now if I want to remove the buckets my understanding is that they have to be emptied first, using a method something like:
db = conn.get_bucket('bucket_name')
for key in db.list():
key.delete()
And then I could do:
conn.delete_bucket('bucket_name')
I want to set it up such that it pulls each bucket name from 'ls', but I'm not sure how to go about this. I tried this:
for i in ls:
db = conn.get_bucket('i')
for key in db.list():
key.delete()
But I get an error "S3ResponseError: 400 Bad Request". I'm getting a sneaking suspicion that it's not pulling the separate elements from the list. Do I maybe have to get data frames involved? As far as I know, boto doesn't have an option to just nuke all the folders outright.
I'd recommend using boto3
The following should do the trick, though it's untested (I don't want to delete all my buckets :))
import boto3
client = session.client('s3')
s3 = boto3.resource('s3')
buckets = client.list_buckets()
for bucket in buckets['Buckets']:
s3_bucket = s3.Bucket(bucket['Name'])
s3_bucket.objects.all().delete()
s3_bucket.delete()

load csv and set parameters in jupyter notebook on Azure ML

I'm using a Python 3.4 Jupyter notebook to load a dataset in Azure ML which is stored in the cloud as a dataset in the Azure ML project environment. But using the default template created by Azure ML, I can't load the data due to a mixed datatypes error.
from azureml import Workspace
import pandas as pd
ws = Workspace()
ds = ws.datasets['rossmann-train.csv']
df = ds.to_dataframe()
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/kernel/main.py:6: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
In my local environment I just import the dataset as follows:
df = pd.read_csv('train.csv',low_memory=False)
But I'm not sure how to do this in azure using the ds object.
df = pd.read_csv(ds)
and
pd.DataFrame.from_csv(ds)
raise the error:
OSError: Expected file path name or file-like object, got type
*edit: more info on the ds object:
In [1]: type(ds)
Out [1]: azureml.SourceDataset
In [2]: print (ds)
Out [2]: rossmann-train.csv
First of all, I am not sure, by your question, what is the ds object. But I'm pretty sure it is not a csv file, since, if it were, you'd have processed it your self and you wouldn't be having this question.
Now, I am not sure whether pandas has a native way of dealing with Azure, but this piece of documentation indicates that first you must download the data form Azure, using their package, and save it into your local file system.
But for that, they are assuming that the data you downloaded is already in the csv format. If not, use the appropriate reader (or parse it by hand) in order to tabulate the data for a pandas.DataFrame.
According to the docs on the azureml library, one workaround would be to import the file as text then parse it into csv but this seems unnecessary since the data is already recognised as being in csv structure.
text_data = ds.read_as_text()

Resources