Loading a 8.9 GB dataset from Google Drive to Google Colab? - python-3.x

I am working on a huge laboratory dataset and want to know how to load an 8.9GB dataset from my google drive to my google colab file. The error it shows is runtime stopped, Restarting it.
I've already tried chunksize, nrows, na_filter, and dask. There might be a problem implementing them though. If you could explain to me how to use it. I am attaching my original code below.
import pandas as pd
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
id = '1M4tregypJ_HpXaQCIykyG2lQtAMR9nPe'
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('Filename.csv')
df = pd.read_csv('Filename.csv')
df.head()
If you suggest any of the methods I've already tried please do so with appropriate and working code.

The problem is probably from pd.read_csv('Filename.csv').
A 8.9GB CSV file will take more than 13GB RAM. You should not load the whole file into memory, but work incrementally.

Related

How do i import the saved module in Google Colab?

I was build a NER module using Spacy in Google Colab. I saved it to the disk using nlp.to_disk() function.
nlp.to_disk("RCM.model")
This module is saved under the files. How should i import the RCM module for testing purpose?
i have tried the below code but it didn't work.
from google.colab import drive
my_module = drive.mount('/content/RCM.model', force_remount=True)
If you save a model you can load it using spacy.load.
import spacy
spacy.load("RCM.model") # the argument should be the path to the directory

Import Image from Google Drive to Google Colab

I have mounted my Google drive to my colab notebook:
from google.colab import drive
drive.mount("/content/gdrive")
I can import csv files through it, in my case:
df = pd.read_csv("/content/gdrive/MyDrive/colab/heart-disease.csv")
But when I try to import images in markdown/text in colab
nothing happens:
![](/content/gdrive/MyDrive/colab/6-step-ml-framework.png)
Here's my directory on Google drive:
You can using OpenCV in colab by import cv2 as cv, read image by img = cv.imread('/content/gdrive/MyDrive/colab/6-step-ml-framework.png'), convert image to numpy array using img = np.float32(img)
#Mr. For Example
you should not use np.float(img), because OpenCV's method imread has converted it to np.array already

How to search for Tensorflow Files in Google Drive?

I'm following the docs here: https://colab.research.google.com/github/google/earthengine-api/blob/master/python/examples/ipynb/TF_demo1_keras.ipynb#scrollTo=43-c0JNFI_m6 to learn how to use Tensorflow with GEE. One part of this tutorial is checking the existence of exported files. In the docs, the example code is:
fileNameSuffix = '.tfrecord.gz'
trainFilePath = 'gs://' + outputBucket + '/' + trainFilePrefix + fileNameSuffix
testFilePath = 'gs://' + outputBucket + '/' + testFilePrefix + fileNameSuffix
print('Found training file.' if tf.gfile.Exists(trainFilePath)
else 'No training file found.')
print('Found testing file.' if tf.gfile.Exists(testFilePath)
else 'No testing file found.')
In my case, I'm just exporting the files to Google Drive instead of Google Cloud bucket. How would I change trainFilePath and testFilePath to point to the Google Drive folder? FWIW, when I go into the Google Drive Folder, I do actually see files.
I would say you could use the Google Drive API to list files in you Google Drive instead of a GCS Bucket. You can find the documentation here.
You can also use PyDrive, which is pretty easy to understand. This is an example, you only have to adjust the query "q" to your needs:
from pydrive.drive import GoogleDrive
from pydrive.auth import GoogleAuth
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file in file_list:
print(f"title: {file['title']}, id: {file['id']}")
Solution
You can use the great PyDrive library to acess your Drive files easily from Google collab and thus check which files you have or have been exported, etc.
The following piece of code is an example that lists all the files in the root directory of your Google Drive API. This has been found in this answer (yes, I am making this answer a community wiki post) :
# Install the library
!pip install -U -q PyDrive
# Install the rest of the services/libraries needed
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# choose a local (colab) directory to store the data.
local_download_path = os.path.expanduser('~/data')
try:
os.makedirs(local_download_path)
except: pass
# 2. Auto-iterate using the query syntax, in this case as I am using the main directory of Drive this would be root
# https://developers.google.com/drive/v2/web/search-parameters
file_list = drive.ListFile(
{'q': "'root' in parents"}).GetList()
for f in file_list:
# 3. Print the name and id of the files
print('title: %s, id: %s' % (f['title'], f['id']))
NOTE: when you do this colab will take you to another page to authentificate and make you insert the secret key. Just follow what the service indicates you to do, it is pretty straightforward.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)

Error in joblib.load when reading file from s3

When trying to read a file from s3 with joblib.load() I get the error ValueError: embedded null byte when attempting to read files.
The files were created by joblib and can be successfully loaded from local copies (that were made locally before uploading to s3), so the error is presumably in storage and retrieval protocols from S3.
Min code:
####Imports (AWS credentials assumed)
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
joblib.loads(s3.Bucket(bucket_str).Object(bucket_key).get()['Body'].read())
The following code reconstructs a local copy of the file in memory before feeding into joblib.load(), enabling a successful load.
from io import BytesIO
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
with BytesIO() as data:
s3.Bucket(bucket_str).download_fileobj(bucket_key, data)
data.seek(0) # move back to the beginning after writing
df = joblib.load(data)
I assume, but am not certain, that something in how boto3 chunks files for download creates a null byte that breaks joblib, and BytesIO fixes this before letting joblib.load() see the datastream.
PS. In this method the file never touches the local disk, which is helpful under some circumstances (eg. node with big RAM but tiny disk space...)
You can do it like this using the s3fs package.
import s3fs
fs = s3fs.FileSystem()
with fs.open('s3://my-aws-bucket/some-pseudo/folder-set/my-filename.joblib', encoding='utf8')
df = joblib.load(f)
I guess everybody has their own preference but I really like s3fs because it makes the code look very familiar to people who haven't worked with s3 before.

Access Google Sheets on Google Colaboratory

Hi I am using Google Colaboratory (similar to Jupyter Notebook). Does anyone know how to access data from Google Sheets using Google Colaboratory notebook?
Loading data from Google Sheets is covered in the I/O example notebook:
https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb&scrollTo=sOm9PFrT8mGG
!pip install --upgrade -q gspread
import gspread
import pandas as pd
from google.colab import auth
auth.authenticate_user()
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
worksheet = gc.open('data_set.csv').sheet1
rows = worksheet.get_all_values()
pd.DataFrame.from_records(rows)

Resources