How to upload Cloud files into Python - excel

I use to upload excel files into pandas dataframe
pd.ExcelFile if the files are in my local drive
How can I do the same if I have an Excel file in Google Drive or Microsoft One Drive and I want to connect remotely?

You can use read_csv() on a StringIO object:
from StringIO import StringIO # moved to io in python3.
import requests
r = requests.get('Your google drive link')
data = r.content
df = pd.read_csv(StringIO(data))

Related

Read a utf-16LE file directly in cloud function -python/GCP

I have a csv file with utf-16le encoding, I tried to open it in cloud function using
import pandas as pd
from io import StringIO as sio
with open("gs://bucket_name/my_file.csv", "r", encoding="utf16") as f:
read_all_once = f.read()
read_all_once = read_all_once.replace('"', "")
file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=";", skiprows=5)
I get the error that the file is not found on location. what is the issue? When I run the same code locally with a local path it works.
Also when the file is in utf-8 encoding I can read it directly with
df = pd.read_csv("gs://bucket_name/my_file.csv, delimiter=";", encoding="utf-8", skiprows=0,low_memory=False)
I need to know if I can read the utf16 file directly with pd.read_csv()? if no, how do I make with open() recognize the path?
Thanks in advance!
Yes, you can read the utf-16 csv file directly with the pd.read_csv() method.
For the method to work please make sure that the service account attached to your function has access to read the CSV file in the Cloud Storage bucket.
Please ensure whether the encoding of the csv file you are using is “utf-16” or “utf-16le” or “utf-16be” and use the appropriate one in the method.
I used python 3.7 runtime.
My main.py file and requirement.txt file looks as below. You can
modify the main.py according to your use case.
main.py
import pandas as pd
def hello_world(request):
#please change the file's URI
data = pd.read_csv('gs://bucket_name/file.csv', encoding='utf-16le')
print (data)
return f'check the results in the logs'
requirement.txt
pandas==1.1.0
gcsfs==0.6.2

Read excel sheets from gcp bucket

I am currently trying to read in data to my gcp notebook from a shared gcp storage bucket. I am an admin and so restrictions shouldn't apply as far as I know, but I am getting an error before I can even read in with pandas. Is this possible? Or am I going about this in the wrong way?
This is the code I have tried:
from google.cloud import storage
from io import BytesIO
import pandas as pd
client = storage.Client()
bucket = "our_data/deid"
blob = storage.blob.Blob("B_ACTIVITY.xlsx",bucket)
content = blob.download_as_string()
df = pd.read_excel(BytesIO(content))
I was hoping for the data to simply be brought in once the bucket was specified, but I get an error "'str' object has no attribute 'path'".
bucket needs to be a bucket object not just a string.
Try changing that line to
bucket = client.bucket(<BUCKET_URL>)
Here's a link to the constructor:
https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.bucket

Dash app deployed on Heroku cannot read CSV file

I'm trying to use Heroku to deploy my Dash app, which is supposed to read data from a local CSV file. The deployment was successful, but if I open the URL of the app, it gives me an Application Error.
I have checked the Heroku logs and I found a FileNotFoundError which tells me the CSV file from which the app reads the data does not exist, but it works if I run the app locally. In fact, the CSV file exists in my directory, so I want to know if there's another way to go about this.
EDIT: Actually, this is how my app.py code starts. The FileNotFoundError points to the part where I read the CSV file with pandas.
How can I get my app to read the CSV file?
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_table as tablefrom
from dash.dependencies import Input, Output
import plotly as py
import plotly.graph_objs as go
import numpy as np
import pandas as pd
filepath='C:\\Users\\DELL\\Desktop\\EDUCATE\\DATA CSV\\crop_prod_estimates_GH.csv'
data=pd.read_csv(filepath,sep=',',thousands=',')
data.dropna(inplace=True)
data[['REGION','DISTRICT','CROP']]=data[['REGION','DISTRICT','CROP']].astype('category')
data.CROP=data.CROP.str.strip()
data.drop(data.columns[0],axis=1,inplace=True)
Solved it!!!!!!!!!
I uploaded my csv data file on my github repository and had the app.py read data from it.like:
url = 'https://raw.githubusercontent.com/your_account_name/repository_name/master/file.csv'
df = pd.read_csv(url,sep=",")
df.head()
You can store the csv file at the same location where your app.py exists.
Change from:
filepath='C:\\Users\\DELL\\Desktop\\EDUCATE\\DATA CSV\\crop_prod_estimates_GH.csv'
To:
filepath='crop_prod_estimates_GH.csv'
It should work.
Upload your csv file on cloudinary:
urlfile = 'https://res.cloudinary.com/hmmpyq8rf/raw/upload/v1604671300/localisationDigixpress_n8s98k.csv'
df = pd.read_csv(urlfile,sep=",")
df.head()

using GCS path in PyPDF2 PdfFileReader

I am using the python library PyPDF2 and trying to read a pdf file using PdfFileReader. It works fine for a local pdf file. Is there a way to access my pdf file from Google Cloud Storage bucket (gs://bucket_name/object_name)?
from PyPDF2 import PdfReader
with open('testpdf.pdf','rb') as f1:
reader = PdfReader(f1)
number_of_pages = len(reader.pages)
Instead of 'testpdf.pdf', how can I provide my Google Cloud Storage object location? Please let me know if anyone tried this.
You can use GCSFS library to access files from gcs bucket. For eg.
import gcsfs
from pypdf import PdfReader
gcs_file_system = gcsfs.GCSFileSystem(project="PROJECT_ID")
gcs_pdf_path = "gs://bucket_name/object.pdf"
f_object = gcs_file_system.open(gcs_pdf_path, "rb")
# Open our PDF file with the PdfReader
reader = PdfReader(f_object)
# Get number of pages
num = len(reader.pages)
f_object.close()

Importing multiple files from Google Cloud Bucket to Datalab instance

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3.
So, I can easily see them as objects using
gcs list --objects gs://<BUCKET_NAME>
Further, I can read in an individual file/object using
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('<BUCKET_NAME')
data_csv = myBucket.object('<FILE_NAME.json')
uri = data_csv.uri
%gcs read --object $uri --variable data
df = pd.read_csv(BytesIO(data))
df.head()
(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own)
What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?
As an extra bit- what if a file is saved as a json, but isn't actually that structure? How can I handle that?
Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab.
Any help is greatly appreciated.
This can be done using Bucket.objects which returns an iterator with all matching files. Specify a prefix or leave it empty to match all files in the bucket. I did an example with two files countries1.csv and countries2.csv:
$ cat countries1.csv
id,country
1,sweden
2,spain
$ cat countries2.csv
id,country
3,italy
4,france
And used the following Datalab snippet:
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')
df_list = []
for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()
which will output the combined csv:
id country
0 1 sweden
1 2 spain
2 3 italy
3 4 france
Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. If you want to retrieve all files in the bucket just use this instead:
object_list = myBucket.objects()

Resources