Read a utf-16LE file directly in cloud function -python/GCP - python-3.x

I have a csv file with utf-16le encoding, I tried to open it in cloud function using
import pandas as pd
from io import StringIO as sio
with open("gs://bucket_name/my_file.csv", "r", encoding="utf16") as f:
read_all_once = f.read()
read_all_once = read_all_once.replace('"', "")
file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=";", skiprows=5)
I get the error that the file is not found on location. what is the issue? When I run the same code locally with a local path it works.
Also when the file is in utf-8 encoding I can read it directly with
df = pd.read_csv("gs://bucket_name/my_file.csv, delimiter=";", encoding="utf-8", skiprows=0,low_memory=False)
I need to know if I can read the utf16 file directly with pd.read_csv()? if no, how do I make with open() recognize the path?
Thanks in advance!

Yes, you can read the utf-16 csv file directly with the pd.read_csv() method.
For the method to work please make sure that the service account attached to your function has access to read the CSV file in the Cloud Storage bucket.
Please ensure whether the encoding of the csv file you are using is “utf-16” or “utf-16le” or “utf-16be” and use the appropriate one in the method.
I used python 3.7 runtime.
My main.py file and requirement.txt file looks as below. You can
modify the main.py according to your use case.
main.py
import pandas as pd
def hello_world(request):
#please change the file's URI
data = pd.read_csv('gs://bucket_name/file.csv', encoding='utf-16le')
print (data)
return f'check the results in the logs'
requirement.txt
pandas==1.1.0
gcsfs==0.6.2

Related

How to upload downloaded telegram media directly on google drive?

I'm working on the telethon download_media method for downloading images and videos. It is working fine (as expected). Now, I want to directly upload the download_media to my google drive folder.
Sample code looks something like:
from telethon import TelegramClient, events, sync
from telethon.tl.types import PeerUser, PeerChat, PeerChannel
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
gfile = drive.CreateFile({'parents': [{'id': 'drive_directory_path'}]})
api_id = #####
api_hash = ##########
c = client.get_entity(PeerChannel(1234567)) # some random channel id
for m in client.iter_messages(c):
if m.photo:
# below is the one way and it works
# m.download_media("Media/")
# I want to try something like this - below code
gfile.SetContentFile(m.media)
gfile.Upload()
This code is not working. How Can I define the google drive object for download_media?
Thanks in advance. Kindly assist!
The main problem is that according to PyDrive's documentation, setContentFile() expects a string with the file's local path, and then it just uses open(), so you're meant to use this with local files. In your code you're trying to feed it the media file so it won't work.
To upload a bytes file with PyDrive you'll need to convert it to BytesIO and send it as the content. An example with a local file would look like this:
drive = GoogleDrive(gauth)
file = drive.CreateFile({'mimeType':'image/jpeg', 'title':'example.jpg'})
filebytes = open('example.jpg', 'rb').read()
file.content = io.BytesIO(filebytes)
file.Upload()
Normally you don't need to do it this way because setContentFile() does the opening and conversion for you, but this should give you the idea that if you get the bytes media file you can just convert it and assign it to file.content and then you can upload it.
Now, if you look at the Telethon documentation, you will see that download_media() takes a file argument which you can set to bytes:
file (str | file, optional):
The output file path, directory, or stream-like object. If the path exists and is a file, it will be overwritten. If file is the type bytes, it will be downloaded in-memory as a bytestring (e.g. file=bytes).
So you should be able to call m.download_media(file=bytes) to get a bytes object. Looking even deeper at the Telethon source code it appears that this does return a BytesIO object. With this in mind, you can try the following change in your loop:
for m in client.iter_messages(c):
if m.photo:
gfile.content = io.BytesIO(m.download_media(file=bytes))
gfile.Upload()
Note that I only tested the PyDrive side since I currently don't have access to the Telegram API, but looking at the docs I believe this should work. Let me know what happens.
Sources:
PyDrive docs and source
Telethon docs and source

Python on Chrome OS through Linux: Cannot Export DataFrame [duplicate]

I am trying to write a DataFrame to a .csv file:
now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")
enrichedDataDir = "/export/market_data/temp"
enrichedDataFile = enrichedDataDir + "/marketData_optam_" + date + ".csv"
dbutils.fs.ls(enrichedDataDir)
df.to_csv(enrichedDataFile, sep='; ')
This throws me the following error
IOError: [Errno 2] No such file or directory:
'/export/market_data/temp/marketData_optam_2018-10-12.csv'
But when i do
dbutils.fs.ls(enrichedDataDir)
Out[72]: []
There is no error! When i go on the directory levels (one level higher):
enrichedDataDir = "/export/market_data"
dbutils.fs.ls(enrichedDataDir)
Out[74]:
[FileInfo(path=u'dbfs:/export/market_data/temp/', name=u'temp/', size=0L)
FileInfo(path=u'dbfs:/export/market_data/update/', name=u'update/', size=0L)]
This works, too. This mean for me that i have really all the folders which i want to access. But i dont know thy the .to_csv option throws the error. I also have checked the permissions, which are fine!
The main problem was, that i am using Micrsoft Azure Datalake Store for storing those .csv files. And for whatever reason, it is not possible through df.to_csv to write to Azure Datalake Store.
Due to the fact that i was trying to use df.to_csv i was using a Pandas DataFrame instead of a Spark DataFrame.
I changed to
from pyspark.sql import *
df = spark.createDataFrame(result,['CustomerId', 'SalesAmount'])
and then write to csv via the following lines
from pyspark.sql import *
df.coalesce(2).write.format("csv").option("header", True).mode("overwrite").save(enrichedDataFile)
And it works.
Here is a more general answer.
If you want to load file from DBFS to Pandas dataframe, you can do this trick.
Move the file from dbfs to file
%fs cp dbfs:/FileStore/tables/data.csv file:/FileStore/tables/data.csv
Read data from file dir
data = pd.read_csv('file:/FileStore/tables/data.csv')
Thanks
have you tried opening the file first ? (replace last row of your first example with below code)
from os import makedirs
makedirs(enrichedDataDir)
with open(enrichedDataFile, 'w') as output_file:
df.to_csv(output_file, sep='; ')
check the permissions on the sas token you used for the container when you mounted this path.. if it starts with "sp=racwdlmeopi" then you have a sas token with immutable storage.. your token should start with "sp=racwdlmeop"

Python: Can you upload a .xlsx file from xlsxwriter without creating one locally?

I hope this question is as straightforward as I think it is.
Here is some background:
I'm helping out on python backend that is getting messy data as a csv. In its current state it just reroutes the url given by the API and triggers a download on the client computer. I wrote a utility using Pandas and xlsxwriter that cleans up this data, separates into multiple tabs and makes some graphs then writes them to a .xlsx file. Basically like this:
import pandas as pd
writer = pd.ExcelWriter(output_file_name, engine = 'xlsxwriter')
#Do a bunch of stuff and save each tab to writer
writer.save() #Writes the file
This .xlsx file would be created locally and there would need to be additional backend stuff that uploads it and cleans up the local file.
Seeing as the file is created all at once using the .save() method at the end, I was thinking its probably possible to trigger an upload directly without creating the local file at all, but I'm not seeing anything in xlsxwriter documentation about it. Is there any way to avoid saving a local file within or outside of xlsxwriter?
Assuming df is your dataframe variable:
import io
buffer = io.BytesIO()
writer = pd.ExcelWriter(buffer, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
data = buffer.getvalue()
data contains the binary data of excel file. For instance, you can use requests module to upload file somewhere.

pandas : read_csv not accepting relative path

I have python code in Jupyter notebook and accompanying data in the same folder. I will be bundling both the code and data into a zip file and submitting for evaluation. I am trying to read the data inside the Notebook using pandas.read_csv using a relative path and thats not working. the API doesnt seem to work with relative path. What is the correct way to handle this?
Update:
My findings so far seem to suggest that, I should be using os.chdir() to set the current working directory. But I wouldn't know where the zip file will get extracted. The code is supposed to be read-only..So I cannot expect the receiver to update the path as appropriate.
You could append the current working directory with the relative path to avoid problem as such:
import os
import pandas as pd
BASE_DIR = os.getcwd()
csv_path = "csvname.csv"
df = pd.read_csv(os.path.join(BASE_DIR, csv_path)
where csv_path is the relative path.
I think first of all you should make a unzip file then you can run.
You may use the below code to unzip file,
from zipfile import ZipFile
file_name = "folder_name.zip"
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print("Done !")

Convert pickle file from protocol 3 to protocol 2

I dumped a pickle file using protocol 3, the default used by python3, but while deploying it on Google cloud which works on python2 , so i need to convert pickle file to protocol 2 . Now i want to directly convert this pickle file of protocol3 to pickle file of protocol 2. How to do ?
Can you try something like below?
I did not find any direct converter in standard. May be someone can.
Load the file into object obj and then do the following.
pickle.dump(obj, fileObject, 2)
There is an option to pass to dump function:
https://docs.python.org/3.1/library/pickle.html#pickle.dump
Rough code:
import pickle
with open('data1.pickle', 'rb') as f1:
data = pickle.load(f1)
with open('data2.pickle', 'wb') as f2:
pickle.dump(data, f2, 2)

Resources