using GCS path in PyPDF2 PdfFileReader - python-3.x

I am using the python library PyPDF2 and trying to read a pdf file using PdfFileReader. It works fine for a local pdf file. Is there a way to access my pdf file from Google Cloud Storage bucket (gs://bucket_name/object_name)?
from PyPDF2 import PdfReader
with open('testpdf.pdf','rb') as f1:
reader = PdfReader(f1)
number_of_pages = len(reader.pages)
Instead of 'testpdf.pdf', how can I provide my Google Cloud Storage object location? Please let me know if anyone tried this.

You can use GCSFS library to access files from gcs bucket. For eg.
import gcsfs
from pypdf import PdfReader
gcs_file_system = gcsfs.GCSFileSystem(project="PROJECT_ID")
gcs_pdf_path = "gs://bucket_name/object.pdf"
f_object = gcs_file_system.open(gcs_pdf_path, "rb")
# Open our PDF file with the PdfReader
reader = PdfReader(f_object)
# Get number of pages
num = len(reader.pages)
f_object.close()

Related

How to upload downloaded telegram media directly on google drive?

I'm working on the telethon download_media method for downloading images and videos. It is working fine (as expected). Now, I want to directly upload the download_media to my google drive folder.
Sample code looks something like:
from telethon import TelegramClient, events, sync
from telethon.tl.types import PeerUser, PeerChat, PeerChannel
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
gfile = drive.CreateFile({'parents': [{'id': 'drive_directory_path'}]})
api_id = #####
api_hash = ##########
c = client.get_entity(PeerChannel(1234567)) # some random channel id
for m in client.iter_messages(c):
if m.photo:
# below is the one way and it works
# m.download_media("Media/")
# I want to try something like this - below code
gfile.SetContentFile(m.media)
gfile.Upload()
This code is not working. How Can I define the google drive object for download_media?
Thanks in advance. Kindly assist!
The main problem is that according to PyDrive's documentation, setContentFile() expects a string with the file's local path, and then it just uses open(), so you're meant to use this with local files. In your code you're trying to feed it the media file so it won't work.
To upload a bytes file with PyDrive you'll need to convert it to BytesIO and send it as the content. An example with a local file would look like this:
drive = GoogleDrive(gauth)
file = drive.CreateFile({'mimeType':'image/jpeg', 'title':'example.jpg'})
filebytes = open('example.jpg', 'rb').read()
file.content = io.BytesIO(filebytes)
file.Upload()
Normally you don't need to do it this way because setContentFile() does the opening and conversion for you, but this should give you the idea that if you get the bytes media file you can just convert it and assign it to file.content and then you can upload it.
Now, if you look at the Telethon documentation, you will see that download_media() takes a file argument which you can set to bytes:
file (str | file, optional):
The output file path, directory, or stream-like object. If the path exists and is a file, it will be overwritten. If file is the type bytes, it will be downloaded in-memory as a bytestring (e.g. file=bytes).
So you should be able to call m.download_media(file=bytes) to get a bytes object. Looking even deeper at the Telethon source code it appears that this does return a BytesIO object. With this in mind, you can try the following change in your loop:
for m in client.iter_messages(c):
if m.photo:
gfile.content = io.BytesIO(m.download_media(file=bytes))
gfile.Upload()
Note that I only tested the PyDrive side since I currently don't have access to the Telegram API, but looking at the docs I believe this should work. Let me know what happens.
Sources:
PyDrive docs and source
Telethon docs and source

Can you retrieve and access netcdf files from an FTP server using tempfile and xarray? [duplicate]

I would like to retrieve the data inside a compressed gz file stored on an FTP server, without writing the file to the local archive.
At the moment I have done
from ftplib import FTP
import gzip
ftp = FTP('ftp.server.com')
ftp.login()
ftp.cwd('/a/folder/')
fileName = 'aFile.gz'
localfile = open(fileName,'wb')
ftp.retrbinary('RETR '+fileName, localfile.write, 1024)
f = gzip.open(localfile,'rb')
data = f.read()
This, however, writes the file "localfile" on the current storage.
I tried to change this in
from ftplib import FTP
import zlib
ftp = FTP('ftp.server.com')
ftp.login()
ftp.cwd('/a/folder/')
fileName = 'aFile.gz'
data = ftp.retrbinary('RETR '+fileName, zlib.decompress, 1024)
but, ftp.retrbinary does not output the output of its callback.
Is there a way to do this?
A simple implementation is to:
download the file to an in-memory file-like object, like BytesIO;
pass that to fileobj parameter of GzipFile constructor.
import gzip
from io import BytesIO
import shutil
from ftplib import FTP
ftp = FTP('ftp.example.com')
ftp.login('username', 'password')
flo = BytesIO()
ftp.retrbinary('RETR /remote/path/archive.tar.gz', flo.write)
flo.seek(0)
with open('archive.tar', 'wb') as fout, gzip.GzipFile(fileobj = flo) as gzip:
shutil.copyfileobj(gzip, fout)
The above loads whole .gz file to a memory. What can be inefficient for large files. A smarter implementation would stream the data instead. But that would probably require implementing a smart custom file-like object.
See also Get files names inside a zip file on FTP server without downloading whole archive.

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

Read a utf-16LE file directly in cloud function -python/GCP

I have a csv file with utf-16le encoding, I tried to open it in cloud function using
import pandas as pd
from io import StringIO as sio
with open("gs://bucket_name/my_file.csv", "r", encoding="utf16") as f:
read_all_once = f.read()
read_all_once = read_all_once.replace('"', "")
file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=";", skiprows=5)
I get the error that the file is not found on location. what is the issue? When I run the same code locally with a local path it works.
Also when the file is in utf-8 encoding I can read it directly with
df = pd.read_csv("gs://bucket_name/my_file.csv, delimiter=";", encoding="utf-8", skiprows=0,low_memory=False)
I need to know if I can read the utf16 file directly with pd.read_csv()? if no, how do I make with open() recognize the path?
Thanks in advance!
Yes, you can read the utf-16 csv file directly with the pd.read_csv() method.
For the method to work please make sure that the service account attached to your function has access to read the CSV file in the Cloud Storage bucket.
Please ensure whether the encoding of the csv file you are using is “utf-16” or “utf-16le” or “utf-16be” and use the appropriate one in the method.
I used python 3.7 runtime.
My main.py file and requirement.txt file looks as below. You can
modify the main.py according to your use case.
main.py
import pandas as pd
def hello_world(request):
#please change the file's URI
data = pd.read_csv('gs://bucket_name/file.csv', encoding='utf-16le')
print (data)
return f'check the results in the logs'
requirement.txt
pandas==1.1.0
gcsfs==0.6.2

How to upload Cloud files into Python

I use to upload excel files into pandas dataframe
pd.ExcelFile if the files are in my local drive
How can I do the same if I have an Excel file in Google Drive or Microsoft One Drive and I want to connect remotely?
You can use read_csv() on a StringIO object:
from StringIO import StringIO # moved to io in python3.
import requests
r = requests.get('Your google drive link')
data = r.content
df = pd.read_csv(StringIO(data))

Resources