I am trying to fetch .wav file from Amazon S3 and modify it using AudioSegment library. For fetching .wav file from S3, I have used boto3 and IO module. For Audio operations, I am using AudioSegment module.
When I fetch file from S3 using BytesIO and pass it to AudioSegment, I am getting "System can not find the file specified" error. Below is my code
import boto3
from pydub import AudioSegment
import io
client = boto3.client('s3')
obj = client.get_object(Bucket='<BucketName>', Key='<FileName>')
data = io.BytesIO(obj['Body'].read())
sound1 = AudioSegment.from_file(data)
I am getting error at AudioSegment.from_file(data)
System can not find the file specified
Try specifying the format argument for AudioSegment. For example:
sound1 = AudioSegment.from_file(data, format='mp3')
Related
I would like to retrieve the data inside a compressed gz file stored on an FTP server, without writing the file to the local archive.
At the moment I have done
from ftplib import FTP
import gzip
ftp = FTP('ftp.server.com')
ftp.login()
ftp.cwd('/a/folder/')
fileName = 'aFile.gz'
localfile = open(fileName,'wb')
ftp.retrbinary('RETR '+fileName, localfile.write, 1024)
f = gzip.open(localfile,'rb')
data = f.read()
This, however, writes the file "localfile" on the current storage.
I tried to change this in
from ftplib import FTP
import zlib
ftp = FTP('ftp.server.com')
ftp.login()
ftp.cwd('/a/folder/')
fileName = 'aFile.gz'
data = ftp.retrbinary('RETR '+fileName, zlib.decompress, 1024)
but, ftp.retrbinary does not output the output of its callback.
Is there a way to do this?
A simple implementation is to:
download the file to an in-memory file-like object, like BytesIO;
pass that to fileobj parameter of GzipFile constructor.
import gzip
from io import BytesIO
import shutil
from ftplib import FTP
ftp = FTP('ftp.example.com')
ftp.login('username', 'password')
flo = BytesIO()
ftp.retrbinary('RETR /remote/path/archive.tar.gz', flo.write)
flo.seek(0)
with open('archive.tar', 'wb') as fout, gzip.GzipFile(fileobj = flo) as gzip:
shutil.copyfileobj(gzip, fout)
The above loads whole .gz file to a memory. What can be inefficient for large files. A smarter implementation would stream the data instead. But that would probably require implementing a smart custom file-like object.
See also Get files names inside a zip file on FTP server without downloading whole archive.
I have a csv file with utf-16le encoding, I tried to open it in cloud function using
import pandas as pd
from io import StringIO as sio
with open("gs://bucket_name/my_file.csv", "r", encoding="utf16") as f:
read_all_once = f.read()
read_all_once = read_all_once.replace('"', "")
file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=";", skiprows=5)
I get the error that the file is not found on location. what is the issue? When I run the same code locally with a local path it works.
Also when the file is in utf-8 encoding I can read it directly with
df = pd.read_csv("gs://bucket_name/my_file.csv, delimiter=";", encoding="utf-8", skiprows=0,low_memory=False)
I need to know if I can read the utf16 file directly with pd.read_csv()? if no, how do I make with open() recognize the path?
Thanks in advance!
Yes, you can read the utf-16 csv file directly with the pd.read_csv() method.
For the method to work please make sure that the service account attached to your function has access to read the CSV file in the Cloud Storage bucket.
Please ensure whether the encoding of the csv file you are using is “utf-16” or “utf-16le” or “utf-16be” and use the appropriate one in the method.
I used python 3.7 runtime.
My main.py file and requirement.txt file looks as below. You can
modify the main.py according to your use case.
main.py
import pandas as pd
def hello_world(request):
#please change the file's URI
data = pd.read_csv('gs://bucket_name/file.csv', encoding='utf-16le')
print (data)
return f'check the results in the logs'
requirement.txt
pandas==1.1.0
gcsfs==0.6.2
When trying to read a file from s3 with joblib.load() I get the error ValueError: embedded null byte when attempting to read files.
The files were created by joblib and can be successfully loaded from local copies (that were made locally before uploading to s3), so the error is presumably in storage and retrieval protocols from S3.
Min code:
####Imports (AWS credentials assumed)
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
joblib.loads(s3.Bucket(bucket_str).Object(bucket_key).get()['Body'].read())
The following code reconstructs a local copy of the file in memory before feeding into joblib.load(), enabling a successful load.
from io import BytesIO
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
with BytesIO() as data:
s3.Bucket(bucket_str).download_fileobj(bucket_key, data)
data.seek(0) # move back to the beginning after writing
df = joblib.load(data)
I assume, but am not certain, that something in how boto3 chunks files for download creates a null byte that breaks joblib, and BytesIO fixes this before letting joblib.load() see the datastream.
PS. In this method the file never touches the local disk, which is helpful under some circumstances (eg. node with big RAM but tiny disk space...)
You can do it like this using the s3fs package.
import s3fs
fs = s3fs.FileSystem()
with fs.open('s3://my-aws-bucket/some-pseudo/folder-set/my-filename.joblib', encoding='utf8')
df = joblib.load(f)
I guess everybody has their own preference but I really like s3fs because it makes the code look very familiar to people who haven't worked with s3 before.
I am using the python library PyPDF2 and trying to read a pdf file using PdfFileReader. It works fine for a local pdf file. Is there a way to access my pdf file from Google Cloud Storage bucket (gs://bucket_name/object_name)?
from PyPDF2 import PdfReader
with open('testpdf.pdf','rb') as f1:
reader = PdfReader(f1)
number_of_pages = len(reader.pages)
Instead of 'testpdf.pdf', how can I provide my Google Cloud Storage object location? Please let me know if anyone tried this.
You can use GCSFS library to access files from gcs bucket. For eg.
import gcsfs
from pypdf import PdfReader
gcs_file_system = gcsfs.GCSFileSystem(project="PROJECT_ID")
gcs_pdf_path = "gs://bucket_name/object.pdf"
f_object = gcs_file_system.open(gcs_pdf_path, "rb")
# Open our PDF file with the PdfReader
reader = PdfReader(f_object)
# Get number of pages
num = len(reader.pages)
f_object.close()
In some tutorials was explained to use StringIO in Pillow save method. but when I use this testcode:
from PIL import Image
from io import StringIO, BytesIO
photo = Photo.objects.get(pk=1)
bytes = BytesIO()
string = StringIO()
image = Image.open(photo.image)
image.save(string, 'PNG')
then I get the error:
string argument expected, got 'bytes'
But when I use BytesIO like this:
image.save(bytes, 'PNG')
it works fine. That is strange, because the error message says that a string is expected and bytes is wrong, but obviously the opposite is correct. And that is also contrary to the informations I got while checking tutorials.
Maybe the behavior of save() was changed in the Pillow fork, and the error message was not updated? Or is it different because I use Python 3 with io module instead of StringIO module?
edit, examples where StringIO is proposed
How do you convert a PIL `Image` to a Django `File`?
https://djangosnippets.org/snippets/10473/
Django - Rotate image and save