ipywidgets - widgets.FileUpload, updated CSV file read the CSV file - python-3.x

I am using jupyterhub and hosting the .ipynb file and hosted on server. I have usecase to upload a CSV from localdrive file and read the same for other dataframe tasks.
uploader = widgets.FileUpload(
accept='*.csv', # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
multiple=False # True to accept multiple files upload else False
)
display(uploader)
[input_file] = uploader.value
print(input_file)
pd.read_csv(input_file)
print(input_file) - is printing Test.csv which is CSV file name
I am able to print [input_file] but `pd.read_csv(input_file)' is throwing below error
FileNotFoundError: [Errno 2] No such file or directory: 'Test.csv'
Not sure were the CSV is uploaded, how can i read that data. Please help.

I don't have your exact ipywidgets version, but can you try this:
input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
df = pd.read_csv(content)

Related

Compress multiple files with zip in AWS Glue

I'm trying to compress the multiple csv files from a folder in the S3 bucket using AWS Glue. I have the script below.
currentdate = datetime.now().strftime("%Y%m%d")
directory = pathlib.Path("s3://mys3bucket/folder1/"+currentdate+"/")
zippy = "s3://mys3bucket/folder1/"+currentdate+"-abc.zip"
with ZipFile(zippy, "w", ZIP_DEFLATED) as archive:
for file_path in directory.rglob("*csv"):
archive.write(file_path, arcname=file_path.relative_to(directory))
It will end in the error below.
FileNotFoundError: [Errno 2] No such file or directory: 's3://mys3bucket/folder1/20220420-abc.zip'
Can anyone advise how to solve this error?

FileNotFoundError when using shutil.copy()

I want to copy the source file to the destination folder.
In the destination folder, I am creating n number of folders (That may be nested of any depth) and I want to paste file1.pdf in any random folders inside the destination folder.
I have done the following code
import shutil
destination_folder = "path_to_the_destination_folder"
source_file = "path_to_source_file\file1.pdf"
destination_file = "f{destination_folder}\any_random_folder_from_n_nested_folders\file1.pdf"
new_file= shutil.copy(source_file, destination_file )
print(new_file)
FYI: "destination_folder\any_random_folders_from_n_nested_folders" This path is present, means it is getting created successfully , checked using os.path.isdir()
And facing this issue for some files, not for all files..
But it is giving me an error:
FileNotFoundError: [Errno 2] No such file or directory:
File "\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 420, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 265, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory
I believe, you have incorrectly specified destination folder, try this:
import shutil
destination_folder = "path_to_the_destination_folder"
source_file = "path_to_source_file\file1.pdf"
destination = f"{destination_folder}\any_random_folders_from_n_nested_folders\file1.pdf"
dest = shutil.copy(source_file, destination)
print(dest)
If this doesn't work, then check if all the directories exist and also check the permissions.
This is issue is resolved..
I just changed folder names which I was creating according to user input.
Previous folder names: f"Fld+{datetimestamp}"
Current folder names: f"Fld+{folder_counter}"

Upload Gzip file using Boto3

i am trying to upload files to S3 before that i am trying to Gzip files, if you see the code below, the files uploaded to the S3 have no change in the size, so i am trying to figure out if i have missed something.
import gzip
import shutil
from io import BytesIO
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
"""Compress and upload the contents from fp to S3.
If compressed_fp is None, the compression is performed in memory.
"""
if not compressed_fp:
compressed_fp = BytesIO()
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
Courtesy Link for the source
And this is how i am using this fucntion, so basically reading files as stream from SFTP and then trying to Gzip them and then write them to S3.
with pysftp.Connection(host_name, username=user, password=password, cnopts=cnopts, port=int(port)) as sftp:
list_of_files = sftp.listdir('{}{}'.format(base_path, file_path))
is_file_found = False
for file_name in list_of_files:
if entity_name in str(file_name.lower()):
is_file_found = True
flo = BytesIO()
# Step 1: Read File Using SFTP as input Stream
sftp.getfo('{}{}/{}'.format(base_path, file_path, file_name), flo)
s3_destination_key = '{}/{}'.format(s3_path, file_name)
# Step 2: Write files to desitination S3
logger.info('Moving file to S3 {} '.format(s3_destination_key))
# Creating a bucket resource to use bucket object for file upload
input_bucket_object = S3.Bucket(environment_config['S3_INBOX_BUCKET'])
flo.seek(0)
upload_gzipped(input_bucket_object, s3_destination_key, flo)
It seems like the upload_gzipped function uses shutil.copyfileobj incorrectly.
Looking at https://docs.python.org/3/library/shutil.html#shutil.copyfileobj shows that you put the source first, and destination second.
Also, you're just writing your object to a gzipped object without ever actually compressing it.
You need to compress fp into a Gzip object, then upload that specific object to S3.
I'd recommend not using that gist from github as it seems wrong.

Google cloud function with wand stopped working

I have set up 3 Google Cloud Storge buckets and 3 functions (one for each bucket) that will trigger when a PDF file is uploaded to a bucket. Functions convert PDF to png image and do further processing.
When I am trying to create a 4th bucket and similar function, strangely it is not working. Even if I copy one of the existing 3 functions, it is still not working and I am getting this error:
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 333, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 199, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 196, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 27, in pdf_to_img with Image(filename=tmp_pdf, resolution=300) as image: File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2874, in __init__ self.read(filename=filename, resolution=resolution) File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2952, in read self.raise_exception() File "/env/local/lib/python3.7/site-packages/wand/resource.py", line 222, in raise_exception raise e wand.exceptions.PolicyError: not authorized/tmp/tmphm3hiezy' # error/constitute.c/ReadImage/412`
It is baffling me why same functions are working on existing buckets but not on new one.
UPDATE:
Even this is not working (getting "cache resources exhausted" error):
In requirements.txt:
google-cloud-storage
wand
In main.py:
import tempfile
from google.cloud import storage
from wand.image import Image
storage_client = storage.Client()
def pdf_to_img(data, context):
file_data = data
pdf = file_data['name']
if pdf.startswith('v-'):
return
bucket_name = file_data['bucket']
blob = storage_client.bucket(bucket_name).get_blob(pdf)
_, tmp_pdf = tempfile.mkstemp()
_, tmp_png = tempfile.mkstemp()
tmp_png = tmp_png+".png"
blob.download_to_filename(tmp_pdf)
with Image(filename=tmp_pdf) as image:
image.save(filename=tmp_png)
print("Image created")
new_file_name = "v-"+pdf.split('.')[0]+".png"
blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)
Above code is supposed to just create a copy of image file which is uploaded to bucket.
Because the vulnerability has been fixed in Ghostscript but not updated in ImageMagick, the workaround for converting PDFs to images in Google Cloud Functions is to use this ghostscript wrapper and directly request the PDF conversion to png from Ghostscript (bypassing ImageMagick).
requirements.txt
google-cloud-storage
ghostscript==0.6
main.py
import locale
import tempfile
import ghostscript
from google.cloud import storage
storage_client = storage.Client()
def pdf_to_img(data, context):
file_data = data
pdf = file_data['name']
if pdf.startswith('v-'):
return
bucket_name = file_data['bucket']
blob = storage_client.bucket(bucket_name).get_blob(pdf)
_, tmp_pdf = tempfile.mkstemp()
_, tmp_png = tempfile.mkstemp()
tmp_png = tmp_png+".png"
blob.download_to_filename(tmp_pdf)
# create a temp folder based on temp_local_filename
# use ghostscript to export the pdf into pages as pngs in the temp dir
args = [
"pdf2png", # actual value doesn't matter
"-dSAFER",
"-sDEVICE=pngalpha",
"-o", tmp_png,
"-r300", tmp_pdf
]
# the above arguments have to be bytes, encode them
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
#run the request through ghostscript
ghostscript.Ghostscript(*args)
print("Image created")
new_file_name = "v-"+pdf.split('.')[0]+".png"
blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)
Anyway, this gets you around the issue and keeps all the processing in GCF for you. Hope it helps. Your code works for single page PDFs though. My use-case was for multipage pdf conversion, ghostscript code & solution in this question.
This actually seems to be a show stopper for ImageMagick related functionalities using PDF format. Similar code deployed by us on Google App engine via custom docker is failing with the same error on missing authorizations.
I am not sure how to edit the policy.xml file on GAE or GCF but a line there has to be changed to:
<policy domain="coder" rights="read|write" pattern="PDF" />
#Dustin: Do you have a bug link where we can see the progress ?
Update:
I fixed it on my Google app engine container by adding a line in docker image. This directly changes the policy.xml file content after imagemagick gets installed.
RUN sed -i 's/rights="none"/rights="read|write"/g' /etc/ImageMagick-6/policy.xml
This is an upstream bug in Ubuntu, we are working on a workaround for App Engine and Cloud Functions.
While we wait for the issue to be resolved in Ubuntu, I followed #DustinIngram's suggestion and created a virtual machine in Compute Engine with an ImageMagick installation. The downside is that I now have a second API that my API in App Engine has to call, just to generate the images. Having said that, it's working fine for me. This is my setup:
Main API:
When a pdf file is uploaded to Cloud Storage, I call the following:
response = requests.post('http://xx.xxx.xxx.xxx:5000/makeimages', data=data)
Where data is a JSON string with the format {"file_name": file_name}
On the API that is running on the VM, the POST request gets processed as follows:
#app.route('/makeimages', methods=['POST'])
def pdf_to_jpg():
file_name = request.form['file_name']
blob = storage_client.bucket(bucket_name).get_blob(file_name)
_, temp_local_filename = tempfile.mkstemp()
temp_local_filename_jpeg = temp_local_filename + '.jpg'
# Download file from bucket.
blob.download_to_filename(temp_local_filename)
print('Image ' + file_name + ' was downloaded to ' + temp_local_filename)
with Image(filename=temp_local_filename, resolution=300) as img:
pg_num = 0
image_files = {}
image_files['pages'] = []
for img_page in img.sequence:
img_page_2 = Image(image=img_page)
img_page_2.format = 'jpeg'
img_page_2.compression_quality = 70
img_page_2.save(filename=temp_local_filename_jpeg)
new_file_name = file_name.replace('.pdf', 'p') + str(pg_num) + '.jpg'
new_blob = blob.bucket.blob(new_file_name)
new_blob.upload_from_filename(temp_local_filename_jpeg)
print('Page ' + str(pg_num) + ' was saved as ' + new_file_name)
image_files['pages'].append({'page': pg_num, 'file_name': new_file_name})
pg_num += 1
try:
os.remove(temp_local_filename)
except (ValueError, PermissionError):
print('Could not delete the temp file!')
return jsonify(image_files)
This will download the pdf from Cloud Storage, create an image for each page, and save them back to cloud storage. The API will then return a JSON file with the list of image files created.
So, not the most elegant solution, but at least I don't need to convert the files manually.

Adding timestamp to a file in PYTHON

I can able to rename a file without any problem/error using os.rename().
But the moment I tried to rename a file with timestamp adding to it, it throws win3 error or win123 error, tried all combinations but no luck, Could anyone help.
Successfully Ran Code :
#!/usr/bin/python
import datetime
import os
import shutil
import json
import re
maindir = "F:/Protocols/"
os.chdir(maindir)
maindir = os.getcwd()
print("Working Directory : "+maindir)
path_4_all_iter = os.path.abspath("all_iteration.txt")
now = datetime.datetime.now()
timestamp = str(now.strftime("%Y%m%d_%H:%M:%S"))
print(type(timestamp))
archive_name = "all_iteration_"+timestamp+".txt"
print(archive_name)
print(os.getcwd())
if os.path.exists("all_iteration.txt"):
print("File Exists")
os.rename(path_4_all_iter, "F:/Protocols/archive/archive.txt")
print(os.listdir("F:/Protocols/archive/"))
print(os.path.abspath("all_iteration.txt"))
Log :
E:\python.exe C:/Users/SPAR/PycharmProjects/Sample/debug.py
Working Directory : F:\Protocols
<class 'str'>
all_iteration_20180409_20:25:51.txt
F:\Protocols
File Exists
['archive.txt']
F:\Protocols\all_iteration.txt
Process finished with exit code 0
Error Code :
print(os.getcwd())
if os.path.exists("all_iteration.txt"):
print("File Exists")
os.rename(path_4_all_iter, "F:/Protocols/archive/"+archive_name)
print(os.listdir("F:/Protocols/archive/"))
print(os.path.abspath("all_iteration.txt"))
Error LOG:
E:\python.exe C:/Users/SPAR/PycharmProjects/Sample/debug.py
Traceback (most recent call last):
Working Directory : F:\Protocols
<class 'str'>
File "C:/Users/SPAR/PycharmProjects/Sample/debug.py", line 22, in <module>
all_iteration_20180409_20:31:16.txt
F:\Protocols
os.rename(path_4_all_iter, "F:/Protocols/archive/"+archive_name)
File Exists
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'F:\\Protocols\\all_iteration.txt' -> 'F:/Protocols/archive/all_iteration_20180409_20:31:16.txt'
Process finished with exit code 1
Your timestamp format has colons in it, which are not allowed in Windows filenames. See this answer on that subject:
How to get a file in Windows with a colon in the filename?
If you change your timestamp format to something like:
timestamp = str(now.strftime("%Y%m%d_%H-%M-%S"))
it should work.
You can't have : characters as part of the filename, so change
timestamp = str(now.strftime("%Y%m%d_%H:%M:%S"))
to
timestamp = str(now.strftime("%Y%m%d_%H%M%S"))
and you'll be able to rename your file.

Resources