Compress multiple files with zip in AWS Glue - apache-spark

I'm trying to compress the multiple csv files from a folder in the S3 bucket using AWS Glue. I have the script below.
currentdate = datetime.now().strftime("%Y%m%d")
directory = pathlib.Path("s3://mys3bucket/folder1/"+currentdate+"/")
zippy = "s3://mys3bucket/folder1/"+currentdate+"-abc.zip"
with ZipFile(zippy, "w", ZIP_DEFLATED) as archive:
for file_path in directory.rglob("*csv"):
archive.write(file_path, arcname=file_path.relative_to(directory))
It will end in the error below.
FileNotFoundError: [Errno 2] No such file or directory: 's3://mys3bucket/folder1/20220420-abc.zip'
Can anyone advise how to solve this error?

Related

How to upload folder from local to GCP bucket using python

I am following this link and getting some error:
How to upload folder on Google Cloud Storage using Python API
I have saved model in container environment and from there I want to copy to GCP bucket.
Here is my code:
storage_client = storage.Client(project='*****')
def upload_local_directory_to_gcs(local_path, bucket, gcs_path):
bucket = storage_client.bucket(bucket)
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + '/**'):
print(local_file)
print("this is bucket",bucket)
blob = bucket.blob(gcs_path)
print("here")
blob.upload_from_filename(local_file)
print("done")
path="/pythonPackage/trainer/model_mlm_demo" #this is local absolute path where my folder is. Folder name is **model_mlm_demo**
buc="py*****" #this is my GCP bucket address
gcs="model_mlm_demo2/" #this is the new folder that I want to store files in GCP
upload_local_directory_to_gcs(local_path=path, bucket=buc, gcs_path=gcs)
/pythonPackage/trainer/model_mlm_demo has 3 files in it config, model.bin and arguments.bin`
ERROR
The codes doesn't throw any error, but there is no files uploaded in GCP bucket. It just creates empty folder.
What I can see the error is, you don't need to pass the gs:// as the bucket parameter. Actually, here is an example you may need to check out,
https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The path to your file to upload
# source_file_name = "local/path/to/file"
# The ID of your GCS object
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
I have reproduced your issue and the below code snippet works fine. I have updated the code based on folders and names you have mentioned in the question. Let me know if you have any issues.
import os
import glob
from google.cloud import storage
storage_client = storage.Client(project='')
def upload_local_directory_to_gcs(local_path, bucket, gcs_path):
bucket = storage_client.bucket(bucket)
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + '/**'):
print(local_file)
print("this is bucket", bucket)
filename=local_file.split('/')[-1]
blob = bucket.blob(gcs_path+filename)
print("here")
blob.upload_from_filename(local_file)
print("done")
# this is local absolute path where my folder is. Folder name is **model_mlm_demo**
path = "/pythonPackage/trainer/model_mlm_demo"
buc = "py*****" # this is my GCP bucket address
gcs = "model_mlm_demo2/" # this is the new folder that I want to store files in GCP
upload_local_directory_to_gcs(local_path=path, bucket=buc, gcs_path=gcs)
I just came across the gcsfs library which seems to be also about better interfaces
You could copy an entire directory into a gcs location like this:
def upload_to_gcs(src_dir: str, gcs_dst: str):
fs = gcsfs.GCSFileSystem()
fs.put(src_dir, gcs_dst, recursive=True)
I figured out a way using subprocess to upload model artefacts in GCP bucket.
import subprocess
subprocess.call('gsutil cp -r source_folder_in_local gs://*****/folder_name', shell=True, stdout=subprocess.PIPE)
If gsutil is not installed. You can install using this link:
https://cloud.google.com/storage/docs/gsutil_install

FileNotFoundError when using shutil.copy()

I want to copy the source file to the destination folder.
In the destination folder, I am creating n number of folders (That may be nested of any depth) and I want to paste file1.pdf in any random folders inside the destination folder.
I have done the following code
import shutil
destination_folder = "path_to_the_destination_folder"
source_file = "path_to_source_file\file1.pdf"
destination_file = "f{destination_folder}\any_random_folder_from_n_nested_folders\file1.pdf"
new_file= shutil.copy(source_file, destination_file )
print(new_file)
FYI: "destination_folder\any_random_folders_from_n_nested_folders" This path is present, means it is getting created successfully , checked using os.path.isdir()
And facing this issue for some files, not for all files..
But it is giving me an error:
FileNotFoundError: [Errno 2] No such file or directory:
File "\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 420, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "\AppData\Local\Programs\Python\Python39\lib\shutil.py", line 265, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory
I believe, you have incorrectly specified destination folder, try this:
import shutil
destination_folder = "path_to_the_destination_folder"
source_file = "path_to_source_file\file1.pdf"
destination = f"{destination_folder}\any_random_folders_from_n_nested_folders\file1.pdf"
dest = shutil.copy(source_file, destination)
print(dest)
If this doesn't work, then check if all the directories exist and also check the permissions.
This is issue is resolved..
I just changed folder names which I was creating according to user input.
Previous folder names: f"Fld+{datetimestamp}"
Current folder names: f"Fld+{folder_counter}"

ipywidgets - widgets.FileUpload, updated CSV file read the CSV file

I am using jupyterhub and hosting the .ipynb file and hosted on server. I have usecase to upload a CSV from localdrive file and read the same for other dataframe tasks.
uploader = widgets.FileUpload(
accept='*.csv', # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
multiple=False # True to accept multiple files upload else False
)
display(uploader)
[input_file] = uploader.value
print(input_file)
pd.read_csv(input_file)
print(input_file) - is printing Test.csv which is CSV file name
I am able to print [input_file] but `pd.read_csv(input_file)' is throwing below error
FileNotFoundError: [Errno 2] No such file or directory: 'Test.csv'
Not sure were the CSV is uploaded, how can i read that data. Please help.
I don't have your exact ipywidgets version, but can you try this:
input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
df = pd.read_csv(content)

How to get csv file name in amazon s3 bucket using python?

I am trying to extract CSV file name in a directory in amazon s3 bucket. But it is not working . Can you please help me how to do this?
example:
s3://itx-acm-medaff-dev-sourcefiles/Raw_Layer/PCYC_IMBRUVICA/PCYC_VOC_Data_Load.csv
expected result:
PCYC_VOC_Data_Load.csv
using ls command we can get the total file in the s3 bucket directory.
file = os.popen('aws s3 ls s3://itx-acm-medaff-dev-sourcefiles/Raw_Layer/PCYC_IMBRUVICA/').read()
print(file)
list_file = list(file.split("\n"))
s = str(list_file[0])
s1 = s.split(' ')
print(s1[-1])
output:
'PCYC_VOC_Data_Load.csv'

Upload Gzip file using Boto3

i am trying to upload files to S3 before that i am trying to Gzip files, if you see the code below, the files uploaded to the S3 have no change in the size, so i am trying to figure out if i have missed something.
import gzip
import shutil
from io import BytesIO
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
"""Compress and upload the contents from fp to S3.
If compressed_fp is None, the compression is performed in memory.
"""
if not compressed_fp:
compressed_fp = BytesIO()
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
Courtesy Link for the source
And this is how i am using this fucntion, so basically reading files as stream from SFTP and then trying to Gzip them and then write them to S3.
with pysftp.Connection(host_name, username=user, password=password, cnopts=cnopts, port=int(port)) as sftp:
list_of_files = sftp.listdir('{}{}'.format(base_path, file_path))
is_file_found = False
for file_name in list_of_files:
if entity_name in str(file_name.lower()):
is_file_found = True
flo = BytesIO()
# Step 1: Read File Using SFTP as input Stream
sftp.getfo('{}{}/{}'.format(base_path, file_path, file_name), flo)
s3_destination_key = '{}/{}'.format(s3_path, file_name)
# Step 2: Write files to desitination S3
logger.info('Moving file to S3 {} '.format(s3_destination_key))
# Creating a bucket resource to use bucket object for file upload
input_bucket_object = S3.Bucket(environment_config['S3_INBOX_BUCKET'])
flo.seek(0)
upload_gzipped(input_bucket_object, s3_destination_key, flo)
It seems like the upload_gzipped function uses shutil.copyfileobj incorrectly.
Looking at https://docs.python.org/3/library/shutil.html#shutil.copyfileobj shows that you put the source first, and destination second.
Also, you're just writing your object to a gzipped object without ever actually compressing it.
You need to compress fp into a Gzip object, then upload that specific object to S3.
I'd recommend not using that gist from github as it seems wrong.

Resources