How to get csv file name in amazon s3 bucket using python? - linux

I am trying to extract CSV file name in a directory in amazon s3 bucket. But it is not working . Can you please help me how to do this?
example:
s3://itx-acm-medaff-dev-sourcefiles/Raw_Layer/PCYC_IMBRUVICA/PCYC_VOC_Data_Load.csv
expected result:
PCYC_VOC_Data_Load.csv

using ls command we can get the total file in the s3 bucket directory.
file = os.popen('aws s3 ls s3://itx-acm-medaff-dev-sourcefiles/Raw_Layer/PCYC_IMBRUVICA/').read()
print(file)
list_file = list(file.split("\n"))
s = str(list_file[0])
s1 = s.split(' ')
print(s1[-1])
output:
'PCYC_VOC_Data_Load.csv'

Related

Not able to copy file from s3 bucket with speacial character in name

I have to copy the file from one s3 bucket to another. In the name of S3 bucket there are some special character which restrict it to appear as single file name. While fetching the list from bucket, we are able to get the file name but while copying, we get file not found error.(while copying,it goes to read the filename but due to special character not able to read.)
Code I am using:
def copy_object():
s3 = boto3.client('s3', region_name='us-east-1')
response = s3.list_objects_v2(Bucket=os.environ.get('bucket'), Prefix='v2/abc/date='+date)
for s3_objects in response["Contents"]:
key = str(s3_objects["Key"]).split("/")[4] #this will give file name
print(key)
copy_source = {
'Bucket': os.environ.get('bucket'),
'Key': key
}
s3_dist = boto3.resource(
service_name=os.environ.get('serviceName'),
region_name=os.environ.get('regionName'),
aws_access_key_id=os.environ.get('awsAccessKeyId'),
aws_secret_access_key=os.environ.get('awsSecretAccessKey')
)
ack = s3_dist.meta.client.copy(copy_source, 'dest_bucket', file_to_copy)
Filename :Stat#2022-12-28#nameflyer##_1672185644223109701__i-008f78f00fd9d9bfb.parquet
Error getting while copy the file :
{'Error': {'Code': '404', 'Message': 'Not Found'},
How can we read the file to copy in s3 bucket with special characters or is there any way to omit special character and read the file in bucket.
You have to add your Prefix to the key name:
key='v2/abc/date='+date + '/' + key

Compress multiple files with zip in AWS Glue

I'm trying to compress the multiple csv files from a folder in the S3 bucket using AWS Glue. I have the script below.
currentdate = datetime.now().strftime("%Y%m%d")
directory = pathlib.Path("s3://mys3bucket/folder1/"+currentdate+"/")
zippy = "s3://mys3bucket/folder1/"+currentdate+"-abc.zip"
with ZipFile(zippy, "w", ZIP_DEFLATED) as archive:
for file_path in directory.rglob("*csv"):
archive.write(file_path, arcname=file_path.relative_to(directory))
It will end in the error below.
FileNotFoundError: [Errno 2] No such file or directory: 's3://mys3bucket/folder1/20220420-abc.zip'
Can anyone advise how to solve this error?

How to upload folder from local to GCP bucket using python

I am following this link and getting some error:
How to upload folder on Google Cloud Storage using Python API
I have saved model in container environment and from there I want to copy to GCP bucket.
Here is my code:
storage_client = storage.Client(project='*****')
def upload_local_directory_to_gcs(local_path, bucket, gcs_path):
bucket = storage_client.bucket(bucket)
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + '/**'):
print(local_file)
print("this is bucket",bucket)
blob = bucket.blob(gcs_path)
print("here")
blob.upload_from_filename(local_file)
print("done")
path="/pythonPackage/trainer/model_mlm_demo" #this is local absolute path where my folder is. Folder name is **model_mlm_demo**
buc="py*****" #this is my GCP bucket address
gcs="model_mlm_demo2/" #this is the new folder that I want to store files in GCP
upload_local_directory_to_gcs(local_path=path, bucket=buc, gcs_path=gcs)
/pythonPackage/trainer/model_mlm_demo has 3 files in it config, model.bin and arguments.bin`
ERROR
The codes doesn't throw any error, but there is no files uploaded in GCP bucket. It just creates empty folder.
What I can see the error is, you don't need to pass the gs:// as the bucket parameter. Actually, here is an example you may need to check out,
https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The path to your file to upload
# source_file_name = "local/path/to/file"
# The ID of your GCS object
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
I have reproduced your issue and the below code snippet works fine. I have updated the code based on folders and names you have mentioned in the question. Let me know if you have any issues.
import os
import glob
from google.cloud import storage
storage_client = storage.Client(project='')
def upload_local_directory_to_gcs(local_path, bucket, gcs_path):
bucket = storage_client.bucket(bucket)
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + '/**'):
print(local_file)
print("this is bucket", bucket)
filename=local_file.split('/')[-1]
blob = bucket.blob(gcs_path+filename)
print("here")
blob.upload_from_filename(local_file)
print("done")
# this is local absolute path where my folder is. Folder name is **model_mlm_demo**
path = "/pythonPackage/trainer/model_mlm_demo"
buc = "py*****" # this is my GCP bucket address
gcs = "model_mlm_demo2/" # this is the new folder that I want to store files in GCP
upload_local_directory_to_gcs(local_path=path, bucket=buc, gcs_path=gcs)
I just came across the gcsfs library which seems to be also about better interfaces
You could copy an entire directory into a gcs location like this:
def upload_to_gcs(src_dir: str, gcs_dst: str):
fs = gcsfs.GCSFileSystem()
fs.put(src_dir, gcs_dst, recursive=True)
I figured out a way using subprocess to upload model artefacts in GCP bucket.
import subprocess
subprocess.call('gsutil cp -r source_folder_in_local gs://*****/folder_name', shell=True, stdout=subprocess.PIPE)
If gsutil is not installed. You can install using this link:
https://cloud.google.com/storage/docs/gsutil_install

ipywidgets - widgets.FileUpload, updated CSV file read the CSV file

I am using jupyterhub and hosting the .ipynb file and hosted on server. I have usecase to upload a CSV from localdrive file and read the same for other dataframe tasks.
uploader = widgets.FileUpload(
accept='*.csv', # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
multiple=False # True to accept multiple files upload else False
)
display(uploader)
[input_file] = uploader.value
print(input_file)
pd.read_csv(input_file)
print(input_file) - is printing Test.csv which is CSV file name
I am able to print [input_file] but `pd.read_csv(input_file)' is throwing below error
FileNotFoundError: [Errno 2] No such file or directory: 'Test.csv'
Not sure were the CSV is uploaded, how can i read that data. Please help.
I don't have your exact ipywidgets version, but can you try this:
input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
df = pd.read_csv(content)

Upload Gzip file using Boto3

i am trying to upload files to S3 before that i am trying to Gzip files, if you see the code below, the files uploaded to the S3 have no change in the size, so i am trying to figure out if i have missed something.
import gzip
import shutil
from io import BytesIO
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
"""Compress and upload the contents from fp to S3.
If compressed_fp is None, the compression is performed in memory.
"""
if not compressed_fp:
compressed_fp = BytesIO()
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
Courtesy Link for the source
And this is how i am using this fucntion, so basically reading files as stream from SFTP and then trying to Gzip them and then write them to S3.
with pysftp.Connection(host_name, username=user, password=password, cnopts=cnopts, port=int(port)) as sftp:
list_of_files = sftp.listdir('{}{}'.format(base_path, file_path))
is_file_found = False
for file_name in list_of_files:
if entity_name in str(file_name.lower()):
is_file_found = True
flo = BytesIO()
# Step 1: Read File Using SFTP as input Stream
sftp.getfo('{}{}/{}'.format(base_path, file_path, file_name), flo)
s3_destination_key = '{}/{}'.format(s3_path, file_name)
# Step 2: Write files to desitination S3
logger.info('Moving file to S3 {} '.format(s3_destination_key))
# Creating a bucket resource to use bucket object for file upload
input_bucket_object = S3.Bucket(environment_config['S3_INBOX_BUCKET'])
flo.seek(0)
upload_gzipped(input_bucket_object, s3_destination_key, flo)
It seems like the upload_gzipped function uses shutil.copyfileobj incorrectly.
Looking at https://docs.python.org/3/library/shutil.html#shutil.copyfileobj shows that you put the source first, and destination second.
Also, you're just writing your object to a gzipped object without ever actually compressing it.
You need to compress fp into a Gzip object, then upload that specific object to S3.
I'd recommend not using that gist from github as it seems wrong.

Resources