Read file metadata along with data in Python - python-3.x

I am using Python 3.7 along with a library for AES CBC on Windows 10 to encrypt files and it works perfectly. Except, after decrypting them, they lose their metadata like the date they were created. Because I want the user to feel like they never 'deleted' or 'lost' the original file, I need to preserve that data.
This is what I'm doing to read the data:
f = open(file_name, "rb")
data = f.read()
f.close()
After I encrypt the data, I write the encrypted bytes into a new file. When I decrypt this new file, I would like the metadata to be preserved so that the file (like an image) is exactly like it was before encryption. (P.S. I don't know if this will help but overwriting the new data on the original file might help but I want to try and avoid this if possible)
How do I include metadata of the file in a variable WITH the data that I am encrypting, such that when I decrypt, I get the exact same file with the same date created etc.?
EDIT:
I found a way to get the file creation time but I STILL NEED to get all the metadata as the file can be in any format: for example an image or a video, or a doc file's author. I also want to store this in the decrypted file which I don't know how to.
os.path.getctime(file_name)

Related

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?
Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

nodejs write files which do not open

let dataCer = '0�\u0007\u00060�\u0006��\u0003\u0002\u0001\u0002\u0002\u0010Q��\u0000����K��Z�Q��0\n\u0006\b*�\u0003\u0007\u0001\u0001\u0003\u00020�\u0001l1\u001e0\u001c\u0006\.............'
fs.writeFile('111.cer', dataCer);
let dataPdf = '%PDF-1.4\r\n1 0 obj\r\n<< \r\n/Length 9947\r\n/Filter /FlateDecode\r\n>>\r\nstream\r\nX��]�n#9p}���\u000f���\u0005\b\u0002X��<\'X \u001f�\u001b\u0010 \u0001���H�,6�R�Z�\u0014�N`�\n�T�t�ڼT\u0015���?ԋz��_�{IN_Bz�����O.............'
fs.writeFile('111.pdf', dataPdf);
The data dataCer and dataPdf I get from the application using the GET requests. I can only get this data in this encoding.
And now I need to save them as files.
Also, I will need to then save any data to the file in the same way (zip, rar, png, jpeg, ...).
When i use fs.writeFile, I get files that do not open.
fs.writeFile, can not keep the original state data, ignoring the encoding does not give me the desired result.
Please tell me how to get around this error?
Or which library can save data to any file in node.js, while ignoring the encoding?

Read n rows from csv in Google Cloud Storage to use with Python csv module

I have a variety of very large (~4GB each) csv files that contain different formats. These come from data recorders from over 10 different manufacturers. I am attempting to consolidate all of these into BigQuery. In order to load these up on a daily basis I want to first load these files into Cloud Storage, determine the schema, and then load into BigQuery. Due to the fact that some of the files have additional header information (from 2 - ~30 lines) I have produced my own functions to determine the most likely header row and the schema from a sample of each file (~100 lines), which I can then use in the job_config when loading the files to BQ.
This works fine when I am working with files from local storage direct to BQ as I can use a context manager and then Python's csv module, specifically the Sniffer and reader objects. However, there does not seem to be an equivalent method of using a context manager direct from Storage. I do not want to bypass Cloud Storage in case any of these files are interrupted when loading into BQ.
What I can get to work:
# initialise variables
with open(csv_file, newline = '', encoding=encoding) as datafile:
dialect = csv.Sniffer().sniff(datafile.read(chunk_size))
reader = csv.reader(datafile, dialect)
sample_rows = []
row_num = 0
for row in reader:
sample_rows.append(row)
row_num+=1
if (row_num >100):
break
sample_rows
# Carry out schema and header investigation...
With Google Cloud Storage I have attempted to use download_as_string and download_to_file, which provide binary object representations of the data, but then I cannot get the csv module to work with any of the data. I have attempted to use .decode('utf-8') and it returns a looong string with \r\n's. I then used splitlines() to get a list of the data but still the csv functions keep giving a dialect and reader that splits the data into single characters as each entry.
Has anyone managed to get a work around to use the csv module with files stored in Cloud Storage without downloading the whole file?
After having a look at the csv source code on GitHub, I have managed to use the io module and csv module in Python to solve this problem. The io.BytesIO and TextIOWrapper were the two key functions to use. Probably not a common use case but thought I would post the answer here to save some time for anyone that needs it.
# Set up storage client and create a blob object from csv file that you are trying to read from GCS.
content = blob.download_as_string(start = 0, end = 10240) # Read a chunk of bytes that will include all header data and the recorded data itself.
bytes_buffer = io.BytesIO(content)
wrapped_text = io.TextIOWrapper(bytes_buffer, encoding = encoding, newline = newline)
dialect = csv.Sniffer().sniff(wrapped_text.read())
wrapped_text.seek(0)
reader = csv.reader(wrapped_text, dialect)
# Do what you will with the reader object

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file
When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally,
therefore cannot take input from the S3 bucket. To generate the list
file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

Appending to a text file in S3

I know how to write and read from a file in S3 using boto. I'm wondering if there is a way to append to a file without having to download the file and re-upload an edited version?
There is no way to append data to an existing object in S3. You would have to grab the data locally, add the extra data, and then write it back to S3.

Resources