How can I rename all the existing carrierwave uploaded files? - rename

I have been using Carrierwave for file uploads for some time. I did not try to rename the files as they got uploaded. Now I want to give each file a random name and a file extension that's consistent with the content type. I read the wiki and other sites, and it was recommended that in the uploader, I could:
def filename
"#{secure_token}.#{file.extension}" if original_filename.present?
end
private
def secure_token
#implement the secure token
end
It worked fine on files uploaded after these additions to the uploader. But I got many files that were uploaded before this change. I was wondering if someone could tell me how to migrate the old files.
I tried adding a method to the uploader:
def rename_file!
model.update_attribute mounted_as, "#{secure_token}.#{file.extension}"
recreate_versions!
end
then in the rails console, I tried calling this on an model with attachment. However, I found that the mounted_as column of the model never got updated, though on the file system, various versions of the file were created with the new name. When I inspected the mounted_as field of the model, it did not get updated. The log actually said the column was updated with the old value.
How can I get the mounted_as column on the model updated?
In addition, it seemed like the old files with the old names were still on the file system. Is there a way to remove them? I tried adding a line:
file.move_to File.join("#{File.dirname file.path}, "#{secure_token}.#{file.extension}")
in the rename_file! method. It renamed the files, but did not update the mounted_as column on the model. So accessing its URL resulted in a 404.

I know this is a little old now, but perhaps useful for others.
After updating your Uploader with the filename method, like you have, you could run this from the Rails console;
Post.all.each do |p|
p.avatar.recreate_versions!
p.save!
end
In the current version of CarrierWave, this will both rename the file and update the model record.
Post of course is the model name and avatar the column on which you are mounting the uploader, so change those as required.

Related

How to read the most recent Excel export into a Pandas dataframe without specifying the file name?

I frequent a real estate website that shows recent transactions, from which I will download data to parse within a Pandas dataframe. Everything about this dataset remains identical every time I download it (regarding the column names, that is).
The name of the Excel output may change, though. For example, if I already have download a few of these in my Downloads folder, the file that's exported may read "Generic_File_(3)" or "Generic_File_(21)" if I already have a few older "Generic_File" exports in that folder from a previous export.
Ideally, I'd like my workflow to look like this: export this Excel file of real estate sales, then run a Python script to read in the most recent export as a Pandas dataframe. The catch is, I don't want to have to go in and change the filename in the script to match the appending number of the Excel export everytime. I want the pd.read_excel method to simply read the "Generic_File" that is appended with the largest number (which will obviously correspond to the most rent export).
I suppose I could always just delete old exports out of my Downloads folder so the newest, freshest export is always named the same ("Generic_File", in this case), but I'm looking for a way to ensure I don't have to do this. Are wildcards the best path forward, or is there some other method to always read in the most recently downloaded Excel file from my Downloads folder?
I would use the OS package and create a method to read to file names in the downloads folder. Parsing string filenames you could then find the file following your specified format with the highest copy number. Something like the following might help you get started.
import os
downloads = os.listdir('C:/Users/[username here]/Downloads/')
is_file = [True if '.' in item else False for item in downloads]
files = [item for keep, item in zip(is_file, downloads) if keep]
** INSERT CODE HERE TO IDENTIFY THE FILE OF INTEREST **
Regex might be the best way to find matches if you have a diverse listing of files in your downloads folder.

Google Drive File Stream files creation date

I know it has been asked a few times already, but - not in this context I think and other questions have been asked few years ago already, so I'm hoping maybe something changed.
So my issue is - I am uploading files to the Google Drive using Google Drive File Stream. However, while the uploading goes smoothly, I have a problem with files creation date - it is always changed to the timestamp of the time the file got uploaded, not the actual, local file creation date. It is a serious problem, as I am going to use this to back-up huge amounts of data and preserve all the meta-data I can and the creation date is crucial. Is there a way to either upload it with the creation date intact, or to change it after the upload? From what I've seen this seems not to be possible, but I have to try and make it work. Any help and insight will be appreciated. I'm using the Drive File Stream with Python.
EDIT: I didn't make it clear enough - the issue here is that I don't want to use Google Drive API at all, but rather deal with this using only Google Drive File Stream interface if it's possible.
create
If you check the documentation for files.create You will find that acceptable metadata for file creation does include a createdTime
You should then just add this to the metadata you use when uploading the file. As you did not post your code I have grabbed the standard example from the documentation and added the created time as follows.
file_metadata = {'name': 'photo.jpg', 'createdTime': 'THETIME'}
media = MediaFileUpload('files/photo.jpg', mimetype='image/jpeg')
file = drive_service.files().create(body=file_metadata,
media_body=media,
fields='id').execute()
print 'File ID: %s' % file.get('id')
Update
In the event that you want to update the ones you have already created you could use the following method.
If you check the documentation for file create you will find that the response is just a File resource
If you check file resource you will see that CreatedTime is write able.
You should run a file.update and reset the createdTime to the proper time.

Autorenaming duplicate filename downloads in chrome/puppeteer/ubuntu

I'm downloading pdf files using headFULL chromium & puppeteer. I call a javascript function in the browser context and the download starts. The file name comes as is from the server. Issue: Many files I download in a directory are of same names coming from the server and Chrome instead of autosuffixing an index (1) to the file, overwrites the existing one.
Since the file is downloaded by calling a JS function and I have inspected the function as well, I don't have access to a the pdf url. It is triggered using the function call and thus I have no control over the file names.
I have a list of the file names but that in no way helps in changing the filename on the fly, if it it's duplicate name already exists on the machine.
Config: Ubuntu 18.04, Puppeteer 1.18.1
I know either it's a config issue with Nautilus file manager or with Chrome. Is it possible to configure any of these two?
I cannot foresee an option within nodejs where I can rename the file before it's downloaded. A workaround is to download each file in a temp folder, then move it to the required folder while doing a check if it already exists and rename if so. But it adds a lot of time complexity. It would be great to have chrome or nautilus do the task.
Function which triggers the download:
await page.evaluate( (doc_index,arg1,arg2) => openDocument(String(doc_index), String(arg1), String(arg2) ,'ABC','','','XYZ') , doc_index,arg1,arg2 )
Expected behaviour: When the above function is called and pdf starts downloading in the set folder, if a pdf of the same name exists, the new pdf should be renamed to something like pdf_name.pdf(1) or the like.

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file
When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally,
therefore cannot take input from the S3 bucket. To generate the list
file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

Can't get to exif data .JPG image

I'm trying to read the exif data from a .JPG image. I've tried differents solutions found here and there (PIL, piexif, exifread...) and none of them worked for this set of images. It worked for other images taken from another camera but not for this one, all these different methods returning empty dictionaries. It seems that there is no exif data but (I apologies for my newbyness) when I RIGHT-click + properties (I use windows), I do see what is exif data to me : date of creation, etc...
Here is one image :
image.JPG
If another of the thousands of anonymous heroes could help me on this one, I would be very grateful...
Alright so I found a solution which I share now.
The problem is that the libraries that open metadata are not taking all possible configurations for the image file and therefore, they can handle some and some others they cannot. I finally made it using exiftool, an executable that I dowloaded on my windows on this link :
https://sno.phy.queensu.ca/~phil/exiftool/
Then I paste the executable in a folder and I add exiftool.py in that folder, that I got from :
https://github.com/smarnach/pyexiftool/find/master
Then, using this small piece of code (for example):
import exiftool
with exiftool.ExifTool("exiftool.exe") as et:
metadata = et.get_metadata_batch(files)
for d in metadata:
print("{:20.20} {:20.20}".format(d["SourceFile"],
d["File:FileCreateDate"]))
Of course, this is just to show that you indeed can access the metadata, then you can do whatever you want with that. Here is the documentation of the library exiftool : http://smarnach.github.io/pyexiftool/
Cheers, JM

Resources