Google Drive File Stream files creation date - python-3.x

I know it has been asked a few times already, but - not in this context I think and other questions have been asked few years ago already, so I'm hoping maybe something changed.
So my issue is - I am uploading files to the Google Drive using Google Drive File Stream. However, while the uploading goes smoothly, I have a problem with files creation date - it is always changed to the timestamp of the time the file got uploaded, not the actual, local file creation date. It is a serious problem, as I am going to use this to back-up huge amounts of data and preserve all the meta-data I can and the creation date is crucial. Is there a way to either upload it with the creation date intact, or to change it after the upload? From what I've seen this seems not to be possible, but I have to try and make it work. Any help and insight will be appreciated. I'm using the Drive File Stream with Python.
EDIT: I didn't make it clear enough - the issue here is that I don't want to use Google Drive API at all, but rather deal with this using only Google Drive File Stream interface if it's possible.

create
If you check the documentation for files.create You will find that acceptable metadata for file creation does include a createdTime
You should then just add this to the metadata you use when uploading the file. As you did not post your code I have grabbed the standard example from the documentation and added the created time as follows.
file_metadata = {'name': 'photo.jpg', 'createdTime': 'THETIME'}
media = MediaFileUpload('files/photo.jpg', mimetype='image/jpeg')
file = drive_service.files().create(body=file_metadata,
media_body=media,
fields='id').execute()
print 'File ID: %s' % file.get('id')
Update
In the event that you want to update the ones you have already created you could use the following method.
If you check the documentation for file create you will find that the response is just a File resource
If you check file resource you will see that CreatedTime is write able.
You should run a file.update and reset the createdTime to the proper time.

Related

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Pygsheets and AWS Lambda

I am sure the suggestions here will be to use an S3 bucket and I am aware of this. My question is a bit more difficult, from what I am gathering, in that I want to use Pygsheets, a python library, to write to a Google Sheet. However, after getting through all the deployment and layer steps... what is stopping me is a pesky .json file needs to be read by one of the functions in Pygsheets. I do believe it is reading and writing something else on the fly which may not be allowed in and of itself but I am asking regardless.
Link directly to the function that needs to be used in conjunction with the secret.json from Google: Pygsheets Github
Sample code:
print("-->Using the library pygsheets to update...")
print(f"-->Accessing client_secret.json")
gc = pygsheets.authorize(service_file='client_secret.json')
print(f"-->Opening Google Sheets")
#open the google spreadsheet
sh = gc.open_by_url('https://...')
print(f"-->Accessing")
#select the first sheet
wks = sh[0]
print(f"-->Updating selected cells... ")
#update the first sheet with df, starting at cell A11.
wks.set_dataframe(df, 'J14')
Again, I am close to my final product of automating my sheets using this script/library/lambda that I can taste it :). If the absolute best workaround is S3 please be gentle I am a first year analyst trying to get my feet wet. Superior is telling me it would take a while to hook up a connection to S3 so thats also a reason to avoid. Thanks!
Fixed. Simply added the .json creds to the deployment package. I had ran into an issue with pandas so I have a blend of layers and a deployment package with my .py script (and, again, with secret.json). Thanks!

NetSuite SuiteScript How to bridge the 10MG limit?

HI and thanks for any help. Is there a way to work with files larger than 10mg? I have to check for updates on items in a file that would be uploaded, but the file contains all items in the system and is approximately 20MG. This 10MG limit is killing me. I see streaming for file save and appending but not for file reading. So I am open to any suggestions. The provider in this instance doesn't offer the facility to chunk the files. thanks in advance for your help.
If you are using SS2 to process a file from the file cabinet then if you use file.lines.iterator() to process a file the size limit is 10MB per line.
I believe returning a file object from a map reduce script's getInputStage automatically parses the file into lines.
The 10MB file size limit comes into play if you try to create a file larger than 10MB.
If you are trying to read in a an external file via script then one approach that I've used is to proxy the call via an external service. e.g. query an AWS lambda function that checks for and saves the file to S3. Return the file path and size to your SuiteScript. The SuiteScript then asks for "pages" of the file that are less than 10MB and saves those. If you are uploading something like a .csv then the lambda function can send the header with each paged request.

Backup Google Drive to .zip file with file conversion

I keep going in circles on this topic, and can't find an automated method that works for mass data on a Google Drive. Here is the goal I'm looking to achieve:
My company uses an unlimited Google Drive to store shared documents, and we are looking to backup the contents automatically. But we can't have the data stored in a backup with google documents like ".gdoc" and ".gsheet"... we need to have the documents backed up in Microsoft/Open-Office format (".docx" and ".xlsx").
We currently use Google's Takeout page to zip all the contents of the Drive and save it on our Linux server (That has redundant storage). And it does zip and export the files to the correct formats.
Here: [https://takeout.google.com/settings/takeout][1]
Now that works... but requires a bit of manual work on our part. And babysitting the zip, download and upload processes is becoming wasteful. I have searched and have read that the google API for Takeout is unavailable to use through gscript. So, that seems to be out of the question.
Using Google scripts, I have been able to convert single files.... but can't, for instance, convert a folder of ".gsheet" files to ".xlsx" format. Maybe copying and converting all the google files into a new folder on the drive could be possible. Having access to the drive and the converted "backup", we could then backup the collection of converted files via the server...
So here is the just of it all:
Can you mass-convert all of a google drive and/or a specific folder on the drive from ".gdoc" to ".docx", and ".gsheet" to ".xlsx". Can this be done with gscript?
If not able to via the method in question one, is anyone familiar with an Linux of Mac app that could do such a directory conversion? (Don't believe it because of googles proprietary file types)
I'm stuck in a bit of a hole, and any insight to this problem could help. I really wish Google would allow users to convert and export drive folders via a script selection.
#1) Can you mass-convert all of a google drive and/or a specific folder on the drive from ".gdoc" to ".docx", and ".gsheet" to ".xlsx". Can this be done with gscript?
You can try this:
How To Automaticlly Convert files in Google App Script
Converting file in Google App Script into blob
var documentId = DocumentApp.getActiveDocument().getId();
function getBlob(documentId) {
var file = Drive.Files.get(documentId);
var url = file.exportLinks['application/vnd.openxmlformats-officedocument.wordprocessingml.document'];
var oauthToken = ScriptApp.getOAuthToken();
var response = UrlFetchApp.fetch(url, {
headers: {
'Authorization': 'Bearer ' + oauthToken
}
});
return response.getBlob();
}
Saving file as docx in Drive
function saveFile(blob) {
var file = {
title: 'Converted_into_MS_Word.docx',
mimeType: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
};
file = Drive.Files.insert(file, blob);
Logger.log('ID: %s, File size (bytes): %s', file.id, file.fileSize);
return file;
}
Time-driven triggers
A time-driven trigger (also called a clock trigger) is similar to a cron job in Unix. Time-driven triggers let scripts execute at a particular time or on a recurring interval, as frequently as every minute or as infrequently as once per month. (Note that an add-on can use a time-driven trigger once per hour at most.) The time may be slightly randomized — for example, if you create a recurring 9 a.m. trigger, Apps Script chooses a time between 9 a.m. and 10 a.m., then keeps that timing consistent from day to day so that 24 hours elapse before the trigger fires again.
function createTimeDrivenTriggers() {
// Trigger every 6 hours.
ScriptApp.newTrigger('myFunction')
.timeBased()
.everyHours(6)
.create();
// Trigger every Monday at 09:00.
ScriptApp.newTrigger('myFunction')
.timeBased()
.onWeekDay(ScriptApp.WeekDay.MONDAY)
.atHour(9)
.create();
}
Process:
List all files id inside a folder
Convert Files
Insert Code to a Time-driven Triggers
2) If not able to via the method in question one, is anyone familiar with an Linux of Mac app that could do such a directory conversion? (Don't believe it because of googles proprietary file types)
If you are want to save it locally try setting a cronjob and use Download Files
The Drive API allows you to download files that are stored in Google Drive. Also, you can download exported versions of Google Documents (Documents, Spreadsheets, Presentations, etc.) in formats that your app can handle. Drive also supports providing users direct access to a file via the URL in the webViewLink property.
Depending on the type of download you'd like to perform — a file, a Google Document, or a content link — you'll use one of the following URLs:
Download a file — files.get with alt=media file resource
Download and export a Google Doc — files.export
Link a user to a file — webContentLink from the file resource
Sample Code :
$fileId = '0BwwA4oUTeiV1UVNwOHItT0xfa2M';
$content = $driveService->files->get($fileId, array(
'alt' => 'media' ));
Hope this helps and answered all you questions

How can I rename all the existing carrierwave uploaded files?

I have been using Carrierwave for file uploads for some time. I did not try to rename the files as they got uploaded. Now I want to give each file a random name and a file extension that's consistent with the content type. I read the wiki and other sites, and it was recommended that in the uploader, I could:
def filename
"#{secure_token}.#{file.extension}" if original_filename.present?
end
private
def secure_token
#implement the secure token
end
It worked fine on files uploaded after these additions to the uploader. But I got many files that were uploaded before this change. I was wondering if someone could tell me how to migrate the old files.
I tried adding a method to the uploader:
def rename_file!
model.update_attribute mounted_as, "#{secure_token}.#{file.extension}"
recreate_versions!
end
then in the rails console, I tried calling this on an model with attachment. However, I found that the mounted_as column of the model never got updated, though on the file system, various versions of the file were created with the new name. When I inspected the mounted_as field of the model, it did not get updated. The log actually said the column was updated with the old value.
How can I get the mounted_as column on the model updated?
In addition, it seemed like the old files with the old names were still on the file system. Is there a way to remove them? I tried adding a line:
file.move_to File.join("#{File.dirname file.path}, "#{secure_token}.#{file.extension}")
in the rename_file! method. It renamed the files, but did not update the mounted_as column on the model. So accessing its URL resulted in a 404.
I know this is a little old now, but perhaps useful for others.
After updating your Uploader with the filename method, like you have, you could run this from the Rails console;
Post.all.each do |p|
p.avatar.recreate_versions!
p.save!
end
In the current version of CarrierWave, this will both rename the file and update the model record.
Post of course is the model name and avatar the column on which you are mounting the uploader, so change those as required.

Resources