Is there a way to check if batches of files exist in GCS? - python-3.x

I have a list of file paths that I want to check in GCS. It looks like this:
for path in all_paths:
try:
gcs.blob(GCS_BUCKET_NAME, path).exists()
except google.api_core.exceptions.NotFound:
missing_paths.append(path)
This works fine but it takes a lot of time as requests are sent one by one, for each path. Is there a way to send batches of requests in google cloud storage API ? Or any way to speed up this check ?

With Cloud Storage you can only filter by the path prefix (path/to/file.xxx). Then you will receive all the files matching this prefix even the sub path (path/to/sub/path/file.xxx). Therefore, the rest of the processing is to perform by yourselves.
And yes, if you have lot of files, it will take lot of time.

Related

how to work with local tmp files in an Nodejs API?

I am currently making a little API which returns jsons according to an input. This API needs to run some local programs on the server and also needs to place some temporary files. It all works if I ask the API once a time and just with one "user".
The problem is, that I only have one temp folder to store the temporary files.so whenever there are multiple API queries, the tmp folder screws up - the data in there mixes up.
What would be a good way to have an API using temp files - and still keep it working if every run needs its own temp data?
The current process is:
server/api/getGeometry?lat=52.5167776&lon=13.4092091&bboxSize=2000&output=glb
server queries zips according to lat/lon
unpacks
converts
does voodoo
generates json
sends back the json
cleans up the tmp folder
next run same story...
so every time the same tmp folder, I guess it's not the way to go.
Thanks a lot for ideas!

Is it possible to download a file nested in a zip file, without downloading the entire zip file?

Is it possible to download a file nested in a zip file, without downloading the entire zip archive?
For example from a url that could look like:
https://www.any.com/zipfile.zip?dir1\dir2\ZippedFileName.txt
Depending on if you are asking whether there is a simple way of implementing this on the server-side or a way of using standard protocols so you can do it from the client-side, there are different answers:
Doing it with the server's intentional support
Optimally, you implement a handler on the server that accepts a query string to any file download similar to your suggestion (I would however include a variable name, example: ?download_partial=dir1/dir2/file
). Then the server can just extract the file from the ZIP archive and serve just that (maybe via a compressed stream if the file is large).
If this is the path you are going and you update the question with the technology used on the server, someone may be able to answer with suggested code.
But on with the slightly more fun way...
Doing it opportunistically if the server cooperates a little
There are two things that conspire to make this a bit feasible, but only worth it if the ZIP file is massive in comparison to the file you want from it.
ZIP files have a directory that says where in the archive each file is. This directory is present at the end of the archive.
HTTP servers optionally allow download of only a range of a response.
So, if we issue a HEAD request for the URL of the ZIP file: HEAD /path/file.zip we may get back a header Accept-Ranges: bytes and a header Content-Length that tells us the length of the ZIP file. If we have those then we can issue a GET request with the header (for example) Range: bytes=1000000-1024000 which would give us part of the file.
The directory of files is towards the end of the archive, so if we request a reasonable block from the end of the file then we will likely get the central directory included. We then look up the file we want, and know where it is located in the large ZIP file.
We can then request just that range from the server, and decompress the result...

Copy files from AWS s3 sub folder to Azure Blob

I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks
I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist

Filter recent files in Logic Apps' SFTP when files are added/modified trigger

I have this Logic App that connects to an SFTP server and it's triggered by the "files are added or modified" trigger. It's set to run every 10 minutes, looking for new/modified files and copying them to an Azure storage account.
The problem is that this SFTP server path is set to overwrite a set of files every X minutes (I have no control over this) and so, pretty often the Logic App overlaps with the update process of these files and downloads files that are still being written. The result is corrupted files.
Is there a way to add a filter to the When files are added or modified (properties only) so that it only takes into consideration files with a modified date of, at least, 1 minute old?
That way, files that are currently being written won't be added to the list of files to download. The next run of the Logic App would then fetch this ignored files and so on.
UPDATE
I've found a Trigger Conditions in the trigger's setting but I can't find any documentation about it.
According to test the trigger "When files are added or modified", it seems we can not add a filter in the trigger to filter the records which are modified at least 1 minute ago. We can just get the List of Files LastModified datetime and loop them, use "If" condition to judge if we should download it.
Update:
The expression in the screenshot is:
sub(ticks(utcNow()), ticks(triggerBody()?['LastModified']))
Update workaround
Is it possible to add a "Delay" action when the last modified time less than 1 minute ? For example, if the last modified time less than 60 seconds, use "Delay" to wait 5 minutes until the overwrite operation complete, then do the download.
I check the sample #equals(triggers().code, 'InternalServerError'), actually it uses the condition functions in Logical comparison functions, so the key word is make sure the property you want to filter is in the trigger or triggerBody or you will get the below error.
So I change the expression to like #greater(triggerBody().LastModified,'2020-04-20T11:23:00Z'), this could filter the file modified less than 2020-04-20T11:23:00Z not trigger the flow.
Also you could use other function like less ,greaterOrEquals etc in the Logical comparison functions.

Using Logic Apps to get specific files from all sub(sub)folders, load them to SQL-Azure

I'm quite new to Data Factory and Logic Apps (but I am experienced with SSIS since many years),
I succeeded in loading a folder with 100 text-files into SQL-Azure with DATA FACTORY
But the files themselves are untouched
Now, another requirement is that I loop through the folders to get all files with a certain file extension,
In the end I should move (=copy & delete) all the files from the 'To_be_processed' folder to the 'Processed' folder
I can not find where to put 'wildcards' and such:
For example, get all files with file extensions .001, 002, 003, 004, 005, ...until... , 996, 997, 998, 999 (thousand files)
--> also searching in the subfolders.
Is it possible to call a Data Factory from within a Logic App ? (although this seems unnecessary)
Please find some more detailed information in this screenshot:
(click to enlarge)
Thanks in advance helping me out exploring this new technology!
Interesting situation.
I agree that using Logic Apps just for this additional layer of file handling seems unnecessary, but Azure Data Factory may currently be unable to deal with exactly what you need...
In terms of adding wild cards to your Azure Data Factory datasets you have 3 attributes available within the JSON type properties block, as follows.
Folder Path - to specify the directory. Which can work with a partition by clause for a time slice start and end. Required.
File Name - to specify the file. Which again can work with a partition by clause for a time slice start and end. Not required.
File Filter - this is where wildcards can be used for single and multiple characters. (*) for multi and (?) for single. Not required.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-file-system-connector
I have to say that separately none of the above are ideal for what you require and I've already fed back to Microsoft that we need a more flexible attribute that combines the 3 above values into 1, allowing wildcards in various places and a partition by condition that works with more than just date time values.
That said. Try something like the below.
"typeProperties": {
"folderPath": "TO_BE_PROCESSED",
"fileFilter": "17-SKO-??-MD1.*" //looks like 2 middle values in image above
}
On a side note; there is already a Microsoft feedback item thats been raised for a file move activity which is currently under review.
See here: https://feedback.azure.com/forums/270578-data-factory/suggestions/13427742-move-activity
Hope this helps
We have used a C# application which we call through 'app services' -> webjobs.
Much easier to iterate through folders. To call SQL we used sql bulkinsert

Resources