Spark - how to keep data integrity when writing files to appended folder

Spark - how to keep data integrity when writing files to appended folder - apache-spark

In my organization we have application that gets events and stores them on s3 partitioned by day. Some of the events are offline which means that while writing we append the files to the proper folder (according to the date of the offline event).
We get the events by reading folders path from a queue (SQS) and then reading the data from the folders we got. each folder will contain data from several different event dates
The problem is that if the application failed for some reason after one of the stages was completed, I have no idea what was already written to the output folder and I can't delete it all because there is already other data there.
Our solution currently is writing to HDFS and after application finishes we have a script that copies files to s3 (using s3-Dist-cp).
But that doesn't seem very elegant.
My current approach is to write my own FileOutputCommmitter that will add an applicationId prefix to all written files and so in case of error I know what to delete.
So what I'm asking is actually is there an already existing solution to this within Spark and if not what do you think about my approach
--edit--
After chatting with #Yuval Itzchakov I decided to have the application write to and add this path to an AWS SQS queue. An independent process will be triggered every x minutes, read folders from SQS and copy them with s3-dist-cp from to . in the application I wrapped the main method with try-catch, if I catch exception I delete the temp folder.

Related

How to organize s3 uploads client/server with AWS SDK

I have a bucket that has multiple users, and would like to pre-sign urls for the client to upload to s3 (some files can be large, so I'd rather they not pass through the Node server. My question is this: Until the mongo database is hit, there is no mongo Object Id to tag as a prefix for the file. (I'm separating the files in this structure: (UserID/PostID/resource) so you can check all of a user's pictures by looking under /UserID, and you can target a specific post by also adding the PostID. Conversely, there is no Object URL until the client uploads the file, so I'm at a bit of an impasse.
Is it bad practice to rename files after they touch the bucket? I just can't pre-know the ObjectID (the post has to be created in Mongo first) - but the user has to select what files they want to upload before the object is created. I was thinking the best flow could be one of two situations:
Client sets files -> Mongo created Document -> Responds to client with ObjectID and pre-signed urls for each file with the key set to /UserID/PostID/name. After successful upload, it re-triggers an update function on the server to edit the urls of the post. after update, send success to client.
Client uploads files to root of bucket -> Mongo doc created where urls of uploaded s3 files are being stored -> iterate over list and prepend the UserID and newly-created PostID, updating mongo document -> success response to client
Is there another approach that I don't know about?

Answering your question:
Is it bad practice to rename files after they touch the server?
If you are planing to use S3 to save your files, there is no server, so there is no problems to change these files after you upload them.
The only thing that you need to understand is renaming a object you need to two requests:
copy the object with a new name
delete the old object with the old name
And this means that maybe can be a problem in costs/latency if you have a huge number of changes (but I can say for most of cases this will not be a problem)
I can say that the first option will be a good option for you, and the only thing that I would change is adding a Serverless processing for your object/files, using the AWS Lambda service will be a good option .
In this case instead of updating the files on the server, you will update using a Lambda function, you only need to add a trigger for your bucket in the PutObject event on S3, this way will can change the name of your files in the best processing time for your client and with low costs.

Alert for no file uploaded to S3 within a particular time

I have created a Python lambda function which gets executed as soon as a .zip file lands in a particular folder in an s3 bucket. Now there may be a situation where there is no file uploaded to the S3 within in a certain time period (for example 10 AM morning). How to get an alert for tracking no file arrival?

You may use cloudwatch alarms. You can set an alarm when no event (e.g. lambda execution) is present for metrics.
It has only basic options to configure, but imho it's the simplest solution

Azure ADF how to ensure that the same files that are copied, are also deleted?

Using Azure ADF and currently my setup is as follows:
Event based triggerd on a input BLOB on file upload. File upload triggers a copy activity to output BLOB, and this action is followed by a delete operation on the input BLOB. The input BLOB can take 1 or many files at once (not sure how often the file is scanned/how quickly the event triggers the pipeline). Reading up on the delete function documentation it says:
Make sure you are not deleting files that are being written at the same time.
Would my current setup delete files that are being written?
Event based trigger on file upload >> Write from input Blob to Output Blob >> Delete input Blob
I've made an alternative solution which does a get metadata activity based on event in the beginning of the pipeline, and then does a for loop which deletes the files at the end, not sure if this is necessary though. Would my original solution suffice in an unlikely event where I'm receiving files every 15seconds or so?
Also while I'm at it, in a get metadata activity how can I get the actual path to the file, not just the file name?
Thank you for the help.

Delete Active says:
Make sure you are not deleting files that are being written at the
same time.
Your settings are:
Event based trigger on file upload >> Write from input Blob to Output
Blob >> Delete input Blob
Only after the active Write from input Blob to Output Blob finished(the deleting files are not being written), then the Delete input Blob can works.
Your questions: Would my current setup delete files that are being written?
So did you test these steps? You must test by yourself and you will get the answer.
Please notice:
Delete activity does not support deleting list of folders described by wildcard.
Any other suggestions:
You don't need to using delete actives to delete the input blob after Write from input Blob to Output Blob finished.
You can learn from Data flow, it's Source settings support delete the source file(input blob) after the copy active completed.
Hope this helps.

I could not use Leon Yue solution because my source dataset was a sftp one, which is not supported by Azure dataflows.
To deal with this problem, I used the Filter by last modified of the dataset. I set the End Time to the time the pipeline has started.
With this solution, only the files added to the source before the pipeline started will be consumed by both the copy and delete activities.

Azure DML Data slices?

I have 40 mil blobs of 10 TB in blob storage. I am using DML CopyDirectory to copy these into another storage account for backup purpose. It took nearly 2 weeks to complete. Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
Does DML uses anything like data slices ?

Now i am worried that until which date the blobs are copied to target directory. Is it the date when the job started or the date job finished ?
As far as I know, when you start the CopyDirectory method, it will just send the request to tell the azure storage account to copy files from another storage account. All the copy operation is azure storage.
If we run the method to start copy the directory, the azure storage will firstly create the file with 0 size as below:
After the job finished, you will find it has change the size as below:
So the result is if the job started it will create the file in the target directory, but the file size is 0. You could see image1's file last modify time.
The azure storage will continue copy the file content to the target directory.
If the job finished, it will change the file last modify time.
So the DML SDK just tell the storage to copy files, then it will continue send the request to the azure storage to check each file's copy status.
Like below:
Thanks. But what happens if the files added to the source directory during this copy operation ? Does the new files as well get copied to the target directory ?
In short answer Yes.
The DML won't get the whole blob list and send request to copy all the file at one time.
It will firstly get a part of your file name list and send request to tell the storage copy file.
The list is sort by the file name.
For example.
If the DML have already copied the file name like 0 file as below.
This target blob folder
If you add the 0 start file to your folder,it will not copy.
This is copy from blob folder.
Copy completely blob folder:
If you add the file at the end of your blob folder and the DML doesn't scan it, it will be copied to the new folder.
so during that 2 weeks at least a million blobs must have been added to the container with very random names. So i think DML doesn't work in the case of large containers ?
As far as I know, the DML is designed for high-performance uploading, downloading and copying Azure Storage Blob and File.
When you using the DML CopyDirectoryAsync to copy the blob file.It will firstly send a request to list the folder's current file, then it will send the request to copy the file.
The default of the operation sending a request to list the folder's current file number is 250.
After get list it will generate a marker which is the next blob search file names. It will start to list the next file name in the folder and start copy again.
And by default, the .Net HTTP connection limit is 2. This implies that only two concurrent connections can be maintained.
It means if you don't set the .Net HTTP connection limit, the CopyDirectoryAsync will just get 500 record and start copy.
After copy completely, the operation will start to copy next files.
You could see this images:
The marker:
I suggest you could firstly set the max http connections to detect more blob files.
ServicePointManager.DefaultConnectionLimit = Environment.ProcessorCount * 8;
Besides, I suggest you could create multiple folder to store the files.
For example, you could create a folder which stores one week files.
Next week, you could start a new folder.
Then you could backup the old folder's file without new files store into that folder.
Finally, you could also write your own code to achieve your requirement, you need firstly get the list of the folder's files.
The max result of one request to get the list is 5000.
Then you could send the request to tell the storage copy each files.
If the file upload to the folder after you get the list, it will not copy to the new folder.

How to "clean" serialized files from filesystem without restarting http task

We have an XPages application and we serialize all pages on disk for this specific application. We already use the gzip option but it seems the serialized files are removed from disk only when the http task is stopped or restarted.
As this application is used by many different customers from different places around the globe, we try to avoid restart the server or the http task as much as possible but the drawback is that serialized files are never deleted ans so sooner or later we face a disk space problem even if the gzip serialzed files are not that big.
A secondary issue is that the http task takes quite a long time to stop because it has to remove all the serialized files.
Is there any way to have the domino server "clean" old/unused serialized files without restarting the http task ?
Currently we implemented an OS script which cleans serialized files older than tow days which is fine, but I would prefer a solution within domino.
Thanks in advance for your answers/suggestions !
Renaud

I believe the httpSessionId is used to store the file(s) on disk. You could try the following:
Alter the xsp.persistence.dir.xspstate to a friendlier location on (i.e. /temp/xspstate)
Register a SessionListener with your XPage application
Inside the SessionListener's sessionDestroyed method recursively search through the folders to find the one file or folder that matches the sessionId and delete
When the sessionDestoryed method is called in the listener any file locks should have been removed. Also note, as of right now, the seesionDestroyed method is not called right after a user logs out (see my question here: SessionListener sessionDestroyed not called)
hope this helps...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string