DataLake locks on read and write for the same file

DataLake locks on read and write for the same file - azure

I have 2 different applications that handle data from Data Lake Storage Gen1.
The first application uploads files: if multiple uploads on the same day, the existing file will be overridden (it is always a file per day saved using the YYYY-MM-dd format)
The second application reads the data from the files.
Is there an option to lock this operations: when a write operation is performed, no read should take place and the same when a read happens the write should wait until the read operation is finished.
I did not find any option using the AdlsClient.
Thanks.

As I know, ADL gen1 is Apache Hadoop file system that's compatible with Hadoop Distributed File System (HDFS). So I searched some documents of HDFS and I'm afraid that you can't control mutual exclusion of reading and writing directly. Please see below documents:
1.link1: https://www.raviprak.com/research/hadoop/leaseManagement.html
writers must obtain an exclusive lock for a file before they’d be
allowed to write / append / truncate data in those files. Notably,
this exclusive lock does NOT prevent other clients from reading the
file, (so a client could be writing a file, and at the same time
another could be reading the same file).
2.link2: https://blog.cloudera.com/understanding-hdfs-recovery-processes-part-1/
Before a client can write an HDFS file, it must obtain a lease, which is essentially a lock. This ensures the single-writer semantics. The lease must be renewed within a predefined period of time if the client wishes to keep writing. If a lease is not explicitly renewed or the client holding it dies, then it will expire. When this happens, HDFS will close the file and release the lease on behalf of the client so that other clients can write to the file. This process is called lease recovery.
I provide a workaround here for your reference: Adding a Redis database before your writes and reads!
No matter when you do read or write operations, firstly please judge whether there is a specific key in the Redis database. If not, write a set of key-value into Redis. Then do business logic processing. Finally don't forget to delete the key.
Although this is may a little bit cumbersome or affecting performance, I think it can meet your needs. BTW,considering that the business logic may fail or crash so that the key is never released, you can add the TTL setting when creating the key to avoid this situation.

Related

Is it possible to turn off lock on the file set up by SFTP Session Factory

I'm struggle with cashed sftp session factory. Namely, I suffered from session unavailable because I used to many in my application. Currently I have one default non cashed session. Which writes file to sftp server but set up locks on them. Thus it can't be read by any other user. I'd like to avoid it. Perfectly, turn off lock after single file is uploaded. Is it possible ?
Test structure
Start polling adapter
Upload file to remote
Check whether files are uploaded
Stop polling adapter
Clean up remote

When you deal with data transferring over the network, you need to be sure that you release resources you use do to that. For example be sure to close InputStream after sending data to the SFTP. This is really not a framework responsibility to close it automatically. More over you may give us already not an InputStream, but just plain byte[] from it. That's only a reason I can think about locking-like behavior.

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData)?

Is there a way to find out which partition is the message written if using EventHub.SendAsync(EventData) from Azure EventHub client SDK ?
We intentionally do not provide a partition key so EventHub service can do its internal load balancing but want to find out which partition the data is written to eventually, for diagnosing issues with the end to end data flow.

Ivan's answer is correct in the context of the legacy SDK (Microsoft.Azure.EventHubs), but the current generation (Azure.Messaging.EventHubs) is slightly different. You don't mention a specific language, but conceptually the answer is the same across them. I'll use .NET to illustrate.
If you're not using a call that requires specifying a partition directly when reading events, then you'll always have access to an object that represents the partition that an event was read from. For example, if you're using the EventHubConsumerClient method ReadEventsAsync to explore, you'll be see the PartitionEvent where the Partition property tells you the partition that the Data was read from.
When using the EventProcessorClient, your ProcessEventAsync handler will be invoked with a set of ProcessEventArgs where the Partition property tells you the partition that the Data was read from.

There is no direct way to this. But there're 2 workarounds.
1.Use the Event Hubs Capture to store the incoming events, then check the events in the specified blob storage. When the events are stored in blob storage, the path contains the partition id, so you can know it.
2.Use code. Create a new Consumer Group, and follow this article to read events. And in this section, there is a method public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages). You can take use of the parameter PartitionContext to get the event's partition id(by using context.PartitionId).

spring batch design advice for processing 50k files

We have more than 50k files coming in everyday and needs to be processed. For that we have developed POC apps with design like,
Polling app picks the file continuously from ftp zone.
Validate that file and create metadata in db table.
Another poller picks 10-20 files from db(only file id and status) and deliver it to slave apps as message
Slave app take message and launch a spring batch job, which is reading data, does biz validation in processors and writes validated data to db/another file.
We used spring integration and spring batch technology for this POC
Is it a good idea to launch spring batch job in slaves or directly implement read,process and write logic as plan java or spring bean objects?
Need some insight on launching this job where slave can have 10-25 MDP(spring message driven pojo) and each of this MDP is launching a job.
Note : Each file will have max 30 - 40 thousand records

Generally, using SpringIntegration and SpringBatch for such tasks is a good idea. This is what they are intended for.
With regard to SpringBatch, you get the whole retry, skip and restart handling out of the box. Moreover, you have all these readers and writers that are optimised for bulk operations. This works very well and you only have to concentrate on writing the appropriate mappers and such stuff.
If you want to use plain java or spring bean objects, you will probably end up developing such infrastructure code by yourself... incl. all the needed effort for testing and so on.
Concerning your design:
Besides validating and creation of the metadata entry, you could consider to load the entries directly into a database table. This would give you a better "transactional" control, if something fails. Your load job could look something like this:
step1:
tasklet to create an entry in metadata table with columns like
FILE_TO_PROCESS: XY.txt
STATE: START_LOADING
DATE: ...
ATTEMPT: ... first attempt
step2:
read and validate each line of the file and store it in a data table
DATA: ........
STATE:
FK_META_TABLE: ForeignKey to meta table
step3:
update metatable with status LOAD_completed
-STATE : LOAD_COMPLETED
So, as soon as your metatable entry gets the state LOAD_COMPLETED, you know that all entries of the files have been validated and are ready for further processing.
If something fails, you just can fix the file and reload it.
Then, to process further, you could just have jobs which poll periodically and check if there are new data in the database which should be processed. If more than one file had been loaded during the last period, simply process all files that are ready.
You could even have several slave-processes polling from time to time. Just do a read for update on the state of the metadata table or use an optimistic locking approach to prevent several slaves from trying to process the same entries.
With this solution, you don't need a message infrastructure and you can still scale the whole application without any problems.

Azure Queue Storage: Send files in messages

I am assessing Azure Queue Storage to communicate between two decoupled applications.
My requirement is to send a file (flat file, size: small to large) in the queue message.
As per my reading an individual message in a queue cannot exceed beyond 64KB, so sending a file of variable size in the message is out of question.
Another solution I can think of is using a combination of Queue Storage and blob storage, i.e. in the queue message add a reference to the file (on blob storage) and then when required read the file from the blob (using the reference/address in the queue message).
My question is, is this a right approach? or are there any other elegant ways to achieving this?
Thanks,
Sandeep

While there's no right approach, since you can put anything you want in a queue message (within size limits), consider this: If your file sizes can go over 64K, you simply cannot store these within a queue message, so you will have no other choice but to store your content somewhere else (e.g. blobs). For files under 64K, you'll need to decide whether you want two different methods for dealing with files, or just use blobs as your file source across the board and have a consistent approach.
Also remember that message-passing will eat up bandwidth and processing. If you store your files in queue messages, you'll need to account for this with high-volume message-passing, and you'll also need to extract your file content from your queue messages.
One more thing: If you store content in blobs, you can use any number of tools to manipulate these files, and your files remain in blob storage permanently (until you explicitly delete them). Queue messages must be deleted after processing, giving you no option to keep your file around. This is probably an important aspect to consider.

Non-Idempotent Actions & Blob Leases on Azure

I have a multi-instance worker role.
It needs to do 2 things.
Download emails from a pop-inbox, save them, create a DB entry & then delete the emails
Download files from an FTP Server, save them, create a DB entry & then delete the files
Both these operations are time-sensitive and in a multi-instance environment, it's possible the second instance could pull duplicate copies of files/emails before the first instance goes back and deletes them.
I'm planning to implement a sync-lock mechanism around the main download method, which acquires a lease on a blob-file. The goal being that it would act as a lock, preventing another instance from interfering for the duration of the download-save-delete operation. If anything goes wrong with Instance 1 (i.e. it crashes), then the lease will eventually expire, and the second instance will pick up where it left off on the next loop and I can maintain my SLA
Just wondering if this is a viable solution or if there are any gotcha's I should be aware of ?

Blob leases are a viable locking strategy across multiple servers.
However, I'd still be cautious and record downloading of each individual email as a separate record, so that i would minimize accidental double downloading of the same email.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string