Azure Queue Storage: Send files in messages - azure

I am assessing Azure Queue Storage to communicate between two decoupled applications.
My requirement is to send a file (flat file, size: small to large) in the queue message.
As per my reading an individual message in a queue cannot exceed beyond 64KB, so sending a file of variable size in the message is out of question.
Another solution I can think of is using a combination of Queue Storage and blob storage, i.e. in the queue message add a reference to the file (on blob storage) and then when required read the file from the blob (using the reference/address in the queue message).
My question is, is this a right approach? or are there any other elegant ways to achieving this?
Thanks,
Sandeep

While there's no right approach, since you can put anything you want in a queue message (within size limits), consider this: If your file sizes can go over 64K, you simply cannot store these within a queue message, so you will have no other choice but to store your content somewhere else (e.g. blobs). For files under 64K, you'll need to decide whether you want two different methods for dealing with files, or just use blobs as your file source across the board and have a consistent approach.
Also remember that message-passing will eat up bandwidth and processing. If you store your files in queue messages, you'll need to account for this with high-volume message-passing, and you'll also need to extract your file content from your queue messages.
One more thing: If you store content in blobs, you can use any number of tools to manipulate these files, and your files remain in blob storage permanently (until you explicitly delete them). Queue messages must be deleted after processing, giving you no option to keep your file around. This is probably an important aspect to consider.

Related

Azure storage queue triggered function starts multiple times

All,
I have storage queue triggered Azure Function. It loads various data into a database from files. I specify the input file in the message sent into the input queue.
However when I send a message into the queue my function starts in multiple instances and tries to insert the same file into the db. If I log msg.dequeue_count I see it rising.
What shall I do to start only one function for each message? Please note I'd like to keep the possibility to start multiple instance for multiple messages to load different files parallel.
This question was also asked here and the answer was to check out the chart comparing storage and service bus queues.
Bottom line is that storage queues offer 'at least once' delivery. If you want 'at most once' you should use service bus and PeekLock or ReceiveAndDelete.

DataLake locks on read and write for the same file

I have 2 different applications that handle data from Data Lake Storage Gen1.
The first application uploads files: if multiple uploads on the same day, the existing file will be overridden (it is always a file per day saved using the YYYY-MM-dd format)
The second application reads the data from the files.
Is there an option to lock this operations: when a write operation is performed, no read should take place and the same when a read happens the write should wait until the read operation is finished.
I did not find any option using the AdlsClient.
Thanks.
As I know, ADL gen1 is Apache Hadoop file system that's compatible with Hadoop Distributed File System (HDFS). So I searched some documents of HDFS and I'm afraid that you can't control mutual exclusion of reading and writing directly. Please see below documents:
1.link1: https://www.raviprak.com/research/hadoop/leaseManagement.html
writers must obtain an exclusive lock for a file before they’d be
allowed to write / append / truncate data in those files. Notably,
this exclusive lock does NOT prevent other clients from reading the
file, (so a client could be writing a file, and at the same time
another could be reading the same file).
2.link2: https://blog.cloudera.com/understanding-hdfs-recovery-processes-part-1/
Before a client can write an HDFS file, it must obtain a lease, which is essentially a lock. This ensures the single-writer semantics. The lease must be renewed within a predefined period of time if the client wishes to keep writing. If a lease is not explicitly renewed or the client holding it dies, then it will expire. When this happens, HDFS will close the file and release the lease on behalf of the client so that other clients can write to the file. This process is called lease recovery.
I provide a workaround here for your reference: Adding a Redis database before your writes and reads!
No matter when you do read or write operations, firstly please judge whether there is a specific key in the Redis database. If not, write a set of key-value into Redis. Then do business logic processing. Finally don't forget to delete the key.
Although this is may a little bit cumbersome or affecting performance, I think it can meet your needs. BTW,considering that the business logic may fail or crash so that the key is never released, you can add the TTL setting when creating the key to avoid this situation.

How do I remove events from Eventhub

I might be confused how EventHubs supposed to be used or need guidance on how to reliably process events posted into Eventhub. I export Azure ActivityLog to Eventhub and currently just using console application to read those messages. What I don't understand is what I'm supposed to do with events which I already read and processed. Say I want to write content of all messages into Storage account AppendLog. For this I need to delete messages which I already processed (like it would be done if it will be message queue), how do I do that with eventhub?
You cannot delete them. From the docs:
Event Hubs retains data for a configured retention time that applies across all partitions in the event hub. Events expire on a time basis; you cannot explicitly delete them.
Back to your question:
Say I want to write content of all messages into Storage account AppendLog. For this I need to delete messages which I already processed
I am not sure why you need this though. You can keep a pointer to the last read message so you are able to process only new messages. Why should you need to delete the older ones? You can read about offsets and ceckpointing here.
What technique are you using for reading the messages?
If you need a pattern of popping messages, you need the Queue or Topic from the Azure Service Bus.
When you ack that message, it is popped from the queue.

Parallelize Azure Logic App executions when copying a file from SFTP to Blob Storage

I have an Azure Logic App which gets triggered when a new file is added or modified in an SFTP server. When that happens the file is copied to Azure Blob Storage and then gets deleted from the SFTP server. This operation takes approximately 2 seconds per file.
The only problem I have is that these files (on average 500kb) are processed one by one. Given that I'm looking to transfer around 30,000 files daily this approach becomes very slow (something around 18 hours).
Is there a way to scale out/parallelize these executions?
I am not sure that there is a scale out/parallelize execution on Azure Logic App. But based on my experience, if the timeliness requirements are not very high, we could use Foreach to do that, ForEach parallelism limit is 50 and the default is 20.
In your case, my suggestion is that we could do loop to trigger when a new file is added or modified in an SFTP then we could insert a queue message with file path as content to azure storage queue, then according to time or queue length to end the loop. We could get the queue message collection. Finally, fetch the queue message and fetch the files from the SFTP to create blob in the foreach action.
If you're C# use Parallel.ForEach like Tom Sun said. If you use this one I also recommend to use async/await pattern for IO operation (save to blob). It will free up the executing thread when file is being saved to serve some other request.

Azure Storage Performance Queue vs Table

I've got a nice logging system I've set up that writes to Azure Table Storage and it has worked well for a long time. However, there are certain places in my code where I need to now write a lot of messages to the log (50-60 msgs) instead of just a couple. It is also important enough that I can't start a new thread to finish writing to the log and return the MVC action before I know the log is successful because theoretically that thread could die. I have to write to the log before I return data to the web user.
According to the Azure dashboard, Table Storage transactions take ~37ms to commit, end to end (E2E), while queues only take ~6ms E2E to commit.
I'm now considering not logging directly to table storage, and instead log to an Azure Queue, then have a batch job run that reads off the queue and then puts them in their proper place in table storage. That way I can still index them properly via their partition and row keys. I can also write just a single queue message with all of the log entries. So it should only take 6ms instead of (37 * 50) ms.
I know that there are Table Storage batch operations. However, each of the log entries typically goes to different partition, and batch ops need to stay within a single partition.
I know that queue messages only live for 7 days, so I'll make sure I store queue messages in a new mechanism if they're older than a day (if it doesn't work the first 50 times, it just isn't going to work).
My question, then is: what am I not thinking about? How could this completely kick me in the balls in 4 months down the road?

Resources