Preparing archive data for Stream Analytics Import - azure

Before I had time to get an ingestion strategy & process setup, I started collecting data that will eventually go through a Stream Analytics job. Now I'm sitting on an Azure blob storage container with over 500,000 blobs in it (no folder organization), another with 300,000 and a few others with 10,000 - 90,000.
The production collection process now writes these blobs to different containers in the YYYY-MM-DD/HH format, but that's only great going forward. This archived data I have is critical to get into my system and I'd like to just modify the inputs a bit for the existing production ASA job so I can leverage the same logic in the query, functions and other dependencies.
I know ASA doesn't like batches of more than a few hundred / thousand, so I'm trying to figure a way to stage my data in order to work well under ASA. This would be a one time run...
One idea was to write a script that looked at every blob, looked at the timestamp within the blob and re-create the YYYY-MM-DD/HH folder setup, but in my experience, the ASA job will fail when the blob's lastModified time doesn't match the folders it's in...
Any suggestions how to tackle this?
EDIT: Failed to mention (1) there are no folders in these containers... all blobs live at the root of the container and (2) my LastModifiedTime on the blobs is no longer useful or have meaning. The reason for the latter is these blobs were collected from multiple other containers and merged together using the Azure CLI copy-batch command.

Can you please try below?
Do this processing in two different jobs, one for the folders with date partitioning (say partitionedJob). Another for old blobs without any date partitioning (say RefillJob)
Since RefillJob has a fixed number of blobs, put a predicate on System.Timestamp to make sure that it only processes old events. Start this job with at least 6 SUs and run it until all the events have been processed. You can confirm by looking at LastOutputProcessedTime or by looking at the input event count or by inspecting your output source. After this check, stop the job. This job is no longer needed.
Start the partitionedJob with timestamp > RefillJob. This assumes the folders for the timestamps exists.

Related

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

Batch Copy/Delete some blobs in container

I have a lot thousands containers and each container has up to 10k blobs inside. I have a list of tuple (container, blob) to
copy to another storage
delete later in the original storage
The blobs in containers are not related to each other - random date creation, random names (guids), nothing in common.
Q: is there any efficient way how to do these operations?
I already looked at az-cli and azcopy and haven't found any good way.
I tried e.g. to call azcopy repeatedly for each tuple, but this would take ages. One call to copy the blob took 2sec in average. So it's nice it starts operation in background, but if this "starting operation" takes about 2 seconds, it's pretty useless for my case.
I'm assuming based on the comments that within each container, it's an arbitrary number (and naming) of blobs to copy and delete. And that the delete is only for the blobs copied (not the full container). If so, and want to use something besides REST one suggestion would be Powershell script to read from a file the list of blobs to copy (service side copy) and then separately do a delete (more efficient to do a copy and if successful, then delete) e.g. https://learn.microsoft.com/en-us/powershell/module/az.storage/get-azstorageblobcopystate?view=azps-4.7.0#example-4--start-copy-and-pipeline-to-get-the-copy-status
Cheers, Klaas [Microsoft]

Incremental Data Storage

I have time series daily data which I run a model on. The model runs in Spark.
I only want to run the model daily, and append the results to the historic results. It is important to have a 'merged single data source' containing historical data for the model to run successfully.
I have to use an AWS service to store the results. If I store in S3, I will end up storing backfill + 1 file per day (too many files). If I store in Redshift, it doesn't merge + upsert, therefore becoming complicated. The customer facing data is in Redshift, so dropping the table and reloading daily is not an option.
I am not sure how to cleverly (defined as minimal cost and subsequent processing) store the incremental data without re-processing everything daily to get a single file.
S3 is still your best shot. Since your job doesn't seems need to be accessed on a real-time fashion, it's more of a rolling data set.
If you are worried about the amount of file it generates, there is at least 2 things you can do:
S3 object lifecycle management
You can define your objects to be removed or transition to another storage class(cheaper) after x days.
More examples: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-configuration-examples.html
S3 notification
Basically you can set up a listener in your S3 bucket, 'listening for' all the objects that match your specified prefix and suffix, to trigger other AWS services. One easy thing you can do is to trigger a Lambda, do your processing and then you can do whatever you would like.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Use S3 as your database whenever it's possible. It's damn cheap and it's AWS's backbone.
You can also switch to an ETL. A very efficient one, which is OpenSource, specialized in BigData, fully automatizable and easy to use is the Pentaho Data Integrator.
It comes equipped with ready made plugins for S3, Redshift (and others), and there is a single step to compare with previous values. From my experience it runs pretty fast. Plus it works for you during the night and sends you a morning mail saying every thing went OK (or not).
Note to the moderators: this is a agnostic point of view, I could have recommended many others, but this one seams the most suited for the OP's need.

Data Lake Analytics U-SQL EXTRACT speed (Local vs Azure)

Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue. Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick , roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes.
We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure the speed.
Each file containing a collection of 50 xml objects (with varying amount of detail contained within child elements), the files are roughly 1 MB when Gzip’d and between 5MB and 10MB when not. 99% of the time processing time is spent within the EXTRACT section of the u-sql script.
Things tried,
Unzipped the files before processing, this took roughly the same time as the zipped version, certainly nowhere near the 40 seconds I was seeing locally.
Moved the data from Blob storage to Azure Data Lake storage, took exactly the same length of time.
Temporarily Removed about half of the data from the files and re-ran, surprisingly this didn’t take more than a minute off either.
Added more AU’s to increase the processing time, this worked extremely well but isn’t a long term solution due to the costs that would be incurred.
It seems to me as if there is a major bottleneck when getting the data from Azure Blob Storage/Azure Data Lake. Am I missing something obvious.
P.S. Let me know if you need any more information.
Thanks,
Nick.
See slide 31 of https://www.slideshare.net/MichaelRys/best-practices-and-performance-tuning-of-usql-in-azure-data-lake-sql-konferenz-2018. There is a preview option
SET ##FeaturePreviews="InputFileGrouping:on";
which groups small files into limited vertices.

update 40+ million entities in azure table with many instances how to handle concurrency issues

So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.
My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.
I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?
Any better ideas?
Very interesting question!!! Extending #Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))
Assumptions:
Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
You have one separate worker role which is doing all the reading of entities.
You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.
Possible Solution:
First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.
Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.
Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.
What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.
Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.
As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.
If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.
If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.
In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.
As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.
I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.

Resources