Why does Stream Analytics create separate files when using Azure Data Lake or Azure Blob Storage? Some times the stream runs for days in one file, while other times every day a couple of new files are made. It seems rather random?
I output the data to CSV, the query stays the same, and every now and then there is a new file generated.
I would prefer it to have one large CSV file, because I want to be able to run long-term statistics using Power BI on the data, but this seems impossible when it are all separate files with a seemingly random name.
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs - this page has details about when a new file is created. In your case, it is most likely due to an internal restart.
Related
I'm working with Azure Data Factory to copy .txt files from an FTP site. I'm using a binary transfer approach leveraging binary formats, but ADF is showing incredibly slow throughput (90KB/s) so is taking hours to transfer a 4GB file, which isn't particularly large.
The FTP site is in the US while ADF is located in Europe, but I'm able to download from a VM in the same Europe data center and retrieve the file from FTP in a few minutes. It seems like something is not quite right, any idea why ADF is not able to retrieve a 4GB .txt file? I'm copying to BLOB and am using Azure IR for compute.
The pipeline is running 6-7 hours which seems absurd for a reasonably sized file. I have tried different formats (reading directly as delimited), etc. but it continues to be absurdly slow. I'm assuming the FTP has reasonable download speeds considering I can retrieve the file from a desktop after 4-5 minutes. When monitoring the load in the ADF monitoring I can see it is continually "reading from source" though I can see the data read amount is not changing so I'm wondering if it is dropping a connection, etc.
Any thoughts would help!
I've solved this by, instead of using binary transfer, reading in native format (delimited) and then write out in chunked up parquet in BLOB. I've set a max number of rows for the write out files so it's forcing physical files to be committed.
For whatever reason, it was struggling trying to read the entire 4GB file and then write it out. It seems counterintuitive to move away from a binary copy but seems to be the only option in this case.
I am trying to design a timer-triggered processor (all in azure) which will process a set of records that are set out for it to be consumed. It will be grouping it based on a column, creating files out of it, and dumping in a blob container. The records that it will consume are supposed to be generated based on an event - when the event is raised, containing a key, which can be used to query the data for the record (the data/ record being generated is to be pulled from different services.)
This is what I am thinking currently
Event is raised to event-grid-topic
Azure Function(ConsumerApp) is event triggered, reads the key, calls a service API to get all the data, stores that record in storage
table, with flag ready to be consumed.
Azure Function(ProcessorApp) is timer triggered, will read from the storage table, group based on another column, create and dump them as
files. This can then mark the records as processed, if not updated
already by ConsumerApp.
Some of my questions on these, apart from any way we can do it in a different better way are -
The table storage is going to fill up quickly, which will again decrease the speed to read the 'ready cases' so is there any better approach to store this intermediate & temporary data? One thing which I thought was to regularly flush the table or delete the record from the consumer app instead of marking it as 'processed'
The service API is being called for each event, which might increase the strain on that service/its database. should I group the call for records as a single API call, since the processor will run only after a said interval, or is there a better approach here?
Any feedback on this approach or a new design will be appreciated.
If you don't have to process data on Step 2 individually, you can try saving it in a blob too and add a record the blob path in Azure Table Storage to keep minimal row count.
Azure Table Storage has partitions that you can use to partition your data and keep your read operations fast. Partition scan is faster compared to table scan. In addition, Azure Table Storage is cheap, but if you have pricing concern. Then you can write a clean up function to periodically clean the processed rows. Keeping the processed rows around for a reasonable time is usually a good idea. Because you may need those for debugging issues.
By batching multiple calls in a single call, you can decrease network I/O delay. But resource contention will remain at service level. You can try moving that API to a separate service if possible to scale it separately.
We have an incoming XML file in azure file storage path everyday, and we are loading them into Azure SQL database using ADF Copy activity. The source is a XML dataset referring to the XML file and the sink is a table in database. The copy activity completes in less than 3 minutes if the file is around 500mb. But when we tried a 680mb file, it ran nearly 5hours. We are not able to find what is the reason behind this huge increase in time. We tried to change the DIU & parallelism settings, but didn't help.
Any idea why such huge increase in loading time?
Does ADF XML copy activity has file size limit?
Is there any way to reduce the processing time, apart from rewriting the logic in an azure function?
Any help or suggestion is appreciated! Thanks
I am developing a log enrichment Kafka Stream job. The plan is to use the file cache on Azure Blob to enrich the log entry from the Kafka KStream. My understanding is that I have to load the cache file from Azure Blob to a KTable. Then I can join the KStream with the KTable.
As a newbie, there are two difficulties I've met, can anyone give me some hint?
Looks like Kafka Connect doesn't have lib to connect to Azure Blob. Do I have to write another separate job to always read from Azure and write back to KTable? Is there any quick way?
The cache got updated four to five times every day and the job need to detect the change of the cache file and reflect in the KTable. To detect if some of entries deleted from the cache file, does it mean I have to compare each entries between the KTable and the file timely? Any more efficient way?
Thanks
There are multiple ways to approach this. The first thing you need to realize it, that you need to put the data into a Kafka topic first if you want to read it into a KTable.
If there is no connector, you could write your own connector (https://docs.confluent.io/current/connect/devguide.html). An alternative would be, to write a small application, that reads the data from Azure and uses a KafkaProducer to write into a topic.
For updating the KTable, you don't need to worry about this from a Kafka Streams perspective. If new data is written into the topic, the KTable will be updated automatically. If you write you own connector, this will also ensure, that update to the file will be propagated into the topic automatically. If you write your own application, you will need to make sure, that this application write the changes into the topic.
I found quite some answers to copy blobs between azure storage accounts. I know of the cmdlet using Start-AzureStorageBlobCopy. However, I have > 20 million files to copy between two storage accounts in the same data center and it seems to take forever (it is copying since more than a week) since it starts each file copy process separately.
Furthermore, I found that in the most current version of the Azure tools (7.4), the cmdlet downloads the full file list (to memory) and only then starts with the copy process. So it not only takes forever but uses large amount of memory. The same is also true if I use AzCopy.
Thus my question: what is a good possibility (that actually really works!) to copy large amounts of files of which each is not that big between two storage accounts in the same data center? Or maybe you know of parameters to set when using the cmdlets (the documentation is awful and not updated)?