How to join Kafka KStream with Plain file cache on Azure Cloud? - azure

I am developing a log enrichment Kafka Stream job. The plan is to use the file cache on Azure Blob to enrich the log entry from the Kafka KStream. My understanding is that I have to load the cache file from Azure Blob to a KTable. Then I can join the KStream with the KTable.
As a newbie, there are two difficulties I've met, can anyone give me some hint?
Looks like Kafka Connect doesn't have lib to connect to Azure Blob. Do I have to write another separate job to always read from Azure and write back to KTable? Is there any quick way?
The cache got updated four to five times every day and the job need to detect the change of the cache file and reflect in the KTable. To detect if some of entries deleted from the cache file, does it mean I have to compare each entries between the KTable and the file timely? Any more efficient way?
Thanks

There are multiple ways to approach this. The first thing you need to realize it, that you need to put the data into a Kafka topic first if you want to read it into a KTable.
If there is no connector, you could write your own connector (https://docs.confluent.io/current/connect/devguide.html). An alternative would be, to write a small application, that reads the data from Azure and uses a KafkaProducer to write into a topic.
For updating the KTable, you don't need to worry about this from a Kafka Streams perspective. If new data is written into the topic, the KTable will be updated automatically. If you write you own connector, this will also ensure, that update to the file will be propagated into the topic automatically. If you write your own application, you will need to make sure, that this application write the changes into the topic.

Related

Designing a timer-triggered processor which relies on data from events

I am trying to design a timer-triggered processor (all in azure) which will process a set of records that are set out for it to be consumed. It will be grouping it based on a column, creating files out of it, and dumping in a blob container. The records that it will consume are supposed to be generated based on an event - when the event is raised, containing a key, which can be used to query the data for the record (the data/ record being generated is to be pulled from different services.)
This is what I am thinking currently
Event is raised to event-grid-topic
Azure Function(ConsumerApp) is event triggered, reads the key, calls a service API to get all the data, stores that record in storage
table, with flag ready to be consumed.
Azure Function(ProcessorApp) is timer triggered, will read from the storage table, group based on another column, create and dump them as
files. This can then mark the records as processed, if not updated
already by ConsumerApp.
Some of my questions on these, apart from any way we can do it in a different better way are -
The table storage is going to fill up quickly, which will again decrease the speed to read the 'ready cases' so is there any better approach to store this intermediate & temporary data? One thing which I thought was to regularly flush the table or delete the record from the consumer app instead of marking it as 'processed'
The service API is being called for each event, which might increase the strain on that service/its database. should I group the call for records as a single API call, since the processor will run only after a said interval, or is there a better approach here?
Any feedback on this approach or a new design will be appreciated.
If you don't have to process data on Step 2 individually, you can try saving it in a blob too and add a record the blob path in Azure Table Storage to keep minimal row count.
Azure Table Storage has partitions that you can use to partition your data and keep your read operations fast. Partition scan is faster compared to table scan. In addition, Azure Table Storage is cheap, but if you have pricing concern. Then you can write a clean up function to periodically clean the processed rows. Keeping the processed rows around for a reasonable time is usually a good idea. Because you may need those for debugging issues.
By batching multiple calls in a single call, you can decrease network I/O delay. But resource contention will remain at service level. You can try moving that API to a separate service if possible to scale it separately.

Writing to Google Cloud Storage with v2 algorithm safe?

Recommended settings for writing to object stores says:
For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.
Is it safe to use the v2 algorithm to write out to Google Cloud Storage?
What, exactly, does it mean for the algorithm to be "not safe"? What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
aah. I wrote that bit of the docs. And one of the papers you cite.
GCP implements rename() non-atomically, so v1 isn't really any more robust than v2. And v2 can be a lot faster.
Azure "abfs" connector has O(1) Atomic renames, all good.
S3 has suffered from both performance and safety. As it is now consistent, there's less risk, but its still horribly slow on production datasets. Use a higher-perfomance committer (EMR spark commtter, S3A committer)
Or look at cloud-first formats like: Iceberg, Hudi, Delta Lake. This is where the focus is these days.
Update October 2022
Apache Hadoop 3.3.5 added in MAPREDUCE-7341 the Intermediate Manifest Committer for correctness, performance and scalability on abfs and gcs. (it also works on hdfs, FWIW). it commits tasks by listing the output directory trees of task attempts and saves the list of files to rename to a manifest file, which is committed atomically. Job commit is a simple series of
list manifest files to commit, load these as the list results are paged in.
create the output dir tree
rename all source files to the destination via a thread pool
task attempt cleanup, which again can be done in a thread pool for gcs performance
save the summary to the _SUCCESS json file, and, if you want, another dir. the summary includes statistics on all store IO done during task and job commit.
This is correct for GCS as it relies on a single file rename as the sole atomic operation.
For ABFS it adds support for rate limiting of IOPS and resilience to the way abfs fails when you try a few thousand renames in the same second. One of those examples of a problem which only surfaces in production, not in benchmarking.
This committer ships with Hadoop 3.3.5 and will not be backported -use hadoop binaries of this or a later version if you want to use it.
https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
We see empirically that while v2 is faster, it also leaves behind
partial results on job failures, breaking transactionality
requirements. In practice, this means that with chained ETL jobs, a
job failure — even if retried successfully — could duplicate some of
the input data for downstream jobs. This requires careful management
when using chained ETL jobs.
It's safe as long as you manage partial writes on failure. And to elaborate, they mean safe in regard to rename safety in the part you quote. Of Azure, AWS and GCP only AWS S3 is eventual consistent and unsafe to use with the V2 algorithm even when no job failures happen. But GCP (nor Azure or AWS) is not safe in regards to partial writes.
FileOutputCommitter V1 vs V2
1. mapreduce.fileoutputcommitter.algorithm.version=1
AM will do mergePaths() in the end after all reducers complete.
If this MR job has many reducers, AM will firstly wait for all reducers to finish and then use a single thread to merge the outout files.
So this algorithm has some performance concern for large jobs.
2. mapreduce.fileoutputcommitter.algorithm.version=2
Each Reducer will do mergePaths() to move their output files into the final output directory concurrently.
So this algorithm saves a lot of time for AM when job is commiting.
http://www.openkb.info/2019/04/what-is-difference-between.html
If you can see Apache Spark documentation Google cloud marked safe in v1 version so its same in v2
What, exactly, does it mean for the algorithm to be "not safe"?
S3 there is no concept of renaming, so once the data is written to s3 temp location it again copied that data to new s3 location but Azure and google cloud stores do have directory renames
AWS S3 has eventual consistent meaning If you delete a bucket and immediately list all buckets, the deleted bucket might still appear in the list ,eventual consistency causes file not found expectation during partial writes and its not safe.
What are the concrete set of criteria to use to decide if I am in a situation where v2 is not safe?
What is the best practice writing massive amount of files to s3 using Spark
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel
https://spark.apache.org/docs/3.1.1/cloud-integration.html#recommended-settings-for-writing-to-object-stores
https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
https://github.com/steveloughran/zero-rename-committer/files/1604894/a_zero_rename_committer.pdf

Why does Azure Stream Analytics outputs data to seperate files?

Why does Stream Analytics create separate files when using Azure Data Lake or Azure Blob Storage? Some times the stream runs for days in one file, while other times every day a couple of new files are made. It seems rather random?
I output the data to CSV, the query stays the same, and every now and then there is a new file generated.
I would prefer it to have one large CSV file, because I want to be able to run long-term statistics using Power BI on the data, but this seems impossible when it are all separate files with a seemingly random name.
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs - this page has details about when a new file is created. In your case, it is most likely due to an internal restart.

Is it good to create Spark batch job for every new Use cases

I run 100sof computer in a network and 100sof user access those machines. Every day, thousands or more syslogsare generated from all those machines. Syslog could be any log including system failures, network, firewall, application errors etc.
Sample log would look like below
May 11 11:32:40 scrooge SG_child[1829]: [ID 748625 user.info] m:WR-SG-BLOCK-111-
00 c:Y th:BLOCK , no allow rule matched for request with entryurl:http:url on
mapping:bali [ rid:T6zcuH8AAAEAAGxyAqYAAAAQ sid:a6bbd3447766384f3bccc3ca31dbd50n ip:192.24.61.1]
From the logs, I extract fields like Timestamp, loghost, msg, process, facility etc and store them in HDFS. Logsare stored in json format. Now, I want to build a system where I can type a query in a web application and do analysis on those logs. I would like to be able to do queries like
get logs where the message contains "Firewall blocked" keywords.
get logs generated for the User Jason
get logs containing "Access denied" msg.
get log count grouped by user, process, loghost etc.
There could be thousands of different types of analytics I want to do. To add more, I want the combined results of historical data and the real time data i.e. combining batch and realtime results.
Now my questions is
To get the batch result, I need to run the batch spark jobs. Should I
be creating batch jobs for every unique query user makes. If I
do so, I will end up creating 1000s of batch jobs. If not, what kind
of batch jobs should I run so that I can get results for any type of
analytics.
Am I thinking it the right way. If my approach itself is wrong, then do share what should be the correct procedure.
While it's possible (via thrift server for example), Apache Spark main objective is not to be a query engine but building data pipelines for stream and batch data sources.
If the transformation you are only projecting fields and you want to enable ad-hoc queries, sounds like you need another data store - such as ElasticSearch for example. The additional benefit is that it comes with a Kibana that enable analytics to some extent.
Another option is to use a SQL engine such as Apache Drill.
Spark is probably not the right tool to use unless the size of these logs justifies the choice.
Are these logs in the order of a few gigabytes? Then use splunk.
Are these logs in the order of hundreds of gigabytes? Then use elasticsearch with maybe Kibana on top.
Are they in the order of terabytes? Then you should think to some more powerful analytical architecture and there are many alternatives here that basically do batch jobs in the same way as you would with Spark, but usually in a smarter way.

Running HDInsight jobs howto

Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.

Resources