One stream analytics job vs multiple jobs - azure

Performance-wise, is there a big difference between having one stream analytics job with multiple queries and outputs (Azure tables in my case) Vs. splitting these queries and outputs into multiple stream analytics jobs?
And if the difference is significant, how to determine which scenario best suits my needs?
The doc states that:
Stream Analytics can handle up to 1 GB of incoming data per second.
but my concern is more about the processing of the data.

There are two things that will govern how far you can scale one stream analytics account:
The documented limits
How many Streaming Units you need to process your workload - see the FAQ at the bottom of this page.
Smaller windows processing less volumes will consume less units. Like many of the other PaaS services, the measured unit is esoteric and your mileage may vary, it's best to build a small sample for your use case and then make a prediction on how it will scale.
I certainly would not build a separate streaming job for every case, but you may need to find a way to "partition" your requirement, maybe by use case/feature, by business domain, consuming system, etc.

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !
The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

Apache Flink - Run same job multiple times for multi-tenant applications

We have a multi-tenant application where we maintain message queue for each tenant. We have implemented a Flink job to process the stream data from the message queues. Basically each of the message queue is a source in the Flink job. Is this the recommended way to do so? Or is it ok to run the same job (with one source) multiple times based on the number of tenants? We expect that the each of the tenants will produce data in different volumes. Will there be any scalability advantages in the multi job approach?
Approaches
1: Single job with multiple sources
2. Run duplicate of same job each with one source each
I think these approaches suits to Storm, Spark or any other streaming platforms.
Thank you
Performance-wise approach 1) has the greatest potential. Resources are better utilized for the different sources. Since it's different sources, the optimization potential of the query itself is limited though.
However, if we really talk multi-tenant, I'd go with the second approach. You can assign much more fine-grain rights to the application (e.g., which Kafka topic can be consumed, to which S3 bucket to write). Since most application developer tend to develop GDPR compliant workflows (even though the current countries might not be affected), I'd go this route to stay on the safe side. This approach also has the advantage that you don't need to restart the jobs for everyone if you add/remove a certain tenant.

Spark: Is there a way to use resources in each of the local machine where is distributed?

I have to admit that I don't know how to formulate properly a title question for this (any help is appreciated), but I'll try to be more clear here:
I would like to distribute a task with Spark, but I would like to use exclusively some resources. There are no restrictions on the order of the dataset processed, but I wish that every batch distributed and analysed in different nodes of the clusters use different resources.
I will give an example that, hopefully, will make the question clearer:
Immagine I have to analyse 10MLN text messages for a task of sentiment analysis. The sentiment analysis is provided by a Webserver that is able to analyse a batch of 100 messages in 100ms via API accessible with credentials. Since I don't want to waste weeks for analyse them all, the idea is to distribute the task. But I cannot distribute the SAME credential because I would incur in RateLimit, or overload.
The desirable solution would be to use ONE credential per partition in Spark, or per node. How can I do that given that the credentials might change, so they are not fixed for nodes?

Troubleshooting Azure Search poor performance

I am seeing erratic performance with an Azure Search Basic instance. Our index only has 1,544 documents and is 28MB in size, so I would expect searches to be very fast.
Azure Application Insights is reporting 4.7K calls to Azure Search from our app within the last 12 hours, with an average response time of 2.1s and a standard deviation of 35.8s(!).
I am personally seeing erratic performance during my manual testing. A query can take 20+ seconds at one moment, and then just a bit later the same query will take less than 100ms.
There queries are very simple. Here's an example query string:
api-version=2015-02-28&api-key=removed&search=&%24count=true&%24top=10&%24skip=0&searchMode=all&scoringProfile=FieldBoost&%24orderby=sortableTitle
What can I do to further troubleshoot this issue?
First off, I am assume you have a fairly even distribution of queries which means based on your numbers, you are only ~1 query per second. Does that sound correct? If not, and you are seeing large spikes of queries, it is very possible that you do not have enough replicas (copies of the index) to handle the query load. Please note that a single replica Basic service is targeted to handle low single digit QPS (although this can vary widely based on the complexity or simplicity of the queries). If you go beyond the limits of the service, latency can certainly become an issue. A good way to drill into this is to use Azure Search Traffic Analytics which can expose the search metrics that include data such as the number of queries per second over various timeframe as well as the latency metrics that we are seeing internally.
Also, most importantly, please try to reuse HTTP connections as much as possible and leverage HTTP connection pooling if possible. By the way, in .NET you should reuse a single HttpClient instance, or SearchIndexClient instance if using our Azure Search SDK.
I gathered more data and posted my results over at the Azure Search forum.
The slowdowns are due to the fact that we're running a single basic instance and code deployments by the Azure Search team cause a brief (a few minutes in my experience) interruption / degradation in service.
I find running two basic instances too expensive. Our search traffic doesn't warrant two instances except for availability purposes.
It's my understanding from the forum that the free tier has generally higher availability than a single basic instance. As a result, I have submitted a feedback item suggesting a paid shared tier that would provide more storage than the free tier while retaining higher availability than a single dedicated instance.

What does "streaming" mean in Apache Spark and Apache Flink?

As I went to Apache Spark Streaming Website, I saw a sentence:
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
And in Apache Flink Website, there is a sentence:
Apache Flink is an open source platform for scalable batch and stream data processing.
What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?
Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).
Characteristics of Streaming Applications
Stream data processing applications are typically characterized by the following points:
Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.
Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.
Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".
In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.
Examples of Streaming Applications
Typical examples of stream data processing application are
Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.
Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.
Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.
Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.
There are many more ...

Resources