Apache Flink - Run same job multiple times for multi-tenant applications - apache-spark

We have a multi-tenant application where we maintain message queue for each tenant. We have implemented a Flink job to process the stream data from the message queues. Basically each of the message queue is a source in the Flink job. Is this the recommended way to do so? Or is it ok to run the same job (with one source) multiple times based on the number of tenants? We expect that the each of the tenants will produce data in different volumes. Will there be any scalability advantages in the multi job approach?
Approaches
1: Single job with multiple sources
2. Run duplicate of same job each with one source each
I think these approaches suits to Storm, Spark or any other streaming platforms.
Thank you

Performance-wise approach 1) has the greatest potential. Resources are better utilized for the different sources. Since it's different sources, the optimization potential of the query itself is limited though.
However, if we really talk multi-tenant, I'd go with the second approach. You can assign much more fine-grain rights to the application (e.g., which Kafka topic can be consumed, to which S3 bucket to write). Since most application developer tend to develop GDPR compliant workflows (even though the current countries might not be affected), I'd go this route to stay on the safe side. This approach also has the advantage that you don't need to restart the jobs for everyone if you add/remove a certain tenant.

Related

What is the industry standard for number of clusters for a development team in Databricks?

I am a part of a team of 5 developers that work with gathering data, transforming, analyzing and predicting data in Azure Databricks (basically a combination of Data Science and Data Engineering).
Up until now we have been working on relatively small data, so the team of 5 could easily work with a single cluster with 8 worker nodes in development. Even though we are 5 developers, usually we're at maximum 3 developers in Databricks at the same time.
Recently we started working with "Big Data" and thus we need to make use of Databricks' Apache Spark parallelization methods to improve run-times for our codes. However, a problem that quickly came to light is that with more than one developer running parallelizing codes on a single cluster, there will be queues that slow us down. Because of this we have been thinking about increasing the amount of clusters in our dev-environment so that multiple developers can work on codes that take use of the Spark parallelizing methods.
My question is this: What is the industry standard for number of clusters to have in a development environment? Do teams usually have a cluster per developer? That sounds like it could easily become quite expensive in terms of economic costs.
Usually I see following pattern:
There is a shared cluster for many people for adhoc experimenting, "small" data processing, etc. Please notice that current versions of databricks runtimes is trying to split resources between all users.
If some people need to run something "heaviweight", like, integration tests, etc., closer to production workloads, it's allowed them to create clusters. But to control costs, etc. it's recommended to use cluster policies to limit a size of cluster to create, node types, auto-termination times, etc.
For development clusters it's ok to use spot instances because Databricks cluster manager will pull new instances if existing ones are evicted
SQL queries could be more efficient to run on SQL warehouses that are optimized for BI workloads
P.S. Really, integration tests, and similar things could be easily run as jobs that are less expensive

Which tools to use when migrating bounded data?

I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?
Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.
Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

One stream analytics job vs multiple jobs

Performance-wise, is there a big difference between having one stream analytics job with multiple queries and outputs (Azure tables in my case) Vs. splitting these queries and outputs into multiple stream analytics jobs?
And if the difference is significant, how to determine which scenario best suits my needs?
The doc states that:
Stream Analytics can handle up to 1 GB of incoming data per second.
but my concern is more about the processing of the data.
There are two things that will govern how far you can scale one stream analytics account:
The documented limits
How many Streaming Units you need to process your workload - see the FAQ at the bottom of this page.
Smaller windows processing less volumes will consume less units. Like many of the other PaaS services, the measured unit is esoteric and your mileage may vary, it's best to build a small sample for your use case and then make a prediction on how it will scale.
I certainly would not build a separate streaming job for every case, but you may need to find a way to "partition" your requirement, maybe by use case/feature, by business domain, consuming system, etc.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Azure Service Bus - Multiple Topics vs Filtered Topic

I have written an implementation of azure service bus into our application using Topics which are subscribed to by a number of applications. One of the discussions in our team is whether we stick with a single Topic and filter via the properties of the message or alternatively create a Topic for our particular needs.
Our scenario is that we wish to filter by a priority and an environment variable (test and uat environments share a connection).
So do we have Topics (something like):
TestHigh
TestMedium
TestLow
UatHigh
UatMedium
UatLow
OR, just a single topic with these values set as two properties?
My preference is that we create separate topics, as we'd be utilising the functionality available and I would imagine that under high load this would scale better? I've read peeking large queues can be inefficient. It also seems cleaner to subscribe to a single topic.
Any advice would be appreciated.
I would go with separate topics for each environment. It's cleaner. Message counts in topics can be monitored separately for each environment. It's marginally more scalable (e.g. topic size limits won't be shared) - but the limits are generous and won't matter much in testing.
But my main argument: that's how production will (hopefully) go. As in, production will have it's own connection (and namespace) in ASB, and will have separate topics. Thus you would not be filtering messages via properties in production, so why do it differently in testing?
Last tip: to make topic provision easier, I'd recommend having your app auto create them on start up. It's easy to do - check if they exist, and create if they don't.
Either approach works. More topics and subscriptions mean that you have more entities to manage at deployment time. If High/Medium/Low reflect priorities, then multiple topics may be a better choice since you can pull from the the highest priority subscription first.
From a scalability perspective there really isn't too much of a difference that you would notice since Service Bus already spreads the load across multiple logs internally, so if you use six topics or two topics will not make a material difference.
What does impact performance predictability is the choice of service class. If you choose "Standard", throughput and latency are best effort over a shared multi-tenant infrastructure. Other tenants on the same cluster may impact your throughput. If you choose "Premium", you get ringfenced resources that give you predictable performance, and your two or six Topics get processed out of that resource pool.

Resources