KUDU for JDBC replication purposes, but not for Off-loaded Analytics - apache-kudu

Given the quote from Apache KUDU official documentation, namely: https://kudu.apache.org/overview.html
Kudu isn't designed to be an OLTP system, but if you have some subset
of data which fits in memory, it offers competitive random access
performance. We've measured 99th percentile latencies of 6ms or below
using YCSB with a uniform random access workload over a billion rows.
Being able to run low-latency online workloads on the same storage as
back-end data analytics can dramatically simplify application
architecture.
Does this statement imply that KUDU can be used for replication from a JDBC source - the simplest form possible?

Elsewhere I have used KUDU for replicating to from SAP and other COTS, so that reports could run against the KUDU tables as opposed to Hana. That was an architecture decided upon by others.
For pure replication of data, primarily for subsequent extractions from a Data Lake, for data with embellished history with a size < 1TB, this is feasible as well. Cloudera confirmed this after discussion. Even though KUDU has a columnar format and a row format would be desirable, it simply works as well.

Related

HBase to Delta Tables

We are in the process of migrating a Hadoop Workload to Azure Databricks. In the existing Hadoop ecosystem, we have some HBase tables which contains some data(not big). Since, Azure Databricks does not support Hbase, we were planning if we can replace the HBase tables with Delta tables.
Is this technically feasible, if yes, is there any challenges or issues we might face during the migration or in the target system.
It all comes to the access patterns. HBase is OLTP system where you usually operate on individual records (read/insert/update/delete) and expect subsecond (or millisecond) response time. Delta Lake, on other side is OLAP system designed for efficient processing of many records together, but it could be slower when you read individual records, and especially when you update or delete them.
If your application needs subseconds queries, especially with updates, then it make sense to setup a test to check if Delta Lake is the right choice for that - you may need to look onto Databricks SQL that is doing a lot of optimizations for fast data access.
If it won't fulfill your requirements, then you may look onto other products in Azure ecosystem, like, Azure Redis or Azure CosmosDB that are designed for OLTP-style data processing.

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

Does Cassandra replication degrades Analytics performance in other DC or vice versa?

We have proposed a solution utilizing Cassandra-Spark combo realized through a Workload separation architecture. That is, the Operations DC mainly undergoes heavy write operations while the Analytics DC processes Analytics jobs. I have read here that:
"Once these asynchronous hints are received on the additional clusters, they undergo the normal write procedures and are assimilated into that data center. This way, any analytical jobs that are running can easily and simply access this new data without a time-consuming ETL process."
Our concern is as all the data is replicated near real-time from Operations DC to Analytics DC, how can we be sure that the replication process will not impact the Analytical processing happening on the Analytics DC?
Alternatively, will the heavy processing of Analytic jobs impact the replication of data between DCs?
I understand that I may be missing something, but a direction will help. Will also appreciate any related documentation on benchmarking or theoretical analysis to address this concern.
It's really depends on type of data processing you'll have in the Analytics DC. You need to size servers that they could handle the standard write traffic from replication from transactional DC, plus the load from your analytical jobs. But you can have smaller replication factor for Analytical DC, so it will be slight fewer writes to servers in Analytical DC.
DSE Architecture is described in the corresponding guide. You need to look through information on data replication, and read/write paths...
I would suggest to perform load testing of your cluster, and measure the load on the servers in Analytical DC, and, for example, 99th percentiles for read & writes on the servers there.
You can emulate the load to transactional DC using the DSE Gatling plugin, or related projects (search by word gatling in DataStax repository). Using Gatling it's easier to develop more real-world-like load simulators.

What is the impact of writes to a cassandra cluster that is hosting analytics jobs?

I am considering a heavily analytics based spark cluster that also has to consume some writes from a response time sensitive UI.
Will the analytics jobs impact or hinder my rest response time ?
Will it have any other impact ?
Your best bet would be to keep multiple separate data centres. I couldn't find the blog about newest version of DSE but all the principles from the article still apply and the principles are applicable for the community cassandra too.
https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/deploy/deployWkLdSep.html

Apache Spark vs Apache Ignite [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Currently I'm studying Apache spark and Apache ignite frameworks.
Some principle differences between them are described in this article ignite vs spark But I realized that I still don't understand their purposes.
I mean for which problems spark more preferable than ignite and vice versa?
I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations.
Another common use for Ignite is distributed caching, which is often used to improve performance of applications that interact with relational databases or any other data sources.
Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time.Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability.
Apache Spark(cluster computing framework) is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited for high-performance computing and machine learning algorithms.
Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
Conclusion:
Ignite and Spark are both in-memory computing solutions but they target different use cases.
In many cases, they are used together to achieve superior results:
Ignite can provide shared storage, so the state can be passed from one Spark application or job to another.
Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x (spark doesn’t index the data)
When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
Does Spark and Ignite works together?
Yes, Spark and Ignite works together.
In short
Ignite vs. Spark
Ignite is an in-memory distributed database more focused on data storage and handles transnational updates on data, then serves client requests. Apache Spark is an MPP compute engine which is more inclined towards analytics, ML, Graph, and ETL specific payloads.
In detail
Apache Spark is an OLAP tool
Apache Spark is a general-purpose cluster computing system. It's an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark with other components
Deployment topology
Spark on YARN typology is discussed here.
Apache Ignite is an OLTP tool
Ignite is a memory-centric distributed database, caching, and processing platform for transnational, analytical, and streaming workloads delivering in-memory speeds at the petabyte scale. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging, and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources.
SQL Overview
Deployment topology
Apache Spark is a processing framework. You tell it where to get data, provide some code about how to process that data, and then tell it where to put the results. It's a way to easily reliably run computing logic across a bunch of nodes in a cluster on data from any source (which is then kept in-memory during processing). It's primarily meant for large-scale analysis on data from various sources (even from multiple databases at once), or from streaming sources like Kafka. It can also be used for ETL, like transforming and joining data together before putting the final results in some other database system.
Apache Ignite is more of an in-memory distributed database, at least that's how it started. It has a key/value and SQL API, so you can store and read data in various ways, and run queries like you would any other SQL database. It also supports running your own code (similar to Spark) so you can do processing that wouldn't really work with SQL, while also reading and writing the data all in the same system. It also can read/write data to other database systems while acting as a cache layer in the middle. Eventually, as of 2018, it also supports on-disk storage so now you can use it as an all-in-one distributed database, cache, and processing framework.
Apache Spark is still better for more complex analytics, and you can have Spark read data from Apache Ignite, but for many scenarios it's now possible to consolidate processing and storage into a single system with Apache Ignite.
Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
By Nikita Ivanov: http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/
Although both Apache Spark and Apache Ignite utilize the power of in-memory computing, they address somewhat different use cases and rarely “compete” for the same task. Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
I am late to answer this question, but let me try to share my view on this.
Ignite may not be ready to use in production for enterprise application as some important features such as Security is only available in Gridgain(wrapper over Ignite)
Complete list of features can be found from below link
https://www.gridgain.com/products/gridgain-vs-ignite

Resources