How to handle backpressure on databases when using Apache Spark? - apache-spark

We are using Apache Spark for performing ETL for every 2 hours.
Sometimes Spark puts much pressure on databases when read/write operation is performed.
For Spark Streaming, I can see backpressure configuration on kafka.
Is there a way to handle this issue in batch processing?

Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.
What should be done here is actually on the reading end.
Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :
Spark JDBC fetchsize option
What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
Unfortunately, this might not solve all of your performance issues with your RDBMS.
What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.
To deal with this, I suggest the following :
If available, consider using specialized data sources over JDBC
connections.
Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.
Be sure to understand performance implications of different JDBC data source
variants, especially when working with production database.
Consider using a separate replica for Spark jobs.
If you wish to know more about Reading data using the JDBC source, I suggest you read the following :
Spark SQL and Dataset API.
Disclaimer: I'm the co-author of that repo.

Related

Which framework should be used to aggregate and joining the data of Kafka topics and store in to MySQL

I have data in two kafka topics from mysql using debezium-connector-mysql-plugin.
now i want to aggregate this data at daily level and store in to another mysql table.
please suggest.
Thanks.
You've not really laid out your requirements, other than commenting that you don't want to use Confluent Platform (but not said why).
In general, with data in Kafka (regardless of where it comes from) you have different options for processing it:
Bespoke consumer (probably a bad idea, given the availability of stream processing frameworks)
KSQL (use SQL to do your joins etc) - part of Confluent Platform
Kafka Streams - a Java library for doing stream processing. Part of Apache Kafka.
Flink, Spark Streaming, Samza, Heron, etc etc etc
It's up to you which you use, and it's going to come down to factors such as
Existing technology in use (no point deploying a Spark cluster if you don't need to; conversely, if you already use Spark and have lots of developers trained on it then it could make sense to use it)
Language familiarity of developers - does it have to be a Java API, or is SQL more accessible
Capabilities of the framework/tool - do you need tight security integration, exactly-once processing, CEP, etc etc. Some of these will rule in or out the tool that you use.
Once you've joined and aggregated your data, a good pattern to follow is to write it back to Kafka (thus more loosely decoupling your design, and enabling separation of responsibilities of the components) and from there write it to MySQL using Kafka Connect and the JDBC Sink. Kafka Connect is part of Apache Kafka.
One final consideration : if you're taking data from MySQL, to process it and then write it back into MySQL… do you even need Kafka? Is there an appropriate reason to be using it and not just doing this processing in mySQL itself?
Disclaimer: I work for Confluent.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Apache Spark vs Apache Ignite [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Currently I'm studying Apache spark and Apache ignite frameworks.
Some principle differences between them are described in this article ignite vs spark But I realized that I still don't understand their purposes.
I mean for which problems spark more preferable than ignite and vice versa?
I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations.
Another common use for Ignite is distributed caching, which is often used to improve performance of applications that interact with relational databases or any other data sources.
Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time.Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability.
Apache Spark(cluster computing framework) is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited for high-performance computing and machine learning algorithms.
Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
Conclusion:
Ignite and Spark are both in-memory computing solutions but they target different use cases.
In many cases, they are used together to achieve superior results:
Ignite can provide shared storage, so the state can be passed from one Spark application or job to another.
Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x (spark doesn’t index the data)
When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
Does Spark and Ignite works together?
Yes, Spark and Ignite works together.
In short
Ignite vs. Spark
Ignite is an in-memory distributed database more focused on data storage and handles transnational updates on data, then serves client requests. Apache Spark is an MPP compute engine which is more inclined towards analytics, ML, Graph, and ETL specific payloads.
In detail
Apache Spark is an OLAP tool
Apache Spark is a general-purpose cluster computing system. It's an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark with other components
Deployment topology
Spark on YARN typology is discussed here.
Apache Ignite is an OLTP tool
Ignite is a memory-centric distributed database, caching, and processing platform for transnational, analytical, and streaming workloads delivering in-memory speeds at the petabyte scale. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging, and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources.
SQL Overview
Deployment topology
Apache Spark is a processing framework. You tell it where to get data, provide some code about how to process that data, and then tell it where to put the results. It's a way to easily reliably run computing logic across a bunch of nodes in a cluster on data from any source (which is then kept in-memory during processing). It's primarily meant for large-scale analysis on data from various sources (even from multiple databases at once), or from streaming sources like Kafka. It can also be used for ETL, like transforming and joining data together before putting the final results in some other database system.
Apache Ignite is more of an in-memory distributed database, at least that's how it started. It has a key/value and SQL API, so you can store and read data in various ways, and run queries like you would any other SQL database. It also supports running your own code (similar to Spark) so you can do processing that wouldn't really work with SQL, while also reading and writing the data all in the same system. It also can read/write data to other database systems while acting as a cache layer in the middle. Eventually, as of 2018, it also supports on-disk storage so now you can use it as an all-in-one distributed database, cache, and processing framework.
Apache Spark is still better for more complex analytics, and you can have Spark read data from Apache Ignite, but for many scenarios it's now possible to consolidate processing and storage into a single system with Apache Ignite.
Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
By Nikita Ivanov: http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/
Although both Apache Spark and Apache Ignite utilize the power of in-memory computing, they address somewhat different use cases and rarely “compete” for the same task. Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
I am late to answer this question, but let me try to share my view on this.
Ignite may not be ready to use in production for enterprise application as some important features such as Security is only available in Gridgain(wrapper over Ignite)
Complete list of features can be found from below link
https://www.gridgain.com/products/gridgain-vs-ignite

stack for loading log files into cassandra

I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.

how to integrate cassandra with zookeeper to support transactions

I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
Zookeeper creates znodes to perform read and write operations and data to and fro goes through znodes in Zookeeper. I want to know that how to support rollback and commit feature in cassandra using Zookeeper. Is there any way by which we can specify cassandra configurations in zookeeper or zookeeper configurations in cassandra.
I know cassandra and zookeeper individually how data is read and written but I dont know how to integrate both of them using Java.
how can we do transactions in Cassandra using Zookeeper.
Thanks.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
With great difficulty. Cassandra does not work well as a transactional system. Writes to multiple rows are not atomic, there is no way to rollback writes if some writes fail, and there is no way to ensure readers read a consistent view when reading.
I want to know that how to support rollback and commit feature in cassandra using Zookeeper.
Zookeeper won't help you with this, especially the commit feature. You may be able to write enough information to zookeeper to roll back in case of failure, but if you are doing that, you might as well store the rollback info in cassandra.
Zookeeper and Cassandra work well together when you use Zookeeper as a locking service. Look at the Cages library. Use zookeeper to co-ordinate read/writes to cassandra.
Trying to use cassandra as a transactional system with atomic commits to multiple rows and rollbacks is going to be very frustrating.
There are ways you can use to implement transactions in Cassandra without ZooKeeper.
Cassandra itself has a feature called Lightweight transactions which provides per key linearizability and compare-and-set. With such primitives you can implement serializable transactions on the application level by youself.
Please see the Visualization of serializable cross shard client-side transactions post for for details and step-by-step visualization.
The variants of this approach are used in Google's Percolator system and in CockroachDB.
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on the RAMP transactions paper by Peter Bailis.
There is a BATCH feature for Cassandra's CQL3 (Cassandra 1.2 is the formal version that released CQL3), which allegedly can atomically apply all the updates in the BATCH as one unit all-or-nothing.
This does not mean you can rollback a successfully executed BATCH as an RDBMS could do, that would have to be manually done.
Depending on the consistency and preferences you provide to the BATCH statement, guarantees of atomicity of the updates can be increased or decreased to some degree with the UNLOGGED option.
http://www.datastax.com/docs/1.2/cql_cli/cql/BATCH
Well, I'm not an exepert at this (far from it actually) but the way I see it, either you deploy some middleware made by yourself, in order to guarantee the specific properties you are looking for or you can just have Cassandra write data to auxilliary files and then copy them through the file system, since the copy function in Java works as an atomic operation.
I don't know anything about the size of the data files you are considering so I don't really know if it is doable, however there might be a way to use this property through smaller bits of information and then combining them as a whole.
Just my 2 cents...

Resources