Is it right to access external cache in apache spark applications? - apache-spark

We have many micro-services(java) and data is being written to hazelcast cache for better performance. Now the same data needs to be made available to Spark application for data analysis. I am not sure If this is right design approach to access external cache in apache spark. I cannot make database calls to get the data as there will be many database hits which might affect micro-services(currently we dont have http caching).
I thought about pushing the latest data into Kafka and read the same in spark. However, data(each message) might be big(> 1 MB sometimes) which is not right.
If its ok to use external cache in apache spark, is it better to use hazelcast client or to read Hazelcast cached data over rest service ?
Also, please let me know If there are any other recommended way of sharing data between Apache Spark and micro-services
Please let me know your thoughts. Thanks in advance.

Related

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

How to handle backpressure on databases when using Apache Spark?

We are using Apache Spark for performing ETL for every 2 hours.
Sometimes Spark puts much pressure on databases when read/write operation is performed.
For Spark Streaming, I can see backpressure configuration on kafka.
Is there a way to handle this issue in batch processing?
Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.
What should be done here is actually on the reading end.
Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :
Spark JDBC fetchsize option
What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
Unfortunately, this might not solve all of your performance issues with your RDBMS.
What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.
To deal with this, I suggest the following :
If available, consider using specialized data sources over JDBC
connections.
Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.
Be sure to understand performance implications of different JDBC data source
variants, especially when working with production database.
Consider using a separate replica for Spark jobs.
If you wish to know more about Reading data using the JDBC source, I suggest you read the following :
Spark SQL and Dataset API.
Disclaimer: I'm the co-author of that repo.

Securing Spark Job on Cluster

I was dealing with one the spark requirement here where Client (like Banking Client where security is major concern) needs all spark processing should happen securely.
For example all communication happening between spark client and server ( driver & executor communication) should be on secure channel. Even when spark spills on disk based on storage level (Mem+Disk), it should not be written in un-encrypted format on local disk or there should be some workaround to prevent spill.
I did some research but could not get any concrete solution.Let me know if someone has done this.
Any guidance would be a great help. Thanks in advance.
Sounds like the right job to implement Apache Commons Crypto
Instead of preventing the spill, that usually happens during the shuffle phase, you can implement the Crypto library to encrypt the output that is spilled.
Here are a few recommended reads:
Securing Apache Spark Shuffle using Apache Commons Crypto
Spark Encryption
JAVA based examples
CipherByteBuffer
Stream Example
These examples are in Java I don't think you should have any problem in implementing them with Spark/Scala as well. I haven't implemented them myself so I am not sure of any underlying issues etc.

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

stack for loading log files into cassandra

I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.

Resources