I was dealing with one the spark requirement here where Client (like Banking Client where security is major concern) needs all spark processing should happen securely.
For example all communication happening between spark client and server ( driver & executor communication) should be on secure channel. Even when spark spills on disk based on storage level (Mem+Disk), it should not be written in un-encrypted format on local disk or there should be some workaround to prevent spill.
I did some research but could not get any concrete solution.Let me know if someone has done this.
Any guidance would be a great help. Thanks in advance.
Sounds like the right job to implement Apache Commons Crypto
Instead of preventing the spill, that usually happens during the shuffle phase, you can implement the Crypto library to encrypt the output that is spilled.
Here are a few recommended reads:
Securing Apache Spark Shuffle using Apache Commons Crypto
Spark Encryption
JAVA based examples
CipherByteBuffer
Stream Example
These examples are in Java I don't think you should have any problem in implementing them with Spark/Scala as well. I haven't implemented them myself so I am not sure of any underlying issues etc.
Related
I am working on a project where I need to share execution state across different spark application.
I decided to go with apache-ignite as a shared memory storage between different spark application.
I was thinking of going with embedded ignite mode with static allocation in spark where
ignite nodes will start in Spark executor process. So that, tasks will be executed in the same process where Data is present. But, this mode is deprecated.
I could go with standalone Ignite deployment but there would be inter-process communication to get and save the state which I want to avoid.
Is there any way to tell the Spark to create its executors in already present process (in this case, Ignite nodesprocesses) ?
Can ExternalClusterManager be implemented to achieve this ?
Does Ignite is planning to introduce such mode in future ?
Well, yes, your general direction is reasonable. Ignite's deprecated embedded deployment is, so to say, embedded "backwards" - when you embed Ignite into Spark it works poorly, but if we embedded Spark into Ignite, it would work better.
Yes, I assume it would be possible to implement. It probably could even be implemented outside of Ignite.
I don't think there are any open issues for that in Ignite backlog, but you can share you suggestions on Ignite dev mailing list.
And now the main part. All you're going to achieve with your suggestion is replacing inter-process communication with intra-process. Usually, communication on the same host isn't that expensive. You might see some performance gain from this but I'd only went into implementing this if there were a solid evidence that this is going to solve a real problem.
We have many micro-services(java) and data is being written to hazelcast cache for better performance. Now the same data needs to be made available to Spark application for data analysis. I am not sure If this is right design approach to access external cache in apache spark. I cannot make database calls to get the data as there will be many database hits which might affect micro-services(currently we dont have http caching).
I thought about pushing the latest data into Kafka and read the same in spark. However, data(each message) might be big(> 1 MB sometimes) which is not right.
If its ok to use external cache in apache spark, is it better to use hazelcast client or to read Hazelcast cached data over rest service ?
Also, please let me know If there are any other recommended way of sharing data between Apache Spark and micro-services
Please let me know your thoughts. Thanks in advance.
I have a use case where in I launch local spark (embedded) inside an application server rather than going for spark rest job server or kernel. Because former(embedded spark) has very low latency compared to other. I am interested in
Drawbacks of this approach if there are any.
Can same be used in production?
P.S. Low latency is priority here.
EDIT: Size of the data being processed for most of the cases will be less than 100mb.
I don't think it is a drawback at all. If you have a look at the implementation of the Hive Thriftserver within the Spark project itself, they also manage SQLContext etc, in the Hive Server process. This is especially the case, if the amount of data is small and the driver can handle it easily. So I would also see this as a hint, that this okay for production use.
But I totally agree, the documentation or advice in general how to integrate spark into interactive customer-facing application is lacking behind the information for BigData pipelines.
We have a huge existing application in php which
Accepts a log file
Initialises all the database, in-memory store resources
Processes every line
Creates a set of output files
Above process happens per input file.
Input files are written by a kafka consumer. Is it possible to fit this application in spark streaming by somehow not porting all the code in java? For example in following manner
get a message from kafka topic
Pass this message to spark streaming
Spark streaming somehow interacts with legacy app and generates output
spark then writes output again in kafka
Whatever I have just mentioned is too high level. I just want to know whether there's a possibility of doing this by not recoding existing app in java? And can anyone please tell me roughly how this can be done?
I think there is no possibility to use PHP in Spark directly. According to documentation (http://spark.apache.org/) and my knowledge it supports only Java, Scala, R and Python.
However you can change an architecture of your app and create some external services (ws, rest etc) and use them from Spark (you can use whichever library you want) - not all modules from old app must be rewritten to Java. I would try to go in that way :)
I think Storm is an excellent choice in this case because it offers non-jvm language integration through Thrift. Also I am sure that there is a PHP Thrift client.
So basically what you have to do is finding a ShellSpout and ShellBolt written in PHP (this is the integration part needed to interact with Storm in your application) and then write your own spouts and bolts which are consuming Kafka and processing each line.
You can use this library for your need:
https://github.com/Lazyshot/storm-php
Then you will also have to find a PHP Thrift client to interact with the Storm cluster.
The Storm Thrift definition can be found here:
https://github.com/apache/storm/blob/master/storm-core/src/storm.thrift
And a PHP Thrift client example can be found here:
https://thrift.apache.org/tutorial/php
Now putting these things together you can write your own Apache Storm app in PHP.
Information sources:
http://storm.apache.org/about/multi-language.html
http://storm.apache.org/releases/current/Using-non-JVM-languages-with-Storm.html
I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.