Port existing php application in spark streaming - apache-spark

We have a huge existing application in php which
Accepts a log file
Initialises all the database, in-memory store resources
Processes every line
Creates a set of output files
Above process happens per input file.
Input files are written by a kafka consumer. Is it possible to fit this application in spark streaming by somehow not porting all the code in java? For example in following manner
get a message from kafka topic
Pass this message to spark streaming
Spark streaming somehow interacts with legacy app and generates output
spark then writes output again in kafka
Whatever I have just mentioned is too high level. I just want to know whether there's a possibility of doing this by not recoding existing app in java? And can anyone please tell me roughly how this can be done?

I think there is no possibility to use PHP in Spark directly. According to documentation (http://spark.apache.org/) and my knowledge it supports only Java, Scala, R and Python.
However you can change an architecture of your app and create some external services (ws, rest etc) and use them from Spark (you can use whichever library you want) - not all modules from old app must be rewritten to Java. I would try to go in that way :)

I think Storm is an excellent choice in this case because it offers non-jvm language integration through Thrift. Also I am sure that there is a PHP Thrift client.
So basically what you have to do is finding a ShellSpout and ShellBolt written in PHP (this is the integration part needed to interact with Storm in your application) and then write your own spouts and bolts which are consuming Kafka and processing each line.
You can use this library for your need:
https://github.com/Lazyshot/storm-php
Then you will also have to find a PHP Thrift client to interact with the Storm cluster.
The Storm Thrift definition can be found here:
https://github.com/apache/storm/blob/master/storm-core/src/storm.thrift
And a PHP Thrift client example can be found here:
https://thrift.apache.org/tutorial/php
Now putting these things together you can write your own Apache Storm app in PHP.
Information sources:
http://storm.apache.org/about/multi-language.html
http://storm.apache.org/releases/current/Using-non-JVM-languages-with-Storm.html

Related

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

Which framework should be used to aggregate and joining the data of Kafka topics and store in to MySQL

I have data in two kafka topics from mysql using debezium-connector-mysql-plugin.
now i want to aggregate this data at daily level and store in to another mysql table.
please suggest.
Thanks.
You've not really laid out your requirements, other than commenting that you don't want to use Confluent Platform (but not said why).
In general, with data in Kafka (regardless of where it comes from) you have different options for processing it:
Bespoke consumer (probably a bad idea, given the availability of stream processing frameworks)
KSQL (use SQL to do your joins etc) - part of Confluent Platform
Kafka Streams - a Java library for doing stream processing. Part of Apache Kafka.
Flink, Spark Streaming, Samza, Heron, etc etc etc
It's up to you which you use, and it's going to come down to factors such as
Existing technology in use (no point deploying a Spark cluster if you don't need to; conversely, if you already use Spark and have lots of developers trained on it then it could make sense to use it)
Language familiarity of developers - does it have to be a Java API, or is SQL more accessible
Capabilities of the framework/tool - do you need tight security integration, exactly-once processing, CEP, etc etc. Some of these will rule in or out the tool that you use.
Once you've joined and aggregated your data, a good pattern to follow is to write it back to Kafka (thus more loosely decoupling your design, and enabling separation of responsibilities of the components) and from there write it to MySQL using Kafka Connect and the JDBC Sink. Kafka Connect is part of Apache Kafka.
One final consideration : if you're taking data from MySQL, to process it and then write it back into MySQL… do you even need Kafka? Is there an appropriate reason to be using it and not just doing this processing in mySQL itself?
Disclaimer: I work for Confluent.

Connecting Bluemix virtual sensors to an instance of Spark service

I am new to bluemix and also Apache Spark. I just wanted to do a small task using IBM analytics for Apache Spark where I want to create a virtual sensor using Bluemix's virtual sensors (https://virtualsensors.mybluemix.net/) and use that generated data as input to the spark streaming service and do some analytics based on the input data. But, I don't know exactly how to connect the instances of those two application and I am stuck. It would be great if someone could help me.
Thanks,
From the documentation the Virtual Sensors just emit their sensor data using MQTT, so I imagine this would be as easy as importing an MQTT library in your language of choice and simply connecting that to the Virtual Sensors.
You haven't really specified what language you're working with on the Spark side, but they'll probably all shake out to either:
Paho (Python, Java, Scala)
Scala-MQTT-client (specifically Scala)
For how to use it, the Paho project also includes some basic documentation about how MQTT works.
Some of the other basics are covered in the MQTT FAQ and this youtube video.
If you need to add the JAR to your notebook, you should be able to use the %AddJar command. You can read about that here -- scroll down to the section titled "Deploy your custom library jar to a Jupyter Notebook" for the instructions and example use.
I would like you to go through this recipe that shows how to configure the Apache Spark Streaming running in IBM Bluemix to get data from the actual sensor devices. I believe, you can just tweak the topic id to get the data from virtual sensor as well.
Also, look at the Github project that shows how to create the Spark-mqtt-connector Dstream such that the Spark service can consume the events in real-time.

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

stack for loading log files into cassandra

I would like to periodically (hourly) load my application logs into Cassandra for analysis using pig.
How is this typically done? Are there project(s) that focus on this?
I see mumakil is commonly used to bulk-load data. I could write a cron job built around that, but was hoping for something more robust than the job I would whip-up.
I'm also willing to modify the applications to store the data in another format (like syslog or directly to Cassandra) if that is preferable. Though in that case I would be worried about data-loss should Cassandra be unavailable.
If you are set on using Flume, you'll need to write a custom Flume sink (not hard). You can model it on https://github.com/geminitech/logprocessing.
If you are wanting to use Pig, I agree with the other poster that you should use HDFS (or S3). Hadoop is designed to work very well with block storage where the blocks are huge. This prevents the terrible IO performance you get from doing lots of disk seeks and network IO. While you CAN use Pig with Cassandra, you're going to have trouble with the Cassandra data model and you're going to have much worse performance.
However, if you really want to use Cassandra and you aren't dead set on Flume, I would recommend using Kafka and Storm.
My workflow for loading log files into Cassandra with Storm is:
Kafka collects the logs (e.g. with the log4j appender)
Logs enter the storm cluster using storm-kafka
Log line is parsed and inserted into Cassandra using custom Storm bolts (It's extremely easy to write Storm bolts). There is also a storm-cassandra bolt already available.
You should consider loading them into HDFS using Flume, since these projects were designed for this purpose. You can then use Pig directly against your unstructured/semi-structured log data.

Resources