I have a simple Java program that reads messages from a Google Cloud Pubsub topic and prints them. It works correctly when I run the program as a standalone, but it fails to receive messages when run using spark-submit.
Do programs launched using spark-submit have to follow a different structure in general? My program doesn't do any Spark related stuff as of now, but I'll be adding it later.
Are there working examples of Spark + Cloud Pubsub integration? I came across a library called Apache Bahir, but is it a must to use a library like that?
The code for my example can be found here: https://github.com/pramodbiligiri/pubsub-spark
I was able to get it working. It needed a SparkSession to be instantiated and wait for termination signal from the user. In my case I used a StreamingContext - https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/streaming/StreamingContext.html
Related
I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.
I'm building a RESTful API on top of Apache Spark. Serving the following Python script with spark-submit seems to work fine:
import cherrypy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext
class doStuff(object):
#cherrypy.expose
def compute(self, user_input):
# do something spark-y with the user input
return user_output
cherrypy.quickstart(doStuff())
But googling around I see things like Livy and spark-jobserver. I read these projects' documentation and a couple of tutorials but I still don't fully understand the advantages of Livy or spark-jobserver over a simple script with CherryPy or Flask or any other web framework. Is it about scalability? Context management? What am I missing here? If what I want is a simple RESTful API with not many users, are Livy or spark-jobserver worth the trouble? If so, why?
If you use spark-submit, you must upload manually JAR file to cluster and run command. Everything must be prepared before run
If you use Livy or spark-jobserver, then you can programatically upload file and run job. You can add additional applications that will connect to same cluster and upload jar with next job
What's more, Livy and Spark-JobServer allows you to use Spark in interactive mode, which is hard to do with spark-submit ;)
I won't comment on using Livy or spark-jobserver specifically but are at least three reasons to avoid embedding Spark context directly in your application:
Security with the main focus on reducing exposure of your cluster to the outside world. Attacker which gains control over your application can do anything between getting access to your data to executing arbitrary code on your cluster if cluster is not correctly configured.
Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows you to handle Spark issues gracefully, without full downtime of your application.
Responsiveness. User facing Spark API is mostly (in PySpark exclusively) synchronous. Using external service basically solves this problem for you.
We have a huge existing application in php which
Accepts a log file
Initialises all the database, in-memory store resources
Processes every line
Creates a set of output files
Above process happens per input file.
Input files are written by a kafka consumer. Is it possible to fit this application in spark streaming by somehow not porting all the code in java? For example in following manner
get a message from kafka topic
Pass this message to spark streaming
Spark streaming somehow interacts with legacy app and generates output
spark then writes output again in kafka
Whatever I have just mentioned is too high level. I just want to know whether there's a possibility of doing this by not recoding existing app in java? And can anyone please tell me roughly how this can be done?
I think there is no possibility to use PHP in Spark directly. According to documentation (http://spark.apache.org/) and my knowledge it supports only Java, Scala, R and Python.
However you can change an architecture of your app and create some external services (ws, rest etc) and use them from Spark (you can use whichever library you want) - not all modules from old app must be rewritten to Java. I would try to go in that way :)
I think Storm is an excellent choice in this case because it offers non-jvm language integration through Thrift. Also I am sure that there is a PHP Thrift client.
So basically what you have to do is finding a ShellSpout and ShellBolt written in PHP (this is the integration part needed to interact with Storm in your application) and then write your own spouts and bolts which are consuming Kafka and processing each line.
You can use this library for your need:
https://github.com/Lazyshot/storm-php
Then you will also have to find a PHP Thrift client to interact with the Storm cluster.
The Storm Thrift definition can be found here:
https://github.com/apache/storm/blob/master/storm-core/src/storm.thrift
And a PHP Thrift client example can be found here:
https://thrift.apache.org/tutorial/php
Now putting these things together you can write your own Apache Storm app in PHP.
Information sources:
http://storm.apache.org/about/multi-language.html
http://storm.apache.org/releases/current/Using-non-JVM-languages-with-Storm.html
I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .
Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup