Im learning Spark, when reading about:
"Static Configuration", an explained has been followed: "Using when Properties related to the application, not the deployment".
A huge misunderstanding in here is: So what is the different when related to the application and the deployment!
Thanks for your help!
Misunderstanding about the Spark Properties when related to the application and the deployment!
Related
I could not find a Hazlecast Jet source connector for Apache Pulsar. Have anybody tried this? Appreciate any directions, points, sources, considerations if I have to write a custom stream connector for Pulsar as source for Jet?
The initial version of the Jet connector for Apache Pulsar is recently implemented here. It hasn't been extensively tested yet. For details, you can look at the design document in which the connector's sufficiencies and deficiencies stated, and tutorial. If there is anything confusing about these, you can ask again.
Hazelcast Jet doesn't have any connector for Apache Pulsar as of now (version 4.0). If you'd like to contribute one you can have a look at the Source Builder class and its section on the reference manual as a starting point.
Also, please check out existing implementations of various connectors at the Hazelcast Jet extension modules repository which uses source builder API and contribute yours to there.
I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.
Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.
I am using HDInsight Spark clusters on Azure, and Jupyter fails to add external dependencies. Tried this:
However, if I make an intentional mistake:
%%configure
{ "packages":["com.websudos:phantom_2.10:1.27.111111111111"] }
So this is trying to resolve packages, just not loading them?
Is there any other way to make this thing work?
The package you are using is not the right one. The intentional mistake is actually telling you that it cannot resolve that package.
It seems the package you might actually want to use is com.websudos:phantom-spark since that's what they built Spark support on? Link
%%configure -f
{ "packages":["com.websudos:phantom-spark_2.10:1.8.0"] }
and then you can import
import com.websudos.phantom.spark._
However, if what you want is a Spark-Cassandra connector, the datastax connector seems to be the one to use.
I should say I've never used Spark with Cassandra before, so please do follow tutorials online on how to set them up.
This article from HDInsight site might help you:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-notebook-use-external-packages/
I am working on a project of Twitter Data Analysis using Apache Spark with Java and Cassandra for NoSQL databases.
In the project I am working I want to maintain a arraylist of linkedlist(will use Java in built Arraylist and Linkedlist) which is common to all mapper nodes. I mean, if one mapper writes some data into arraylist it should be reflected to all other mapper nodes.
I am aware of broadcast shared variable, but that is read only shared variable, what I want is shared writable dataframe where changes by one mapper should be reflected in all.
Any advice on how to achieve this in apache spark with Java will be of great help.
Thanks in advance
Short, and most likely disappointing, answer is it is not possible given Spark architecture. Worker nodes don't communicate with each other and neither broadcast variables nor accumulators (write-only variables) are really shared variables. You can try different workarounds like using external services or shared file system to communicate but it introduces all kind of issues like idempotency or synchronizing.
As far as I can tell the best thing you can get is updating state between batches or using tools like StreamingContext.remember.
Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup