How to connect Spark Streaming to standalone Solr on windows? - apache-spark

I want to integrate Spark Streaming with Standalone Solr. I am using Spark 1.6.1 and Solr 5.2 standalone on windows with no Zookeeper configuration. I am able to find some solution where they are connecting to Solr from spark by passing the Zookeeper config.
How can I connect my spark program to standalone Solr?

Please see if this example is helpful http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
From example, you will need to write your own Connection class which wraps object of HttpSolrClient or ConcurrentUpdateSolrClient. You need to also write your own ConnectionPool class which will implement pool of your own Connection objects (or if its thread safe, just return same singleton object).

Related

Understanging kappa architecture with apache superset

There is a lot of information about kappa architecture in the internet and after going through some of the conceptual aspects I am trying to drill down to something more concrete. As I main source I used this website.
Let's imaging you want to implement a kappa architecture involving the following tech stack:
Apache Kafka
Apache Spark
Apache Superset
Now imagine the application you want to build do data-analytics against has a PostgreSQL database. Of course you can easily directly connect apache superset with the PostgresSQL database and create charts.
But now you want to see how you would do this with a kappa architecture and you add kafka and spark.
You can emit events to kafka and you can read such events in apache spark. Kafka will retain messages for topcis a certain period as pointed out in the answers to this quesition. When I read about connecting superset with spark in the docs it says hive should be used as a connector (also the project websites states the tool is unsupported, and if you look at this issue on pyhive then you find impyla could be an alternative). But apache hive is a completely different project for a storage system. So how would this connection work?
Assuming you have kafka nodes running (with zookeper obviously) and also have spark running and then you connect apache superset through this hive connector with spark.
How can you write queries against the data that is in kafka (which is in fact the live data)?
On spark side itself you can easily write a scala program that reads data from kafka and does something with it but how can you achieve this from apache superset?
Or is this not the intended way of connecting the things?
If I understood your question, you'd need to use Spark Structured Streaming to register a streaming SQL table into the Hive metastore, which could be queried from Superset from the Spark Thiftserver.
Hive itself doesn't store any of the data. Hive also has a built-in Kafka query handler, so Spark isn't completely necessary.
But, Hive/Spark isn't the only option. You could use Spark to write to HDFS/S3 and have Presto query that from Superset.
Or you can remove Spark and use Kafka Connect write to any other thing that a dashboarding tool (Tableau is another popular one) can support - JDBC database (i.e. Postgres), Mongo, Cassandra, etc. Then you'd just refresh the panels to run a new query.

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

Spark integration

I am newbie to apache spark.
My requirement is, when user clicks on the Web UI, query needs to pass to the Spark cluster and get the data back from the cluster and update the UI.
I want to know, how to pass the Spark SQL query and get the result set ?
Spark has Thrift server for this(running SQL queries through JDBC/ODBC). If you are using Java is your middle layer, use JDBC and connect spark Thrift server as like data base and pass/run what ever SQL(supports Spark).
Usually you would have to write a web application, usually with a REST interface, and implement the Spark SQL inside of the server-side REST handler.
You can use Apache Livy.
Details : https://livy.incubator.apache.org/

Spark SQL CLI vs Thriftserver/Beeline

Can someone spell out the differences between using the Spark SQL CLI vs. Thriftserver/Beeline to query/modify data in Hive ? The Spark SQL documentation
mentions both of them but when would you use one or the other or are they equivalent alternatives from a functional point of view ?
For clarification:
spark-sql is a program that runs a single instance of Spark and you interact with it as if it were a mysql-like shell prompt and it makes use of the spark-warehouse and those types of features
Spark with Thriftserver is an application that exposes a connection to a running instance of Spark over a JDBC connection.
https://community.hortonworks.com/questions/33715/why-do-we-need-to-setup-spark-thrift-server.html
Beeline is a query / consumer tool that one uses to consume / connect to a running JDBC hive2 table (and thus in the spark documentation, they use beeline to test that the JDBC connection is in fact working). Note: query / connector programs like SQL Workbench can be made to connect to Spark with Thriftserver if it imports the proper Hive2 JDBC drivers & jars

Connecting to Cassandra Cluster instead of specific node

I am trying to learn Cassandra and have setup a 2 node Cassandra cluster. I have written a client in Java using cassandra jdbc driver, which currently connects to a hard coded single node in the cluster. Ideally, I would like my client to connect to the "cluster" rather then a specific node.
So that client code automatically connects to other node if the first node is down.
Is this possible using cassandra jdbc driver? Currently using below code to create connection
DriverManager.getConnection("jdbc:cassandra://localhost:9160/testdb");
Yes. If you're using the Datastax Java driver, you can get all of these benefits and more. From the documentation:
The driver has the following features:
connection pooling
node discovery
automatic failover
load balancing
What is your language? If you're using Java, I suggest for Hector framework.
http://hector-client.github.io/hector/build/html/index.html
I think it's very good for correspond on Cassandra db.

Resources