SparkSQL over REST API not JDBC - apache-spark

What is the successor to this:
https://github.com/VeritoneAlpha/jaws-spark-sql-rest
?
It does not support Spark 2 version. Would like to pass query into curl request and get a response of the rows retrieved by the query in json format.

Apache Livy works well with Spark 2.3

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

How to use spark structured streaming to simultaneously write to parquet and call REST API

How to use spark structured streaming to simultaneously write to parquet and call REST API? Below is where I need to integrate with:
Through spark SQL structured streaming, am able to consume from Kafka Topics.
The message is in avro format, and able to write into parquet filesystem.
On the other hand able to read the parquet filesystem and fire any SQL query as per the need.
Below are few integration or processing I am stuck, can anyone please help:
So, I have to now integrate a rest call, simultaneously I should be able to write to parquet filesystem and call the rest API.
To call rest API I should also convert the Dataset to Avro object first and then prepare the request object for REST API.
The above streaming implementation is done on JAVA. Preferably if JAVA based API or approach is suggested that would be great help.
FYI. I am using the latest version of spark streaming:
spark-streaming-kafka-0-10_2.12 -> 2.4.0
spark-streaming_2.12 -> 3.0.1
{
//dataSet -> dataset having kafka message
Dataset<Row> output = dataSet.select(package$.MODULE$.from_avro(col("value"), avroSchema).as("EventMessage")).select("EventMessage.*");
output
.writeStream()
.outputMode(OutputMode.Append().toString()).format("console")
.foreachBatch((VoidFunction2<Dataset<Row>, Long>) (df, batchId) -> {
df.write().mode(OutputMode.Append().toString()).format("parquet").partitionBy("action").parquet(STREAM_PARQUET_OUTPUT_PATH);
// REST API CALL BLOCK
//df -> avro object -> API Rquest Object -> REST Call
}).start().awaitTermination();
}

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

How to configure SSL between Spark and Cassandra?

I'm trying to configure SSL for the Cassandra Spark connector, but I couldn't find an example of how to do it.
I'm trying to configure it like this:
SparkConf conf = new SparkConf().setAppName("someApp")
.set("spark.cassandra.connection.host", "111.111.111.111")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/some/tfile.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "apassword")
.set("spark.cassandra.connection.ssl.trustStore.type", "JKS")
.set("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.set("spark.cassandra.connection.ssl.keyStore.path", "/some/kfile.jks")
.set("spark.cassandra.connection.ssl.keyStore.password", "anotherpassword")
.set("spark.cassandra.connection.ssl.keyStore.type", "JKS")
.set("spark.cassandra.connection.ssl.protocol", "TLS");
When I try to submit the spark job, I get these errors:
Exception in thread "main" com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.ssl.keyStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabled is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.protocol is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabledAlgorithms is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
So I'm not sure if this is supported or I'm just using the wrong property names.
I saw this ticket for release 1.2.3 of the connector, but I couldn't find an example of how to use it and it sounded like it may not support keystores. I'm using version 1.4.0-M1 of the connector.
Can anyone show me an example of how to configure SSL for the Spark Cassandra connector? Thanks.
Though I don't see any keystore configurations, I can see below config variables and they are working fine for me.
Note: I am using 1.5.0-M1 version. Not sure if there is any other bug in the version you are using.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.
I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73
Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

Resources