How to work with temporary tables in foreachBatch? - apache-spark

We are building a streaming platform where it is essential to work with SQL's in batches.
val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {
df.createOrReplaceTempView("events")
val df1 = ExecutionContext.getSparkSession.sql("select * from events")
df1.limit(5).show()
// More complex processing on dataframes
}}.trigger(trigger).outputMode(outputMode).start()
query.awaitTermination()
Error thrown is :
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';
Streaming source is Kafka with watermarking and without using Spark-SQL we are able to execute dataframe transformations. Spark version is 2.4.0 and Scala is 2.11.7. Trigger is ProcessingTime every 1 minute and OutputMode is Append.
Is there any other approach to facilitate use of spark-sql within foreachBatch ? Would it work with upgraded version of Spark - in which case to version do we upgrade ?
Kindly help. Thank you.

tl;dr Replace ExecutionContext.getSparkSession with df.sparkSession.
The reason of the StreamingQueryException is that the streaming query tries to access the events temporary table in a SparkSession that knows nothing about it, i.e. ExecutionContext.getSparkSession.
The only SparkSession that has this events temporary table registered is exactly the SparkSession the df dataframe is created within, i.e. df.sparkSession.

Please check the code snippet below. Here, I have created two separate DataFrames, responseDF1 and responseDF2 from resultDF and shown the output in the console. responseDF2 is created using a temporary table. You can try the same.
resultDF.writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
val responseDF1 = batchDF.selectExpr("ResponseObj.type","ResponseObj.key", "ResponseObj.activity", "ResponseObj.price")
responseDF1.show()
responseDF1.createTempView("responseTbl1")
val responseDF2 = batchDF.sparkSession.sql("select activity, key from responseTbl1")
responseDF2.show()
batchDF.sparkSession.catalog.dropTempView("responseTbl1")
batchDF.unpersist()
()}.start().awaitTermination()
Code Snippet

Related

EMR: How to integrate Spark with Hive?

Using a EMR cluster, I created an external Hive table (over 800 millions of rows) that maps to a DynamoDB table. It works well and I can do queries and inserts through hive.
IF I try a query with a condition by the hash_key in Hive, I get the results in seconds. But doing the same query through spark-submit using SparkSQL and enableHiveSupport (accesing Hive) it doesn't finish.It seems that from Spark it's doing a full scan to the table.
I tried several configurations(different hive-site.xml for example) but it doesn't seem to work well from Spark. How should I do it through Spark? Any suggestions?
Thanks
Just make sure to use the dynamo connector opensource by AWS. By default it is available on EMR AFAIK.
Syntax to create a table using the DynamoDBStorageHandler class:
CREATE EXTERNAL TABLE hive_tablename (
hive_column1_name column1_datatype,
hive_column2_name column2_datatype
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "dynamodb_tablename",
"dynamodb.column.mapping" =
"hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name"
);
For any Spark Job, you need to have the followings confs :
$ spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
...
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.input.tableName", "myDynamoDBTable")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
orders.count()
References :
https://github.com/awslabs/emr-dynamodb-connector

Spark Cassandra Connector Issue

I am trying to integrate Cassandra with Spark and facing the below issue.
Issue:
com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches:
spark.cassandra.sql.keyspace
spark.cassandra.output.batch.grouping.key
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:50)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:253)
at org.apache.spark.sql.cassandra.CassandraSourceRelation$.apply(CassandraSourceRelation.scala:263)
at org.apache.spark.sql.cassandra.CassandraCatalog.org$apache$spark$sql$cassandra$CassandraCatalog$$buildRelation(CasandraCatalog.scala:41)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:26)
at org.apache.spark.sql.cassandra.CassandraCatalog$$anon$1.load(CassandraCatalog.scala:23)
Please find the below versions of spark Cassandra and connector I am using.
Spark : 1.6.0
Cassandra : 2.1.17
Connector Used : spark-cassandra-connector_2.10-1.6.0-M1.jar
Below is the code snippet I am using to connect Cassandra from spark.
val conf: org.apache.spark.SparkConf = new SparkConf(true) \
.setAppName("Spark Cassandra") \
.set"spark.cassandra.connection.host", "abc.efg.lkh") \
.set("spark.cassandra.auth.username", "xyz") \
.set("spark.cassandra.auth.password", "1234") \
.set("spark.cassandra.keyspace","abcded")
val sc = new SparkContext("local[*]", "Spark Cassandra",conf)
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("abcded")
val my_df = csc.sql("select * from table")
Here when I try to create DF, I am getting above posted error. I tried without passing schema in conf but it is trying to access in default schema where mentioned user doesn't have access.
Already a JIRA was opened and closed.
https://datastax-oss.atlassian.net/browse/SPARKC-102
yet I am getting this issue. Please let me know whether I need to use lastest connector to resolve this issue.
Thanks in advance.
The important information is in the error message you posted [formatted for readability]:
Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.keyspace is not a valid Spark Cassandra Connector variable.
Possible matches: spark.cassandra.sql.keyspace
spark.cassandra.keyspace is not an available property for the connector. A full list of the available properties can be found here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
You may have some luck using the suggested spark.cassandra.sql.keyspace; otherwise you may just need to explicitly specify the keyspace for every Cassandra interaction you perform using the connector.

spark-sql Table or view not found error

I'm trying to run a basic java program using spark-sql & JDBC. I'm running into the following error. Not sure what's wrong here. Most of the material I have read does not talk on what needs to be done to fix this problem.
It will also be great if someone can point me to some good material to read on Spark-sql (Spark-2.1.1). I'm planning to use spark to implement ETL's, connecting to MySQL and other datasources.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: myschema.mytable; line 1 pos 21;
String MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/myschema";
String MYSQL_USERNAME = "root";
String MYSQL_PWD = "root";
Properties connectionProperties = new Properties();
connectionProperties.put("user", MYSQL_USERNAME);
connectionProperties.put("password", MYSQL_PWD);
Dataset<Row> jdbcDF2 = spark.read()
.jdbc(MYSQL_CONNECTION_URL, "myschema.mytable", connectionProperties);
spark.sql("SELECT COUNT(*) FROM myschema.mytable").show();
It's because Spark is not registering any tables from any schemas from connection by default in Spark SQL Context. You must register it by yourself:
jdbcDF2.createOrReplaceTempView("mytable");
spark.sql("select count(*) from mytable");
Your jdbcDF2 has a source in myschema.mytable from MySQL and will load data from this table on some action.
Remember that MySQL table is not the same as Spark table or view. You are telling Spark to read data from MySQL, but you must register this DataFrame or Dataset as table or view in current Spark SQL Context or Spark Session

Query Spark SQL from Node.js server

I'm currently using npm's cassandra-driver to query my Cassandra database from a Node.js server. Since I want to be able to write more complex queries, I'd like to use Spark SQL instead of CQL. Is there any way to create a RESTful API (or something else) so that I can use Spark SQL the same way that I currently use CQL?
In other words, I want to be able to send a Spark SQL query from my Node.js server to another server and get a result back.
Is there any way to do this? I've been searching for solutions to this problem for a while and haven't found anything yet.
Edit: I'm able to query my database with Scala and Spark SQL from the Spark shell, so that bit is working. I just need to connect Spark and my Node.js server somehow.
I had a similar problem, and I solved by using Spark-JobServer.
The main approach with Spark-Jobserver (SJS) usually is to create a special job that extends their SparkSQLJob such as in the following example:
object ExecuteQuery extends SparkSQLJob {
override def validate(sqlContext: SQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(sqlContext: SQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = sqlContext.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
However, for this approach to work well with Cassandra, you have to use the spark-cassandra-connector and then, to load the data you will have two options:
1) Before calling this ExecuteQuery via REST, you have to transfer the full data you want to query from Cassandra to Spark. For that, you would do something like (code adapted from the spark-cassandra-connector documentation):
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
And then register it as a table in order to SparkSQL be able to access it:
df.registerAsTempTable("myTable") // As a temporary table
df.write.saveAsTable("myTable") // As a persistent Hive Table
Only after that you would be able to use the ExecuteQuery to query from myTable.
2) As the first option can be inefficient in some use cases, there is another option.
The spark-cassandra-connector has a special CassandraSQLContext that can be used to query C* tables directly from Spark. It can be used like:
val cc = new CassandraSQLContext(sc)
val df = cc.sql("SELECT * FROM keyspace.table ...")
However, to use a different type of context with Spark-JobServer, you need to extend SparkContextFactory and use it in the moment of context creation (which can be done by a POST request to /contexts). An example of a special context factory can be seen on SJS Gitub. You also have to create a SparkCassandraJob, extending SparkJob (but this part is very easy).
Finally, the ExecuteQuery job have to be adapted to use the new classes. It would be something like:
object ExecuteQuery extends SparkCassandraJob {
override def validate(cc: CassandraSQLContext, config: Config): SparkJobValidation = {
// Code to validate the parameters received in the request body
}
override def runJob(cc: CassandraSQLContext, jobConfig: Config): Any = {
// Assuming your request sent a { "query": "..." } in the body:
val df = cc.sql(config.getString("query"))
createResponseFromDataFrame(df) // You should implement this
}
}
After that, the ExecuteQueryjob can be executed via REST with a POST request.
Conclusion
Here I use the first option because I need the advanced queries available in the HiveContext (window functions, for example), which are not available in the CassandraSQLContext. However, if you don't need those kind of operations, I recommend the second approach, even if it needs some extra coding to create a new ContextFactory for SJS.

MemSQL Spark Job

I am trying to read a CSV file in Spark job using MemSQL Extractor and do some enrichment using Transformer and load to MemSQL Database using Java.
I see there is memsql-spark interface jar but not finding any useful Java API documentation or example.
I have started writing extractor to read from CSV but I dont know how to move further.
public Option<RDD<byte[]>> nextRDD(SparkContext sparkContext, UserExtractConfig config, long batchInterval, PhaseLogger logger) {
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<String> inputFile = sparkContext.textFile(filePath, minPartitions);
RDD<byte[]> bytes = inputFile.map(ByteUtils.utf8StringToBytes(filePath), String.class); //compilation error
return bytes; //compilation error
}
Would appreciate if someone can point me to some direction to get started...
thanks...
First configure Spark connector in java using following code:
SparkConf conf = new SparkConf();
conf.set("spark.datasource.singlestore.clientEndpoint", "singlestore-host")
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", "s3cur3-pa$$word")
After running the above code spark will be connected to java then you can read csv in spark dataframe. You can transform and manipulate data according to requirements then you can write this dataframe to Database table.
Also attaching link for your reference.
spark-singlestore.

Resources