Is it possible to do an update using SparkSQL? - apache-spark

I have a DataFrame, that I created out of parquet file.
val file = "/user/spark/pagecounts-20160713-150000.parquet"
val df = sqlContext.read.parquet(file)
df.registerTempTable("wikipedia")
And now I want to do an update:
// just a dummy update statement
val sqlDF = sqlContext.sql("update wikipedia set requests=0 where article='!K7_Records'")
But I'm getting an error:
java.lang.RuntimeException: [1.1] failure: ``with'' expected but
identifier update found
update wikipediaEnTemp set requests=0 where article='!K7_Records'
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:137)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:237)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:249)
at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:197)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:249)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:217)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:882)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:881)
at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
at org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:43)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:231)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
... 57 elided

Spark tables are immutable so direct updates are not possible. However, if you can change your schema and queries, you can perform the equivalent of updates using append-only operations. The general problem is known in the data warehousing community as a Type II Slowly Changing Dimension. There is a Spark package for this, which I have not worked with.

RDD and Dataframes are immutable because the underlying data is immutable. So DML option is not included as part of sparkSQL.

Related

spark GroupBy throws StateSchemaNotCompatible exception with different "Existing key schema"

I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error.
can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.

Databricks Autoloader throws IllegalArgumentException

I'm trying the simplest auto loader example included in the databricks website
https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html
df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(input_data_path))
(df.writeStream.format("delta")
.option("checkpointLocation", chkpt_path)
.table("iot_stream"))
I keep getting this message:
IllegalArgumentException: cloudFiles.schemaLocation Could not find required option: schemaLocation. Please provide a schema location using cloudFiles.schemaLocation for storing inferred schema and supporting schema evolution.
If providing cloudFiles.schemaLocation is required, why do the examples everywhere are missing it? what's the underlying issue here?
I suspect what is going on is that you are not explicitly setting the .option("cloudFiles.schemaEvolutionMode")
Which means it is being set to the default which is "addNewColumns" as per https://docs.databricks.com/ingestion/auto-loader/options.html
And that requires you set the .option("cloudFiles.schemaLocation", path) in the reader.
Thus you are inadvertently requiring it and not setting it.

FileNotFoundException: Spark save fails. Cannot clear cache from Dataset[T] avro

I get the following error when saving a dataframe in avro for a second time. If I delete sub_folder/part-00000-XXX-c000.avro after saving, and then try to save the same dataset, I get the following:
FileNotFoundException: File /.../main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
If I delete not only from sub_folder, but also from main_folder, then the problem doesn't happen, but I can't afford that.
The problem actually doesnt happen when trying to save the dataset in any
other format.
Saving an empty dataset does not cause an error.
The example suggests that the tables need to be refreshed, but as the output of sparkSession.catalog.listTables().show() there are no tables to refresh.
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+
The previously saved dataframe looks like this. The application is supposed to update it:
+--------------------+--------------------+
| Col1 | Col2 |
+--------------------+--------------------+
|[123456, , ABC, [...|[[v1CK, RAWNAME1_,..|
|[123456, , ABC, [...|[[BG8M, RAWNAME2_...|
+--------------------+--------------------+
For me this is a clear cache problem. However, all attemps of clearing the cache have failed:
dataset.write
.format("avro")
.option("path", path)
.mode(SaveMode.Overwrite) // Any save mode gives the same error
.save()
// Moving this either before or after saving doesnt help.
sparkSession.catalog.clearCache()
// This will not un-persist any cached data that is built upon this Dataset.
dataset.cache().unpersist()
dataset.unpersist()
And this is how I read the dataset:
private def doReadFromPath[T <: SpecificRecord with Product with Serializable: TypeTag: ClassTag](path: String): Dataset[T] = {
val df = sparkSession.read
.format("avro")
.load(path)
.select("*")
df.as[T]
}
Finally the stack trace is this one. Thanks a lot for your help!:
ERROR [task-result-getter-3] (Logging.scala:70) - Task 0 in stage 9.0 failed 1 times; aborting job
ERROR [main] (Logging.scala:91) - Aborting job 150de02a-ac6a-4d42-824d-5db44a98c19a.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 11, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/DATA/XXX/main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:241)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
... 10 more
*Reading from the same location and writing in to same location will give this issue. it was also discussed in this forum. along with my answer there *
and the below message in the error will mis lead. but actual issue is read/write from/in the same location.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL
I am giving another example other than yours (used parquet in your case avro).
I have 2 options for you.
Option 1 (cache and show will work like below...) :
import org.apache.spark.sql.functions._
val df = Seq((1, 10), (2, 20), (3, 30)).toDS.toDF("sex", "date")
df.show(false)
df.repartition(1).write.format("parquet").mode("overwrite").save(".../temp") // save it
val df1 = spark.read.format("parquet").load(".../temp") // read back again
val df2 = df1.withColumn("cleanup" , lit("Rod want to cleanup")) // like you said you want to clean it.
//BELOW 2 ARE IMPORTANT STEPS LIKE `cache` and `show` forcing a light action show(1) with out which FileNotFoundException will come.
df2.cache // cache to avoid FileNotFoundException
df2.show(2, false) // light action to avoid FileNotFoundException
// or println(df2.count) // action
df2.repartition(1).write.format("parquet").mode("overwrite").save(".../temp")
println("Rod saved in same directory where he read it from final records he saved after clean up are ")
df2.show(false)
Option 2 :
1) save the DataFrame with a different avro folder.
2) Delete the old avro folder.
3) Finally rename this newly created avro folder to the old name, will work.
Thanks a lot Ram Ghadiyaram!
The solution had 2 solved my problem but only in my local Ubuntu. When I tested in HDFS, the problem remained.
The solution 1 was the definite fix. This is how my code looks now:
private def doWriteToPath[T <: Product with Serializable: TypeTag: ClassTag](dataset: Dataset[T], path: String): Unit = {
// clear any previously cached avro
sparkSession.catalog.clearCache()
// update the cache for this particular dataset, and trigger an action
dataset.cache().show(1)
dataset.write
.format("avro")
.option("path", path)
.mode(SaveMode.Overwrite)
.save()
}
Some remarks:
I had indeed checked that post, and attempted unsuccessfully the solution. I discarded that to be my problem, for the following reasons:
I had created a /temp under 'main_folder', called 'sub_folder_temp', and saving still failed.
Saving the same non-empty dataset in the same path but in json format actually works without the workaround discussed here.
Saving an empty dataset with the same type [T] in the same path actually works without the workaround discussed here.

How to work with temporary tables in foreachBatch?

We are building a streaming platform where it is essential to work with SQL's in batches.
val query = streamingDataSet.writeStream.option("checkpointLocation", checkPointLocation).foreachBatch { (df, batchId) => {
df.createOrReplaceTempView("events")
val df1 = ExecutionContext.getSparkSession.sql("select * from events")
df1.limit(5).show()
// More complex processing on dataframes
}}.trigger(trigger).outputMode(outputMode).start()
query.awaitTermination()
Error thrown is :
org.apache.spark.sql.streaming.StreamingQueryException: Table or view not found: events
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'events' not found in database 'default';
Streaming source is Kafka with watermarking and without using Spark-SQL we are able to execute dataframe transformations. Spark version is 2.4.0 and Scala is 2.11.7. Trigger is ProcessingTime every 1 minute and OutputMode is Append.
Is there any other approach to facilitate use of spark-sql within foreachBatch ? Would it work with upgraded version of Spark - in which case to version do we upgrade ?
Kindly help. Thank you.
tl;dr Replace ExecutionContext.getSparkSession with df.sparkSession.
The reason of the StreamingQueryException is that the streaming query tries to access the events temporary table in a SparkSession that knows nothing about it, i.e. ExecutionContext.getSparkSession.
The only SparkSession that has this events temporary table registered is exactly the SparkSession the df dataframe is created within, i.e. df.sparkSession.
Please check the code snippet below. Here, I have created two separate DataFrames, responseDF1 and responseDF2 from resultDF and shown the output in the console. responseDF2 is created using a temporary table. You can try the same.
resultDF.writeStream.foreachBatch {(batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
val responseDF1 = batchDF.selectExpr("ResponseObj.type","ResponseObj.key", "ResponseObj.activity", "ResponseObj.price")
responseDF1.show()
responseDF1.createTempView("responseTbl1")
val responseDF2 = batchDF.sparkSession.sql("select activity, key from responseTbl1")
responseDF2.show()
batchDF.sparkSession.catalog.dropTempView("responseTbl1")
batchDF.unpersist()
()}.start().awaitTermination()
Code Snippet

Saving / exporting transformed DataFrame back to JDBC / MySQL

I'm trying to figure out how to use the new DataFrameWriter to write data back to a JDBC database. I can't seem to find any documentation for this, although looking at the source code it seems like it should be possible.
A trivial example of what I'm trying looks like this:
sqlContext.read.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar")
).select("some_column", "another_column")
.write.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar2")
).save("foo.bar2")
This doesn't work — I end up with this error:
java.lang.RuntimeException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
I'm not sure if I'm doing something wrong (why is it resolving to DefaultSource instead of JDBCRDD for example?) or if writing to an existing MySQL database just isn't possible using Spark's DataFrames API.
Update
Current Spark version (2.0 or later) supports table creation on write.
The original answer
It is possible to write to an existing table but it looks like at this moment (Spark 1.5.0) creating table using JDBC data source is not supported yet*. You can check SPARK-7646 for reference.
If table already exists you can simply use DataFrameWriter.jdbc method:
val prop: java.util.Properties = ???
df.write.jdbc("jdbc:mysql://localhost/foo", "foo.bar2", prop)
* What is interesting PySpark seems to support table creation using jdbc method.

Resources