Spark: Use Persistent Table as Streaming Source for Spark Structured Streaming - apache-spark

I stored data in a table:
spark.table("default.student").show()
(1) Spark Jobs
+---+----+---+
| id|name|age|
+---+----+---+
| 1| bob| 34|
+---+----+---+
I would like to make a read stream using that table as source. I tried
newDF=spark.read.table("default.student")
newDF.isStreaming
Which returns False.
Is there a way to use a table as Streaming Source?

Need to use delta table. Like this on Databricks Notebook:
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").saveAsTable("T1")
stream = spark.readStream.format("delta").table("T1").writeStream.format("console").start()
// In another cell, execute:
data = spark.range(6, 10)
In DriverLogs can see 2 sets of data, then.

Related

Refresh metadata for Dataframe while reading parquet file

I am trying to read a parquet file as a dataframe which will be updated periodically(path is /folder_name. whenever a new data comes the old parquet file path(/folder_name) will be renamed to a temp path and then we union both new data and old data and will store in the old path(/folder_name)
What happens is suppose we have a parquet file as hdfs://folder_name/part-xxxx-xxx.snappy.parquet before updation and then after updation it is changed to hdfs://folder_name/part-00000-yyyy-yyy.snappy.parquet
The issue is happening is when I try to read the parquet file while the update is being done
sparksession.read.parquet("filename") => it takes the old path hdfs://folder_name/part-xxxx-xxx.snappy.parquet(path exists)
when an action is called on the dataframe it is trying to read the data from hdfs://folder_name/part-xxxx-xxx.snappy.parquet but because of updation the filename changed and I am getting the below issue
java.io.FileNotFoundException: File does not exist: hdfs://folder_name/part-xxxx-xxx.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I am using Spark 2.2
Can anyone help me how to refresh the metadata?
That error occurs when you are trying to read a file that doesn't exists.
Correct me if I'm wrong but I suspect you are overwriting all the files when you save the new dataframe (using .mode("overwrite")). While this process is running you are trying to read a file that was deleted and that exception is thrown - this makes the table unavailable for a period of time (during the update).
As far as I know there is no direct way of "refreshing the metadata" as you want.
Two (of several possible) ways of solving this:
1 - Use append mode
If you just want to append the new dataframe to the old one there is no need of creating a temporary folder and overwriting the old one. You can just change the save mode from overwrite to append. This way you can add partitions to an existing Parquet file without having to rewrite existing ones.
df.write
.mode("append")
.parquet("/temp_table")
This is by far the simplest solution and there is no need to read the data that was already stored. This, however, won't work if you have to update the old data (ex: if you are doing an upsert). For that you have option 2:
2 - Use a Hive view
You can create hive tables and use a view to point to the most recent (and available) one.
Here is an example on the logic behind this approach:
Part 1
If the view <table_name> does not exist we create a new table called
<table_name>_alpha0 to store the new data
After creating the table
we create a view <table_name> as select * from
<table_name>_alpha0
Part 2
If the view <table_name> exists we need to see to which table it is pointing (<table_name>_alphaN)
You do all the operations you want with the new data save it as a table named <table_name>_alpha(N+1)
After creating the table we alter the view <table_name> to select * from <table_name>_alpha(N+1)
And a code example:
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
import spark.implicits._
//This method verifies if the view exists and returns the table it is pointing to (using the query 'describe formatted')
def getCurrentTable(spark: SparkSession, databaseName:String, tableName: String): Option[String] = {
if(spark.catalog.tableExists(s"${databaseName}.${tableName}")) {
val rdd_desc = spark.sql(s"describe formatted ${databaseName}.${tableName}")
.filter("col_name == 'View Text'")
.rdd
if(rdd_desc.isEmpty()) {
None
}
else {
Option(
rdd_desc.first()
.get(1)
.toString
.toLowerCase
.stripPrefix("select * from ")
)
}
}
else
None
}
//This method saves a dataframe in the next "alpha table" and updates the view. It maintains 'rounds' tables (default=3). I.e. if the current table is alpha2, the next one will be alpha0 again.
def saveDataframe(spark: SparkSession, databaseName:String, tableName: String, new_df: DataFrame, rounds: Int = 3): Unit ={
val currentTable = getCurrentTable(spark, databaseName, tableName).getOrElse(s"${databaseName}.${tableName}_alpha${rounds-1}")
val nextAlphaTable = currentTable.replace(s"_alpha${currentTable.last}",s"_alpha${(currentTable.last.toInt + 1) % rounds}")
new_df.write
.mode("overwrite")
.format("parquet")
.option("compression","snappy")
.saveAsTable(nextAlphaTable)
spark.sql(s"create or replace view ${databaseName}.${tableName} as select * from ${nextAlphaTable}")
}
//An example on how to use this:
//SparkSession: spark
val df = Seq((1,"I"),(2,"am"),(3,"a"),(4,"dataframe")).toDF("id","text")
val new_data = Seq((5,"with"),(6,"new"),(7,"data")).toDF("id","text")
val dbName = "test_db"
val tableName = "alpha_test_table"
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
println("Saving dataframe")
saveDataframe(spark, dbName, tableName, df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
val processed_df = df.unionByName(new_data) //Or other operations you want to do
println("Saving new dataframe")
saveDataframe(spark, dbName, tableName, processed_df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
Result:
Current table: Table does not exist
Saving dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha0
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 1| I|
| 2| am|
+---+---------+
Saving new dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha1
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 5| with|
| 6| new|
| 7| data|
| 1| I|
| 2| am|
+---+---------+
By doing this you can guarantee that a version of the view <table_name> will always be available. This also has the advantage (or not, depending on your case) of maintaining the previous versions of the table. i.e. the previous version of <table_name_alpha1> will be <table_name_alpha0>
3 - A bonus
If upgrading your Spark version is an option, take a look at Delta Lake (minimum Spark version: 2.4.2)
Hope this helps :)
Cache the parquet first, then do overwrite.
var tmp = sparkSession.read.parquet("path/to/parquet_1").cache()
tmp.write.mode(SaveMode.Overwrite).parquet("path/to/parquet_1") // same path
Error is thrown because spark does lazy evaluation. When the DAG is executed on "write" command, it starts to read the parquet and write/overwrite at the same time.
Spark doesn't have a transaction manager like Zookeeper to do locks on files hence doing concurrent read/writes is a challenge which needs to be take care of separately.
To refresh the catalog you can do the following:-
spark.catalog.refreshTable("my_table")
OR
spark.sql(s"REFRESH TABLE $tableName")
A simple solution would be to use df.cache.count to bring in memory first, then do union with new data and write to /folder_name with mode overwrite. You won't have to use temp path in this case.
You mentioned that you are renaming the /folder_name to some temp path. So you should read the old data from that temp path rather than hdfs://folder_name/part-xxxx-xxx.snappy.parquet.
Example
From reading your question, I think this might be your issue if so you should be able to run your code without using DeltaLake. In the below use-case Spark will run the code as such: (1) load the inputDF a store locally the file names of the folder location [in this case the explicit part file names] ; (2a) reach line 2 and overwrite the files within the tempLocation; (2b) load the contents from the inputDF and output it to the tempLocation; (3) follow the same steps as 1 but on the tempLocation; (4a) delete the files within the inputLocation folder; and (4b) try to load the part files cached in 1 to load the data from the inputDF to run the union and break because the file does not exist.
val inputDF = spark.read.format("parquet").load(inputLocation)
inputDF.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF.unionAll(tempDF)
outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
From my experience you can follow two pathways persistence or temporarily output everything used for the overwrite.
Persistence
In the below use case we are going to load the inputDF and immediately save it as another element and persist it. When following with the action the persist will be on the data and not the file paths within the folder.
Else you can do the persistence on the outputDF, which will have, relatively, the same effect. Because the persistence is tethered to the data and not the file paths, the destruction of the inputs will not, cause the file paths to be missing during overwrite.
val inputDF = spark.read.format("parquet").load(inputLocation)
val inputDF2 = inputDF.persist
inputDF2.count
inputDF2.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF2.unionAll(tempDF) outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
Temporary load
Instead of loading the temporary output for the union input, if you instead entirely load the outputDF to a temporary file and reload that file for the output then you shouldn't see the file not found error.

How to write a table to hive from spark without using the warehouse connector in HDP 3.1

when trying to use spark 2.3 on HDP 3.1 to write to a Hive table without the warehouse connector directly into hives schema using:
spark-shell --driver-memory 16g --master local[3] --conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")
fails with:
Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional
but a:
val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")
Spark with spark.sql("select * from foo.my_table_02").show works just fine.
Now going to Hive / beeline:
0: jdbc:hive2://hostname:2181/> select * from my_table_02;
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
A
describe extended my_table_02;
returns
+-----------------------------+----------------------------------------------------+----------+
| col_name | data_type | comment |
+-----------------------------+----------------------------------------------------+----------+
| value | int | |
| part | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| part | int | |
| | NULL | NULL |
| Detailed Table Information | Table(tableName:my_table_02, dbName:foo, owner:hive/bd-sandbox.t-mobile.at#SANDBOX.MAGENTA.COM, createTime:1571201905, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:value, type:int, comment:null), FieldSchema(name:part, type:int, comment:null)], location:hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{path=hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, compression=zlib, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:part, type:int, comment:null)], parameters:{numRows=0, rawDataSize=0, spark.sql.sources.schema.partCol.0=part, transient_lastDdlTime=1571201906, bucketing_version=2, spark.sql.create.version=2.3.2.3.1.0.0-78, totalSize=740, spark.sql.sources.schema.numPartCols=1, spark.sql.sources.schema.part.0={\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}, numFiles=4, numPartitions=4, spark.sql.partitionProvider=catalog, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=orc, transactional=true}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER, writeId:-1) |
How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?
To my best knowledge external tables should be possible (thy are not managed, not ACID not transactional), but I am not sure how to tell the saveAsTable how to handle these.
edit
related issues:
https://community.cloudera.com/t5/Support-Questions/In-hdp-3-0-can-t-create-hive-table-in-spark-failed/td-p/202647
Table loaded through Spark not accessible in Hive
setting the properties there proposed in the answer do not solve my issue
seems also to be a bug: https://issues.apache.org/jira/browse/HIVE-20593
Might be a workaround like the https://github.com/qubole/spark-acid like https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html but I do not like the idea of using more duct tape where I have not seen any large scale performance tests just yet. Also, this means changing all existing spark jobs.
In fact Cant save table to hive metastore, HDP 3.0 reports issues with large data frames and the warehouse connector.
edit
I just found https://community.cloudera.com/t5/Support-Questions/Spark-hive-warehouse-connector-not-loading-data-when-using/td-p/243613
And:
execute() vs executeQuery()
ExecuteQuery() will always use the Hiveserver2-interactive/LLAP as it
uses the fast ARROW protocol. Using it when the jdbc URL point to the
non-LLAP Hiveserver2 will yield an error.
Execute() uses JDBC and does not have this dependency on LLAP, but has
a built-in restriction to only return 1.000 records max. But for most
queries (INSERT INTO ... SELECT, count, sum, average) that is not a
problem.
But doesn't this kill any high-performance interoperability between hive and spark? Especially if there are not enough LLAP nodes available for large scale ETL.
In fact, this is true. This setting can be configured at https://github.com/hortonworks-spark/spark-llap/blob/26d164e62b45cfa1420d5d43cdef13d1d29bb877/src/main/java/com/hortonworks/spark/sql/hive/llap/HWConf.java#L39, though I am not sure of the performance impact of increasing this value
Did you try
data.write \
.mode("append") \
.insertInto("tableName")
Inside Ambari simply disabling the option of creating transactional tables by default solves my problem.
set to false twice (tez, llap)
hive.strict.managed.tables = false
and enable manually in each table property if desired (to use a transactional table).
Creating an external table (as a workaround) seems to be the best option for me.
This still involves HWC to register the column metadata or update the partition information.
Something along these lines:
val df:DataFrame = ...
val externalPath = "/warehouse/tablespace/external/hive/my_db.db/my_table"
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
dxx.write.partitionBy("part_col").option("compression", "zlib").mode(SaveMode.Overwrite).orc(externalPath)
val columns = dxx.drop("part_col").schema.fields.map(field => s"${field.name} ${field.dataType.simpleString}").mkString(", ")
val ddl =
s"""
|CREATE EXTERNAL TABLE my_db.my_table ($columns)
|PARTITIONED BY (part_col string)
|STORED AS ORC
|Location '$externalPath'
""".stripMargin
hive.execute(ddl)
hive.execute(s"MSCK REPAIR TABLE $tablename SYNC PARTITIONS")
Unfortunately, this throws a:
java.sql.SQLException: The query did not generate a result set!
from HWC
"How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?"
We are working on the same setting (HDP 3.1 with Spark 2.3). Using below code we were getting the same error messages as you got "bucketId out of range: -1". The solution was to run set hive.fetch.task.conversion=none; in Hive shell before trying to query the table.
The code to write data into Hive without the HWC:
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.master("yarn")
.appName("SparkHiveExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
spark.sql("USE databaseName")
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("sparkhive_records")
[Example from https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]

Spark: Transforming multiple dataframes in parallel

Understanding how to achieve best parallelism while transforming multiple dataframes in parallel
I have an array of paths
val paths = Array("path1", "path2", .....
I am loading dataframe from each path then transforming and writing to destination path
paths.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
The transformation processData is independent of dataframe I am loading.
This limits to processing one dataframe at a time and most of my cluster resources are idle. As processing each dataframe is independent, I converted Array to ParArray of scala.
paths.par.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
Now it is using more resources in cluster. I am still trying to understand how it works and how to fine tune the parallel processing here
If I increase the default scala parallelism using ForkJoinPool to higher number, can it lead to more threads spawning at driver side and will be in lock state waiting for foreach function to finish and eventually kill the driver?
How does it effect the centralized spark things like EventLoggingListnener which needs to handle more inflow of events as multiple dataframes are processed in parallel.
What parameters do I consider for optimal resource utilization.
Any other approach
Any resources I can go through to understand this scaling would be very helpful
The reason why this is slow is that spark is very good at parallelizing computations on lots of data, stored in one big dataframe. However, it is very bad at dealing with lots of dataframes. It will start the computation on one using all its executors (even though they are not all needed) and wait for it to finish before starting the next one. This results in a lot of inactive processors. This is bad but that's not what spark was designed for.
I have a hack for you. There might need to refine it a little, but you would have the idea. Here is what I would do. From a list of paths, I would extract all the schemas of the parquet files and create a new big schema that gathers all the columns. Then, I would ask spark to read all the parquet files using this schema (the columns that are not present will be set to null automatically). I would then union all the dataframes and perform the transformation on this big dataframe and finally use partitionBy to store the dataframes in separate files, while still doing all of it in parallel. It would look like this.
// let create two sample datasets with one column in common (id)
// and two different columns x != y
val d1 = spark.range(3).withColumn("x", 'id * 10)
d1.show
+---+----+
| id| x |
+---+----+
| 0| 0|
| 1| 10|
| 2| 20|
+---+----+
val d2 = spark.range(2).withColumn("y", 'id cast "string")
d2.show
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 1|
+---+---+
// And I store them
d1.write.parquet("hdfs:///tmp/d1.parquet")
d2.write.parquet("hdfs:///tmp/d2.parquet")
// Now let's create the big schema
val paths = Seq("hdfs:///tmp/d1.parquet", "hdfs:///tmp/d2.parquet")
val fields = paths
.flatMap(path => spark.read.parquet(path).schema.fields)
.toSet //removing duplicates
.toArray
val big_schema = StructType(fields)
// and let's use it
val dfs = paths.map{ path =>
spark.read
.schema(big_schema)
.parquet(path)
.withColumn("path", lit(path.split("/").last))
}
// The we are ready to create one big dataframe
dfs.reduce( _ unionAll _).show
+---+----+----+----------+
| id| x| y| file|
+---+----+----+----------+
| 1| 1|null|d1.parquet|
| 2| 2|null|d1.parquet|
| 0| 0|null|d1.parquet|
| 0|null| 0|d2.parquet|
| 1|null| 1|d2.parquet|
+---+----+----+----------+
Yet, I do not recommend using unionAll on lots of dataframes. Because of spark's analysis of the execution plan, it can be very slow with many dataframes. I would use the RDD version although it is more verbose.
val rdds = sc.union(dfs.map(_.rdd))
// let's not forget to add the path to the schema
val big_df = spark.createDataFrame(rdds,
big_schema.add(StructField("path", StringType, true)))
transform(big_df)
.write
.partitionBy("path")
.parquet("hdfs:///tmp/processed.parquet")
And having a look at my processed directory, I get this:
hdfs:///tmp/processed.parquet/_SUCCESS
hdfs:///tmp/processed.parquet/path=d1.parquet
hdfs:///tmp/processed.parquet/path=d2.parquet
You should play with some variables here. Most important are: CPU cores, the size of each DF and a little use of futures. The propose is decide the priority of each DF to be processed. You can use FAIR configuration but that don't be enough and process all in parallel could consume a big part of your cluster. You have to assign priorities to DFs and use Future pooll to control the number of parallel Jobs running in your app.

Vizualization with zeppelin using cassandra and spark

I'm new with zeppelin, but it look like interesting.
I'd like to do some visualization with cassandra's data reading with spark within zeppelin. But I can't do it, yet!
This is my code:
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
val createDDL = """CREATE TEMPORARY VIEW keyspaces9
USING org.apache.spark.sql.cassandra
OPTIONS (
table "foehis",
keyspace "tfm",
pushdown "true")"""
spark.sql(createDDL)
spark.sql("SELECT hoclic,hodtac,hohrac,hotpac FROM keyspaces").show
And I get:
res41: org.apache.spark.sql.DataFrame = []
+------+--------+------+------+
|hoclic| hodtac|hohrac|hotpac|
+------+--------+------+------+
| 1011|10180619| 510| ENPR|
| 1011|20140427| 800| ANDE|
| 1011|20140427| 800| ANDE|
| 1011|20170522| 1100| ANDE|
| 1011|20170522| 1100| ANDE|
....
But I don't have the ability to make a viz
How do I convert that data into a table for zeppelin?
Register the DataFrame as a Table using df.registerTempTable.
In your case, register 'keyspaces' dataframe as table and then you can execute the SQL queries on the table and create visualizations.
Sample code:

Overwrite Hive table with Spark SQL [duplicate]

I have a test table in MySQL with id and name like below:
+----+-------+
| id | name |
+----+-------+
| 1 | Name1 |
+----+-------+
| 2 | Name2 |
+----+-------+
| 3 | Name3 |
+----+-------+
I am using Spark DataFrame to read this data (using JDBC) and modifying the data like this
Dataset<Row> modified = sparkSession.sql("select id, concat(name,' - new') as name from test");
modified.write().mode("overwrite").jdbc(AppProperties.MYSQL_CONNECTION_URL,
"test", connectionProperties);
But my problem is, if I give overwrite mode, it drops the previous table and creates a new table but not inserting any data.
I tried the same program by reading from a csv file (same data as test table) and overwriting. That worked for me.
Am I missing something here ?
Thank You!
The problem is in your code. Because you overwrite a table from which you're trying to read you effectively obliterate all data before Spark can actually access it.
Remember that Spark is lazy. When you create a Dataset Spark fetches required metadata, but doesn't load the data. So there is no magic cache which will preserve original content. Data will be loaded when it is actually required. Here it is when you execute write action and when you start writing there is no more data to be fetched.
What you need is something like this:
Create a Dataset.
Apply required transformations and write data to an intermediate MySQL table.
TRUNCATE the original input and INSERT INTO ... SELECT from the intermediate table or DROP the original table and RENAME intermediate table.
Alternative, but less favorable approach, would be:
Create a Dataset.
Apply required transformations and write data to a persistent Spark table (df.write.saveAsTable(...) or equivalent)
TRUNCATE the original input.
Read data back and save (spark.table(...).write.jdbc(...))
Drop Spark table.
We cannot stress enough that using Spark cache / persist is not the way to go. Even in with the conservative StorageLevel (MEMORY_AND_DISK_2 / MEMORY_AND_DISK_SER_2) cached data can be lost (node failures), leading to silent correctness errors.
I believe all the steps above are unnecessary. Here's what you need to do:
Create a dataset A like val A = spark.read.parquet("....")
Read the table to be updated, as dataframe B. Make sure enable caching is enabled for dataframe B. val B = spark.read.jdbc("mytable").cache
Force a count on B - this will force execution and cache the table depending on the chosen StorageLevel - B.count
Now, you can do a transformation like val C = A.union(B)
And, then write C back to the database like C.write.mode(SaveMode.Overwrite).jdbc("mytable")
Reading and writing to same table.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
df_3 = sqlContext.createDataFrame(broad_cast_var.value, cols_df)
Reading and writing to same table with some modification.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
def update_x(x):
y = (x[0] + 311, *x[1:])
return y
rdd_2_1 = spark_context.parallelize(broad_cast_var.value).map(update_x)
df_3 = sqlContext.createDataFrame(rdd_2_1, cols_df)

Resources