Overwrite Hive table with Spark SQL [duplicate] - apache-spark

I have a test table in MySQL with id and name like below:
+----+-------+
| id | name |
+----+-------+
| 1 | Name1 |
+----+-------+
| 2 | Name2 |
+----+-------+
| 3 | Name3 |
+----+-------+
I am using Spark DataFrame to read this data (using JDBC) and modifying the data like this
Dataset<Row> modified = sparkSession.sql("select id, concat(name,' - new') as name from test");
modified.write().mode("overwrite").jdbc(AppProperties.MYSQL_CONNECTION_URL,
"test", connectionProperties);
But my problem is, if I give overwrite mode, it drops the previous table and creates a new table but not inserting any data.
I tried the same program by reading from a csv file (same data as test table) and overwriting. That worked for me.
Am I missing something here ?
Thank You!

The problem is in your code. Because you overwrite a table from which you're trying to read you effectively obliterate all data before Spark can actually access it.
Remember that Spark is lazy. When you create a Dataset Spark fetches required metadata, but doesn't load the data. So there is no magic cache which will preserve original content. Data will be loaded when it is actually required. Here it is when you execute write action and when you start writing there is no more data to be fetched.
What you need is something like this:
Create a Dataset.
Apply required transformations and write data to an intermediate MySQL table.
TRUNCATE the original input and INSERT INTO ... SELECT from the intermediate table or DROP the original table and RENAME intermediate table.
Alternative, but less favorable approach, would be:
Create a Dataset.
Apply required transformations and write data to a persistent Spark table (df.write.saveAsTable(...) or equivalent)
TRUNCATE the original input.
Read data back and save (spark.table(...).write.jdbc(...))
Drop Spark table.
We cannot stress enough that using Spark cache / persist is not the way to go. Even in with the conservative StorageLevel (MEMORY_AND_DISK_2 / MEMORY_AND_DISK_SER_2) cached data can be lost (node failures), leading to silent correctness errors.

I believe all the steps above are unnecessary. Here's what you need to do:
Create a dataset A like val A = spark.read.parquet("....")
Read the table to be updated, as dataframe B. Make sure enable caching is enabled for dataframe B. val B = spark.read.jdbc("mytable").cache
Force a count on B - this will force execution and cache the table depending on the chosen StorageLevel - B.count
Now, you can do a transformation like val C = A.union(B)
And, then write C back to the database like C.write.mode(SaveMode.Overwrite).jdbc("mytable")

Reading and writing to same table.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
df_3 = sqlContext.createDataFrame(broad_cast_var.value, cols_df)
Reading and writing to same table with some modification.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
def update_x(x):
y = (x[0] + 311, *x[1:])
return y
rdd_2_1 = spark_context.parallelize(broad_cast_var.value).map(update_x)
df_3 = sqlContext.createDataFrame(rdd_2_1, cols_df)

Related

Problem in reading string NULL values from BigQuery

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', <bq_dataset> + <bq_table>) \
.load()
bqdf.createOrReplaceTempView('bqdf')
# Select required data into another df
bqdf2 = spark.sql(
'SELECT * FROM bqdf')
# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')
I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.
I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!
I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:
According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:
nullValue – sets the string representation of a null value. If None is
set, it uses the default value, empty string.
Which is what you are looking for. Then, I just implemented nullValue option below.
sc = SparkContext()
spark = SparkSession(sc)
# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
"table", "dataset.table").load()
# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")
# Select required data into another df
data_view2 = spark.sql(
'SELECT * FROM data_view')
df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')
data_view2.show()
Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:
+------+---+
|name |age|
+------+---+
|Robert| 25|
|null | 23|
+------+---+
Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:
name,age
Robert,25
,23
As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.
Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

Refresh metadata for Dataframe while reading parquet file

I am trying to read a parquet file as a dataframe which will be updated periodically(path is /folder_name. whenever a new data comes the old parquet file path(/folder_name) will be renamed to a temp path and then we union both new data and old data and will store in the old path(/folder_name)
What happens is suppose we have a parquet file as hdfs://folder_name/part-xxxx-xxx.snappy.parquet before updation and then after updation it is changed to hdfs://folder_name/part-00000-yyyy-yyy.snappy.parquet
The issue is happening is when I try to read the parquet file while the update is being done
sparksession.read.parquet("filename") => it takes the old path hdfs://folder_name/part-xxxx-xxx.snappy.parquet(path exists)
when an action is called on the dataframe it is trying to read the data from hdfs://folder_name/part-xxxx-xxx.snappy.parquet but because of updation the filename changed and I am getting the below issue
java.io.FileNotFoundException: File does not exist: hdfs://folder_name/part-xxxx-xxx.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I am using Spark 2.2
Can anyone help me how to refresh the metadata?
That error occurs when you are trying to read a file that doesn't exists.
Correct me if I'm wrong but I suspect you are overwriting all the files when you save the new dataframe (using .mode("overwrite")). While this process is running you are trying to read a file that was deleted and that exception is thrown - this makes the table unavailable for a period of time (during the update).
As far as I know there is no direct way of "refreshing the metadata" as you want.
Two (of several possible) ways of solving this:
1 - Use append mode
If you just want to append the new dataframe to the old one there is no need of creating a temporary folder and overwriting the old one. You can just change the save mode from overwrite to append. This way you can add partitions to an existing Parquet file without having to rewrite existing ones.
df.write
.mode("append")
.parquet("/temp_table")
This is by far the simplest solution and there is no need to read the data that was already stored. This, however, won't work if you have to update the old data (ex: if you are doing an upsert). For that you have option 2:
2 - Use a Hive view
You can create hive tables and use a view to point to the most recent (and available) one.
Here is an example on the logic behind this approach:
Part 1
If the view <table_name> does not exist we create a new table called
<table_name>_alpha0 to store the new data
After creating the table
we create a view <table_name> as select * from
<table_name>_alpha0
Part 2
If the view <table_name> exists we need to see to which table it is pointing (<table_name>_alphaN)
You do all the operations you want with the new data save it as a table named <table_name>_alpha(N+1)
After creating the table we alter the view <table_name> to select * from <table_name>_alpha(N+1)
And a code example:
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
import spark.implicits._
//This method verifies if the view exists and returns the table it is pointing to (using the query 'describe formatted')
def getCurrentTable(spark: SparkSession, databaseName:String, tableName: String): Option[String] = {
if(spark.catalog.tableExists(s"${databaseName}.${tableName}")) {
val rdd_desc = spark.sql(s"describe formatted ${databaseName}.${tableName}")
.filter("col_name == 'View Text'")
.rdd
if(rdd_desc.isEmpty()) {
None
}
else {
Option(
rdd_desc.first()
.get(1)
.toString
.toLowerCase
.stripPrefix("select * from ")
)
}
}
else
None
}
//This method saves a dataframe in the next "alpha table" and updates the view. It maintains 'rounds' tables (default=3). I.e. if the current table is alpha2, the next one will be alpha0 again.
def saveDataframe(spark: SparkSession, databaseName:String, tableName: String, new_df: DataFrame, rounds: Int = 3): Unit ={
val currentTable = getCurrentTable(spark, databaseName, tableName).getOrElse(s"${databaseName}.${tableName}_alpha${rounds-1}")
val nextAlphaTable = currentTable.replace(s"_alpha${currentTable.last}",s"_alpha${(currentTable.last.toInt + 1) % rounds}")
new_df.write
.mode("overwrite")
.format("parquet")
.option("compression","snappy")
.saveAsTable(nextAlphaTable)
spark.sql(s"create or replace view ${databaseName}.${tableName} as select * from ${nextAlphaTable}")
}
//An example on how to use this:
//SparkSession: spark
val df = Seq((1,"I"),(2,"am"),(3,"a"),(4,"dataframe")).toDF("id","text")
val new_data = Seq((5,"with"),(6,"new"),(7,"data")).toDF("id","text")
val dbName = "test_db"
val tableName = "alpha_test_table"
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
println("Saving dataframe")
saveDataframe(spark, dbName, tableName, df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
val processed_df = df.unionByName(new_data) //Or other operations you want to do
println("Saving new dataframe")
saveDataframe(spark, dbName, tableName, processed_df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
Result:
Current table: Table does not exist
Saving dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha0
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 1| I|
| 2| am|
+---+---------+
Saving new dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha1
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 5| with|
| 6| new|
| 7| data|
| 1| I|
| 2| am|
+---+---------+
By doing this you can guarantee that a version of the view <table_name> will always be available. This also has the advantage (or not, depending on your case) of maintaining the previous versions of the table. i.e. the previous version of <table_name_alpha1> will be <table_name_alpha0>
3 - A bonus
If upgrading your Spark version is an option, take a look at Delta Lake (minimum Spark version: 2.4.2)
Hope this helps :)
Cache the parquet first, then do overwrite.
var tmp = sparkSession.read.parquet("path/to/parquet_1").cache()
tmp.write.mode(SaveMode.Overwrite).parquet("path/to/parquet_1") // same path
Error is thrown because spark does lazy evaluation. When the DAG is executed on "write" command, it starts to read the parquet and write/overwrite at the same time.
Spark doesn't have a transaction manager like Zookeeper to do locks on files hence doing concurrent read/writes is a challenge which needs to be take care of separately.
To refresh the catalog you can do the following:-
spark.catalog.refreshTable("my_table")
OR
spark.sql(s"REFRESH TABLE $tableName")
A simple solution would be to use df.cache.count to bring in memory first, then do union with new data and write to /folder_name with mode overwrite. You won't have to use temp path in this case.
You mentioned that you are renaming the /folder_name to some temp path. So you should read the old data from that temp path rather than hdfs://folder_name/part-xxxx-xxx.snappy.parquet.
Example
From reading your question, I think this might be your issue if so you should be able to run your code without using DeltaLake. In the below use-case Spark will run the code as such: (1) load the inputDF a store locally the file names of the folder location [in this case the explicit part file names] ; (2a) reach line 2 and overwrite the files within the tempLocation; (2b) load the contents from the inputDF and output it to the tempLocation; (3) follow the same steps as 1 but on the tempLocation; (4a) delete the files within the inputLocation folder; and (4b) try to load the part files cached in 1 to load the data from the inputDF to run the union and break because the file does not exist.
val inputDF = spark.read.format("parquet").load(inputLocation)
inputDF.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF.unionAll(tempDF)
outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
From my experience you can follow two pathways persistence or temporarily output everything used for the overwrite.
Persistence
In the below use case we are going to load the inputDF and immediately save it as another element and persist it. When following with the action the persist will be on the data and not the file paths within the folder.
Else you can do the persistence on the outputDF, which will have, relatively, the same effect. Because the persistence is tethered to the data and not the file paths, the destruction of the inputs will not, cause the file paths to be missing during overwrite.
val inputDF = spark.read.format("parquet").load(inputLocation)
val inputDF2 = inputDF.persist
inputDF2.count
inputDF2.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF2.unionAll(tempDF) outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
Temporary load
Instead of loading the temporary output for the union input, if you instead entirely load the outputDF to a temporary file and reload that file for the output then you shouldn't see the file not found error.

spark: No of records in DataFrame is different in different runs

I am running a spark job that reads data from teradata. The query looks like
select * from db_name.table_name sample 5000000;
I'm trying to pull sample of 5 million rows of data. When I tried to print the number of rows in the result DataFrame, it is giving different results each time I run. Sometimes it is 4999937 and sometimes it is 5000124. Is there any particular reason for this kind of behaviour?
EDIT #1:
The code I'm using:
val query = "(select * from db_name.table_name sample 5000000) as data"
var teradataConfig = Map("url"->"jdbc:teradata://HOSTNAME/DATABASE=db_name,DBS_PORT=1025,MAYBENULL=ON",
"TMODE"->"TERA",
"user"->"username",
"password"->"password",
"driver"->"com.teradata.jdbc.TeraDriver",
"dbtable" -> query)
var df = spark.read.format("jdbc").options(teradataConfig).load()
df.count
Try caching the resultant dataframe and perform count action on the dataframe
df.cache()
println(s"Record count: ${df.count()}
From here on when you reuse the df to create new dataframe or any other transformation you don't get mismatched counts since it is already in cache.
Make sure you have given enough memory to hold the cached dataframe in memory.

Filtering Dataframe with predicate pushdown from another dataframe

How can I push down a filter to a dataframe reading based on another dataframe I have? Basically want to avoid reading the second dataframe entirely and then doing an inner join. Instead I would like to just submit a filter on the reading to filter at source. Even if I use an inner joined wrapped up with the read, the plan doesn't show that it is getting filtered. I feel like there is definitely a better way to set this up. Using Spark 2.x I have this so far but I want to avoid collecting a List as below:
// Don't want to do this collect...too slow
val idFilter = df1.select("id").distinct().map(r => r.getLong(0)).collect.toList
val df2: DataFrame = spark.read.format("parquet").load("<path>")
.filter($"id".isin(idFilter: _*))
You cannot directly use predicate pushdown unless you are implementing a DataSource yourself. Predicate Pushdown is a mechanism provided by the Spark Datasources which must be implemented by each Datasources individually.
For file based Datasources there is already a simple mechanism in place based on partitioning on Disk.
consider the following DataFrame:
val df = Seq(("test", "day1"), ("test2", "day2")).toDF("data", "day")
If we save that DataFrame to disk the following way:
df.write.partitionBy("day").save("/tmp/data")
The result will be the following folder structure
tmp -
|
| - data - |
|
|--day=day1 -|- part1....parquet
| |- part2....parquet
|
|--day=day2 -|- part1....parquet
|- part2....parquet
If you are now using this datasource like this:
spark.read.load("/tmp/data").filter($"day" = "day1").show()
Spark doesn't even bother loading the data of folder day2 as there is no need for it.
This is one type of predicate pushdown which works for every standard file format spark supports.
A more specific mechanism would be parquet. Parquet is a columnar based file format which means its quit easy to filter out columns. If you have parquet based files with 3 columns a, b, c in a file /tmp/myparquet.parquet the following query:
spark.read.parquet("/tmp/myparquet.parquet").select("a").show()
will result in an internal predicate pushdown where spark is only fetching data for column a without reading data for the columns b or c.
If someone is interested that mechanisms are established by implementing this trait:
/**
* A BaseRelation that can eliminate unneeded columns and filter using selected
* predicates before producing an RDD containing all matching tuples as Row objects.
*
* The actual filter should be the conjunction of all `filters`,
* i.e. they should be "and" together.
*
* The pushed down filters are currently purely an optimization as they will all be evaluated
* again. This means it is safe to use them with methods that produce false positives such
* as filtering partitions based on a bloom filter.
*
* #since 1.3.0
*/
#Stable
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
to be found in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

How to divide a column by its sum in a Spark DataFrame

How can I divide a column by its own sum in a Spark DataFrame, efficiently and without immediately triggering a computation?
Suppose we have some data:
import pyspark
from pyspark.sql import SparkSession, Window
import pyspark.sql.functions as spf
spark = SparkSession.builder.master('local').getOrCreate()
data = spark.range(0, 100)
data # --> DataFrame[id: bigint]
I’d like to create a new column on this data frame called “normalized” that contains id / sum(id). One way to do it is to pre-compute the sum, like this:
s = data.select(spf.sum('id')).collect()[0][0]
data2 = data.withColumn('normalized', spf.col('id') / s)
data2 # --> DataFrame[id: bigint, normalized: double]
That works fine, but it immediately triggers a computation; if you're defining something similar for many columns it will cause multiple redundant passes over the data.
Another way to do it is with a windowing specification that includes the whole table:
w = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
data3 = data.withColumn('normalized', spf.col('id') / spf.sum('id').over(w))
data3 # --> DataFrame[id: bigint, normalized: double]
In this case, it's fine to define data3, but once you try to actually compute it, Spark 2.2.0 will move all the data into a single partition, which typically causes the job to fail for large data sets.
What other approaches are there to solving this problem, that don't trigger an immediate computation and that will work with large data sets? I'm interested in any solutions, not necessarily solutions based on pyspark.
crossJoin with aggregate is one approach:
data.crossJoin(
data.select(spf.sum('id').alias("sum_id"))
).withColumn("normalized", spf.col("id") / spf.col("sum_id"))
but I wouldn't worry to much:
That works fine, but it immediately triggers a computation; if you're defining something similar for many columns it will cause multiple redundant passes over the data.
Just compute multiple statistics at once:
data2 = data.select(spf.rand(42).alias("x"), spf.randn(42).alias("y"))
mean_x, mean_y = data2.groupBy().mean().first()
and the rest is just an operation on local expressions:
data2.select(spf.col("x") - mean_x, spf.col("y") - mean_y)

Resources