Problem in reading string NULL values from BigQuery - apache-spark

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', <bq_dataset> + <bq_table>) \
.load()
bqdf.createOrReplaceTempView('bqdf')
# Select required data into another df
bqdf2 = spark.sql(
'SELECT * FROM bqdf')
# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')
I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.
I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!

I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:
According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:
nullValue – sets the string representation of a null value. If None is
set, it uses the default value, empty string.
Which is what you are looking for. Then, I just implemented nullValue option below.
sc = SparkContext()
spark = SparkSession(sc)
# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
"table", "dataset.table").load()
# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")
# Select required data into another df
data_view2 = spark.sql(
'SELECT * FROM data_view')
df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')
data_view2.show()
Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:
+------+---+
|name |age|
+------+---+
|Robert| 25|
|null | 23|
+------+---+
Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:
name,age
Robert,25
,23
As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.
Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

Related

pyspark inserInto in overwrite mode is appending and not overwriting partitions

i'm a data engieneer im working on spark 2.3 , and i'm running into some problems :
the function inserInto like below is not insering in overwrite, but is appending even i changed the spark.conf to 'dynamic'
spark = spark_utils.getSparkInstance()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df\
.write\
.mode('overwrite')\
.format('orc')\
.option("compression","snappy")\
.insertInto("{0}.{1}".format(hive_database , src_table ))
each time i run the job, lines are appended in the partition and not overwrited
any one passed through this probleme ?
thank you
I tried to reproduce the error, and from the documentation, you must overwrite to true in insertInto.
def insertInto(self, tableName, overwrite=False):
"""Inserts the content of the :class:`DataFrame` to the specified table.
It requires that the schema of the class:`DataFrame` is the same as the
schema of the table.
Optionally overwriting any existing data.
"""
self._jwrite.mode("overwrite" if overwrite else "append").insertInto(tableName)
So applying this to your code will be:
df\
.write\
.mode('overwrite')\
.format('orc')\
.option("compression","snappy")\
.insertInto("{0}.{1}".format(hive_database , src_table ), overwrite=True))

Replace Null values with no value in spark sql

I am writing a csv file onto datalake from a dataframe which has null values. Spark sql explicitly puts the value as Null for null values. I want to replace these null values with no values or no other strings.
When i write the csv file from databricks, it looks like this
ColA,ColB,ColC
null,ABC,123
ffgg,DEF,345
null,XYZ,789
I tried replacing nulls with '' using fill.na, but when I do that, the file gets written like this
ColA,ColB,ColC
'',ABC,123
ffgg,DEF,345
'',XYZ,789
And I want my csv file to look like this. How do I achieve this from spark sql. I am using databricks. Any help in this regard is highly appreciated.
ColA,ColB,ColC
,ABC,123
ffg,DEF,345
,XYZ,789
Thanks!
I think we need to use .saveAsTextFile for this case instead of csv.
Example:
df.show()
//+----+----+----+
//|col1|col2|col3|
//+----+----+----+
//|null| ABC| 123|
//| dd| ABC| 123|
//+----+----+----+
//extract header from dataframe
val header=spark.sparkContext.parallelize(Seq(df.columns.mkString(",")))
//union header with data and replace [|]|null then save
header.union(df.rdd.map(x => x.toString)).map(x => x.replaceAll("[\\[|\\]|null]","")).coalesce(1).saveAsTextFile("<path>")
//content of file
//co1,co2,co3
//,ABC,123
//dd,ABC,123
If First field in your data is not null then you can use csv option:
df.write.option("nullValue", null).mode("overwrite").csv("<path>")

Refresh metadata for Dataframe while reading parquet file

I am trying to read a parquet file as a dataframe which will be updated periodically(path is /folder_name. whenever a new data comes the old parquet file path(/folder_name) will be renamed to a temp path and then we union both new data and old data and will store in the old path(/folder_name)
What happens is suppose we have a parquet file as hdfs://folder_name/part-xxxx-xxx.snappy.parquet before updation and then after updation it is changed to hdfs://folder_name/part-00000-yyyy-yyy.snappy.parquet
The issue is happening is when I try to read the parquet file while the update is being done
sparksession.read.parquet("filename") => it takes the old path hdfs://folder_name/part-xxxx-xxx.snappy.parquet(path exists)
when an action is called on the dataframe it is trying to read the data from hdfs://folder_name/part-xxxx-xxx.snappy.parquet but because of updation the filename changed and I am getting the below issue
java.io.FileNotFoundException: File does not exist: hdfs://folder_name/part-xxxx-xxx.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I am using Spark 2.2
Can anyone help me how to refresh the metadata?
That error occurs when you are trying to read a file that doesn't exists.
Correct me if I'm wrong but I suspect you are overwriting all the files when you save the new dataframe (using .mode("overwrite")). While this process is running you are trying to read a file that was deleted and that exception is thrown - this makes the table unavailable for a period of time (during the update).
As far as I know there is no direct way of "refreshing the metadata" as you want.
Two (of several possible) ways of solving this:
1 - Use append mode
If you just want to append the new dataframe to the old one there is no need of creating a temporary folder and overwriting the old one. You can just change the save mode from overwrite to append. This way you can add partitions to an existing Parquet file without having to rewrite existing ones.
df.write
.mode("append")
.parquet("/temp_table")
This is by far the simplest solution and there is no need to read the data that was already stored. This, however, won't work if you have to update the old data (ex: if you are doing an upsert). For that you have option 2:
2 - Use a Hive view
You can create hive tables and use a view to point to the most recent (and available) one.
Here is an example on the logic behind this approach:
Part 1
If the view <table_name> does not exist we create a new table called
<table_name>_alpha0 to store the new data
After creating the table
we create a view <table_name> as select * from
<table_name>_alpha0
Part 2
If the view <table_name> exists we need to see to which table it is pointing (<table_name>_alphaN)
You do all the operations you want with the new data save it as a table named <table_name>_alpha(N+1)
After creating the table we alter the view <table_name> to select * from <table_name>_alpha(N+1)
And a code example:
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
import spark.implicits._
//This method verifies if the view exists and returns the table it is pointing to (using the query 'describe formatted')
def getCurrentTable(spark: SparkSession, databaseName:String, tableName: String): Option[String] = {
if(spark.catalog.tableExists(s"${databaseName}.${tableName}")) {
val rdd_desc = spark.sql(s"describe formatted ${databaseName}.${tableName}")
.filter("col_name == 'View Text'")
.rdd
if(rdd_desc.isEmpty()) {
None
}
else {
Option(
rdd_desc.first()
.get(1)
.toString
.toLowerCase
.stripPrefix("select * from ")
)
}
}
else
None
}
//This method saves a dataframe in the next "alpha table" and updates the view. It maintains 'rounds' tables (default=3). I.e. if the current table is alpha2, the next one will be alpha0 again.
def saveDataframe(spark: SparkSession, databaseName:String, tableName: String, new_df: DataFrame, rounds: Int = 3): Unit ={
val currentTable = getCurrentTable(spark, databaseName, tableName).getOrElse(s"${databaseName}.${tableName}_alpha${rounds-1}")
val nextAlphaTable = currentTable.replace(s"_alpha${currentTable.last}",s"_alpha${(currentTable.last.toInt + 1) % rounds}")
new_df.write
.mode("overwrite")
.format("parquet")
.option("compression","snappy")
.saveAsTable(nextAlphaTable)
spark.sql(s"create or replace view ${databaseName}.${tableName} as select * from ${nextAlphaTable}")
}
//An example on how to use this:
//SparkSession: spark
val df = Seq((1,"I"),(2,"am"),(3,"a"),(4,"dataframe")).toDF("id","text")
val new_data = Seq((5,"with"),(6,"new"),(7,"data")).toDF("id","text")
val dbName = "test_db"
val tableName = "alpha_test_table"
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
println("Saving dataframe")
saveDataframe(spark, dbName, tableName, df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
val processed_df = df.unionByName(new_data) //Or other operations you want to do
println("Saving new dataframe")
saveDataframe(spark, dbName, tableName, processed_df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
Result:
Current table: Table does not exist
Saving dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha0
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 1| I|
| 2| am|
+---+---------+
Saving new dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha1
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 5| with|
| 6| new|
| 7| data|
| 1| I|
| 2| am|
+---+---------+
By doing this you can guarantee that a version of the view <table_name> will always be available. This also has the advantage (or not, depending on your case) of maintaining the previous versions of the table. i.e. the previous version of <table_name_alpha1> will be <table_name_alpha0>
3 - A bonus
If upgrading your Spark version is an option, take a look at Delta Lake (minimum Spark version: 2.4.2)
Hope this helps :)
Cache the parquet first, then do overwrite.
var tmp = sparkSession.read.parquet("path/to/parquet_1").cache()
tmp.write.mode(SaveMode.Overwrite).parquet("path/to/parquet_1") // same path
Error is thrown because spark does lazy evaluation. When the DAG is executed on "write" command, it starts to read the parquet and write/overwrite at the same time.
Spark doesn't have a transaction manager like Zookeeper to do locks on files hence doing concurrent read/writes is a challenge which needs to be take care of separately.
To refresh the catalog you can do the following:-
spark.catalog.refreshTable("my_table")
OR
spark.sql(s"REFRESH TABLE $tableName")
A simple solution would be to use df.cache.count to bring in memory first, then do union with new data and write to /folder_name with mode overwrite. You won't have to use temp path in this case.
You mentioned that you are renaming the /folder_name to some temp path. So you should read the old data from that temp path rather than hdfs://folder_name/part-xxxx-xxx.snappy.parquet.
Example
From reading your question, I think this might be your issue if so you should be able to run your code without using DeltaLake. In the below use-case Spark will run the code as such: (1) load the inputDF a store locally the file names of the folder location [in this case the explicit part file names] ; (2a) reach line 2 and overwrite the files within the tempLocation; (2b) load the contents from the inputDF and output it to the tempLocation; (3) follow the same steps as 1 but on the tempLocation; (4a) delete the files within the inputLocation folder; and (4b) try to load the part files cached in 1 to load the data from the inputDF to run the union and break because the file does not exist.
val inputDF = spark.read.format("parquet").load(inputLocation)
inputDF.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF.unionAll(tempDF)
outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
From my experience you can follow two pathways persistence or temporarily output everything used for the overwrite.
Persistence
In the below use case we are going to load the inputDF and immediately save it as another element and persist it. When following with the action the persist will be on the data and not the file paths within the folder.
Else you can do the persistence on the outputDF, which will have, relatively, the same effect. Because the persistence is tethered to the data and not the file paths, the destruction of the inputs will not, cause the file paths to be missing during overwrite.
val inputDF = spark.read.format("parquet").load(inputLocation)
val inputDF2 = inputDF.persist
inputDF2.count
inputDF2.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF2.unionAll(tempDF) outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
Temporary load
Instead of loading the temporary output for the union input, if you instead entirely load the outputDF to a temporary file and reload that file for the output then you shouldn't see the file not found error.

spark Dataframe string to Hive varchar

I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)

Overwrite Hive table with Spark SQL [duplicate]

I have a test table in MySQL with id and name like below:
+----+-------+
| id | name |
+----+-------+
| 1 | Name1 |
+----+-------+
| 2 | Name2 |
+----+-------+
| 3 | Name3 |
+----+-------+
I am using Spark DataFrame to read this data (using JDBC) and modifying the data like this
Dataset<Row> modified = sparkSession.sql("select id, concat(name,' - new') as name from test");
modified.write().mode("overwrite").jdbc(AppProperties.MYSQL_CONNECTION_URL,
"test", connectionProperties);
But my problem is, if I give overwrite mode, it drops the previous table and creates a new table but not inserting any data.
I tried the same program by reading from a csv file (same data as test table) and overwriting. That worked for me.
Am I missing something here ?
Thank You!
The problem is in your code. Because you overwrite a table from which you're trying to read you effectively obliterate all data before Spark can actually access it.
Remember that Spark is lazy. When you create a Dataset Spark fetches required metadata, but doesn't load the data. So there is no magic cache which will preserve original content. Data will be loaded when it is actually required. Here it is when you execute write action and when you start writing there is no more data to be fetched.
What you need is something like this:
Create a Dataset.
Apply required transformations and write data to an intermediate MySQL table.
TRUNCATE the original input and INSERT INTO ... SELECT from the intermediate table or DROP the original table and RENAME intermediate table.
Alternative, but less favorable approach, would be:
Create a Dataset.
Apply required transformations and write data to a persistent Spark table (df.write.saveAsTable(...) or equivalent)
TRUNCATE the original input.
Read data back and save (spark.table(...).write.jdbc(...))
Drop Spark table.
We cannot stress enough that using Spark cache / persist is not the way to go. Even in with the conservative StorageLevel (MEMORY_AND_DISK_2 / MEMORY_AND_DISK_SER_2) cached data can be lost (node failures), leading to silent correctness errors.
I believe all the steps above are unnecessary. Here's what you need to do:
Create a dataset A like val A = spark.read.parquet("....")
Read the table to be updated, as dataframe B. Make sure enable caching is enabled for dataframe B. val B = spark.read.jdbc("mytable").cache
Force a count on B - this will force execution and cache the table depending on the chosen StorageLevel - B.count
Now, you can do a transformation like val C = A.union(B)
And, then write C back to the database like C.write.mode(SaveMode.Overwrite).jdbc("mytable")
Reading and writing to same table.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
df_3 = sqlContext.createDataFrame(broad_cast_var.value, cols_df)
Reading and writing to same table with some modification.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
def update_x(x):
y = (x[0] + 311, *x[1:])
return y
rdd_2_1 = spark_context.parallelize(broad_cast_var.value).map(update_x)
df_3 = sqlContext.createDataFrame(rdd_2_1, cols_df)

Resources