Refresh metadata for Dataframe while reading parquet file - apache-spark

I am trying to read a parquet file as a dataframe which will be updated periodically(path is /folder_name. whenever a new data comes the old parquet file path(/folder_name) will be renamed to a temp path and then we union both new data and old data and will store in the old path(/folder_name)
What happens is suppose we have a parquet file as hdfs://folder_name/part-xxxx-xxx.snappy.parquet before updation and then after updation it is changed to hdfs://folder_name/part-00000-yyyy-yyy.snappy.parquet
The issue is happening is when I try to read the parquet file while the update is being done
sparksession.read.parquet("filename") => it takes the old path hdfs://folder_name/part-xxxx-xxx.snappy.parquet(path exists)
when an action is called on the dataframe it is trying to read the data from hdfs://folder_name/part-xxxx-xxx.snappy.parquet but because of updation the filename changed and I am getting the below issue
java.io.FileNotFoundException: File does not exist: hdfs://folder_name/part-xxxx-xxx.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I am using Spark 2.2
Can anyone help me how to refresh the metadata?

That error occurs when you are trying to read a file that doesn't exists.
Correct me if I'm wrong but I suspect you are overwriting all the files when you save the new dataframe (using .mode("overwrite")). While this process is running you are trying to read a file that was deleted and that exception is thrown - this makes the table unavailable for a period of time (during the update).
As far as I know there is no direct way of "refreshing the metadata" as you want.
Two (of several possible) ways of solving this:
1 - Use append mode
If you just want to append the new dataframe to the old one there is no need of creating a temporary folder and overwriting the old one. You can just change the save mode from overwrite to append. This way you can add partitions to an existing Parquet file without having to rewrite existing ones.
df.write
.mode("append")
.parquet("/temp_table")
This is by far the simplest solution and there is no need to read the data that was already stored. This, however, won't work if you have to update the old data (ex: if you are doing an upsert). For that you have option 2:
2 - Use a Hive view
You can create hive tables and use a view to point to the most recent (and available) one.
Here is an example on the logic behind this approach:
Part 1
If the view <table_name> does not exist we create a new table called
<table_name>_alpha0 to store the new data
After creating the table
we create a view <table_name> as select * from
<table_name>_alpha0
Part 2
If the view <table_name> exists we need to see to which table it is pointing (<table_name>_alphaN)
You do all the operations you want with the new data save it as a table named <table_name>_alpha(N+1)
After creating the table we alter the view <table_name> to select * from <table_name>_alpha(N+1)
And a code example:
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
import spark.implicits._
//This method verifies if the view exists and returns the table it is pointing to (using the query 'describe formatted')
def getCurrentTable(spark: SparkSession, databaseName:String, tableName: String): Option[String] = {
if(spark.catalog.tableExists(s"${databaseName}.${tableName}")) {
val rdd_desc = spark.sql(s"describe formatted ${databaseName}.${tableName}")
.filter("col_name == 'View Text'")
.rdd
if(rdd_desc.isEmpty()) {
None
}
else {
Option(
rdd_desc.first()
.get(1)
.toString
.toLowerCase
.stripPrefix("select * from ")
)
}
}
else
None
}
//This method saves a dataframe in the next "alpha table" and updates the view. It maintains 'rounds' tables (default=3). I.e. if the current table is alpha2, the next one will be alpha0 again.
def saveDataframe(spark: SparkSession, databaseName:String, tableName: String, new_df: DataFrame, rounds: Int = 3): Unit ={
val currentTable = getCurrentTable(spark, databaseName, tableName).getOrElse(s"${databaseName}.${tableName}_alpha${rounds-1}")
val nextAlphaTable = currentTable.replace(s"_alpha${currentTable.last}",s"_alpha${(currentTable.last.toInt + 1) % rounds}")
new_df.write
.mode("overwrite")
.format("parquet")
.option("compression","snappy")
.saveAsTable(nextAlphaTable)
spark.sql(s"create or replace view ${databaseName}.${tableName} as select * from ${nextAlphaTable}")
}
//An example on how to use this:
//SparkSession: spark
val df = Seq((1,"I"),(2,"am"),(3,"a"),(4,"dataframe")).toDF("id","text")
val new_data = Seq((5,"with"),(6,"new"),(7,"data")).toDF("id","text")
val dbName = "test_db"
val tableName = "alpha_test_table"
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
println("Saving dataframe")
saveDataframe(spark, dbName, tableName, df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
val processed_df = df.unionByName(new_data) //Or other operations you want to do
println("Saving new dataframe")
saveDataframe(spark, dbName, tableName, processed_df)
println("Dataframe saved")
println(s"Current table: ${getCurrentTable(spark, dbName, tableName).getOrElse("Table does not exist")}")
spark.read.table(s"${dbName}.${tableName}").show
Result:
Current table: Table does not exist
Saving dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha0
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 1| I|
| 2| am|
+---+---------+
Saving new dataframe
Dataframe saved
Current table: test_db.alpha_test_table_alpha1
+---+---------+
| id| text|
+---+---------+
| 3| a|
| 4|dataframe|
| 5| with|
| 6| new|
| 7| data|
| 1| I|
| 2| am|
+---+---------+
By doing this you can guarantee that a version of the view <table_name> will always be available. This also has the advantage (or not, depending on your case) of maintaining the previous versions of the table. i.e. the previous version of <table_name_alpha1> will be <table_name_alpha0>
3 - A bonus
If upgrading your Spark version is an option, take a look at Delta Lake (minimum Spark version: 2.4.2)
Hope this helps :)

Cache the parquet first, then do overwrite.
var tmp = sparkSession.read.parquet("path/to/parquet_1").cache()
tmp.write.mode(SaveMode.Overwrite).parquet("path/to/parquet_1") // same path
Error is thrown because spark does lazy evaluation. When the DAG is executed on "write" command, it starts to read the parquet and write/overwrite at the same time.

Spark doesn't have a transaction manager like Zookeeper to do locks on files hence doing concurrent read/writes is a challenge which needs to be take care of separately.
To refresh the catalog you can do the following:-
spark.catalog.refreshTable("my_table")
OR
spark.sql(s"REFRESH TABLE $tableName")

A simple solution would be to use df.cache.count to bring in memory first, then do union with new data and write to /folder_name with mode overwrite. You won't have to use temp path in this case.
You mentioned that you are renaming the /folder_name to some temp path. So you should read the old data from that temp path rather than hdfs://folder_name/part-xxxx-xxx.snappy.parquet.

Example
From reading your question, I think this might be your issue if so you should be able to run your code without using DeltaLake. In the below use-case Spark will run the code as such: (1) load the inputDF a store locally the file names of the folder location [in this case the explicit part file names] ; (2a) reach line 2 and overwrite the files within the tempLocation; (2b) load the contents from the inputDF and output it to the tempLocation; (3) follow the same steps as 1 but on the tempLocation; (4a) delete the files within the inputLocation folder; and (4b) try to load the part files cached in 1 to load the data from the inputDF to run the union and break because the file does not exist.
val inputDF = spark.read.format("parquet").load(inputLocation)
inputDF.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF.unionAll(tempDF)
outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
From my experience you can follow two pathways persistence or temporarily output everything used for the overwrite.
Persistence
In the below use case we are going to load the inputDF and immediately save it as another element and persist it. When following with the action the persist will be on the data and not the file paths within the folder.
Else you can do the persistence on the outputDF, which will have, relatively, the same effect. Because the persistence is tethered to the data and not the file paths, the destruction of the inputs will not, cause the file paths to be missing during overwrite.
val inputDF = spark.read.format("parquet").load(inputLocation)
val inputDF2 = inputDF.persist
inputDF2.count
inputDF2.write.format("parquet").mode("overwrite").save(tempLocation)
val tempDF = spark.read.foramt("parquet").load(tempLocation)
val outputDF = inputDF2.unionAll(tempDF) outputDF.write.format("parquet").mode("overwrite").save(inputLocation)
Temporary load
Instead of loading the temporary output for the union input, if you instead entirely load the outputDF to a temporary file and reload that file for the output then you shouldn't see the file not found error.

Related

How to query all the versions from delta lake table at once to track changes made to a specific ID

I have an employee table with salary of all the employees managed using delta lake.
I can query the table based on the version or the timestamp using the time travel feature delta lake supports like this.
SELECT *
FROM DELTA.`EMPLOYEE`
VERSION AS OF 3
But I want to know the history of all the changes done to an employee across all the versions of the delta table. Something like this
SELECT *
, timestamp -- From delta table
, version -- From delta table
FROM DELTA.`EMPLOYEE`
WHERE EMPLOYEE = 'George'
WITHIN ALL VERSIONS --Never exists but just for understanding
It is an old question but today I stumbled on it, since I had some problem to solve. I don't think Delta (delta.io) provides a method for this because Delta revolves around Time Travel to a specific point in time rather over a period.
But if I have to get this, I guess one way will be to directly read the parquet files (ignore delta logs), which will result in all the past versions/states of the record (leave Vacuum etc. aside).
Now if the requirement is to get the exact version of each record creation (which is my requirement), use something like
dataframe.withColumn("input_file", input_file_name())
which will show the exact file name the record is coming from.
Now query the .json _delta_log transaction files which will tell us which version has added which file, something like this
>>> details = spark.read.json('/data/gcs/delta/ingest/bigtable/_delta_log/*.json')
>>> details = details.select(col('add')['path'].alias("file_path")).withColumn("version",substring(input_file_name(),-6,1)).filter("file_path is not NULL")
>>> details.show(5,100)
+-------------------------------------------------------------------+-------+
| file_path|version|
+-------------------------------------------------------------------+-------+
|part-00000-148c98cc-0db1-495e-bb67-0ba1cc4fd45e-c000.snappy.parquet| 4|
|part-00001-2caa89b7-c990-47e0-b7b0-92430b15b141-c000.snappy.parquet| 4|
|part-00002-1f900af7-d819-48e9-a048-ad22e5c7ce65-c000.snappy.parquet| 4|
|part-00003-e043f466-861b-47f0-a1cf-4b67e75a5ed2-c000.snappy.parquet| 4|
|part-00000-93cc0747-ca0b-46ef-ada4-b3fb18e48925-c000.snappy.parquet| 0|
+-------------------------------------------------------------------+-------+
only showing top 5 rows
join both of these dataframes on file_path and you will see each state/version of the record along with the delta version it was created in. My example -
parquet_table = spark.read.parquet('/data/gcs/delta/ingest/bigtable/*.parquet')
>>> parquet_table.printSchema()
root
|-- Region: string (nullable = true)
|-- Country: string (nullable = true)
|-- Item_Type: string (nullable = true)
|-- Sales_Channel: string (nullable = true)
parquet_table = parquet_table.where(col("Order_ID")==913712584).\
withColumn("input_file",substring(input_file_name(),38,1000)).\
select(["Order_ID","Region","Country","Sales_Channel","input_file"]).\
orderBy("Country")
>>> parquet_table.join(details,parquet_table.input_file == details.file_path).select("Order_ID","Region","Country","Sales_Channel","version").orderBy("version").show(100)
+---------+------------------+-------+-------------+-------+
| Order_ID| Region|Country|Sales_Channel|version|
+---------+------------------+-------+-------------+-------+
|913712584|Sub-Saharan Africa|Lesotho| Online| 0|
|913712584|Sub-Saharan Africa|Lesotho| Online| 0|
|913712584|Sub-Saharan Africa|Lesotho| Online| 0|
On Databricks, starting with the Databricks Runtime 8.2 there is a functionality called Change Data Feed that tracks what changes were made to the table, and you can pull that feed of changes either as batch or as stream for analysis or implementing change data capture-style processing.
After change data feed is enabled on the table, you can read data using batch or stream APIs, something like this:
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 0) \
.table("myDeltaTable")
and you'll get all changed records with additional columns that describe what kind of changes was done (insert/update/delete), when this happened (timestamp) and in which version.
Update, September 2022: Change Data Feed is now available in the open source version as well - starting with version 2.0.0.

SPARK Parquet file loading with different schema

We have parquet files generated with two different schemas where we have ID and Amount fields.
File:
file1.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,6)
Content:
1,19500.00
2,198.34
file2.snappy.parquet
ID: INT
AMOUNT: DECIMAL(15,2)
Content:
1,19500.00
3,198.34
When I am loading both the files together df3 = spark.read.parquet("output/"), and tried to get the data it is inferring the schema of Decimal(15,6) to the file which has amount with Decimal(16,2) and that files data is getting manipulated wrongly. Is there is a way that I can retrieve the data properly for this case.
Final output I could see after executing df3.show()
+---+-----------------+
|ID|       AMOUNT|
+---+-----------------+
| 1|        1.950000|
| 3|        0.019834|
| 1|19500.000000|
| 2|    198.340000|
+---+-----------------+
Here if you see for 1st and 2nd row the amount got manipulated incorrectly.
Looking for some suggestions on this. I know if we regenerate the files with same schema this issue will go away, this requires regeneration and replacing of the files which were delivered, is there any other way temporary which we can use and mean while we will work on regenerating those files.
~R,
Krish
You can try by using mergeSchema property as true.
So instead of
df3 = spark.read.parquet("output/")
Try this:
df3 = spark.read.option("mergeSchema","true").parquet("output/")
But this will give inconsistency records if the version of spark is different for both the parquet. in this case the new version of spark should set the below property to true.
spark.sql.parquet.writeLegacyFormat
Try to read this as a string and provide the schema manually while reading the file
schema = StructType([
StructField("flag_piece", StringType(), True)
])
spark.read.format("parquet").schema(schema).load(path)
The following worked for me:
df = spark.read.parquet("data_file/")
for col in df.columns:
df = df.withColumn(col, df[col].cast("string"))

Problem in reading string NULL values from BigQuery

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', <bq_dataset> + <bq_table>) \
.load()
bqdf.createOrReplaceTempView('bqdf')
# Select required data into another df
bqdf2 = spark.sql(
'SELECT * FROM bqdf')
# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')
I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.
I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!
I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:
According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:
nullValue – sets the string representation of a null value. If None is
set, it uses the default value, empty string.
Which is what you are looking for. Then, I just implemented nullValue option below.
sc = SparkContext()
spark = SparkSession(sc)
# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
"table", "dataset.table").load()
# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")
# Select required data into another df
data_view2 = spark.sql(
'SELECT * FROM data_view')
df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')
data_view2.show()
Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:
+------+---+
|name |age|
+------+---+
|Robert| 25|
|null | 23|
+------+---+
Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:
name,age
Robert,25
,23
As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.
Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

hive external table on parquet not fetching data

I am trying to create a datapipeline where the incomng data is stored into parquet and i create and external hive table and users can query the hive table and retrieve data .I am able to save the parquet data and retrieve it directly but when i query the hive table its not returning any rows. I did the following test setup
--CREATE EXTERNAL HIVE TABLE
create external table emp (
id double,
hire_dt timestamp,
user string
)
stored as parquet
location '/test/emp';
Now created dataframe on some data and saved to parquet .
---Create dataframe and insert DATA
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2))
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+
As shown above i am able to see the data if i read from parquet directly but not from hive .The question is what i am doing wrong here ? What i am i doing wrong that the hive isnt getting the data. I thought msck repair may be a reason but i get error if i try to do msck repair table saying table not partitioned.
Based on your create table statement, you have used location as /test/emp but while writing data, you are writing at /tenants/gwm/idr/emp. So you will not have data at /test/emp.
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/test/emp';
Please re-create external table as
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/tenants/gwm/idr/emp';
Apart from the answer given by Ramdev below, you also need to be cautious of using the correct datatype around date/timestamp; as 'date' type is not supported by parquet when creating a hive table.
For that you can change the 'date' type for column 'hire_dt' to 'timestamp'.
Otherwise there will be a mismatch in data you persisting through spark and trying to read in hive (or hive SQL). Keeping it to 'timestamp' at both places will resolve the issue. I hope it helps.
Do you have enableHiveSupport() in your sparkSession builder() statement. Are you able to connect to hive metastore? Try doing show tables/databases in your code to see if you can display tables present at your hive location?
i got this working with below chgn.
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
So basically the issue was datatype mismatch and some how the original code the cast doesn't seem to work. So i did an explicit cast and then write it goes fine and able to query back as well.Logically both are doing the same not sure why the original code not working.
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+

Overwrite Hive table with Spark SQL [duplicate]

I have a test table in MySQL with id and name like below:
+----+-------+
| id | name |
+----+-------+
| 1 | Name1 |
+----+-------+
| 2 | Name2 |
+----+-------+
| 3 | Name3 |
+----+-------+
I am using Spark DataFrame to read this data (using JDBC) and modifying the data like this
Dataset<Row> modified = sparkSession.sql("select id, concat(name,' - new') as name from test");
modified.write().mode("overwrite").jdbc(AppProperties.MYSQL_CONNECTION_URL,
"test", connectionProperties);
But my problem is, if I give overwrite mode, it drops the previous table and creates a new table but not inserting any data.
I tried the same program by reading from a csv file (same data as test table) and overwriting. That worked for me.
Am I missing something here ?
Thank You!
The problem is in your code. Because you overwrite a table from which you're trying to read you effectively obliterate all data before Spark can actually access it.
Remember that Spark is lazy. When you create a Dataset Spark fetches required metadata, but doesn't load the data. So there is no magic cache which will preserve original content. Data will be loaded when it is actually required. Here it is when you execute write action and when you start writing there is no more data to be fetched.
What you need is something like this:
Create a Dataset.
Apply required transformations and write data to an intermediate MySQL table.
TRUNCATE the original input and INSERT INTO ... SELECT from the intermediate table or DROP the original table and RENAME intermediate table.
Alternative, but less favorable approach, would be:
Create a Dataset.
Apply required transformations and write data to a persistent Spark table (df.write.saveAsTable(...) or equivalent)
TRUNCATE the original input.
Read data back and save (spark.table(...).write.jdbc(...))
Drop Spark table.
We cannot stress enough that using Spark cache / persist is not the way to go. Even in with the conservative StorageLevel (MEMORY_AND_DISK_2 / MEMORY_AND_DISK_SER_2) cached data can be lost (node failures), leading to silent correctness errors.
I believe all the steps above are unnecessary. Here's what you need to do:
Create a dataset A like val A = spark.read.parquet("....")
Read the table to be updated, as dataframe B. Make sure enable caching is enabled for dataframe B. val B = spark.read.jdbc("mytable").cache
Force a count on B - this will force execution and cache the table depending on the chosen StorageLevel - B.count
Now, you can do a transformation like val C = A.union(B)
And, then write C back to the database like C.write.mode(SaveMode.Overwrite).jdbc("mytable")
Reading and writing to same table.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
df_3 = sqlContext.createDataFrame(broad_cast_var.value, cols_df)
Reading and writing to same table with some modification.
cols_df = df_2.columns
broad_cast_var = spark_context.broadcast(df_2.collect())
def update_x(x):
y = (x[0] + 311, *x[1:])
return y
rdd_2_1 = spark_context.parallelize(broad_cast_var.value).map(update_x)
df_3 = sqlContext.createDataFrame(rdd_2_1, cols_df)

Resources