(PySpark) Update a delta table based on conditional expression while iterating over a lookup df and extract values to insert from a nested dict? - python-3.x

I have a mapping/lookup table/DF according to which I have to extract values from a highly nested json/dictionary. These values have to be inserted as column values to a delta table. How do I do this leveraging pyspark's parallelism?
I know I can collect() the mapping dataframe, open the json file and update each column of a row of a temp df and append to delta table but that will not run in parrallel.
Alternatively, I broadcast the dict/JSON, iterate over mapping dataframe using foreach() and according to when condition I upsert my delta table. But column.when() does not allow me to update a delta table nor does the delta.tables.merge() allow me to compare a dataframe and a dict.

Related

Why would dataframe.write.mode("overwrite").saveAsTable("table") command be dropping data?

%python
dataframe.count()
#output 1179
%python
dataframe.write.mode("overwrite").saveAsTable("tablename")
%sql
select count(*) from tablename
--output 1069
What can I be doing wrong? (these are different cells in databricks)
I want to overwrite the data. Dataframe has more rows, but is dropping some rows while writing into the table.
in this above code, the existing data in the table will be overwritten with the data of the dataframe.if you want to keep the table data with the dataframe data in the table then you have to append the dataframe into the table.
dataframe.write.mode("append").saveAsTable("tablename")
This code will append your dataframe into the table.
Thank You!

iterating complex dataframe with array of structfield

I have data in one of dataframe's column with the following schema
<type 'list'>: [StructField(data,StructType(List(StructField(account,StructType(List(StructField(Id,StringType,true),StructField(Name,StringType,true),StructField(books,ArrayType(StructType(List(StructField(bookTile,StringType,true),StructField(bookId,StringType,true),StructField(bookName,StringType,true))),true),true)))))))]
I want to interate them extract each value out of it and create a new dataframe. Is there any inbuilt functions in pyspark supports this or I should iterate them? Any efficient way?

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

How to pass multiple column in partitionby method in Spark

I am a newbie in Spark.I want to write the dataframe data into hive table. Hive table is partitioned on mutliple column. Through, Hivemetastore client I am getting the partition column and passing that as a variable in partitionby clause in write method of dataframe.
var1="country","state" (Getting the partiton column names of hive table)
dataframe1.write.partitionBy(s"$var1").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
When I am executing the above code,it is giving me error partiton "country","state" does not exists.
I think it is taking "country","state" as a string.
Can you please help me out.
The partitionBy function takes a varargs not a list. You can use this as
dataframe1.write.partitionBy("country","state").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
Or in scala you can convert a list into a varargs like
val columns = Seq("country","state")
dataframe1.write.partitionBy(columns:_*).mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")

On saveAsTable from Spark

We are trying to write into a HIVE table from SPARK and we are using saveAsTable function. I want to know whether saveAsTable every time drop and recreate the hive table or not? If it does so, then is there any other possible spark function which will actually just truncate and load a table, instead drop and recreate.
It depends on which .mode value you are specifying
overwrite --> then spark drops the table first then recreates the table
append --> insert new data to the table
1.Drop if exists/create if not exists default.spark1 table in parquet format
>>> df.write.mode("overwrite").saveAsTable("default.spark1")
2.Drop if exists/create if not exists default.spark1 table in orc format
>>> df.write.format("orc").mode("overwrite").saveAsTable("default.spark1")
3.Append the new data to the existing data in the table(doesn't drop/recreate table)
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
Achieve Truncate and Load using Spark:
Method1:-
You can register your dataframe as temp table then execute insert overwrite statement to overwrite target table
>>> df.registerTempTable("temp") --registering df as temptable
>>> spark.sql("insert overwrite table default.spark1 select * from temp") --overwriting the target table.
This method will work for Internal/External tables also.
Method2:-
In case of internal tables as we can truncate the tables first then append the data to the table, by using this way we are not recreating the table but we are just appending the data to the table.
>>> spark.sql("truncate table default.spark1")
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
This method will only work for Internal tables.
Even in case of external tables we can do some workaround to truncate the table by changing table properties.
Let's assume default.spark1 table is external table and
--change external table to internal table
>>> saprk.sql("alter table default.spark1 set tblproperties('EXTERNAL'='FALSE')")
--once the table is internal then we can run truncate table statement
>>> spark.sql("truncate table default.spark1")
--change back the table as External table again
>>> spark.sql("alter table default.spark1 set tblproperties('EXTERNAL'='TRUE')")
--then append data to the table
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
You can also use insertInto("table") which doesn't recreate the table
The main difference between saveAsTable is that insertInto expects that the table already exists and is based on the order of columns instead of names.

Resources