PySpark - parallel processing - apache-spark

I belive it is a bacis question about data processing in spark.
Let's assue, there is a data frame:
PartitionColumn
ColumnB
ColumnC
First
value1
First
value2
Second
row
...
...
...
I am going to processig this data pararell using the PartitionColumn, so all rows with First value go to the First table, with the Second values go to the Second table etc.
Could I ask for a tip how to achive it in PySpark (2.x)?

Please refer partitionBy() section in this documentation
df.write \
.partitionBy("PartitionColumn") \
.mode("overwrite") \
.parquet("/path")
Your partitioned data will be saved under folders:
/path/First
/path/Second

Related

How to specify nested partitions in merge query while trying to merge incremental data with a base table?

I am trying to merge a dataframe that contains incremental data into my base table as per the databricks documentation.
base_delta.alias('base') \
.merge(source=kafka_df.alias('inc'),
condition='base.key1=ic.key1 and base.key2=inc.key2') \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
The above operation is working fine but it takes lot time as expected since there are lot of unwanted partitions that are being scanned.
I came across a databricks documentation here, a merge query with partitions specified in it.
Code from that link:
spark.sql(s"""
|MERGE INTO $targetTableName
|USING $updatesTableName
|ON $targetTableName.par IN (1,0) AND $targetTableName.id = $updatesTableName.id
|WHEN MATCHED THEN
| UPDATE SET $targetTableName.ts = $updatesTableName.ts
|WHEN NOT MATCHED THEN
| INSERT (id, par, ts) VALUES ($updatesTableName.id, $updatesTableName.par, $updatesTableName.ts)
""".stripMargin)
The partitions are specified in the IN condition as 1,2,3... But in my case, the table is first partitioned on COUNTRY values USA, UK, NL, FR, IND and then every country has partition on YYYY-MM Ex: 2020-01, 2020-02, 2020-03
How can I specify the partition values if I have nested structure like I mentioned above ?
Any help is massively appreciated.
Yes, you can do that & it's really recommended, because Delta Lake needs to scan all the data that are matching to the ON condition. If you're using Python API, you just need to use correct SQL expression as condition, and you can put restrictions on the partition columns into it, something like this in your case (date is the column from the update date):
base.country = 'country1' and base.date = inc.date and
base.key1=inc.key1 and base.key2=inc.key2
if you have multiple countries, then you can use IN ('country1', 'country2'), but it would be easier to have country inside your update dataframe and match using base.country = inc.country

writing pyspark data frame to text file

I have a pyspark data frame which I created from one table in sql server and
I did some transformation on that and now I am going to convert it to
dynamic data frame in order to be abale to save it as a text file
in s3 bucket. when I am writing data frame to text file I am going to
add another header to that file.
This is my dynamic data frame that will be saved as a file:
AT_DATE | AMG_INS | MONTHLY_AVG
2021-03-21 | MT.0000| 234.543
2021_02_12| MT.1002 | 34.567
I want to add another header on top of that while I am saving my text file I need to add another row like this:
HDR,FTP,PC
AT_DATE,AMG_INS,MONTHLY_AVG
2021-03-21,MT.0000,234.543
2021_02_12,MT.1002,34.567
This is separate row that I need to add on top of my text file.
To save your dataframe as a text file with additional headers lines, you have to perform the following steps:
Prepare your data dataframe
as you can only write to text one column dataframes, you first concatenate all values into one value column, using concat_ws spark SQL function
then you drop all columns but value column using select dataframe method
you add an order column with literal value 2, it will be used later to ensure that headers are at the top of the output text file
Prepare your header dataframe
You create a headers dataframe, containing one row per desired headers. Each row having two column:
a value column containing the header as a string
an order column containing the header order as an int (0 for the first header and 1 for the second header)
Write the union of headers and data dataframes
you union your first dataframe with the headers dataframe using union dataframe method
you use coalesce(1) dataframe method to have only one text file as output
you order your dataframe by your order column using orderBy dataframe method
you drop your order column
and you write the resulting dataframe
Complete code
Translated into code, it gives you below code snippet. I call your dynamic dataframe output_dataframe and your spark session spark and I write to /tmp/to_text_file:
from pyspark.sql import functions as F
data = output_dataframe \
.select(F.concat_ws(',', F.col("AT_DATE"), F.col("AMG_INS"), F.col("MONTHLY_AVG")).alias('value')) \
.withColumn('order', F.lit(2))
headers = sparkSession.createDataFrame([('HDR,FTP,PC', 0), ('AT_DATE,AMG_INS,MONTHLY_AVG', 1)], ['value', 'order'])
headers.union(data) \
.coalesce(1) \
.orderBy('order')\
.drop('order') \
.write.text("/tmp/to_text_file")

Same query resulting in different outputs in Hive vs Spark

Hive 2.3.6-mapr
Spark v2.3.1
I am running same query:
select count(*)
from TABLE_A a
left join TABLE_B b
on a.key = c.key
and b.date > '2021-01-01'
and date_add(last_day(add_months(a.create_date, -1)),1) < '2021-03-01'
where cast(a.TIMESTAMP as date) >= '2021-01-20'
and cast(a.TIMESTAMP as date) < '2021-03-01'
But getting 1B rows as output in hive, while 1.01B in spark-sql.
By some initial analysis, it seems like all the extra rows in spark are having timestamp column as 2021-02-28 00:00:00.000000.
Both the TIMESTAMP and create_date columns have data type string.
What could be the reason behind this?
I will give you one possibility, but I need more information.
If you drop an external table, the data remains and spark can read it, but the metadata in Hive says it doesn't exist and doesn't read it.
That's why you have a difference.

Databricks/Spark data write to SQL DW is dropping the table and recreating it

In Azure SQL DW ,I have an empty table (say table T1) .
Suppose T1 has 4 columns C1,C2,C3 and C4 (C4 is not null)
I have a dataframe in Databricks (say df1) which has data for C1,C2 and C3
I am performing the write operation on the dataframe using code snippet like the following
df1.write
.format("com.databricks.spark.sqldw")
.option("url", jdbcURL)
.option("dbtable", "T1")
.option( "forward_spark_azure_storage_credentials","True")
.option("tempDir", tempDir)
.mode("overwrite")
.save()
What I see there is that instead of getting any error ,the table T1 gets lost and new table T1 gets created with only 3 columns C1,C2 and C3.
Is that an expected behavior or ideally while trying to insert data , some exceptions should have been thrown as data corresponding to C4 was missing ?
You’ve set the mode to overwrite — dropping and recreating the table in question is my experience there too. Maybe try append instead?

Why columns are renamed as c0,c1 in the spark partitioned data?

Following is my source data,
Name |Date |
+-----+----------+
|Azure|2018-07-26|
|AWS |2018-07-27|
|GCP |2018-07-28|
|GCP |2018-07-28|
I have partitioned the data using Date column,
udl_file_df_read.write.format("csv").partitionBy("Date").mode("append").save(outputPath)
val events = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").load(outputPath)
events.show()
The output column names are (c0,Date). I am not sure why the original column name is missing and how do I retain the column names?
Note This is not a duplicate question because of the below reasons Here columns other than partition columns are renamed as c0 and specifying base-path in option doesn't work.
You get column names like c0 because CSV format as used in the question doesn't preserve column names.
You can try writing with
udl_file_df_read
.write.
.option("header", "true")
...
and similarly read
spark
.read
.option("header", "true")
I was able to retain the schema by setting the option header to true when I write my file, I earlier thought I can use this option only to read the data.
udl_file_df_read.write.option("header" ="true" ). format("csv").partitionBy("Date").mode("append").save(outputPath)

Resources