I belive it is a bacis question about data processing in spark.
Let's assue, there is a data frame:
PartitionColumn
ColumnB
ColumnC
First
value1
First
value2
Second
row
...
...
...
I am going to processig this data pararell using the PartitionColumn, so all rows with First value go to the First table, with the Second values go to the Second table etc.
Could I ask for a tip how to achive it in PySpark (2.x)?
Please refer partitionBy() section in this documentation
df.write \
.partitionBy("PartitionColumn") \
.mode("overwrite") \
.parquet("/path")
Your partitioned data will be saved under folders:
/path/First
/path/Second
Is there any sense to reduce not required columns before I join it in Spark data frames?
For example:
DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns.
I want to join them, select needed 10 columns and save it in .parquet.
Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?
Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:
df1 = spark.read.parquet("file1") \
.select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
.select("col1", "col5", "col6")
joined = df1.join(df2, "col1")
I am writing a dataframe to a delta table using the following code:
(df
.write
.format("delta")
.mode("overwrite")
.partitionBy("date")
.saveAsTable("table"))
I have 32 distinct dates in the format yyyy-mm, and I am expecting to have 32 partitions, but if I run print(df.rdd.getNumPartitions()), I get only 15. What am I missing?
To load and partition the incoming data in spark, I am using the following syntax.
val dataframe = spark.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("user", user)
.option("password", password)
.option("dbtable", query)
.option("partitionColumn", partitionColumn)
.option("lowerBound", lowerBound_value)
.option("upperBound", upperBound_value)
.option("numPartitions", numPartitions)
.option("fetchsize", 15000)
.load()
The parameters partitionColumn, lowerBound, upperBound, numPartitions are used to optimise the performance of the job.
I have a table of 1000 records & an integer column that has serial numbers from 1 to 1000.
I am first running min and max on that column to assign min value to lowerBound and max value to upperBound. The numPartitions parameter is given as 3 so that the incoming data is split into 3 different partitions evenly (or close to being even).
The above design works well when there is less data. But I have a scenario as below.
I have a table of 203 billion records with no integer column that contain unique/serial integers. Then there is a date column that has data spread across 5 years namely 2016-2021.
In order to move the data faster, I am moving a month's data of each year everytime.
This is the query I am using:
val query = s"(select * from table where date_column >= '${YearMonth.of(year.toInt, month).atDay(1).toString} and date_time <= '${YearMonth.of(year.toInt, month).atEndOfMonth().toString} 23:59:59.999') as datadf"
So the above query becomes:
select * from table where date_column >= '2016-01-01' and date_time <= '2016-01-31 23:59:59.999''
and so on with first and last day of each month for every year.
This is a rough description of how my loop is:
(2016 to 2021) { year =>
(1 to 12) { month =>
val query = s"(select * from table where date_column >= '${YearMonth.of(year.toInt, month).atDay(1).toString} and date_time <= '${YearMonth.of(year.toInt, month).atEndOfMonth().toString} 23:59:59.999') as datadf"
val dataframe = spark.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("user", user)
.option("password", password)
.option("dbtable", query)
.option("partitionColumn", partitionColumn)
.option("lowerBound", lowerBound_value)
.option("upperBound", upperBound_value)
.option("numPartitions", numPartitions)
.option("fetchsize", 15000)
.load()
}
}
To find out bounds, I am using the same filters of month and year as below:
val bounds = spark.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("user", user)
.option("password", password)
.option("dbtable", "(select min(partitionColumn) as mn, max(partitionColum) as from tablename where date_column >= '${YearMonth.of(year.toInt, month).atDay(1).toString} and date_time <= '${YearMonth.of(year.toInt, month).atEndOfMonth().toString} 23:59:59.999') as boundsDF")
.load()
val lowerBound_value = bounds.select("mn").head.getInt(0)
val upperBound_value = bounds.select("mx").head.getInt(0)
The issue comes here with finding the lower and upper bounds of the filtered data.
Because of the huge data size, the query that runs min & max on the partitionColumn with the given filters is taking way more time than writing the actual dataframe into hdfs.
I tried giving random values there but observed data skew in the partitions while the tasks are running.
It is mandatory to give min and max of then partitionColumn as lower and upper bounds for better data distribution ?
If not, is there any way to specify lower and upper bounds instead of running a min & max query on data ?
Any help is much appreciated.
With 200+ billion rows, I do hope your table is partitioned in your DB on the same date column on which you are accessing the data. Without that, queries will by quite hopeless.
But have you tried integer equivalent of date/timestamp values in lower and upper bounds? Check this reference for Spark's conversion of integer values to timestamps.
The JDBC options lowerBound and upperBound are converted to
TimestampType/DateType values in the same way as casting strings to
TimestampType/DateType values. The conversion is based on Proleptic
Gregorian calendar, and time zone defined by the SQL config
spark.sql.session.timeZone. In Spark version 2.4 and below, the
conversion is based on the hybrid calendar (Julian + Gregorian) and on
default system time zone.
As you mentioned, there is no pre-existing integer column which may be used here. So with your loops, the upper and lower bounds are static and hence convertible to static upper and lower numeric values. Based upon Spark's internals, lower and upper bound values are divided into numeric ranges and multiple queries are thrown to DB to fetch a single partition's data per query. This also means that table partitioning on relevant column or having appropriate indices in source DB is really significant for performance.
You will need to ensure that the placeholders for upper and lower bounds are appropriately placed in your provided query. As a heads up; the actual numeric values may vary depending on the database system in use. If that scenario pops up, i.e. database system's integer conversion to date is different, then you will need to provide the values accepted by database rather than Spark. From same docs:
Parameters:
connectionFactory - a factory that returns an open Connection. The RDD takes care of closing the connection.
sql - the text of the query. The query must contain two ? placeholders for parameters used to partition the results. For
example,
select title, author from books where ? <= id and id <= ?
lowerBound - the minimum value of the first placeholder
upperBound - the maximum value of the second placeholder The lower and upper bounds are inclusive.
...
From same, it is also clear that <= and >= are utilized so both upper and lower bounds are inclusive; a point of confusion I observed on other questions.
In Azure SQL DW ,I have an empty table (say table T1) .
Suppose T1 has 4 columns C1,C2,C3 and C4 (C4 is not null)
I have a dataframe in Databricks (say df1) which has data for C1,C2 and C3
I am performing the write operation on the dataframe using code snippet like the following
df1.write
.format("com.databricks.spark.sqldw")
.option("url", jdbcURL)
.option("dbtable", "T1")
.option( "forward_spark_azure_storage_credentials","True")
.option("tempDir", tempDir)
.mode("overwrite")
.save()
What I see there is that instead of getting any error ,the table T1 gets lost and new table T1 gets created with only 3 columns C1,C2 and C3.
Is that an expected behavior or ideally while trying to insert data , some exceptions should have been thrown as data corresponding to C4 was missing ?
You’ve set the mode to overwrite — dropping and recreating the table in question is my experience there too. Maybe try append instead?