I need to process partition per partition (long story).
Using mapPartitions is working fine when using RDDs. In the example, when using rdd.mapPartitions(mapper).collect() all work as expected.
But, when transforming to DataFrame, one partition is processed twice.
Why this is happening and how to avoid it?
Following, the output of the next simple example. We can read how the function is executed 3 times, when there are only two partitions. One of the partitions [Row(id=1), Row(id=2)] is processed two times.
It is courious that one of the executions is ignored, as we can see in the DataDrame resulted.
size: 2 > values: [Row(id=1), Row(id=2)]
size: 2 > values: [Row(id=1), Row(id=2)]
size: 2 > values: [Row(id=3), Row(id=4)]
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
> Mapper executions: 3
Simple example used:
from typing import Iterator
from pyspark import Row
from pyspark.sql import SparkSession
def gen_random_row(id: str):
return Row(id=id)
if __name__ == '__main__':
spark = SparkSession.builder.master("local[1]").appName("looking for the error").getOrCreate()
executions_counter = spark.sparkContext.accumulator(0)
rdd = spark.sparkContext.parallelize([
gen_random_row(1),
gen_random_row(2),
gen_random_row(3),
gen_random_row(4),
], 2)
def mapper(iterator: Iterator[Row]) -> Iterator[Row]:
executions_counter.add(1)
lst = list(iterator)
print(f"size: {len(lst)} > values: {lst}")
for r in lst:
yield r
# rdd.mapPartitions(mapper).collect()
rdd.mapPartitions(mapper).toDF().show()
print(f"> Mapper executions: {executions_counter.value}")
spark.stop()
The solution is passing the schema to toDF
Looks like Spark is processing one partition to infer the schema.
To solve it:
schema = StructType([StructField("id", IntegerType(), True)])
rdd.mapPartitions(mapper).toDF(schema).show()
With this code, every partition is processed one time.
Related
I am using pyspark.sql.dataframe.DataFrame in pyspark.
I have one driver and 3 executors/workers.
When I want to apply a function to each row and have it run on one of the 3 executors it works on a normal dataframe but if I have done a groupBy and agg to the dataframe then it all goes on the same executor/worker.
data = [('James','Smith','apples','a'),('James','Smith','oranges','b'),('James','Smith','lemons','a'),('Anna','Rose','apples','a'),('Anna','Rose','lemons','b'), ('Robert','Williams','oranges','v'), ]
columns = ["firstname","lastname","fuits","letter"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
+---------+--------+-------+------+
|firstname|lastname| fuits|letter|
+---------+--------+-------+------+
| James| Smith| apples| a|
| James| Smith|oranges| b|
| James| Smith| lemons| a|
| Anna| Rose| apples| a|
| Anna| Rose| lemons| b|
| Robert|Williams|oranges| v|
+---------+--------+-------+------+
dfagg = df.groupBy("firstname","lastname").agg(functions.collect_list("fuits"), functions.collect_list("letter"))
dfagg.show()
+---------+--------+--------------------+--------------------+
|firstname|lastname| collect_list(fuits)|collect_list(letter)|
+---------+--------+--------------------+--------------------+
| James| Smith|[apples, oranges,...| [a, b, a]|
| Anna| Rose| [lemons, apples]| [b, a]|
| Robert|Williams| [oranges]| [v]|
+---------+--------+--------------------+--------------------+
I then apply the foreach with a simple function :
# Foreach example
def f(x):
print(x)
print(' ===== > this is in the simple foreach')
def f2(x):
print(x)
print(' ===== > this is in the aggregated foreach')
# foreach applied to the normal dataframe
df.foreach(f)
# foreach applied to the dataframe that was aggregated
dfagg.foreach(f2)
For the normal dataframe and foreach I get the expected outcome:
The prints of the 6 rows are distributed on the 3 executors/workers
On the executor 1
Row(firstname='James', lastname='Smith', fuits='lemons', letter='a')
===== > this is in the simple foreach
Row(firstname='Anna', lastname='Rose', fuits='apples', letter='a')
===== > this is in the simple foreach
On the executor 2
Row(firstname='Anna', lastname='Rose', fuits='lemons', letter='b')
===== > this is in the simple foreach
Row(firstname='Robert', lastname='Williams', fuits='oranges', letter='v')
===== > this is in the simple foreach
On the executor 3
Row(firstname='James', lastname='Smith', fuits='apples', letter='a')
===== > this is in the simple foreach
Row(firstname='James', lastname='Smith', fuits='oranges', letter='b')
===== > this is in the simple foreach
But for the foreach on the aggregated dataframe:
Everything goes to the same executor/worker
Row(firstname='James', lastname='Smith', collect_list(fuits)=['lemons', 'apples', 'oranges'], collect_list(letter)=['a', 'a', 'b'])
===== > this is in the aggregated foreach
Row(firstname='Anna', lastname='Rose', collect_list(fuits)=['apples', 'lemons'], collect_list(letter)=['a', 'b'])
===== > this is in the aggregated foreach
Row(firstname='Robert', lastname='Williams', collect_list(fuits)=['oranges'], collect_list(letter)=['v'])
===== > this is in the aggregated foreach
How can I distribute the function on each row to the 3 executors ?
I need to work on an aggregated dataframe and performe quite long functions on each aggregated row, so it needs to be distributed otherwise it takes too long.
I have tried with more data to see if there was a min amount of data -> no change
Both dataframes are of the same type
print( 'Type of df : ', type(df) )
print( 'Type of dfagg : ', type(dfagg) )
Type of df : <class 'pyspark.sql.dataframe.DataFrame'>
Type of dfagg : <class 'pyspark.sql.dataframe.DataFrame'>
Thank you very much
Sounds like you have data skew. I can't think of why else all data is ending up on one node.
If you are using a recent version spark 3.0.(You could use adaptive query to help automagically adjust your shuffle partitions.)
If you aren't so luck (to be on spark 3.0) you could try distribute by, set an appropriate shuffle partitions value or repartition. This should redistribute the data hopefully in a way that enables you to use all of your cluster.
I facing the issue that I get the error
TypeError: cannot pickle '_thread.RLock' object
when I try to apply the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),
('Mike','Williams','M',77),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
def find_n_people_with_higher_age(x):
return df_2.filter(df_2['age']>=x).count()
find_n_people_with_higher_age_udf = udf(find_n_people_with_higher_age, IntegerType())
df_1.select(find_n_people_with_higher_age_udf(col('category_id')))
Here's a good article on python UDF's.
I use it as a reference as I suspected that you were running into a serialization issue. I'm showing the entire paragraph to add context of the sentence but really it's the serialization that's the issue.
Performance Considerations
It’s important to understand the performance implications of Apache
Spark’s UDF features. Python UDFs for example (such as our CTOF
function) result in data being serialized between the executor JVM and
the Python interpreter running the UDF logic – this significantly
reduces performance as compared to UDF implementations in Java or
Scala. Potential solutions to alleviate this serialization bottleneck
include:
If you consider what you are asking maybe you'll see why this isn't working. You are asking all data from your dataframe(data_2) to be shipped(serialized) to an executor that then serializes it and ships it to python to be interpreted. Dataframes don't serialize. So that's your issue, but if they did, you are sending an entire data frame to each executor. Your sample data here isn't an issue, but for trillions of records it would blow up the JVM.
What your asking is doable I just need to figure out how do it. Likely a window or group by would be the trick.
add additional data:
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
# add more data to make it more interesting.
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),('Gia','Rose','F',34),
('Mike','Williams','M',77), ('John','Williams','M',77), ('Bill','Williams','F',79),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
# dataframe to help fill in missing ages
ref = spark.range( 1, 110, 1).toDF("numbers").withColumn("count", lit(0)).withColumn("rolling_Count", lit(0))
countAges = df_2.groupby("age").count()
#this actually give you the short list of ages
rollingCounts = countAges.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#fill in missing ages and remove duplicates
filled = rollingCounts.union(ref).groupBy("age").agg(sum("count").alias("count"))
#add a rolling count across all ages
allAgeCounts = filled.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#do inner join because we've filled in all ages.
df_1.join(allAgeCounts, df_1.age == allAgeCounts.age, "inner").show()
+---------+--------+------+---+---+-----+-------------+
|firstname|lastname|gender|age|age|count|rolling_Count|
+---------+--------+------+---+---+-----+-------------+
| Anna| Rose| F| 41| 41| 0| 3|
| Robert|Williams| M| 62| 62| 0| 3|
| James| Smith| M| 30| 30| 0| 5|
+---------+--------+------+---+---+-----+-------------+
I wouldn't normally want to use a window over an entire table, but here the data it's iterating over <= 110 so this is reasonable.
Understanding how to achieve best parallelism while transforming multiple dataframes in parallel
I have an array of paths
val paths = Array("path1", "path2", .....
I am loading dataframe from each path then transforming and writing to destination path
paths.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
The transformation processData is independent of dataframe I am loading.
This limits to processing one dataframe at a time and most of my cluster resources are idle. As processing each dataframe is independent, I converted Array to ParArray of scala.
paths.par.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
Now it is using more resources in cluster. I am still trying to understand how it works and how to fine tune the parallel processing here
If I increase the default scala parallelism using ForkJoinPool to higher number, can it lead to more threads spawning at driver side and will be in lock state waiting for foreach function to finish and eventually kill the driver?
How does it effect the centralized spark things like EventLoggingListnener which needs to handle more inflow of events as multiple dataframes are processed in parallel.
What parameters do I consider for optimal resource utilization.
Any other approach
Any resources I can go through to understand this scaling would be very helpful
The reason why this is slow is that spark is very good at parallelizing computations on lots of data, stored in one big dataframe. However, it is very bad at dealing with lots of dataframes. It will start the computation on one using all its executors (even though they are not all needed) and wait for it to finish before starting the next one. This results in a lot of inactive processors. This is bad but that's not what spark was designed for.
I have a hack for you. There might need to refine it a little, but you would have the idea. Here is what I would do. From a list of paths, I would extract all the schemas of the parquet files and create a new big schema that gathers all the columns. Then, I would ask spark to read all the parquet files using this schema (the columns that are not present will be set to null automatically). I would then union all the dataframes and perform the transformation on this big dataframe and finally use partitionBy to store the dataframes in separate files, while still doing all of it in parallel. It would look like this.
// let create two sample datasets with one column in common (id)
// and two different columns x != y
val d1 = spark.range(3).withColumn("x", 'id * 10)
d1.show
+---+----+
| id| x |
+---+----+
| 0| 0|
| 1| 10|
| 2| 20|
+---+----+
val d2 = spark.range(2).withColumn("y", 'id cast "string")
d2.show
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 1|
+---+---+
// And I store them
d1.write.parquet("hdfs:///tmp/d1.parquet")
d2.write.parquet("hdfs:///tmp/d2.parquet")
// Now let's create the big schema
val paths = Seq("hdfs:///tmp/d1.parquet", "hdfs:///tmp/d2.parquet")
val fields = paths
.flatMap(path => spark.read.parquet(path).schema.fields)
.toSet //removing duplicates
.toArray
val big_schema = StructType(fields)
// and let's use it
val dfs = paths.map{ path =>
spark.read
.schema(big_schema)
.parquet(path)
.withColumn("path", lit(path.split("/").last))
}
// The we are ready to create one big dataframe
dfs.reduce( _ unionAll _).show
+---+----+----+----------+
| id| x| y| file|
+---+----+----+----------+
| 1| 1|null|d1.parquet|
| 2| 2|null|d1.parquet|
| 0| 0|null|d1.parquet|
| 0|null| 0|d2.parquet|
| 1|null| 1|d2.parquet|
+---+----+----+----------+
Yet, I do not recommend using unionAll on lots of dataframes. Because of spark's analysis of the execution plan, it can be very slow with many dataframes. I would use the RDD version although it is more verbose.
val rdds = sc.union(dfs.map(_.rdd))
// let's not forget to add the path to the schema
val big_df = spark.createDataFrame(rdds,
big_schema.add(StructField("path", StringType, true)))
transform(big_df)
.write
.partitionBy("path")
.parquet("hdfs:///tmp/processed.parquet")
And having a look at my processed directory, I get this:
hdfs:///tmp/processed.parquet/_SUCCESS
hdfs:///tmp/processed.parquet/path=d1.parquet
hdfs:///tmp/processed.parquet/path=d2.parquet
You should play with some variables here. Most important are: CPU cores, the size of each DF and a little use of futures. The propose is decide the priority of each DF to be processed. You can use FAIR configuration but that don't be enough and process all in parallel could consume a big part of your cluster. You have to assign priorities to DFs and use Future pooll to control the number of parallel Jobs running in your app.
The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0
This question already has answers here:
Apache Spark: Get number of records per partition
(6 answers)
Closed 2 years ago.
Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.
Something like this:
Rdd.partitions().get(index).size()
Except I don't see such an API for spark. Any ideas? workarounds?
Thanks
The following gives you a new RDD with elements that are the sizes of each partition:
rdd.mapPartitions(iter => Array(iter.size).iterator, true)
PySpark:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), num_partitions)
l = a.glom().map(len).collect() # get length of each partition
print(min(l), max(l), sum(l)/len(l), len(l)) # check if skewed
Spark/scala:
val numPartitions = 20000
val a = sc.parallelize(0 until 1e6.toInt, numPartitions )
val l = a.glom().map(_.length).collect() # get length of each partition
print(l.min, l.max, l.sum/l.length, l.length) # check if skewed
The same is possible for a dataframe, not just for an RDD.
Just add DF.rdd.glom... into the code above.
Notice that glom() converts elements of each partition into a list, so it's memory-intensive. A less memory-intensive version (pyspark version only):
import statistics
def get_table_partition_distribution(table_name: str):
def get_partition_len (iterator):
yield sum(1 for _ in iterator)
l = spark.table(table_name).rdd.mapPartitions(get_partition_len, True).collect() # get length of each partition
num_partitions = len(l)
min_count = min(l)
max_count = max(l)
avg_count = sum(l)/num_partitions
stddev = statistics.stdev(l)
print(f"{table_name} each of {num_partitions} partition's counts: min={min_count:,} avg±stddev={avg_count:,.1f} ±{stddev:,.1f} max={max_count:,}")
get_table_partition_distribution('someTable')
outputs something like
someTable each of 1445 partition's counts:
min=1,201,201 avg±stddev=1,202,811.6 ±21,783.4 max=2,030,137
I know I'm little late here, but I have another approach to get number of elements in a partition by leveraging spark's inbuilt function. It works for spark version above 2.1.
Explanation:
We are going to create a sample dataframe (df), get the partition id, do a group by on partition id, and count each record.
Pyspark:
>>> from pyspark.sql.functions import spark_partition_id, count as _count
>>> df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
>>> df.rdd.getNumPartitions()
4
>>> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(_count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
Scala:
scala> val df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [key: string, value: string ... 1 more field]
scala> df.rdd.getNumPartitions
res0: Int = 4
scala> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
pzecevic's answer works, but conceptually there's no need to construct an array and then convert it to an iterator. I would just construct the iterator directly and then get the counts with a collect call.
rdd.mapPartitions(iter => Iterator(iter.size), true).collect()
P.S. Not sure if his answer is actually doing more work since Iterator.apply will likely convert its arguments into an array.