How to get the number of elements in partition? [duplicate] - apache-spark

This question already has answers here:
Apache Spark: Get number of records per partition
(6 answers)
Closed 2 years ago.
Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.
Something like this:
Rdd.partitions().get(index).size()
Except I don't see such an API for spark. Any ideas? workarounds?
Thanks

The following gives you a new RDD with elements that are the sizes of each partition:
rdd.mapPartitions(iter => Array(iter.size).iterator, true)

PySpark:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), num_partitions)
l = a.glom().map(len).collect() # get length of each partition
print(min(l), max(l), sum(l)/len(l), len(l)) # check if skewed
Spark/scala:
val numPartitions = 20000
val a = sc.parallelize(0 until 1e6.toInt, numPartitions )
val l = a.glom().map(_.length).collect() # get length of each partition
print(l.min, l.max, l.sum/l.length, l.length) # check if skewed
The same is possible for a dataframe, not just for an RDD.
Just add DF.rdd.glom... into the code above.
Notice that glom() converts elements of each partition into a list, so it's memory-intensive. A less memory-intensive version (pyspark version only):
import statistics
def get_table_partition_distribution(table_name: str):
def get_partition_len (iterator):
yield sum(1 for _ in iterator)
l = spark.table(table_name).rdd.mapPartitions(get_partition_len, True).collect() # get length of each partition
num_partitions = len(l)
min_count = min(l)
max_count = max(l)
avg_count = sum(l)/num_partitions
stddev = statistics.stdev(l)
print(f"{table_name} each of {num_partitions} partition's counts: min={min_count:,} avg±stddev={avg_count:,.1f} ±{stddev:,.1f} max={max_count:,}")
get_table_partition_distribution('someTable')
outputs something like
someTable each of 1445 partition's counts:
min=1,201,201 avg±stddev=1,202,811.6 ±21,783.4 max=2,030,137

I know I'm little late here, but I have another approach to get number of elements in a partition by leveraging spark's inbuilt function. It works for spark version above 2.1.
Explanation:
We are going to create a sample dataframe (df), get the partition id, do a group by on partition id, and count each record.
Pyspark:
>>> from pyspark.sql.functions import spark_partition_id, count as _count
>>> df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
>>> df.rdd.getNumPartitions()
4
>>> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(_count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
Scala:
scala> val df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [key: string, value: string ... 1 more field]
scala> df.rdd.getNumPartitions
res0: Int = 4
scala> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+

pzecevic's answer works, but conceptually there's no need to construct an array and then convert it to an iterator. I would just construct the iterator directly and then get the counts with a collect call.
rdd.mapPartitions(iter => Iterator(iter.size), true).collect()
P.S. Not sure if his answer is actually doing more work since Iterator.apply will likely convert its arguments into an array.

Related

Why few partitions are processed twice if mapPartitions is used with toDF()

I need to process partition per partition (long story).
Using mapPartitions is working fine when using RDDs. In the example, when using rdd.mapPartitions(mapper).collect() all work as expected.
But, when transforming to DataFrame, one partition is processed twice.
Why this is happening and how to avoid it?
Following, the output of the next simple example. We can read how the function is executed 3 times, when there are only two partitions. One of the partitions [Row(id=1), Row(id=2)] is processed two times.
It is courious that one of the executions is ignored, as we can see in the DataDrame resulted.
size: 2 > values: [Row(id=1), Row(id=2)]
size: 2 > values: [Row(id=1), Row(id=2)]
size: 2 > values: [Row(id=3), Row(id=4)]
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
> Mapper executions: 3
Simple example used:
from typing import Iterator
from pyspark import Row
from pyspark.sql import SparkSession
def gen_random_row(id: str):
return Row(id=id)
if __name__ == '__main__':
spark = SparkSession.builder.master("local[1]").appName("looking for the error").getOrCreate()
executions_counter = spark.sparkContext.accumulator(0)
rdd = spark.sparkContext.parallelize([
gen_random_row(1),
gen_random_row(2),
gen_random_row(3),
gen_random_row(4),
], 2)
def mapper(iterator: Iterator[Row]) -> Iterator[Row]:
executions_counter.add(1)
lst = list(iterator)
print(f"size: {len(lst)} > values: {lst}")
for r in lst:
yield r
# rdd.mapPartitions(mapper).collect()
rdd.mapPartitions(mapper).toDF().show()
print(f"> Mapper executions: {executions_counter.value}")
spark.stop()
The solution is passing the schema to toDF
Looks like Spark is processing one partition to infer the schema.
To solve it:
schema = StructType([StructField("id", IntegerType(), True)])
rdd.mapPartitions(mapper).toDF(schema).show()
With this code, every partition is processed one time.

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.
The Dataframes looks like this:
df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
| ID| MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003| [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+
df2.show(5)
+-----+----------+
|score| word|
+-----+----------+
| 1.68|abandonado|
| 3.18| abejas|
| 2.8| aborto|
| 2.46| abrasador|
| 8.13| abrazo|
+-----+----------+
The new columns to add in df1, should look like this:
+------------------+---------------------+
| MeanScore| ScoreList|
+------------------+---------------------+
| 2.95|[3.10, 2.50, 1.28,...|
| 2.15|[1.15, 3.50, 2.75,...|
| 2.75|[4.20, 1.00, 1.75,...|
| 3.25|[3.25, 2.50, 3.20,...|
| 3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+
I have reviewed some options using .join, but using columns with different data types gives error.
I have also tried converting the Dataframes to RDD and calling a function:
def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]
RDD1=swRemoved.rdd.map(list)
RDD2=Dict_df.rdd.map(list)
reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)
But with this option I get the error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I have found some examples to score text using afinn library. But it doesn't works with spanish text.
I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.
You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)
welcome to SO
df1.show()
#+------------------+----------------------------+
#|ID |MeaningfulWords |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien] |
#+------------------+----------------------------+
df2.show()
#+-----+---------+
#|score| word|
#+-----+---------+
#| 1.68| casa|
#| 2.8| alejado|
#| 1.03| buen|
#| 3.68| gusto|
#| 0.68| clientes|
#| 2.1|contentos|
#| 2.68| servi|
#| 1.18| resto|
#| 1.98| bien|
#+-----+---------+
from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
.groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
,F.collect_list("score").alias("ScoreList")\
,F.mean("score").alias("MeanScore"))\
.show(truncate=False)
#+------------------+----------------------------+-----------------------+------------------+
#|ID |MeaningfullWords |ScoreList |MeanScore |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien] |[1.18, 1.98] |1.58 |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68] |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975 |
#+------------------+----------------------------+-----------------------+------------------+

Spark distribute tasks over several executors

I d'like to run a SQL query in parallel and be able to control the level of parallelism to 8 queries. Right now, I am doing this piece of code.
The idea is to create 8 partition and allow executors to run them in parallel.
(1 to 8).toSeq.toDF.repartition(8) // 8 partitions
.rdd.mapPartitions(
x => {
val conn = createConnection()
x.foreach{
s => { // expect the below query be run concurently
execute(s"SELECT * FROM myTable WHERE col = ${s.get(0)}")
}
}
conn.close()
x
}).take(1)
The problem is the 8 queries are run one by one.
How should I proceed to get queries run 8 by 8 ?
When you do
val df = (1 to 8).toSeq.toDF.repartition(8)
This will not create 8 partitions with 1 record each. If you inspect this dataframe (see e.g. https://stackoverflow.com/a/46032600/1138523), then you get :
+----------------+-----------------+
|partition_number|number_of_records|
+----------------+-----------------+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 0|
| 6| 4|
| 7| 4|
+----------------+-----------------+
So you will have only 2 partitions which are non empty, therefore you will have at max 2-fold parallelism (I've asked about this here : How does Round Robin partitioning in Spark work?)
To make equal-sized partitions, you better use
spark.sparkContext.parallelize((0 to 7), numSlices = 8)
instead of
(1 to 8).toSeq.toDF.repartition(8).rdd
The first option gives you 1 record per partition, the second one not as it uses round robin partitioning
As a side note, when you do x.foreach, then x will be consumed (Iterators are only traversable once) so if you return x you will always get an empty iterator.
So your final code can look like this :
spark.sparkContext.parallelize((0 to 7), numSlices = 8)
.mapPartitions(
x => {
val xL = x.toList // convert to List
assert(xL.size==1) // make sure partition has only 1 record
val conn = createConnection()
xL.foreach{
s => { // expect the below query be run concurently
execute(s"SELECT * FROM myTable WHERE col = ${s}")
}
}
conn.close()
xL.toIterator
})
.collect // trigger all queries
Instead of using mapPartitions (which is lazy), you could also use foreachPartition, which is non-lazy
As you have only 1 record per partition, iterating the partitions isn't really beneficial, you could also just use a plain foreach:
spark.sparkContext.parallelize((0 to 7), numSlices = 8)
.foreach( s=> {
val conn = createConnection()
execute(s"SELECT * FROM myTable WHERE col = ${s}")
conn.close()
})

Udf not working

can you help me to optimize this code and make it work?
this is original data:
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| null|
| Venlafaxine| null|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
I'm expect to get data like this
+--------------------+-------------+
| original_name|medicine_name|
+--------------------+-------------+
| Venlafaxine| Venlafaxine|
| Lacrifilm 5mg/ml| Lacrifilm|
| Lacrifilm 5mg/ml| Lacrifilm|
| Venlafaxine| Venlafaxine|
|Vitamin D10,000IU...| null|
| paracetamol| null|
| mucolite| null|
This is the code:
distinct_df = spark.sql("select distinct medicine_name as medicine_name from medicine where medicine_name is not null")
distinct_df.createOrReplaceTempView("distinctDF")
def getMax(num1, num2):
pmax = (num1>=num2)*num1+(num2>num1)*num2
return pmax
def editDistance(s1, s2):
ed = (getMax(length(s1), length(s2)) - levenshtein(s1,s2))/
getMax(length(s1), length(s2))
return ed
editDistanceUdf = udf(lambda x,y: editDistance(x,y), FloatType())
def getSimilarity(str):
res = spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
res['medicine_name'].take(1)
return res
getSimilarityUdf = udf(lambda x: getSimilarity(x), StringType())
res_df = df.withColumn('m_name', when((df.medicine_name.isNull)|(df.medicine_name.=="null")),getSimilarityUdf(df.original_name)
.otherwise(df.medicine_name)).show()
now i'm getting error:
command_part = REFERENCE_TYPE + parameter._get_object_id()
AttributeError: 'function' object has no attribute '_get_object_id'
There is a bunch of problems with your code:
You cannot use SparkSession or distributed objects in the udf. So getSimilarity just cannot work. If you want to compare objects like this you have to join.
If length and levenshtein come from pyspark.sql.functions there cannot be used inside UserDefinedFunctions. There are designed to generate SQL expressions, mapping from *Column to Column.
Column isNull is a method not property so should be called:
df.medicine_name.isNull()
Following
df.medicine_name.=="null"
is not a syntactically valid Python (looks like Scala calque) and would throw compiler exceptions.
If SparkSession access was allowed in an UserDefinedFunction this wouldn't be a valid substitution
spark.sql("select medicine_name, editDistanceUdf('str', medicine_name) from distinctDf where editDistanceUdf('str', medicine_name)>=0.85 order by 2")
You should use string formatting methods
spark.sql("select medicine_name, editDistanceUdf({str}, medicine_name) from distinctDf where editDistanceUdf({str}, medicine_name)>=0.85 order by 2".format(str=str))
Maybe some other problems, but since you didn't provide a MCVE, anything else would be pure guessing.
When you fix smaller mistakes you have two choices:
Use crossJoin:
combined = df.alias("left").crossJoin(spark.table("distinctDf").alias("right"))
Then apply udf, filter, and one of the methods listed in Find maximum row per group in Spark DataFrame to closest match in group.
Use built-in approximate matching tools as explained in Efficient string matching in Apache Spark

Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame:
val statsDF = myDataFrame.describe()
Calling describe function yields the following output:
statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]
I can show statsDF normally by calling statsDF.show()
+-------+------------------+
|summary| count|
+-------+------------------+
| count| 53173|
| mean|104.76128862392568|
| stddev|3577.8184333911513|
| min| 1|
| max| 558407|
+-------+------------------+
I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:
val temp = statsDF.where($"summary" === "stddev").collect()
I am getting Task not serializable exception.
I am also facing the same exception when I call:
statsDF.where($"summary" === "stddev").show()
It looks like we cannot filter DataFrames generated by describe() function?
I have considered a toy dataset I had containing some health disease data
val stddev_tobacco = rawData.describe().rdd.map{
case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect
You can select from the dataframe:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
You can also register it as a table and query the table:
val t = x.describe()
t.registerTempTable("dt")
%sql
select * from dt
Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:
myDataFrame.selectExpr('MIN(count)').head()[0]
myDataFrame.describe().filter($"summary"==="stddev").show()
This worked quite nicely on Spark 2.3.0

Resources