simplify multiple (30 columns) column complex pyspark aggregation in one go - apache-spark

I have a sample spark df as below:
df = ([[1, 'a', 'b' , 'c'],
[1, 'b', 'c' , 'b'],
[1, 'b', 'a' , 'b'],
[2, 'c', 'a' , 'a'],
[3, 'b', 'b' , 'a']]).toDF(['id', 'field1', 'field2', 'field3'])
What I need next is to provide a multiple aggregations to show summary of the a, b, c values for each field. I have a working but tedious process as below:
agg_table = (
df
.groupBy('id')
.agg(
# field1
sum(when(col('field1') == 'a',1).otherwise(0)).alias('field1_a_count')
,sum(when(col('field1') == 'b',1).otherwise(0)).alias('field1_b_count')
,sum(when(col('field1') == 'c',1).otherwise(0)).alias('field1_c_count')
# field2
,sum(when(col('field2') == 'a',1).otherwise(0)).alias('field2_a_count')
,sum(when(col('field2') == 'b',1).otherwise(0)).alias('field2_b_count')
,sum(when(col('field2') == 'c',1).otherwise(0)).alias('field2_c_count')
# field3
,sum(when(col('field3') == 'a',1).otherwise(0)).alias('field3_a_count')
,sum(when(col('field3') == 'b',1).otherwise(0)).alias('field3_b_count')
,sum(when(col('field3') == 'c',1).otherwise(0)).alias('field3_c_count')
))
What I am expecting to get is this:
agg_table = (['id':'1','2','3'],
['field1_a_count':1,0,0],
['field1_b_count':2,0,1],
['field1_c_count':0, 1, 0],
['field2_a_count':1,1,0],
['field2_b_count':1,0,1],
['field2_c_count':1,0,0],
['field3_a_count':0,1,1],
['field3_b_count':2,0,0],
['field3_c_count':1,0,0])
It is just fine if I only really have 3 fields, but I have 30 fields with varying/custom names. Maybe somebody can help me with the repetitive task of coding the aggregated sum per field. I tried playing around with a suggestion from :
https://danvatterott.com/blog/2018/09/06/python-aggregate-udfs-in-pyspark/
I can make it work if I will only pull one column and one value, but I get varying errors, one of them is:
AnalysisException: cannot resolve '`value`' given input columns: ['field1','field2','field3']
One last line I tried is using:
validated_cols = ['field1','field2','field3']
df.select(validated_cols).groupBy('id').agg(collect_list($'field1_a_count',$'field1_b_count',$'field1_c_count', ...
$'field30_c_count')).show()
Output: SyntaxError: invalid syntax
I tried with pivot too, but from searches so far, it says it is only good for one column. I tried this multiple columns:
df.withColumn("p", concat($"p1", $"p2"))
.groupBy("a", "b")
.pivot("p")
.agg(...)
I still get a syntax error.
Another link I tried: https://danvatterott.com/blog/2019/02/05/complex-aggregations-in-pyspark/
I also tried the exprs approach: exprs1 = {x: "sum" for x in df.columns if x != 'id'}
Any suggested will be appreciated. Thanks

Let me answer your question in two steps. First, you are wondering if it is possible to avoid hard coding all your aggregations in your attempt to compute all your aggregations. It is. I would do it like this:
from pyspark.sql import functions as f
# let's assume that this is known, but we could compute it as well
values = ['a', 'b', 'c']
# All the columns except the id
cols = [ c for c in df.columns if c != 'id' ]
def count_values(column, value):
return f.sum(f.when(f.col(column) == value, 1).otherwise(0))\
.alias(f"{column}_{value}_count")
# And this gives you the result of your hard coded aggregations:
df\
.groupBy('id')\
.agg(*[count_values(c, value) for c in cols for value in values])\
.show()
But that is not what you expect right? You are trying to compute some kind of pivot on the id column. To do this, I would not use the previous result, but just work the data differently. I would start by replacing all the columns of the dataframe but id (that is renamed into x) by an array of values of the form {column_name}_{value}_count, and I would explode that array. From there, we just need to compute a simple pivot on the former id column renamed x, grouped by the values contained in the exploded array.
df\
.select(f.col('id').alias('x'), f.explode(
f.array(
[f.concat_ws('_', f.lit(c), f.col(c), f.lit('count')).alias(c)
for c in cols]
)
).alias('id'))\
.groupBy('id')\
.pivot('x')\
.count()\
.na.fill(0)\
.orderBy('id')\
.show()
which yields:
+--------------+---+---+---+
| id| 1| 2| 3|
+--------------+---+---+---+
|field1_a_count| 1| 0| 0|
|field1_b_count| 2| 0| 1|
|field1_c_count| 0| 1| 0|
|field2_a_count| 1| 1| 0|
|field2_b_count| 1| 0| 1|
|field2_c_count| 1| 0| 0|
|field3_a_count| 0| 1| 1|
|field3_b_count| 2| 0| 0|
|field3_c_count| 1| 0| 0|
+--------------+---+---+---+

update
based on discussion in the comments, I think this question is a case of an X-Y problem. The task at hand is something that is seen very frequently in the world of Data Engineering and ETL development: how to partition and then quantify good and bad records.
In the case where the data is being prepared to load to a data warehouse / hadoop ecosystem, the usual pattern is to take the raw input and load it to a dataframe, then apply transformations & validations that partition the data into "The Good, The Bad, and The Ugly":
The first— and hopefully largest— partition contains records that are successfully transformed and which pass validation. These will go on to be persisted in durable storage and certified to be used for anayltics.
The second partition contains records that were successfully transformed but which failed during QA. The QA rules should include checks for illegal nulls, string pattern matching (like phone number format), etc...
The third partition is for records that are rejected early in the process because they failed on a transformation step. Examples include fields that contain non-number values that are cast to numeric types, text fields that exceed the maximum length, or strings that contain control characters that are not supported by the database.
The goal should not be to generate counts for each of these 3 classifications across every column and for every row. Trying to do that is counterproductive. Why? Because when a transformation step or QA check fails for a given record, that entire record should be rejected immediately and sent to a separate output stream to be analyzed later. Each row in the data set should be treated as just that: a single record. It isn't possible for a single field to fail and still have the complete record pass, which makes metrics at this granularity unnecessary. What action will you take knowing that 100 rows passed on the "address" field? For valid records, all that matters is the total number that passed for every column. Otherwise, it wouldn't be a valid record.
With that said, remember that the goal is to build a usable and cleansed data set; analyzing the rejected records is a secondary task and can be done offline.
It is common practice to add a field to the rejected data to indicated which column caused the failure. That makes it easy to troubleshoot any malformed data, so there is really no need to generate counts across all columns, even for bad records. Instead, just review the rejected data after the main job finishes, and address the problems. Continue doing that iteratively until the number of rejected records is below whatever threshold you think is reasonable, and then continue to monitor it going forward.
Old answer
This is a sign of a design flaw in the data. Whatever the "field1", "field2", etc... columns actually represent, it appears they are all related, in the sense that the values quantify some attribute (maybe each one is a count for a specific merchandise ID, or the number of people with a certain property...). The problem is that these fields are being added as individual columns on a fact table1, which then needs to be aggregated, resulting in the situation that you're facing.
A better design would be to collapse those "field1", "field2", etc... columns into a single code field that can be used as the GROUP BY field when doing the aggregation. You might want to consider creating a separate table to do this if the existing one has many other columns and making this change would alter the grain in a way that might cause other problems.
1: it's usually a big red flag to have a table with a bunch of enumerated columns with the same name and purpose. I've even seen cases where someone has created tables with "spare" columns for when they want to add more attributes later. Not good.

Related

Spark SQL alternatives to groupby/pivot/agg/collect_list using foldLeft & withColumn so as to improve performance

I have a Spark DataFrame consisting of three columns:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
After applying df.groupBy("id").pivot("col1").agg(collect_list("col2")) I am getting the following dataframe (aggDF):
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
Then I find the name of columns except the id column.
val cols = aggDF.columns.filter(x => x != "id")
After that I am using cols.foldLeft(aggDF)((df, x) => df.withColumn(x, when(size(col(x)) > 0, col(x)).otherwise(lit(null)))) to replace empty array with null. The performance of this code becomes poor when the number of columns increases. Additionally, I have the name of string columns val stringColumns = Array("p1","p3"). I want to get the following final dataframe:
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1 | [b1]|null|
| y|null|[b2, b3]| c1 |
+---+----+--------+----+
Is there any better solution to this problem in order to achieve the final dataframe?
You current code pays 2 performance costs as structured:
As mentioned by Alexandros, you pay 1 catalyst analysis per DataFrame transform so if you loop other a few hundreds or thousands columns, you'll notice some time spent on the driver before the job is actually submitted. If this is a critical issue for you, you can use a single select statement instead of your foldLeft on withColumns but this won't really change a lot the execution time because of the next point
When you use an expression such as when().otherwise() on columns in what can be optimized as a single select statement, the code generator will produce a single large method processing all the columns. If you have more than a couple hundred columns, it's likely that the resulting method won't be JIT-compiled by default by the JVM, resulting in very slow execution performance (max JIT-able method is 8k bytecode in Hotspot).
You can detect if you hit the second issue by inspecting the executor logs and check if you see a WARNING on a too large method that can't be JITed.
How to try and solve this ?
1 - Changing the logic
You can filter the empty cells before the pivot by using a window transform
import org.apache.spark.sql.expressions.Window
val finalDf = df
.withColumn("count", count('col2) over Window.partitionBy('id,'col1))
.filter('count > 0)
.groupBy("id").pivot("col1").agg(collect_list("col2"))
This may or may not be faster depending on actual dataset as the pivot also generates a large select statement expression by itself so it may hit the large method threshold if you encounter more than approximately 500 values for col1.
You may want to combine this with option 2 as well.
2 - Try and finesse the JVM
You can add an extraJavaOption on your executors to ask the JVM to try and JIT hot methods larger than 8k.
For example, add the option
--conf "spark.executor.extraJavaOptions=-XX:-DontCompileHugeMethods"
on your spark-submit and see how it impacts the pivot execution time.
It's difficult to guarantee a substantial speed increase without more details on your real dataset but it's definitely worth a shot.
If you look at https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 then you see that withColumn with a foldLeft has known performance issues. Select is an alternative, as shown below - using varargs.
Not convinced collect_list is an issue. 1st set of logic I kept as well. pivot kicks off a Job to get distinct values for pivoting. It is an accepted approach imo. Trying to roll your own seems pointless to me, but the other answers may prove me wrong or Spark 2.4 has been improved.
import spark.implicits._
import org.apache.spark.sql.functions._
// Your code & assumig id is only col of interest as in THIS question. More elegant than 1st posting.
val df = Seq( ("x","p1","a1"), ("x","p2","b1"), ("y","p2","b2"), ("y","p2","b3"), ("y","p3","c1")).toDF("id", "col1", "col2")
val aggDF = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
//aggDF.show(false)
val colsToSelect = aggDF.columns // All in this case, 1st col id handled by head & tail
val aggDF2 = aggDF.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col)).as(s"$col"))):_*)
aggDF2.show(false)
returns:
+---+----+--------+----+
|id |p1 |p2 |p3 |
+---+----+--------+----+
|x |[a1]|[b1] |null|
|y |null|[b2, b3]|[c1]|
+---+----+--------+----+
Also a nice read BTW: https://lansalo.com/2018/05/13/spark-how-to-add-multiple-columns-in-dataframes-and-how-not-to/. The effects become more noticable with a higher number of columns. At the end a reader makes a relevant point.
I think that performance is better with select approach when higher number of columns prevail.
UPD: Over the holidays I trialed both approaches with Spark 2.4.x with little observable difference up to 1000 columns. That has puzzled me.

Spark Scala sort PIVOT column

The following:
val pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.show()
I cannot recall seeing the ability to sort the pivoted column. What is the assumption of sorting? Ascending always. Cannot find it. Non-deterministic?
Tips welcome.
According to scala docs:
There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
Taking a look how the latter one works
// This is to prevent unintended OOM errors when the number of distinct values is large
val maxValues = df.sparkSession.sessionState.conf.dataFramePivotMaxValues
// Get the distinct values of the column and sort them so its consistent
val values = df.select(pivotColumn)
.distinct()
.limit(maxValues + 1)
.sort(pivotColumn) // ensure that the output columns are in a consistent logical order
.collect()
.map(_.get(0))
.toSeq
and values is passed to the former version. So when using the version that auto-detects the values, the columns are always sorted using the natural ordering of values. If you need another sorting, it is easy enough to replicate the auto-detection mechanism and then call the version with explicit values:
val df = Seq(("Foo", "UK", 1), ("Bar", "UK", 1), ("Foo", "FR", 1), ("Bar", "FR", 1))
.toDF("Product", "Country", "Amount")
df.groupBy("Product")
.pivot("Country", Seq("UK", "FR")) // natural ordering would be "FR", "UK"
.sum("Amount")
.show()
Output:
+-------+---+---+
|Product| UK| FR|
+-------+---+---+
| Bar| 1| 1|
| Foo| 1| 1|
+-------+---+---+

how to get first value and last value from dataframe column in pyspark?

I Have Dataframe,I want get first value and last value from DataFrame column.
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| G| 2| 0.09523809523809523|
| K| 2| 0.09523809523809523|
| D| 1|0.047619047619047616|
+----+-----+--------------------+
expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]
You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.
Another idea would be to use agg with the first and last aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)
Spark offers a head function, which makes getting the first element very easy. However, spark does not offer any last function. A straightforward approach would be to sort the dataframe backward and use the head function again.
first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.
size = df.count()
df.rdd.zipWithIndex()\
.filter(lambda x : x[1] == 0 or x[1] == size-1)\
.map(lambda x : x[0].support)\
.collect()
You can try indexing the data frame see below example:
df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]
EDIT:
You have to pass the column name as well.
df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']
Since version 3.0.0, spark also have DataFrame function called
.tail() to get the last value.
This will return List of Row objects:
last=df.tail(1)[0].support

Spark: refer to a column dynamically based on the value of another column

I have a dataset with many fields and one of the fields "valuefieldname" is the a reference/pointer column which contains the field_name of the field which should be processed. How can I dynamically fetch that column value based on the "valuefieldname" column?
I need something similar to the below code (which doesn't work)
val dataSet2 = dataSet1.withColumn("targetoutput", col(col("valuefieldname")))
There is no way to refer to a column dynamically yet directly in a Spark plan. Therefore, the dynamic access has to happen either via a data structure that is part of the plan or via more than one plan. This leads to three strategies for solving the problem:
Use a UDF to dynamically address a field in a Row. This is the most general and easiest approach. It works best when there aren't too many columns and/or when the data is sparse.
Build a MapType column and reference it. In some cases, this can be more efficient that (1).
Make multiple (light) passes through the data and union the results. Best used when the number of columns is small and the data in each column is "heavy", e.g., deeply structured data, and dense.
Here is how to do (1):
def getColumnAs[A](colName: String, row: Row): Option[A] =
if (row == null) None
else {
val idx = row.fieldIndex(colName)
if (row.isNullAt(idx)) None else Some(row.getAs[A](idx))
}
case class Data(col_name: String, x: Option[Int], y: Option[Int])
val df = spark.createDataset(Seq(
Data("x", Some(1), None),
Data("x", Some(2), Some(20)),
Data("y", None, Some(30))
)).toDF
val colValue = udf(getColumnAs[Int] _)
df.select(
'col_name,
colValue('col_name, struct('*)).as("col_value")
)
.show
The output is
+--------+---------+
|col_name|col_value|
+--------+---------+
| x| 1|
| x| 2|
| y| 30|
+--------+---------+

Is there a way to generate rownumber without converting the dataframe into rdd in pyspark 1.3.1? [duplicate]

I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)
In pandas, I could make just
indexes=[2,3,6,7]
df[indexes]
Here I want something similar, (and without converting dataframe to pandas)
The closest I can get to is:
Enumerating all the objects in the original dataframe by:
indexes=np.arange(df.count())
df_indexed=df.withColumn('index', indexes)
Searching for values I need using where() function.
QUESTIONS:
Why it doesn't work and how to make it working? How to add a row to a dataframe?
Would it work later to make something like:
indexes=[2,3,6,7]
df1.where("index in indexes").collect()
Any faster and simpler way to deal with it?
It doesn't work because:
the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier
PySpark >= 1.4.0
You can add row numbers using respective window function and query using Column.isin method or properly formated query string:
from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window
w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))
# Using DSL
indexed.where(col("index").isin(set(indexes)))
# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))
It looks like window functions called without PARTITION BY clause move all data to the single partition so above may be not the best solution after all.
Any faster and simpler way to deal with it?
Not really. Spark DataFrames don't support random row access.
PairedRDD can be accessed using lookup method which is relatively fast if data is partitioned using HashPartitioner. There is also indexed-rdd project which supports efficient lookups.
Edit:
Independent of PySpark version you can try something like this:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
row = Row("char")
row_with_index = Row("char", "index")
df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)
## +----+
## |char|
## +----+
## | a|
## | b|
## | c|
## | d|
## | e|
## +----+
## only showing top 5 rows
# This part is not tested but should work and save some work later
schema = StructType(
df.schema.fields[:] + [StructField("index", LongType(), False)])
indexed = (df.rdd # Extract rdd
.zipWithIndex() # Add index
.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
.toDF(schema)) # It will work without schema but will be more expensive
# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))
If you want a number range that's guaranteed not to collide but does not require a .over(partitionBy()) then you can use monotonicallyIncreasingId().
from pyspark.sql.functions import monotonicallyIncreasingId
df.select(monotonicallyIncreasingId().alias("rowId"),"*")
Note though that the values are not particularly "neat". Each partition is given a value range and the output will not be contiguous. E.g. 0, 1, 2, 8589934592, 8589934593, 8589934594.
This was added to Spark on Apr 28, 2015 here: https://github.com/apache/spark/commit/d94cd1a733d5715792e6c4eac87f0d5c81aebbe2
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("Atr4", monotonically_increasing_id())
If you only need incremental values (like an ID) and if there is no
constraint that the numbers need to be consecutive, you could use
monotonically_increasing_id(). The only guarantee when using this
function is that the values will be increasing for each row, however,
the values themself can differ each execution.
You certainly can add an array for indexing, an array of your choice indeed:
In Scala, first we need to create an indexing Array:
val index_array=(1 to df.count.toInt).toArray
index_array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can now append this column to your DF. First, For that, you need to open up our DF and get it as an array, then zip it with your index_array and then we convert the new array back into and RDD. The final step is to get it as a DF:
final_df = sc.parallelize((df.collect.map(
x=>(x(0),x(1))) zip index_array).map(
x=>(x._1._1.toString,x._1._2.toString,x._2))).
toDF("column_name")
The indexing would be more clear after that.
monotonicallyIncreasingId() - this will assign row numbers in incresing order but not in sequence.
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 12 | xz |
|---------------------|------------------|
If you want assign row numbers use following trick.
Tested in spark-2.0.1 and greater versions.
df.createOrReplaceTempView("df")
dfRowId = spark.sql("select *, row_number() over (partition by 0) as rowNo from df")
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 2 | xz |
|---------------------|------------------|
Hope this helps.
Selecting a single row n of a Pyspark DataFrame, try:
df.where(df.id == n).show()
Given a Pyspark DataFrame:
df = spark.createDataFrame([(1, 143.5, 5.6, 28, 'M', 100000),\
(2, 167.2, 5.4, 45, 'M', None),\
(3, None , 5.2, None, None, None),\
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
Selecting the 3rd row, try:
df.where('id == 3').show()
Or:
df.where(df.id == 3).show()
Selecting multiple rows with rows' ids (the 2nd & the 3rd rows in this case), try:
id = {"2", "3"}
df.where(df.id.isin(id)).show()

Resources