Shifting a PySpark dataframe column by a variable value in another column - apache-spark

I have a PySpark dataframe that looks like this
Date
Value
Shift_Index
2021/02/11
50.12
0
2021/02/12
72.30
4
2021/02/15
81.87
1
2021/02/16
90.12
2
2021/02/17
91.31
1
2021/02/18
81.23
2
2021/02/19
73.45
1
2021/02/22
87.17
0
I want to lead the offset (On the basis of the values in the Shift_Index column here) which I have to pass depends on a particular Column of type Integer.
Can we somehow use an offset value that depends on the column value in lead/lag function in spark SQL ?
I wanted somewhat like this, which works fine in SQL server, but unfortunately throws exception in Spark SQL.
Create table test_table(ID int identity(1,1), Value float, shift_col int, New_Value float)
SELECT Value, shift_col,
ISNULL(LEAD(Value, shift_col) OVER(ORDER BY ID ASC), Value) AS New_Value
FROM test_table
The final result that I need looks something similar to this :
Date
Value
Shift_Index
New_Value
2021/02/11
50.12
0
50.12
2021/02/12
72.30
4
81.23
2021/02/15
81.87
1
90.12
2021/02/16
90.12
2
81.23
2021/02/17
91.31
1
81.23
2021/02/18
81.23
2
87.17
2021/02/19
73.45
1
87.17
2021/02/22
87.17
0
87.17
The following exceptions are encountered
Py4JJavaError: An error occurred while calling o77.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve 'lead(sample_data_temp.shift_col, NULL)' due to data type mismatch: Offset expression 'shift_col#2835' must be a literal
Any help will be really appreciated.
Thanks in advance.

You could do so with a window and lead. If you have very spread out values for Shift_index you could do a select distinct to determine which shifts you need instead of everything up to the max shift.
Ideally you have something to partition your window by otherwise this can be very heavy for large datasets. Spark provides a warning:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Edit: solution without join, still not partitioning means this doesn't parallelize well.
from pyspark.sql import functions as f
from pyspark.sql.window import Window
w = Window().orderBy(f.col('Date'))
max_shift = df.agg(f.max('Shift_index')).collect()[0][0]
for shift in range(1, max_shift+1):
df = df.withColumn('Value' + str(shift), f.lead(f.col('Value'), shift).over(w))
case_shift = 'CASE Shift_index WHEN 0 THEN Value ' + ' '.join([f'WHEN {i} THEN Value{i}' for i in range(1, max_shift + 1)]) + ' ELSE NULL END'
df = df.select(
f.col('Date'),
f.col('Shift_index'),
f.col('Value'),
f.expr(case_shift).alias('New_Value')
)
df.show()
+----------+-----+-----------+---------+
| Date|Value|Shift_index|New_Value|
+----------+-----+-----------+---------+
|2021/02/11|50.12| 0| 50.12|
|2021/02/12| 72.3| 4| 81.23|
|2021/02/15|81.87| 1| 90.12|
|2021/02/16|90.12| 2| 81.23|
|2021/02/17|91.31| 1| 81.23|
|2021/02/18|81.23| 2| 87.17|
|2021/02/19|73.45| 1| 87.17|
|2021/02/22|87.17| 0| 87.17|
+----------+-----+-----------+---------+

Related

pivoting a single row dataframe where groupBy can not be applied

I have a dataframe like this:
inputRecordSetCount
inputRecordCount
suspenseRecordCount
166
1216
10
I am trying to make it look like
operation
value
inputRecordSetCount
166
inputRecordCount
1216
suspenseRecordCount
10
I tried pivot, but it needs a groupBy field. I dont have any groupBy field. I found some reference of Stack in Scala. But not sure, how to use it in PySpark. Any help would be appreciated. Thank you.
You can use the stack() operation as mentioned in this tutorial.
Since there are 3 unique values, pass the size, and pair of label and column name:
stack(3, "inputRecordSetCount", inputRecordSetCount, "inputRecordCount", inputRecordCount, "suspenseRecordCount", suspenseRecordCount) as (operation, value)
Full example:
df = spark.createDataFrame(data=[[166,1216,10]], schema=['inputRecordSetCount','inputRecordCount','suspenseRecordCount'])
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (operation, value)"
df = df.selectExpr(exprs)
df.show()
+-------------------+-----+
| operation|value|
+-------------------+-----+
|inputRecordSetCount| 166|
| inputRecordCount| 1216|
|suspenseRecordCount| 10|
+-------------------+-----+

Conditional Counting in Spark SQL

I'm trying to count conditionally on one column, here's my code.
spark.sql(
s"""
|SELECT
| date,
| sea,
| contract,
| project,
| COUNT(CASE WHEN type = 'ABC' THEN 1 ELSE 0 END) AS abc,
| COUNT(CASE WHEN type = 'DEF' THEN 1 ELSE 0 END) AS def,
| COUNT(CASE WHEN type = 'ABC' OR type = 'DEF' OR type = 'GHI' THEN 1 ELSE 0 END) AS all
|FROM someTable
|GROUP BY date, seat, contract, project
""".stripMargin).createOrReplaceTempView("something")
This throws up a weird error.
Diagnostic messages truncated, showing last 65536 chars out of 124764:
What am I doing wrong here?
Any help appreciated.
It seems you want to get count of type = 'ABC', type = 'DEF' etc per grouping condition.
If this is the case then COUNT will not give you desired results but would give same result for each case for a group.
It seems you can use SUM instead of COUNT.
SUM will add all 0 and 1 will give you correct count.
Still if you want to resolve the error you are getting, please paste the error and if possible some data you are using to create data frame.

Spark SQL alternatives to groupby/pivot/agg/collect_list using foldLeft & withColumn so as to improve performance

I have a Spark DataFrame consisting of three columns:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
After applying df.groupBy("id").pivot("col1").agg(collect_list("col2")) I am getting the following dataframe (aggDF):
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
Then I find the name of columns except the id column.
val cols = aggDF.columns.filter(x => x != "id")
After that I am using cols.foldLeft(aggDF)((df, x) => df.withColumn(x, when(size(col(x)) > 0, col(x)).otherwise(lit(null)))) to replace empty array with null. The performance of this code becomes poor when the number of columns increases. Additionally, I have the name of string columns val stringColumns = Array("p1","p3"). I want to get the following final dataframe:
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1 | [b1]|null|
| y|null|[b2, b3]| c1 |
+---+----+--------+----+
Is there any better solution to this problem in order to achieve the final dataframe?
You current code pays 2 performance costs as structured:
As mentioned by Alexandros, you pay 1 catalyst analysis per DataFrame transform so if you loop other a few hundreds or thousands columns, you'll notice some time spent on the driver before the job is actually submitted. If this is a critical issue for you, you can use a single select statement instead of your foldLeft on withColumns but this won't really change a lot the execution time because of the next point
When you use an expression such as when().otherwise() on columns in what can be optimized as a single select statement, the code generator will produce a single large method processing all the columns. If you have more than a couple hundred columns, it's likely that the resulting method won't be JIT-compiled by default by the JVM, resulting in very slow execution performance (max JIT-able method is 8k bytecode in Hotspot).
You can detect if you hit the second issue by inspecting the executor logs and check if you see a WARNING on a too large method that can't be JITed.
How to try and solve this ?
1 - Changing the logic
You can filter the empty cells before the pivot by using a window transform
import org.apache.spark.sql.expressions.Window
val finalDf = df
.withColumn("count", count('col2) over Window.partitionBy('id,'col1))
.filter('count > 0)
.groupBy("id").pivot("col1").agg(collect_list("col2"))
This may or may not be faster depending on actual dataset as the pivot also generates a large select statement expression by itself so it may hit the large method threshold if you encounter more than approximately 500 values for col1.
You may want to combine this with option 2 as well.
2 - Try and finesse the JVM
You can add an extraJavaOption on your executors to ask the JVM to try and JIT hot methods larger than 8k.
For example, add the option
--conf "spark.executor.extraJavaOptions=-XX:-DontCompileHugeMethods"
on your spark-submit and see how it impacts the pivot execution time.
It's difficult to guarantee a substantial speed increase without more details on your real dataset but it's definitely worth a shot.
If you look at https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 then you see that withColumn with a foldLeft has known performance issues. Select is an alternative, as shown below - using varargs.
Not convinced collect_list is an issue. 1st set of logic I kept as well. pivot kicks off a Job to get distinct values for pivoting. It is an accepted approach imo. Trying to roll your own seems pointless to me, but the other answers may prove me wrong or Spark 2.4 has been improved.
import spark.implicits._
import org.apache.spark.sql.functions._
// Your code & assumig id is only col of interest as in THIS question. More elegant than 1st posting.
val df = Seq( ("x","p1","a1"), ("x","p2","b1"), ("y","p2","b2"), ("y","p2","b3"), ("y","p3","c1")).toDF("id", "col1", "col2")
val aggDF = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
//aggDF.show(false)
val colsToSelect = aggDF.columns // All in this case, 1st col id handled by head & tail
val aggDF2 = aggDF.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null)).otherwise(aggDF(col)).as(s"$col"))):_*)
aggDF2.show(false)
returns:
+---+----+--------+----+
|id |p1 |p2 |p3 |
+---+----+--------+----+
|x |[a1]|[b1] |null|
|y |null|[b2, b3]|[c1]|
+---+----+--------+----+
Also a nice read BTW: https://lansalo.com/2018/05/13/spark-how-to-add-multiple-columns-in-dataframes-and-how-not-to/. The effects become more noticable with a higher number of columns. At the end a reader makes a relevant point.
I think that performance is better with select approach when higher number of columns prevail.
UPD: Over the holidays I trialed both approaches with Spark 2.4.x with little observable difference up to 1000 columns. That has puzzled me.

How to fill out nulls according to another dataframe pyspark

I currently started using pyspark. I have a two columns dataframe with one column containing some nulls, e.g.
df1
A B
1a3b 7
0d4s 12
6w2r null
6w2r null
1p4e null
and another dataframe has the correct mapping, i.e.
df2
A B
1a3b 7
0d4s 12
6w2r 0
1p4e 3
so I want to fill out the nulls in df1 using df2 s.t. the result is:
A B
1a3b 7
0d4s 12
6w2r 0
6w2r 0
1p4e 3
in pandas, I would first create a lookup dictionary from df2 then use apply on the df1 to populate the nulls. But I'm not really sure what functions to use in pyspark, most of replacing nulls I saw is based on simple conditions, for example, filling all the nulls to be a single constant value for certain column.
What I have tried is:
from pyspark.sql.functions import when, col
df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))
although I was getting AttributeError: 'DataFrame' object has no attribute '_get_object_id'. The logic is to first filter out the nulls then replace it with the column B's value from df2, but I think df.B.isNull() evaluates the whole column instead of single value, which is probably not the right way to do it, any suggestions?
left join on common column A and selecting appropriate columns should get you your desired output
df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)
which should give you
+----+---+
|A |B |
+----+---+
|6w2r|0 |
|6w2r|0 |
|1a3b|7 |
|1p4e|3 |
|0d4s|12 |
+----+---+

Is there a way to generate rownumber without converting the dataframe into rdd in pyspark 1.3.1? [duplicate]

I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)
In pandas, I could make just
indexes=[2,3,6,7]
df[indexes]
Here I want something similar, (and without converting dataframe to pandas)
The closest I can get to is:
Enumerating all the objects in the original dataframe by:
indexes=np.arange(df.count())
df_indexed=df.withColumn('index', indexes)
Searching for values I need using where() function.
QUESTIONS:
Why it doesn't work and how to make it working? How to add a row to a dataframe?
Would it work later to make something like:
indexes=[2,3,6,7]
df1.where("index in indexes").collect()
Any faster and simpler way to deal with it?
It doesn't work because:
the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier
PySpark >= 1.4.0
You can add row numbers using respective window function and query using Column.isin method or properly formated query string:
from pyspark.sql.functions import col, rowNumber
from pyspark.sql.window import Window
w = Window.orderBy()
indexed = df.withColumn("index", rowNumber().over(w))
# Using DSL
indexed.where(col("index").isin(set(indexes)))
# Using SQL expression
indexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))
It looks like window functions called without PARTITION BY clause move all data to the single partition so above may be not the best solution after all.
Any faster and simpler way to deal with it?
Not really. Spark DataFrames don't support random row access.
PairedRDD can be accessed using lookup method which is relatively fast if data is partitioned using HashPartitioner. There is also indexed-rdd project which supports efficient lookups.
Edit:
Independent of PySpark version you can try something like this:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
row = Row("char")
row_with_index = Row("char", "index")
df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)
## +----+
## |char|
## +----+
## | a|
## | b|
## | c|
## | d|
## | e|
## +----+
## only showing top 5 rows
# This part is not tested but should work and save some work later
schema = StructType(
df.schema.fields[:] + [StructField("index", LongType(), False)])
indexed = (df.rdd # Extract rdd
.zipWithIndex() # Add index
.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
.toDF(schema)) # It will work without schema but will be more expensive
# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))
If you want a number range that's guaranteed not to collide but does not require a .over(partitionBy()) then you can use monotonicallyIncreasingId().
from pyspark.sql.functions import monotonicallyIncreasingId
df.select(monotonicallyIncreasingId().alias("rowId"),"*")
Note though that the values are not particularly "neat". Each partition is given a value range and the output will not be contiguous. E.g. 0, 1, 2, 8589934592, 8589934593, 8589934594.
This was added to Spark on Apr 28, 2015 here: https://github.com/apache/spark/commit/d94cd1a733d5715792e6c4eac87f0d5c81aebbe2
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("Atr4", monotonically_increasing_id())
If you only need incremental values (like an ID) and if there is no
constraint that the numbers need to be consecutive, you could use
monotonically_increasing_id(). The only guarantee when using this
function is that the values will be increasing for each row, however,
the values themself can differ each execution.
You certainly can add an array for indexing, an array of your choice indeed:
In Scala, first we need to create an indexing Array:
val index_array=(1 to df.count.toInt).toArray
index_array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can now append this column to your DF. First, For that, you need to open up our DF and get it as an array, then zip it with your index_array and then we convert the new array back into and RDD. The final step is to get it as a DF:
final_df = sc.parallelize((df.collect.map(
x=>(x(0),x(1))) zip index_array).map(
x=>(x._1._1.toString,x._1._2.toString,x._2))).
toDF("column_name")
The indexing would be more clear after that.
monotonicallyIncreasingId() - this will assign row numbers in incresing order but not in sequence.
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 12 | xz |
|---------------------|------------------|
If you want assign row numbers use following trick.
Tested in spark-2.0.1 and greater versions.
df.createOrReplaceTempView("df")
dfRowId = spark.sql("select *, row_number() over (partition by 0) as rowNo from df")
sample output with 2 columns:
|---------------------|------------------|
| RowNo | Heading 2 |
|---------------------|------------------|
| 1 | xy |
|---------------------|------------------|
| 2 | xz |
|---------------------|------------------|
Hope this helps.
Selecting a single row n of a Pyspark DataFrame, try:
df.where(df.id == n).show()
Given a Pyspark DataFrame:
df = spark.createDataFrame([(1, 143.5, 5.6, 28, 'M', 100000),\
(2, 167.2, 5.4, 45, 'M', None),\
(3, None , 5.2, None, None, None),\
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
Selecting the 3rd row, try:
df.where('id == 3').show()
Or:
df.where(df.id == 3).show()
Selecting multiple rows with rows' ids (the 2nd & the 3rd rows in this case), try:
id = {"2", "3"}
df.where(df.id.isin(id)).show()

Resources