Reorder PySpark dataframe columns on specific sort logic - apache-spark

I have a PySpark dataframe with the below column order. I need to order it as per the 'branch'. How do I do it? df.select(sorted(df.columns)) doesn't seem to work the way I want.
Existing column order:
store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit
Desired column order:
store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..

You could manually build your list of columns.
col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
for i in range(1, 13):
cols.append(col_fmt.format(i, branch))
df.select(cols)
Alternatively, I'd recommend building a better dataframe that takes advantage of array + struct/map datatypes. E.g.
months - array (size 12)
- branches: map<string, struct>
- key: string (branch name)
- value: struct
- profit: float
This way, arrays would already be "sorted". Map order doesn't really matter, and it makes SQL queries specific to certain months and branches easier to read (and probably faster with predicate pushdowns)

You may need to use some python coding. In the following script I split the column names based on underscore _ and then sorted according to elements [3] (branch name) and [1] (month value).
Input df:
cols = ['store_id',
'store_name',
'month_1_branch_A_profit',
'month_1_branch_B_profit',
'month_1_branch_C_profit',
'month_1_branch_D_profit',
'month_2_branch_A_profit',
'month_2_branch_B_profit',
'month_2_branch_C_profit',
'month_2_branch_D_profit',
'month_12_branch_A_profit',
'month_12_branch_B_profit',
'month_12_branch_C_profit',
'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))
Script:
branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
'store_id', 'store_name',
*[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)
df.printSchema()
# root
# |-- store_id: integer (nullable = true)
# |-- store_name: integer (nullable = true)
# |-- month_1_branch_A_profit: integer (nullable = true)
# |-- month_2_branch_A_profit: integer (nullable = true)
# |-- month_12_branch_A_profit: integer (nullable = true)
# |-- month_1_branch_B_profit: integer (nullable = true)
# |-- month_2_branch_B_profit: integer (nullable = true)
# |-- month_12_branch_B_profit: integer (nullable = true)
# |-- month_1_branch_C_profit: integer (nullable = true)
# |-- month_2_branch_C_profit: integer (nullable = true)
# |-- month_12_branch_C_profit: integer (nullable = true)
# |-- month_1_branch_D_profit: integer (nullable = true)
# |-- month_2_branch_D_profit: integer (nullable = true)
# |-- month_12_branch_D_profit: integer (nullable = true)

Related

How to convert DataFrame columns from struct<value:double> to struct<values:array<double>> in pyspark?

I have a DataFrame with this structure:
root
|-- features: struct (nullable = true)
| |-- value: double (nullable = true)
and I wanna convert value with double type to "values with array" type.
How can I do that?
You can specify the conversion explicitly using struct and array:
import pyspark.sql.functions as F
df.printSchema()
#root
# |-- features: struct (nullable = false)
# | |-- value: double (nullable = false)
df2 = df.withColumn(
'features',
F.struct(
F.array(F.col('features')['value']).alias('values')
)
)
df2.printSchema()
#root
# |-- features: struct (nullable = false)
# | |-- values: array (nullable = false)
# | | |-- element: double (containsNull = false)

Pyspark Cannot modify a column based on a condition when a column values are in other list

I am using Pyspark 3.0.1
I want to modify the value of a column is in a list.
df.printSchema()
root
|-- ID: decimal(4,0) (nullable = true)
|-- Provider: string (nullable = true)
|-- Principal: float (nullable = false)
|-- PRINCIPALBALANCE: float (nullable = true)
|-- STATUS: integer (nullable = true)
|-- Installment Rate: float (nullable = true)
|-- Yearly Percentage: float (nullable = true)
|-- Processing Fee Percentage: double (nullable = true)
|-- Disb Date: string (nullable = true)
|-- ZOHOID: integer (nullable = true)
|-- UPFRONTPROCESSINGFEEBALANCE: float (nullable = true)
|-- WITHHOLDINGTAXBALANCE: float (nullable = true)
|-- UPFRONTPROCESSINGFEEPERCENTAGE: float (nullable = true)
|-- UPFRONTPROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
|-- PROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
|-- PROCESSINGFEEVATPERCENTAGE: float (nullable = true)
|-- BUSINESSSHORTCODE: string (nullable = true)
|-- EXCTRACTIONDATE: timestamp (nullable = true)
|-- fake Fee: double (nullable = false)
|-- fake WHT: string (nullable = true)
|-- fake Fee_WHT: string (nullable = true)
|-- Agency Fee CP: string (nullable = true)
|-- Agency VAT CP: string (nullable = true)
|-- Agency WHT CP: string (nullable = true)
|-- Agency Fee_VAT_WHT CP: string (nullable = true)
|-- write_offs: integer (nullable = false)
df.head(1)
[Row(ID=Decimal('16'), Provider='fake', Principal=2000.01, PRINCIPALBALANCE=0.2, STATUS=4, Installment Rate=0.33333333, Yearly Percentage=600.0, Processing Fee Percentage=0.20, Disb Date=None, ZOHOID=3000, UPFRONTPROCESSINGFEEBALANCE=None, WITHHOLDINGTAXBALANCE=None, UPFRONTPROCESSINGFEEPERCENTAGE=None, UPFRONTPROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEVATPERCENTAGE=16.0, BUSINESSSHORTCODE='20005', EXCTRACTIONDATE=datetime.datetime(2020, 11, 25, 5, 7, 58, 6000), fake Fee=1770.7, fake WHT='312.48', fake Fee_WHT='2,083.18', Agency Fee CP='566.62', Agency VAT CP='566.62', Agency WHT CP='186.39', Agency Fee_VAT_WHT CP='5,394.41')]
The value of the column of 'write_offs' is 0 for all rows then I want to convert it to 1 if the column ID is in the following list: list1 = [299, 570, 73, 401]
Then I am doing:
df.withColumn('write_offs', when((df.filter(df['ID'].isin(list1))),1).otherwise(df['ID']))
and I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-de9f9cd49ea5> in <module>
----> 1 df.withColumn('write_offs', when((df.filter(df['ID'].isin(userinput_write_offs_ids))),lit(1)).otherwise(df['ID']))
/usr/local/spark/python/pyspark/sql/functions.py in when(condition, value)
789 sc = SparkContext._active_spark_context
790 if not isinstance(condition, Column):
--> 791 raise TypeError("condition should be a Column")
792 v = value._jc if isinstance(value, Column) else value
793 jc = sc._jvm.functions.when(condition._jc, v)
TypeError: condition should be a Column
I don't know why is giving this error because I did a similar operation that the condition returns a dataframe and works.
I read how to use this isin function here:
Pyspark isin function
You need a Boolean column for when, not a dataframe
import pyspark.sql.functions as F
df.withColumn(
'write_offs',
F.when(F.col('ID').isin(list1), 1)
.otherwise(F.col('ID'))
)

Convert DataFrame Format

I have my dataframe in below format -
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
and convert into having multiple values-
|-- id: string (nullable = true)
|-- epoch: string (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Example:
From:
1,12345, [pq -> r, ab -> c]
To:
1,12345, pq ,r
1,12345, ab ,c
I am trying this code but doesn't work-
val array2Df = array1Df.flatMap(line =>
line.getMap[String, String](2).map(
(line.getString(0),line.getString(1),_)
))
Try following
val arrayData = Seq(
Row("1","epoch_1",Map("epoch_1_key1"->"epoch_1_val1","epoch_1_key2"->"epoch_1_Val2")),
Row("2","epoch_2",Map("epoch_2_key1"->"epoch_2_val1","epoch_2_key2"->"epoch_2_Val2"))
)
val arraySchema = new StructType()
.add("Id",StringType)
.add("epoch", StringType)
.add("data", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.printSchema()
df.show(false)
After that you need to explode based on data column. Don't forget to
import org.apache.spark.sql.functions.explode
df.select($"Id",explode($"data")).show(false)

Aggregate one column, but show all columns in select

I try to show maximum value from column while I group rows by date column.
So i tried this code
maxVal = dfSelect.select('*')\
.groupBy('DATE')\
.agg(max('CLOSE'))
But output looks like that:
+----------+----------+
| DATE|max(CLOSE)|
+----------+----------+
|1987-05-08| 43.51|
|1987-05-29| 39.061|
+----------+----------+
I wanna have output like below
+------+---+----------+------+------+------+------+------+---+----------+
|TICKER|PER| DATE| TIME| OPEN| HIGH| LOW| CLOSE|VOL|max(CLOSE)|
+------+---+----------+------+------+------+------+------+---+----------+
| CDG| D|1987-01-02|000000|50.666|51.441|49.896|50.666| 0| 50.666|
| ABC| D|1987-01-05|000000|51.441| 52.02|51.441|51.441| 0| 51.441|
+------+---+----------+------+------+------+------+------+---+----------+
So my question is how to change the code to have output with all columns and aggregated 'CLOSE' column?
Scheme of my data looks like below:
root
|-- TICKER: string (nullable = true)
|-- PER: string (nullable = true)
|-- DATE: date (nullable = true)
|-- TIME: string (nullable = true)
|-- OPEN: float (nullable = true)
|-- HIGH: float (nullable = true)
|-- LOW: float (nullable = true)
|-- CLOSE: float (nullable = true)
|-- VOL: integer (nullable = true)
|-- OPENINT: string (nullable = true)
If you want the same aggregation all your columns in the original dataframe, then you can do something like,
import pyspark.sql.functions as F
expr = [F.max(coln).alias(coln) for coln in df.columns if 'date' not in coln] # df is your datafram
df_res = df.groupby('date').agg(*expr)
If you want multiple aggregations, then you can do like,
sub_col1 = # define
sub_col2=# define
expr1 = [F.max(coln).alias(coln) for coln in sub_col1 if 'date' not in coln]
expr2 = [F.first(coln).alias(coln) for coln in sub_col2 if 'date' not in coln]
expr=expr1+expr2
df_res = df.groupby('date').agg(*expr)
If you want only one of the columns aggregated and added to your original dataframe, then you can do a selfjoin after aggregating
df_agg = df.groupby('date').agg(F.max('close').alias('close_agg')).withColumn("dummy",F.lit("dummmy")) # dummy column is needed as a workaround in spark issues of self join
df_join = df.join(df_agg,on='date',how='left')
or you can use a windowing function
from pyspark.sql import Window
w= Window.partitionBy('date')
df_res = df.withColumn("max_close",F.max('close').over(w))

Incorrect nullability of column after saving pyspark dataframe

When saving a pyspark dataframe with a new column added with 'withColumn' function, the nullability changes from false to true.
Version info : Python 3.7.3/Spark2.4.0-cdh6.1.1
>>> l = [('Alice', 1)]
>>> df = spark.createDataFrame(l)
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
>>> from pyspark.sql.functions import lit
>>> df = df.withColumn('newCol', lit('newVal'))
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = false)
>>> df.write.saveAsTable('default.withcolTest', mode='overwrite')
>>> spark.sql("select * from default.withcolTest").printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
|-- newCol: string (nullable = true)
Why does the nullable flag of the column newCol added with withColumn function change when the dataframe is persisted?

Resources