How to add column to exploded struct in Spark? - apache-spark

Say I have the following data:
{"id":1, "payload":[{"foo":1, "lol":2},{"foo":2, "lol":2}]}
I would like to explode the payload and add a column to it, like this:
df = df.select('id', F.explode('payload').alias('data'))
df = df.withColumn('data.bar', F.col('data.foo') * 2)
However this results in a dataframe with three columns:
id
data
data.bar
I expected the data.bar to be part of the data struct...
How can I add a column to the exploded struct, instead of adding a top-level column?

df = df.withColumn('data', f.struct(
df['data']['foo'].alias('foo'),
(df['data']['foo'] * 2).alias('bar')
))
This will result in:
root
|-- id: long (nullable = true)
|-- data: struct (nullable = false)
| |-- col1: long (nullable = true)
| |-- bar: long (nullable = true)
UPDATE:
def func(x):
tmp = x.asDict()
tmp['foo'] = tmp.get('foo', 0) * 100
res = zip(*tmp.items())
return Row(*res[0])(*res[1])
df = df.withColumn('data', f.UserDefinedFunction(func, StructType(
[StructField('foo', StringType()), StructField('lol', StringType())]))(df['data']))
P.S.
Spark almost do not support inplace opreation.
So every time you want to do inplace, you need to do replace actually.

Related

Reorder PySpark dataframe columns on specific sort logic

I have a PySpark dataframe with the below column order. I need to order it as per the 'branch'. How do I do it? df.select(sorted(df.columns)) doesn't seem to work the way I want.
Existing column order:
store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit
Desired column order:
store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..
You could manually build your list of columns.
col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
for i in range(1, 13):
cols.append(col_fmt.format(i, branch))
df.select(cols)
Alternatively, I'd recommend building a better dataframe that takes advantage of array + struct/map datatypes. E.g.
months - array (size 12)
- branches: map<string, struct>
- key: string (branch name)
- value: struct
- profit: float
This way, arrays would already be "sorted". Map order doesn't really matter, and it makes SQL queries specific to certain months and branches easier to read (and probably faster with predicate pushdowns)
You may need to use some python coding. In the following script I split the column names based on underscore _ and then sorted according to elements [3] (branch name) and [1] (month value).
Input df:
cols = ['store_id',
'store_name',
'month_1_branch_A_profit',
'month_1_branch_B_profit',
'month_1_branch_C_profit',
'month_1_branch_D_profit',
'month_2_branch_A_profit',
'month_2_branch_B_profit',
'month_2_branch_C_profit',
'month_2_branch_D_profit',
'month_12_branch_A_profit',
'month_12_branch_B_profit',
'month_12_branch_C_profit',
'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))
Script:
branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
'store_id', 'store_name',
*[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)
df.printSchema()
# root
# |-- store_id: integer (nullable = true)
# |-- store_name: integer (nullable = true)
# |-- month_1_branch_A_profit: integer (nullable = true)
# |-- month_2_branch_A_profit: integer (nullable = true)
# |-- month_12_branch_A_profit: integer (nullable = true)
# |-- month_1_branch_B_profit: integer (nullable = true)
# |-- month_2_branch_B_profit: integer (nullable = true)
# |-- month_12_branch_B_profit: integer (nullable = true)
# |-- month_1_branch_C_profit: integer (nullable = true)
# |-- month_2_branch_C_profit: integer (nullable = true)
# |-- month_12_branch_C_profit: integer (nullable = true)
# |-- month_1_branch_D_profit: integer (nullable = true)
# |-- month_2_branch_D_profit: integer (nullable = true)
# |-- month_12_branch_D_profit: integer (nullable = true)

UDF to cast a map<bigint,struct<in1:bigint,in2:string>> column to add more fields to inner struct

I have a hive table, which when read into spark as spark.table(<table_name>) having below structure:
scala> df.printSchema
root
|-- id: long (nullable = true)
|-- info: map (nullable = true)
| |-- key: long
| |-- value: struct (valueContainsNull = true)
| | |-- in1: long (nullable = true)
| | |-- in2: string (nullable = true)
I want to cast the map column to add more fields to the inner struct, e.g. in3,in4
in this example : map<bigint,struct<in1:bigint,in2:string,in3:decimal(18,5),in4:string>>
I have tried normal cast but that doesn't work. So I am checking if I can achieve this through a UDF.
I will assign defaults to these new values like 0 for decimal and "" for string.
Below is what is tried and can't get it to work. Can anyone pls suggest how do I achieve this?
val origStructType = new StructType().add("in1", LongType, nullable = true).add("in2", StringType, nullable = true)
val newStructType = origStructType.add("in1", LongType, nullable = true).add("in2", StringType, nullable = true).add("in3", DecimalType(18,5), nullable = true).add("in4", StringType, nullable = true)
val newColSchema = MapType(LongType, newStructType)
val m = Map(101L->(101L,"val2"),102L->(102L,"val3"))
val df = Seq((100L,m)).toDF("id","info")
val typeUDFNewRet = udf((col1: Map[Long,Seq[(Long,String)]]) => {
col1.mapValues(v => Seq(v(0),v(1),null,"")) //Forced to use null here for another issue
}, newColSchema)
spark.udf.register("typeUDFNewRet",typeUDFNewRet)
df.registerTempTable("op1")
val df2 = spark.sql("select id, typeUDFNewRet(info) from op1")
scala> val df2 = spark.sql("select id, typeUDFNewRet(info) from op1")
df2: org.apache.spark.sql.DataFrame = [id: bigint, UDF(info): map<bigint,struct<in1:bigint,in2:string,in1:bigint,in2:string,in3:decimal(18,5),in4:string>>]
scala> df2.show(false)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.collection.Seq
at $anonfun$1$$anonfun$apply$1.apply(<console>:43)
at scala.collection.MapLike$MappedValues$$anonfun$iterato
I have also tried returning as a Row using this answer but that gives a diff issue.
Try this-
val origStructType = new StructType().add("in1", LongType, nullable = true).add("in2", StringType, nullable = true)
val newStructType = origStructType.add("in3", DecimalType(18,5), nullable = true).add("in4", StringType, nullable = true)
val newColSchema = MapType(LongType, newStructType)
val m = Map(101L->(101L,"val2"),102L->(102L,"val3"))
val df = Seq((100L,m)).toDF("id","info")
df.show(false)
df.printSchema()
val typeUDFNewRet = udf((col1: Map[Long,Row]) => {
col1.mapValues(r => Row.merge(r, Row(null, ""))) //Forced to use null here for another issue
}, newColSchema)
spark.udf.register("typeUDFNewRet",typeUDFNewRet)
df.registerTempTable("op1")
val df2 = spark.sql("select id, typeUDFNewRet(info) from op1")
df2.show(false)
df2.printSchema()
/**
* +---+----------------------------------------------+
* |id |UDF(info) |
* +---+----------------------------------------------+
* |100|[101 -> [101, val2,, ], 102 -> [102, val3,, ]]|
* +---+----------------------------------------------+
*
* root
* |-- id: long (nullable = false)
* |-- UDF(info): map (nullable = true)
* | |-- key: long
* | |-- value: struct (valueContainsNull = true)
* | | |-- in1: long (nullable = true)
* | | |-- in2: string (nullable = true)
* | | |-- in3: decimal(18,5) (nullable = true)
* | | |-- in4: string (nullable = true)
*/

Aggregate one column, but show all columns in select

I try to show maximum value from column while I group rows by date column.
So i tried this code
maxVal = dfSelect.select('*')\
.groupBy('DATE')\
.agg(max('CLOSE'))
But output looks like that:
+----------+----------+
| DATE|max(CLOSE)|
+----------+----------+
|1987-05-08| 43.51|
|1987-05-29| 39.061|
+----------+----------+
I wanna have output like below
+------+---+----------+------+------+------+------+------+---+----------+
|TICKER|PER| DATE| TIME| OPEN| HIGH| LOW| CLOSE|VOL|max(CLOSE)|
+------+---+----------+------+------+------+------+------+---+----------+
| CDG| D|1987-01-02|000000|50.666|51.441|49.896|50.666| 0| 50.666|
| ABC| D|1987-01-05|000000|51.441| 52.02|51.441|51.441| 0| 51.441|
+------+---+----------+------+------+------+------+------+---+----------+
So my question is how to change the code to have output with all columns and aggregated 'CLOSE' column?
Scheme of my data looks like below:
root
|-- TICKER: string (nullable = true)
|-- PER: string (nullable = true)
|-- DATE: date (nullable = true)
|-- TIME: string (nullable = true)
|-- OPEN: float (nullable = true)
|-- HIGH: float (nullable = true)
|-- LOW: float (nullable = true)
|-- CLOSE: float (nullable = true)
|-- VOL: integer (nullable = true)
|-- OPENINT: string (nullable = true)
If you want the same aggregation all your columns in the original dataframe, then you can do something like,
import pyspark.sql.functions as F
expr = [F.max(coln).alias(coln) for coln in df.columns if 'date' not in coln] # df is your datafram
df_res = df.groupby('date').agg(*expr)
If you want multiple aggregations, then you can do like,
sub_col1 = # define
sub_col2=# define
expr1 = [F.max(coln).alias(coln) for coln in sub_col1 if 'date' not in coln]
expr2 = [F.first(coln).alias(coln) for coln in sub_col2 if 'date' not in coln]
expr=expr1+expr2
df_res = df.groupby('date').agg(*expr)
If you want only one of the columns aggregated and added to your original dataframe, then you can do a selfjoin after aggregating
df_agg = df.groupby('date').agg(F.max('close').alias('close_agg')).withColumn("dummy",F.lit("dummmy")) # dummy column is needed as a workaround in spark issues of self join
df_join = df.join(df_agg,on='date',how='left')
or you can use a windowing function
from pyspark.sql import Window
w= Window.partitionBy('date')
df_res = df.withColumn("max_close",F.max('close').over(w))

Pyspark : Pass dynamic Column in UDF

Trying to send list of column one by one in UDF using for loop but getting error i.e Data frame not find col_name. currently in list list_col we have two column but it can be change .So i want to write a code which work for every list of column.In this code i am concatenating one row of column at a time and row value is in struct format i.e list inside a list . For every null i have to give space .
list_col=['pcxreport','crosslinediscount']
def struct_generater12(row):
list3 = []
main_str = ''
if(row is None):
list3.append(' ')
else:
for i in row:
temp = ''
if(i is None):
temp+= ' '
else:
for j in i:
if (j is None):
temp+= ' '
else:
temp+= str(j)
list3.append(temp)
for k in list3:
main_str +=k
return main_str
A = udf(struct_generater12,returnType=StringType())
# z = addlinterestdetail_FDF1.withColumn("Concated_pcxreport",A(addlinterestdetail_FDF1.pcxreport))
for i in range(0,len(list_col)-1):
struct_col='Concate_'
struct_col+=list_col[i]
col_name=list_col[i]
z = addlinterestdetail_FDF1.withColumn(struct_col,A(addlinterestdetail_FDF1.col_name))
struct_col=''
z.show()
addlinterestdetail_FDF1.col_name implies the column is named "col_name", you're not accessing the string contained in variable col_name.
When calling a UDF on a column, you can
use its string name directly: A(col_name)
or use pyspark sql function col:
import pyspark.sql.functions as psf
z = addlinterestdetail_FDF1.withColumn(struct_col,A(psf.col(col_name)))
You should consider using pyspark sql functions for concatenation instead of writing a UDF. First let's create a sample dataframe with nested structures:
import json
j = {'pcxreport':{'a': 'a', 'b': 'b'}, 'crosslinediscount':{'c': 'c', 'd': None, 'e': 'e'}}
jsonRDD = sc.parallelize([json.dumps(j)])
df = spark.read.json(jsonRDD)
df.printSchema()
df.show()
root
|-- crosslinediscount: struct (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- e: string (nullable = true)
|-- pcxreport: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
+-----------------+---------+
|crosslinediscount|pcxreport|
+-----------------+---------+
| [c,null,e]| [a,b]|
+-----------------+---------+
We'll write a dictionary with nested column names:
list_col=['pcxreport','crosslinediscount']
list_subcols = dict()
for c in list_col:
list_subcols[c] = df.select(c+'.*').columns
Now we can "flatten" the StructType, replace None with ' ', and concatenate:
import itertools
import pyspark.sql.functions as psf
df.select([c + '.*' for c in list_col])\
.na.fill({c:' ' for c in list(itertools.chain.from_iterable(list_subcols.values()))})\
.select([psf.concat(*sc).alias(c) for c, sc in list_subcols.items()])\
.show()
+---------+-----------------+
|pcxreport|crosslinediscount|
+---------+-----------------+
| ab| c e|
+---------+-----------------+

How to convert Timestamp to Date format in DataFrame?

I have a DataFrame with Timestamp column, which i need to convert as Date format.
Is there any Spark SQL functions available for this?
You can cast the column to date:
Scala:
import org.apache.spark.sql.types.DateType
val newDF = df.withColumn("dateColumn", df("timestampColumn").cast(DateType))
Pyspark:
df = df.withColumn('dateColumn', df['timestampColumn'].cast('date'))
In SparkSQL:
SELECT
CAST(the_ts AS DATE) AS the_date
FROM the_table
Imagine the following input:
val dataIn = spark.createDataFrame(Seq(
(1, "some data"),
(2, "more data")))
.toDF("id", "stuff")
.withColumn("ts", current_timestamp())
dataIn.printSchema
root
|-- id: integer (nullable = false)
|-- stuff: string (nullable = true)
|-- ts: timestamp (nullable = false)
You can use the to_date function:
val dataOut = dataIn.withColumn("date", to_date($"ts"))
dataOut.printSchema
root
|-- id: integer (nullable = false)
|-- stuff: string (nullable = true)
|-- ts: timestamp (nullable = false)
|-- date: date (nullable = false)
dataOut.show(false)
+---+---------+-----------------------+----------+
|id |stuff |ts |date |
+---+---------+-----------------------+----------+
|1 |some data|2017-11-21 16:37:15.828|2017-11-21|
|2 |more data|2017-11-21 16:37:15.828|2017-11-21|
+---+---------+-----------------------+----------+
I would recommend preferring these methods over casting and plain SQL.
For Spark 2.4+,
import spark.implicits._
val newDF = df.withColumn("dateColumn", $"timestampColumn".cast(DateType))
OR
val newDF = df.withColumn("dateColumn", col("timestampColumn").cast(DateType))
Best thing to use..tried and tested -
df_join_result.withColumn('order_date', df_join_result['order_date'].cast('date'))

Resources