How to split a dictionary in a Pyspark dataframe into multiple rows? - python-3.x

I have the following dataframe that is extracted with the following command:
extract = data.select('properties.id', 'flags')
| id | flags |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}" |
| v_002 | "{"45":true,"76":true}" |
The desired result I want is:
| id | flags |
|-------| ------|
| v_001 | 93 |
| v_001 | 83 |
| v_002 | 45 |
| v_002 | 76 |
I tried to apply explode as the following:
extract = data.select('properties.id', explode(col('flags')))
But I encountered the following:
cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>
This makes sense as the schema of the column is not compatible with the explode function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?
P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.

As you might already looked, explode requires ArrayType and it seems you are only taking the keys from the dict in flags.
So, you can first convert the flags to MapType and use map_keys to extract all keys into list.
df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
This will result in like this
+-----+--------+
| id| flags|
+-----+--------+
|v_001|[93, 83]|
|v_002|[45, 76]|
+-----+--------+
Then you can use explode on the flags.
.select('id', F.explode('flags'))
+-----+---+
| id|col|
+-----+---+
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
+-----+---+
The whole code
df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Update
It is probably better to supply the schema and read as MapType for the flags but if your json is complex and hard to create the schema, you can convert the struct into String once then convert to MapType.
# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))
# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))

Related

How to identify if a particular string/pattern exist in a column using pySpark

Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+

spark sql-providing a list parameter for sum function

im using spark dataframe API.
i'm trying to give sum() a list parameter containing columns names as strings.
when i'm putting columns names directly into the function- the script works'
when i'm trying to provide it to the function as a parameter of type list- i get the error:
"py4j.protocol.Py4JJavaError: An error occurred while calling o155.sum.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String"
using same kind of list parameter for groupBy() is working.
this is my script:
groupBy_cols = ['date_expense_int', 'customer_id']
agged_cols_list = ['total_customer_exp_last_m','total_customer_exp_last_3m']
df = df.groupBy(groupBy_cols).sum(agged_cols_list)
when i write it like so it works:
df = df.groupBy(groupBy_cols).sum('total_customer_exp_last_m','total_customer_exp_last_3m')
i tryied also to give sum() a list of column by using
agged_cols_list2 = []
for i in agged_cols_list:
agged_cols_list2.append(col(i))
also didn't work
Unpack your list using the asterisk notation:
df = df.groupBy(groupBy_cols).sum(*agged_cols_list)
If you are having a df like below and want to sum a list of fields
df.show(5,truncate=False)
+---+---------+----+
|id |subject |mark|
+---+---------+----+
|100|English |45 |
|100|Maths |63 |
|100|Physics |40 |
|100|Chemistry|94 |
|100|Biology |74 |
+---+---------+----+
only showing top 5 rows
agged_cols_list=['subject', 'mark']
df.groupBy("id").agg(*[sum(col(c)) for c in agged_cols_list]).show(5,truncate=False)
+---+------------+---------+
|id |sum(subject)|sum(mark)|
+---+------------+---------+
|125|null |330.0 |
|124|null |332.0 |
|155|null |304.0 |
|132|null |382.0 |
|154|null |300.0 |
+---+------------+---------+
Note that sum(subject) beomes null as it is a string column.
In this case you may want to apply count to subject and sum to mark. So you can use a dictionary
summary={ "subject":"count","mark":"sum" }
df.groupBy("id").agg(summary).show(5,truncate=False)
+---+--------------+---------+
|id |count(subject)|sum(mark)|
+---+--------------+---------+
|125|5 |330.0 |
|124|5 |332.0 |
|155|5 |304.0 |
|132|5 |382.0 |
|154|5 |300.0 |
+---+--------------+---------+
only showing top 5 rows

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Is spark smart enough to avoid redundant values while performing aggregation?

I have the following Dataset
case class Department(deptId:String,locations:Seq[String])
// using spark 2.0.2
// I have a Dataset `ds` of type Department
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,kerala] |
| d1|[] |
| dp2|[] |
| dp2|[hyderabad] |
+-------+--------------------+
I intended to convert it to
// Dataset `result` of type Department itself
+-------+--------------------+
|deptId | locations |
+-------+--------------------+
| d1|[delhi,kerala] |
| dp2|[hyderabad] |
+-------+--------------------+
I do the following
val flatten = udf(
(xs: Seq[Seq[String]]) => xs.flatten)
val result = ds.groupBy("deptId").
agg(flatten(collect_list("locations")).as("locations")
My question is, is Spark smart enough not to shuffle around empty locations ie [] ?
PS: I am not sure if this is a stupid question.
Yes and no:
Yes - collect_list performs map-side aggregation, so if there are multiple values per grouping key, data will be merged before shuffle.
No - because an empty list is not the same as the missing data. If that's not the desired behavior you should filter the data first
ds.filter(size($"location") > 0).groupBy("deptId").agg(...)
but keep in mind that it will yield different result if there are only empty arrays for deptId.

pyspark; check if an element is in collect_list [duplicate]

This question already has answers here:
How to filter based on array value in PySpark?
(2 answers)
Closed 4 years ago.
I am working on a dataframe df, for instance the following dataframe:
df.show()
Output:
+----+------+
|keys|values|
+----+------+
| aa| apple|
| bb|orange|
| bb| desk|
| bb|orange|
| bb| desk|
| aa| pen|
| bb|pencil|
| aa| chair|
+----+------+
I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))
The resulting dataframe is then as follows:
df_new.show()
Output:
+----+----------------------+
|keys|collectedSet_values |
+----+----------------------+
|bb |[orange, pencil, desk]|
|aa |[apple, pen, chair] |
+----+----------------------+
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.
Please comment your solutions/ideas.
Kind Regards.
Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:
df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()
Output:
+----+----------------------+--------------+
|keys|collectedSet_values |contains_chair|
+----+----------------------+--------------+
|bb |[orange, pencil, desk]|false |
|aa |[apple, pen, chair] |true |
+----+----------------------+--------------+
The same applies to the result of collect_list.

Resources