pyspark nested columns in a string - apache-spark

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?

The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+

You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

Related

To convert struct column names into rows in a parquet file

I have a sample json datafile as shown below:
{"data_id":"1234","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":true,"familyname":true,"swimming_pool":true}}}
{"data_id":"6789","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":false,"familyname":true}}}
{"data_id":"5678","risk_characteristics":{"indicators":{"alcohol":false}}}
I converted the json file to parquet and loaded into hive using below code
dataDF = spark.read.json("path/Datasmall.json")
dataDF.write.parquet("data.parquet")
parqFile = spark.read.parquet("data.parquet")
parqFile.write.saveAsTable("indicators_table", format='parquet', mode='append', path='/externalpath/indicators_table/')
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
fromHiveDF = hive_context.table("default.indicators_table")
fromHiveDF.show()
indicatorsDF = fromHiveDF.select('data_id', 'risk_characteristics.indicators')
indicatorsDF.printSchema()
root
|-- data_id: string (nullable = true)
|-- indicators: struct (nullable = true)
| |-- alcohol: boolean (nullable = true)
| |-- house: boolean (nullable = true)
| |-- business: boolean (nullable = true)
| |-- familyname: boolean (nullable = true)
indicatorsDF.show()
+-------+--------------------+
|data_id| indicators|
+-------+--------------------+
| 1234|[true, true, true...|
| 6789|[true, false, tru...|
| 5678| [false,,,,]|
+-------+--------------------+
Instead of retrieving the data as select data_id, indicators.alcohol, indicators.house etc,
I simply want to get a parquet data file with below 3 columns only . That is - the struct fields are converted to rows under indicators_type column name.
data_id indicators_type indicators_value
1234 alcohol T
1234 house T
1234 business T
1234 familyname T
1234 swimming_ppol T
6789 alcohol T
6789 house F
6789 business T
6789 familyname F
5678 alcohol F
May I ask how to do that. I am trying to accomplish this using pyspark. Also Is there a way to achieve this without hardcoding the literal details. In my actual data the struct data can extend beyond familyname and it could be even 100 of them.
Thanks a lot
Use stack to stack the columns:
df.show()
+-------+--------------------------+
|data_id|indicators |
+-------+--------------------------+
|1234 |[true, true, false, true] |
|6789 |[true, false, true, false]|
+-------+--------------------------+
stack_expr = 'stack(' + str(len(df.select('indicators.*').columns)) + ', ' + ', '.join(["'%s', indicators.%s" % (col,col) for col in df.select('indicators.*').columns]) + ') as (indicators_type, indicators_value)'
df2 = df.selectExpr(
'data_id',
stack_expr
)
df2.show()
+-------+---------------+----------------+
|data_id|indicators_type|indicators_value|
+-------+---------------+----------------+
| 1234| alcohol| true|
| 1234| house| true|
| 1234| business| false|
| 1234| familyname| true|
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
+-------+---------------+----------------+
Another solution using explode:
val df = spark.sql(""" with t1(
select 1234 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
union
select 6789 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
)
select * from t1
""")
df.show(false)
df.printSchema
+-------+--------------------------+
|data_id|indicators |
+-------+--------------------------+
|6789 |[true, false, true, false]|
|1234 |[true, false, true, false]|
+-------+--------------------------+
root
|-- data_id: integer (nullable = false)
|-- indicators: struct (nullable = false)
| |-- alcohol: boolean (nullable = false)
| |-- house: boolean (nullable = false)
| |-- business: boolean (nullable = false)
| |-- familyname: boolean (nullable = false)
val df2 = df.withColumn("x", explode(array(
map(lit("alcohol") ,col("indicators.alcohol")),
map(lit("house"), col("indicators.house")),
map(lit("business"), col("indicators.business")),
map(lit("familyname"), col("indicators.familyname"))
)))
df2.select(col("data_id"),map_keys(col("x"))(0), map_values(col("x"))(0)).show
+-------+--------------+----------------+
|data_id|map_keys(x)[0]|map_values(x)[0]|
+-------+--------------+----------------+
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
| 1234| alcohol| true|
| 1234| house| false|
| 1234| business| true|
| 1234| familyname| false|
+-------+--------------+----------------+
Update-1:
To get the indicators struct columns dynamically, use the below approach.
val colsx = df.select("indicators.*").columns
colsx: Array[String] = Array(alcohol, house, business, familyname)
val exp1 = colsx.map( x => s""" map("${x}", indicators.${x}) """ ).mkString(",")
val exp2 = " explode(array( " + exp1 + " )) "
val df2 = df.withColumn("x",expr(exp2))
df2.select(col("data_id"),map_keys(col("x"))(0).as("indicator_key"), map_values(col("x"))(0).as("indicator_value")).show
+-------+-------------+---------------+
|data_id|indicator_key|indicator_value|
+-------+-------------+---------------+
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
| 1234| alcohol| true|
| 1234| house| false|
| 1234| business| true|
| 1234| familyname| false|
+-------+-------------+---------------+

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

I have a PySpark dataframe
simpleData = [("person0",10, 10), \
("person1",1, 1), \
("person2",1, 0), \
("person3",5, 1), \
]
columns= ["persons_name","A", 'B']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
It looks like
root
|-- persons_name: string (nullable = true)
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- total: long (nullable = true)
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 10| 10|
| person1| 1| 1|
| person2| 1| 0|
| person3| 5| 1|
+------------+---+---+
Now I want a threshold of value 2 to be applied to the values of columns A and B, such that any value in the column less than the threshold becomes 0 and the values greater than the threshold becomes 1.
The final result should look something like-
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 1| 1|
| person1| 0| 0|
| person2| 0| 0|
| person3| 1| 0|
+------------+---+---+
How can I achieve this?
threshold = 2
exp.select(
[(F.col(col) > F.lit(threshold)).cast('int').alias(col) for col in ['A', 'B']]
)

Adding values of two column which datatypes are in string format in pyspark

The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.
>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+

Pyspark - Add Rows By Group

In Pyspark 2.2 I am essentially trying to add rows by user.
If I have my main Dataframe that looks like:
main_list = [["a","bb",5], ["d","cc",10],["d","bb",11]]
main_pd = pd.DataFrame(main_list, columns = ['user',"group", 'value'])
main_df = spark.createDataFrame(main_pd)
main_df.show()
+----+-----+-----+
|user|group|value|
+----+-----+-----+
| a| bb| 5|
| d| cc| 10|
| d| bb| 11|
+----+-----+-----+
I then have a key Dataframe where I would like to have every user have every group value
User d has a row for group bb and cc. I would like user a to have the same.
key_list = [["bb",10],["cc",17]]
key_pd = pd.DataFrame(key_list, columns = ['group', 'value'])
key_df = spark.createDataFrame(key_pd)
main_df.join(key_df, ["group"], how ="outer").show()
But my result returns:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| bb| d| 11| 10|
+-----+----+-----+-----+
Here are the schemas of each Dataframe:
main_df.printSchema()
root
|-- user: string (nullable = true)
|-- group: string (nullable = true)
|-- value: long (nullable = true)
key_df.printSchema()
root
|-- group: string (nullable = true)
|-- value: long (nullable = true)
Essentially I would like the result to be:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| cc| a| Null| 17|
| bb| d| 11| 10|
+-----+----+-----+-----+
I don't think the full outer join will accomplish this with a coalesce so I had also experimented with row_number/rank
Get all the user-group combinations with a cross join, then use a left join on the maind_df to generate missing rows and then left join the result with key_df.
users = main_df.select("user").distinct()
groups = main_df.select("group").distinct()
user_group = users.crossJoin(groups)
all_combs = user_group.join(main_df, (main_df.user == user_group.user) & (main_df.group == user_group.group), "left").select(user_group.user,user_group.group,main_df.value)
all_combs.join(key_df, key_df.group == all_combs.group, "left").show()

Explode Maptype column in pyspark

I have a dataframe like this
data = [(("ID1", {'A': 1, 'B': 2}))]
df = spark.createDataFrame(data, ["ID", "Coll"])
df.show()
+---+----------------+
| ID| Coll|
+---+----------------+
|ID1|[A -> 1, B -> 2]|
+---+----------------+
df.printSchema()
root
|-- ID: string (nullable = true)
|-- Coll: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
I want to explode the 'Coll' column such that
+---+-----------+
| ID| Key| Value|
+---+-----------+
|ID1| A| 1|
|ID1| B| 2|
+---+-----------+
I am trying to do this in pyspark
I am successful if I use only one column, however I want the ID column as well
df.select(explode("Coll").alias("x", "y")).show()
+---+---+
| x| y|
+---+---+
| A| 1|
| B| 2|
+---+---+
Simply add the ID column to the select and it should work:
df.select("id", explode("Coll").alias("x", "y"))

Resources