To convert struct column names into rows in a parquet file

To convert struct column names into rows in a parquet file - apache-spark

I have a sample json datafile as shown below:
{"data_id":"1234","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":true,"familyname":true,"swimming_pool":true}}}
{"data_id":"6789","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":false,"familyname":true}}}
{"data_id":"5678","risk_characteristics":{"indicators":{"alcohol":false}}}
I converted the json file to parquet and loaded into hive using below code
dataDF = spark.read.json("path/Datasmall.json")
dataDF.write.parquet("data.parquet")
parqFile = spark.read.parquet("data.parquet")
parqFile.write.saveAsTable("indicators_table", format='parquet', mode='append', path='/externalpath/indicators_table/')
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
fromHiveDF = hive_context.table("default.indicators_table")
fromHiveDF.show()
indicatorsDF = fromHiveDF.select('data_id', 'risk_characteristics.indicators')
indicatorsDF.printSchema()
root
|-- data_id: string (nullable = true)
|-- indicators: struct (nullable = true)
| |-- alcohol: boolean (nullable = true)
| |-- house: boolean (nullable = true)
| |-- business: boolean (nullable = true)
| |-- familyname: boolean (nullable = true)
indicatorsDF.show()
+-------+--------------------+
|data_id| indicators|
+-------+--------------------+
| 1234|[true, true, true...|
| 6789|[true, false, tru...|
| 5678| [false,,,,]|
+-------+--------------------+
Instead of retrieving the data as select data_id, indicators.alcohol, indicators.house etc,
I simply want to get a parquet data file with below 3 columns only . That is - the struct fields are converted to rows under indicators_type column name.
data_id indicators_type indicators_value
1234 alcohol T
1234 house T
1234 business T
1234 familyname T
1234 swimming_ppol T
6789 alcohol T
6789 house F
6789 business T
6789 familyname F
5678 alcohol F
May I ask how to do that. I am trying to accomplish this using pyspark. Also Is there a way to achieve this without hardcoding the literal details. In my actual data the struct data can extend beyond familyname and it could be even 100 of them.
Thanks a lot

Use stack to stack the columns:
df.show()
+-------+--------------------------+
|data_id|indicators |
+-------+--------------------------+
|1234 |[true, true, false, true] |
|6789 |[true, false, true, false]|
+-------+--------------------------+
stack_expr = 'stack(' + str(len(df.select('indicators.*').columns)) + ', ' + ', '.join(["'%s', indicators.%s" % (col,col) for col in df.select('indicators.*').columns]) + ') as (indicators_type, indicators_value)'
df2 = df.selectExpr(
'data_id',
stack_expr
)
df2.show()
+-------+---------------+----------------+
|data_id|indicators_type|indicators_value|
+-------+---------------+----------------+
| 1234| alcohol| true|
| 1234| house| true|
| 1234| business| false|
| 1234| familyname| true|
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
+-------+---------------+----------------+

Another solution using explode:
val df = spark.sql(""" with t1(
select 1234 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
union
select 6789 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
)
select * from t1
""")
df.show(false)
df.printSchema
+-------+--------------------------+
|data_id|indicators |
+-------+--------------------------+
|6789 |[true, false, true, false]|
|1234 |[true, false, true, false]|
+-------+--------------------------+
root
|-- data_id: integer (nullable = false)
|-- indicators: struct (nullable = false)
| |-- alcohol: boolean (nullable = false)
| |-- house: boolean (nullable = false)
| |-- business: boolean (nullable = false)
| |-- familyname: boolean (nullable = false)
val df2 = df.withColumn("x", explode(array(
map(lit("alcohol") ,col("indicators.alcohol")),
map(lit("house"), col("indicators.house")),
map(lit("business"), col("indicators.business")),
map(lit("familyname"), col("indicators.familyname"))
)))
df2.select(col("data_id"),map_keys(col("x"))(0), map_values(col("x"))(0)).show
+-------+--------------+----------------+
|data_id|map_keys(x)[0]|map_values(x)[0]|
+-------+--------------+----------------+
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
| 1234| alcohol| true|
| 1234| house| false|
| 1234| business| true|
| 1234| familyname| false|
+-------+--------------+----------------+
Update-1:
To get the indicators struct columns dynamically, use the below approach.
val colsx = df.select("indicators.*").columns
colsx: Array[String] = Array(alcohol, house, business, familyname)
val exp1 = colsx.map( x => s""" map("${x}", indicators.${x}) """ ).mkString(",")
val exp2 = " explode(array( " + exp1 + " )) "
val df2 = df.withColumn("x",expr(exp2))
df2.select(col("data_id"),map_keys(col("x"))(0).as("indicator_key"), map_values(col("x"))(0).as("indicator_value")).show
+-------+-------------+---------------+
|data_id|indicator_key|indicator_value|
+-------+-------------+---------------+
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
| 1234| alcohol| true|
| 1234| house| false|
| 1234| business| true|
| 1234| familyname| false|
+-------+-------------+---------------+

Related

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

I have a PySpark dataframe
simpleData = [("person0",10, 10), \
("person1",1, 1), \
("person2",1, 0), \
("person3",5, 1), \
]
columns= ["persons_name","A", 'B']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
It looks like
root
|-- persons_name: string (nullable = true)
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- total: long (nullable = true)
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 10| 10|
| person1| 1| 1|
| person2| 1| 0|
| person3| 5| 1|
+------------+---+---+
Now I want a threshold of value 2 to be applied to the values of columns A and B, such that any value in the column less than the threshold becomes 0 and the values greater than the threshold becomes 1.
The final result should look something like-
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 1| 1|
| person1| 0| 0|
| person2| 0| 0|
| person3| 1| 0|
+------------+---+---+
How can I achieve this?

threshold = 2
exp.select(
[(F.col(col) > F.lit(threshold)).cast('int').alias(col) for col in ['A', 'B']]
)

How to convert an array of structs into multiple columns?

I have a schema:
root (original)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = true)
How can I flatten it?
root (derived)
|-- col1: string (nullable = false)
|-- col2: string (nullable = true)
|-- col3: string (nullable = false)
|-- col4: string (nullable = true)
|-- ...
where col1...n is [col1 from original] and value for col1...n is value from [col2 from original]
Example:
+--------------------------------------------+
|entries |
+--------------------------------------------+
|[[a1, 1], [a2, P], [a4, N] |
|[[a1, 1], [a2, O], [a3, F], [a4, 1], [a5, 1]|
+--------------------------------------------+
I want to create the next dataset:
+-------------------------+
| a1 | a2 | a3 | a4 | a5 |
+-------------------------+
| 1 | P | null| N | null|
| 1 | O | F | 1 | 1 |
+-------------------------+

You can do it with a combination of explode and pivot, to do so, one needs to create a row_id first:
val df = Seq(
Seq(("a1", "1"), ("a2", "P"), ("a4", "N")),
Seq(("a1", "1"), ("a2", "O"), ("a3", "F"), ("a4", "1"), ("a5", "1"))
).toDF("arr")
.select($"arr".cast("array<struct<col1:string,col2:string>>"))
df
.withColumn("row_id", monotonically_increasing_id())
.select($"row_id", explode($"arr"))
.select($"row_id", $"col.*")
.groupBy($"row_id").pivot($"col1").agg(first($"col2"))
.drop($"row_id")
.show()
gives:
+---+---+----+---+----+
| a1| a2| a3| a4| a5|
+---+---+----+---+----+
| 1| P|null| N|null|
| 1| O| F| 1| 1|
+---+---+----+---+----+

Pyspark - Add Rows By Group

In Pyspark 2.2 I am essentially trying to add rows by user.
If I have my main Dataframe that looks like:
main_list = [["a","bb",5], ["d","cc",10],["d","bb",11]]
main_pd = pd.DataFrame(main_list, columns = ['user',"group", 'value'])
main_df = spark.createDataFrame(main_pd)
main_df.show()
+----+-----+-----+
|user|group|value|
+----+-----+-----+
| a| bb| 5|
| d| cc| 10|
| d| bb| 11|
+----+-----+-----+
I then have a key Dataframe where I would like to have every user have every group value
User d has a row for group bb and cc. I would like user a to have the same.
key_list = [["bb",10],["cc",17]]
key_pd = pd.DataFrame(key_list, columns = ['group', 'value'])
key_df = spark.createDataFrame(key_pd)
main_df.join(key_df, ["group"], how ="outer").show()
But my result returns:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| bb| d| 11| 10|
+-----+----+-----+-----+
Here are the schemas of each Dataframe:
main_df.printSchema()
root
|-- user: string (nullable = true)
|-- group: string (nullable = true)
|-- value: long (nullable = true)
key_df.printSchema()
root
|-- group: string (nullable = true)
|-- value: long (nullable = true)
Essentially I would like the result to be:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| cc| a| Null| 17|
| bb| d| 11| 10|
+-----+----+-----+-----+
I don't think the full outer join will accomplish this with a coalesce so I had also experimented with row_number/rank

Get all the user-group combinations with a cross join, then use a left join on the maind_df to generate missing rows and then left join the result with key_df.
users = main_df.select("user").distinct()
groups = main_df.select("group").distinct()
user_group = users.crossJoin(groups)
all_combs = user_group.join(main_df, (main_df.user == user_group.user) & (main_df.group == user_group.group), "left").select(user_group.user,user_group.group,main_df.value)
all_combs.join(key_df, key_df.group == all_combs.group, "left").show()

Explode Maptype column in pyspark

I have a dataframe like this
data = [(("ID1", {'A': 1, 'B': 2}))]
df = spark.createDataFrame(data, ["ID", "Coll"])
df.show()
+---+----------------+
| ID| Coll|
+---+----------------+
|ID1|[A -> 1, B -> 2]|
+---+----------------+
df.printSchema()
root
|-- ID: string (nullable = true)
|-- Coll: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
I want to explode the 'Coll' column such that
+---+-----------+
| ID| Key| Value|
+---+-----------+
|ID1| A| 1|
|ID1| B| 2|
+---+-----------+
I am trying to do this in pyspark
I am successful if I use only one column, however I want the ID column as well
df.select(explode("Coll").alias("x", "y")).show()
+---+---+
| x| y|
+---+---+
| A| 1|
| B| 2|
+---+---+

Simply add the ID column to the select and it should work:
df.select("id", explode("Coll").alias("x", "y"))

pyspark nested columns in a string

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?

The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+

You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

To convert struct column names into rows in a parquet file - apache-spark

Related

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

How to convert an array of structs into multiple columns?

Pyspark - Add Rows By Group

Explode Maptype column in pyspark

pyspark nested columns in a string

Categories

Resources