Pyspark - How to name partitions in ISO8601 format over integer columns - apache-spark

I have a DataFrame that looks like this:
+--------------------+----+-----+---+----+
| action |year|month|day|hour|
+--------------------+----+-----+---+----+
|check in |2022| 4| 3| 23|
|go shopping |2022| 4| 4| 11|
|go eat |2022| 4| 5| 12|
|go watch a movie |2022| 4| 6| 14|
|go out for a drink |2022| 4| 6| 18|
+--------------------+----+-----+---+----+
root
|-- action: string (nullable = false)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- hour: integer (nullable = true)
When I save this dataframe to disk, I would like the table to be partitioned over the datetime fields, i.e:
df.write.format("parquet").partitionBy(
"year", "month", "day", "hour"
).save(path)
I would also like to be able to add padded zero to the day, month, and hour partition folder name, so that they look like this:
table/year=2022/month=09/...
and not this:
table/year=2022/month=9/...
Is it possible to do so in Pyspark, in a way that isn't intrusive to the table, and doesn't involve changing the table schema?
Thank you for your time.

Related

Spark: group data by equal chunks (using a non-time related criteria)

When analysing a data series, is it possible to group data by equal chunks on the basis of a non-time related column?
Is there a way of splitting the a single row whenever necessary (when the individual values are higher than the chunk size?
For example:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 300|
|2021-09-11 08:05:11| 200|
|2021-09-11 08:07:25| 100|
|2021-09-11 10:28:14| 700|
|2021-09-12 09:24:11| 1500|
|2021-09-12 09:25:00| 100|
|2021-09-13 09:25:00| 400|
+-------------------+--------+
Desired output (every 500 units):
root
|-- Starting Datetime: timestamp (nullable = true)
|-- Ending Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
|-- Duration(seconds): integer (nullable = true)
+-------------------+-------------------+--------+-----------+
| Starting Datetime | Ending Datetime |Quantity|Duration(s)|
+-------------------+-------------------+--------+-----------+
|2021-09-10 10:08:11|2021-09-10 10:08:16| 500| 5|
|2021-09-11 08:05:11|2021-09-11 10:28:14| 500| 8760|
|2021-09-11 10:28:14|2021-09-11 10:28:14| 500| 0|
|2021-09-12 09:24:11|2021-09-12 09:24:11| 500| 0|
|2021-09-12 09:24:11|2021-09-12 09:24:11| 500| 0|
|2021-09-12 09:24:11|2021-09-12 09:24:11| 500| 0|
|2021-09-12 09:25:00|2021-09-13 09:25:00| 500| 86400|
+-------------------+-------------------+--------+-----------+

pyspark split array type column to multiple columns

After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
root
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)
Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
.groupby('person')\
.pivot('ID')\
.agg(f.first('rating')).show()
+------+---+---+---+
|person| a| b| c|
+------+---+---+---+
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
+------+---+---+---+
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()
+------+-------------------+-------------------+-------------------+
|person| a| b| c|
+------+-------------------+-------------------+-------------------+
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

I have a PySpark dataframe
simpleData = [("person0",10, 10), \
("person1",1, 1), \
("person2",1, 0), \
("person3",5, 1), \
]
columns= ["persons_name","A", 'B']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
It looks like
root
|-- persons_name: string (nullable = true)
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- total: long (nullable = true)
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 10| 10|
| person1| 1| 1|
| person2| 1| 0|
| person3| 5| 1|
+------------+---+---+
Now I want a threshold of value 2 to be applied to the values of columns A and B, such that any value in the column less than the threshold becomes 0 and the values greater than the threshold becomes 1.
The final result should look something like-
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 1| 1|
| person1| 0| 0|
| person2| 0| 0|
| person3| 1| 0|
+------------+---+---+
How can I achieve this?
threshold = 2
exp.select(
[(F.col(col) > F.lit(threshold)).cast('int').alias(col) for col in ['A', 'B']]
)

Pyspark - Add Rows By Group

In Pyspark 2.2 I am essentially trying to add rows by user.
If I have my main Dataframe that looks like:
main_list = [["a","bb",5], ["d","cc",10],["d","bb",11]]
main_pd = pd.DataFrame(main_list, columns = ['user',"group", 'value'])
main_df = spark.createDataFrame(main_pd)
main_df.show()
+----+-----+-----+
|user|group|value|
+----+-----+-----+
| a| bb| 5|
| d| cc| 10|
| d| bb| 11|
+----+-----+-----+
I then have a key Dataframe where I would like to have every user have every group value
User d has a row for group bb and cc. I would like user a to have the same.
key_list = [["bb",10],["cc",17]]
key_pd = pd.DataFrame(key_list, columns = ['group', 'value'])
key_df = spark.createDataFrame(key_pd)
main_df.join(key_df, ["group"], how ="outer").show()
But my result returns:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| bb| d| 11| 10|
+-----+----+-----+-----+
Here are the schemas of each Dataframe:
main_df.printSchema()
root
|-- user: string (nullable = true)
|-- group: string (nullable = true)
|-- value: long (nullable = true)
key_df.printSchema()
root
|-- group: string (nullable = true)
|-- value: long (nullable = true)
Essentially I would like the result to be:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| cc| a| Null| 17|
| bb| d| 11| 10|
+-----+----+-----+-----+
I don't think the full outer join will accomplish this with a coalesce so I had also experimented with row_number/rank
Get all the user-group combinations with a cross join, then use a left join on the maind_df to generate missing rows and then left join the result with key_df.
users = main_df.select("user").distinct()
groups = main_df.select("group").distinct()
user_group = users.crossJoin(groups)
all_combs = user_group.join(main_df, (main_df.user == user_group.user) & (main_df.group == user_group.group), "left").select(user_group.user,user_group.group,main_df.value)
all_combs.join(key_df, key_df.group == all_combs.group, "left").show()

pyspark nested columns in a string

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?
The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+
You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

Resources