Spark: group data by equal chunks (using a non-time related criteria)

Spark: group data by equal chunks (using a non-time related criteria) - apache-spark

When analysing a data series, is it possible to group data by equal chunks on the basis of a non-time related column?
Is there a way of splitting the a single row whenever necessary (when the individual values are higher than the chunk size?
For example:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 300|
|2021-09-11 08:05:11| 200|
|2021-09-11 08:07:25| 100|
|2021-09-11 10:28:14| 700|
|2021-09-12 09:24:11| 1500|
|2021-09-12 09:25:00| 100|
|2021-09-13 09:25:00| 400|
+-------------------+--------+
Desired output (every 500 units):
root
|-- Starting Datetime: timestamp (nullable = true)
|-- Ending Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
|-- Duration(seconds): integer (nullable = true)
+-------------------+-------------------+--------+-----------+
| Starting Datetime | Ending Datetime |Quantity|Duration(s)|
+-------------------+-------------------+--------+-----------+
|2021-09-10 10:08:11|2021-09-10 10:08:16| 500| 5|
|2021-09-11 08:05:11|2021-09-11 10:28:14| 500| 8760|
|2021-09-11 10:28:14|2021-09-11 10:28:14| 500| 0|
|2021-09-12 09:24:11|2021-09-12 09:24:11| 500| 0|
|2021-09-12 09:24:11|2021-09-12 09:24:11| 500| 0|
|2021-09-12 09:24:11|2021-09-12 09:24:11| 500| 0|
|2021-09-12 09:25:00|2021-09-13 09:25:00| 500| 86400|
+-------------------+-------------------+--------+-----------+

Related

Pyspark - How to name partitions in ISO8601 format over integer columns

I have a DataFrame that looks like this:
+--------------------+----+-----+---+----+
| action |year|month|day|hour|
+--------------------+----+-----+---+----+
|check in |2022| 4| 3| 23|
|go shopping |2022| 4| 4| 11|
|go eat |2022| 4| 5| 12|
|go watch a movie |2022| 4| 6| 14|
|go out for a drink |2022| 4| 6| 18|
+--------------------+----+-----+---+----+
root
|-- action: string (nullable = false)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- hour: integer (nullable = true)
When I save this dataframe to disk, I would like the table to be partitioned over the datetime fields, i.e:
df.write.format("parquet").partitionBy(
"year", "month", "day", "hour"
).save(path)
I would also like to be able to add padded zero to the day, month, and hour partition folder name, so that they look like this:
table/year=2022/month=09/...
and not this:
table/year=2022/month=9/...
Is it possible to do so in Pyspark, in a way that isn't intrusive to the table, and doesn't involve changing the table schema?
Thank you for your time.

regarding the usage of cast function to process time information in pyspark

There is a dataframe shown as follows, it has two columns.
df.show()
| Time| MinTime|
|2019-11-19 23:00:...|2019-11-19 23:00:...|
|2019-11-19 23:15:...|2019-11-19 23:00:...|
|2019-11-19 23:30:...|2019-11-19 23:00:...|
root
|-- Time: string (nullable = true)
|-- MinTime: string (nullable = true)
df.show(truncate=False)
| Time| MinTime|
|2019-11-19 23:00:000000|2019-11-19 23:00:000000|
|2019-11-19 23:15:000000|2019-11-19 23:00:000000|
|2019-11-19 23:30:000000|2019-11-19 23:00:000000|
After I use the following line of code to process the above column, the values for column Offset are all null. Based on the values in Time and MinTime, the difference should not be null for all the rows. May I know the reason for this?
df= df.withColumn('Offset',((col('Time').cast('long') - col('MinTime').cast('long'))))
df.show()
| Time| MinTime| Offset|
|2019-11-19 23:00:...|2019-11-19 23:00:...| null|
|2019-11-19 23:15:...|2019-11-19 23:00:...| null|
|2019-11-19 23:30:...|2019-11-19 23:00:...| null|
df.printSchema()
root
|-- Time: string (nullable = true)
|-- MinTime: string (nullable = true)
|-- Offset: long (nullable = true)
df.show(truncate=False)
| Time| MinTime| Offset|
|2019-11-19 23:00:000000|2019-11-19 23:00:000000| null|
|2019-11-19 23:15:000000|2019-11-19 23:00:000000| null|
|2019-11-19 23:30:000000|2019-11-19 23:00:000000| null|

Please check your schema of your df, if the columns type is String it has to be converted to timestamp first
You can use to_timestamp function to convert the datatype to timestamp first as
date_format = 'yyyy-MM-dd HH:mm:ss'
df.withColumn('Offset',
(f.to_timestamp('Time', date_format).cast('long') - f.to_timestamp('MinTime').cast('long'))) \
.show(truncate=False)
Result:
+-------------------+-------------------+------+
|Time |MinTime |Offset|
+-------------------+-------------------+------+
|2019-11-19 23:00:00|2019-11-19 23:00:00|0 |
|2019-11-19 23:15:00|2019-11-19 23:00:00|900 |
|2019-11-19 23:30:00|2019-11-19 23:00:00|1800 |
+-------------------+-------------------+------+
Please make sure to use the correct date format.

Type Casting Large number of Struct Fields to String using Pyspark

I have a pyspark df who's schema looks like this
root
|-- company: struct (nullable = true)
| |-- 0: long(nullable = true)
| |-- 1: long(nullable = true)
| |-- 10: long(nullable = true)
| |-- 100: long(nullable = true)
| |-- 101: long(nullable = true)
| |-- 102: long(nullable = true)
| |-- 103: long(nullable = true)
| |-- 104: long(nullable = true)
| |-- 105: long(nullable = true)
| |-- 106: long(nullable = true)
| |-- 107: long(nullable = true)
| |-- 108: long(nullable = true)
| |-- 109: long(nullable = true)
How do I convert all these fields to String in PySpark.

I have tried with my own test dataset, check if it works for you. The answer is inspired from here : Pyspark - Looping through structType and ArrayType to do typecasting in the structfield
Refer for more details
#Create test data frame
tst= sqlContext.createDataFrame([(1,1,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','x','y'])
tst_struct = tst.withColumn("str_col",F.struct('x','y'))
old_schema = tst_struct.schema
res=[]
# Function to transform the schema to string
def transform(schema):
res=[]
for f in schema.fields:
res.append(StructField(f.name, StringType(), f.nullable))
return(StructType(res))
# Traverse through existing schema and change it when struct type is encountered
new_schema=[]
for f in old_schema.fields:
if(isinstance(f.dataType,StructType)):
new_schema.append(StructField(f.name,transform(f.dataType),f.nullable))
else:
new_schema.append(StructField(f.name,f.dataType,f.nullable))
# Transform the dataframe with new schema
tst_trans=tst_struct.select([F.col(f.name).cast(f.dataType) for f in new_schema])
This is the scheme of test dataset:
tst_struct.printSchema()
root
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- x: long (nullable = true)
|-- y: long (nullable = true)
|-- str_col: struct (nullable = false)
| |-- x: long (nullable = true)
| |-- y: long (nullable = true)
This is the transformed schema
tst_trans.printSchema()
root
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- x: long (nullable = true)
|-- y: long (nullable = true)
|-- str_col: struct (nullable = false)
| |-- x: string (nullable = true)
| |-- y: string (nullable = true)
If you need to explode the struct columns into seperate columns , you can do the below:(Refer: How to unwrap nested Struct column into multiple columns?).
So, finally
tst_exp.show()
+----+----+---+---+--------+---+---+
|col1|col2| x| y| str_col| x| y|
+----+----+---+---+--------+---+---+
| 1| 1| 2| 11| [2, 11]| 2| 11|
| 1| 3| 4| 12| [4, 12]| 4| 12|
| 1| 5| 6| 13| [6, 13]| 6| 13|
| 1| 7| 8| 14| [8, 14]| 8| 14|
| 2| 9| 10| 15|[10, 15]| 10| 15|
| 2| 11| 12| 16|[12, 16]| 12| 16|
| 2| 13| 14| 17|[14, 17]| 14| 17|
+----+----+---+---+--------+---+---+
tst_exp = tst_trans.select(tst_trans.columns+[F.col('str_col.*')])

Explode array values using PySpark

I am new to pyspark and I need to explode my array of values in such a way that each value gets assigned to a new column. I tried using explode but I couldn't get the desired output.Below is my output
+---------------+----------+------------------+----------+---------+------------+--------------------+
|account_balance|account_id|credit_Card_Number|first_name|last_name|phone_number| transactions|
+---------------+----------+------------------+----------+---------+------------+--------------------+
| 100000| 12345| 12345| abc| xyz| 1234567890|[1000, 01/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[1100, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[6146, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[253, 03/06/2020,...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[4521, 04/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[955, 05/06/2020,...|
+---------------+----------+------------------+----------+---------+------------+--------------------+
Below is the schema of the program
root
|-- account_balance: long (nullable = true)
|-- account_id: long (nullable = true)
|-- credit_Card_Number: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- phone_number: long (nullable = true)
|-- transactions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: long (nullable = true)
| | |-- date: string (nullable = true)
| | |-- shop: string (nullable = true)
| | |-- transaction_code: string (nullable = true)
I want an output in which I have additional columns of amount,date,shop,transaction_code with their respective values
amount date shop transaction_code
1000 01/06/2020 amazon buy
1100 02/06/2020 amazon sell
6146 02/06/2020 ebay buy
253 03/06/2020 ebay buy
4521 04/06/2020 amazon buy
955 05/06/2020 amazon buy

Use explode and then split the struct fileds, finally drop the newly exploded and transactions array columns.
Example:
from pyspark.sql.functions import *
#got only some columns from json
df.printSchema()
#root
# |-- account_balance: long (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- amount: long (nullable = true)
# | | |-- date: string (nullable = true)
df.selectExpr("*","explode(transactions)").select("*","col.*").drop(*['col','transactions']).show()
#+---------------+------+--------+
#|account_balance|amount| date|
#+---------------+------+--------+
#| 10| 1000|20200202|
#+---------------+------+--------+

pyspark nested columns in a string

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?

The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+

You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark: group data by equal chunks (using a non-time related criteria) - apache-spark

Related

Pyspark - How to name partitions in ISO8601 format over integer columns

regarding the usage of cast function to process time information in pyspark

Type Casting Large number of Struct Fields to String using Pyspark

Explode array values using PySpark

pyspark nested columns in a string

Categories

Resources