multiple criteria for aggregation on pySpark Dataframe - apache-spark

I have a pySpark dataframe that looks like this:
+-------------+----------+
| sku| date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+
I want to group by sku, and then calculate the min and max dates. If I do this:
df_testing.groupBy('sku') \
.agg({'date': 'min', 'date':'max'}) \
.limit(10) \
.show()
the behavior is the same as Pandas, where I only get the sku and max(date) columns. In Pandas I would normally do the following to get the results I want:
df_testing.groupBy('sku') \
.agg({'day': ['min','max']}) \
.limit(10) \
.show()
However on pySpark this does not work, and I get a java.util.ArrayList cannot be cast to java.lang.String error. Could anyone please point me to the correct syntax?
Thanks.

You cannot use dict. Use:
>>> from pyspark.sql import functions as F
>>>
>>> df_testing.groupBy('sku').agg(F.min('date'), F.max('date'))

Related

Pyspark: Long to wide Format and format based on Column Value

I want to bring a pyspark DataFrame from Long to Wide Format and cast the resulting columns based on the DataType of a given column.
Example:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import *
data = [
("BMK","++FT001+TL001-MA11","String", "2021-06-07"),
("RPM","0","Int16", "2021-06-07"),
("CURRENT","-1330","Int16", "2021-06-07")
]
schema = StructType([ \
StructField("key",StringType(),True), \
StructField("value",StringType(),True), \
StructField("dataType",StringType(),True), \
StructField("timestamp",StringType(),True) ])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+-----------+------------------+--------+----------+
| key| value|dataType| timestamp|
+-----------+------------------+--------+----------+
| BMK|++FT001+TL001-MA11| String|2021-06-07|
| RPM| 0| Int16|2021-06-07|
|ACT_CURRENT| -1330| Int16|2021-06-07|
+-----------+------------------+--------+----------+
Column dataType holds the desired datatype.
Outcome should look like this:
+----------+-----------+------------------+---+
| timestamp|ACT_CURRENT| BMK|RPM|
+----------+-----------+------------------+---+
|2021-06-07| -1330|++FT001+TL001-MA11| 0|
+----------+-----------+------------------+---+
The fields "ACT_CURRENT", "BMK" and "RPM" should have the right datatypes (Int16/String/Int16).
There is only entry per timestamp.
What I have so far is only widening the DF - not casting the right datatype:
df_wide = (df.groupBy("timestamp").pivot("key").agg(first('value')))
Help is much appreciated!

Select a next or previous record on a dataframe (PySpark)

I have a spark dataframe that has a list of timestamps (partitioned by uid, ordered by timestamp). Now, I'd like to query the dataframe to get either previous or next record.
df = myrdd.toDF().repartition("uid").sort(desc("timestamp"))
df.show()
+------------------------+-------+
|uid |timestamp |
+------------------------+-------+
|Peter_Parker|2020-09-19 02:14:40|
|Peter_Parker|2020-09-19 01:07:38|
|Peter_Parker|2020-09-19 00:04:39|
|Peter_Parker|2020-09-18 23:02:36|
|Peter_Parker|2020-09-18 21:58:40|
So for example if I were to query:
ts=datetime.datetime(2020, 9, 19, 0, 4, 39)
I want to get the previous record on (2020-09-18 23:02:36), and only that one.
How can I get the previous one?
It's possible to do it using withColumn() and diff, but is there a smarter more efficient way of doing that? I really really don't need to calculate diff for ALL events, since it is already ordered. I just want prev/next record.
You can use a filter and order by, and then limit the results to 1 row:
df2 = (df.filter("uid = 'Peter_Parker' and timestamp < timestamp('2020-09-19 00:04:39')")
.orderBy('timestamp', ascending=False)
.limit(1)
)
df2.show()
+------------+-------------------+
| uid| timestamp|
+------------+-------------------+
|Peter_Parker|2020-09-18 23:02:36|
+------------+-------------------+
Or by using row_number after filtering :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.filter("timestamp < '2020-09-19 00:04:39'") \
.withColumn("rn", F.row_number().over(Window.orderBy(F.desc("timestamp")))) \
.filter("rn = 1").drop("rn")
df1.show()
#+------------+-------------------+
#| uid| timestamp|
#+------------+-------------------+
#|Peter_Parker|2020-09-18 23:02:36|
#+------------+-------------------+

How do I explode this column of type array json in a pyspark dataframe?

I am trying to explode this column into multiple columns, but it seems there is an issue with the datatype even though I have specified it to be an array datatype.
This is what the column looks like:
Column_x
[[{"Key":"a","Value":"40000.0"},{"Key":"b","Value":"0.0"},{"Key":"c","Value":"0.0"},{"Key":"f","Value":"false"},{"Key":"e","Value":"ADB"},{"Key":"d","Value":"true"}]]
[[{"Key":"a","Value":"100000.0"},{"Key":"b","Value":"1.5"},{"Key":"c","Value":"1.5"},{"Key":"d","Value":"false"},{"Key":"e","Value":"Rev30"},{"Key":"f","Value":"true"},{"Key":"g","Value":"48600.0"},{"Key":"g","Value":"0.0"},{"Key":"h","Value":"0.0"}],[{"Key":"i","Value":"100000.0"},{"Key":"j","Value":"1.5"},{"Key":"k","Value":"1.5"},{"Key":"l","Value":"false"},{"Key":"m","Value":"Rev30"},{"Key":"n","Value":"true"},{"Key":"o","Value":"48600.0"},{"Key":"p","Value":"0.0"},{"Key":"q","Value":"0.0"}]]
To something like this:
Key Value
a 10000
b 200000
.
.
.
.
a 100000.0
b 1.5
This is my work so far:
from pyspark.sql.types import *
schema = ArrayType(ArrayType(StructType([StructField("Key", StringType()),
StructField("Value", StringType())])))
kn_sx = kn_s\
.withColumn("Keys", F.explode((F.from_json("Column_x", schema))))\
.withColumn("Key", col("Keys.Key"))\
.withColumn("Values", F.explode((F.from_json("Column_x", schema))))\
.withColumn("Value", col("Values.Value"))\
.drop("Values")
Here is the error:
AnalysisException: u"cannot resolve 'jsontostructs(`Column_x`)' due to data type mismatch: argument 1 requires string type, however, '`Column_x`' is of array<array<struct<Key:string,Value:string>>> type
Really appreciate the help.
refer this for documentation of get_json_object
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
>>> df = spark.createDataFrame(data, ("key", "jstring"))
>>> df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), \
... get_json_object(df.jstring, '$.f2').alias("c1") ).collect()
[Row(key=u'1', c0=u'value1', c1=u'value2'), Row(key=u'2', c0=u'value12', c1=None)]
This is what I did to make it work:
# Took out a single array element
df = df.withColumn('Column_x', F.col('Column_x.MetaData.Parameters').getItem(0))
# can be modified for additional array elements
# Used Explode on the dataframe to make it work
df = df\
.withColumn("Keys", F.explode(F.col("Column_x")))\
.withColumn("Key", col("Keys.Key"))\
.withColumn("Value", col("Keys.Value"))\
.drop("Keys")\
.dropDuplicates()
I hope it finds anyone looking for help with this problem.

How to split a spark dataframe column of ArrayType(StructType) to multiple columns in pyspark?

I am reading xml using databricks spark xml with below schema. the subelement X_PAT can occur more than one time, to handle
this I have used arraytype(structtype),ne xt transformation is to create multiple columns out of this single column.
<root_tag>
<id>fff9</id>
<X1000>
<X_PAT>
<X_PAT01>IC</X_PAT01>
<X_PAT02>EDISUPPORT</X_PAT02>
<X_PAT03>TE</X_PAT03>
</X_PAT>
<X_PAT>
<X_PAT01>IC1</X_PAT01>
<X_PAT02>EDISUPPORT1</X_PAT02>
<X_PAT03>TE1</X_PAT03>
</X_PAT>
</X1000>
</root_tag>
from pyspark.sql import SparkSession
from pyspark.sql.types import *
jar_path = "/Users/nsrinivas/com.databricks_spark-xml_2.10-0.4.1.jar"
spark = SparkSession.builder.appName("Spark - XML read").master("local[*]") \
.config("spark.jars", jar_path) \
.config("spark.executor.extraClassPath", jar_path) \
.config("spark.executor.extraLibrary", jar_path) \
.config("spark.driver.extraClassPath", jar_path) \
.getOrCreate()
xml_schema = StructType()
xml_schema.add("id", StringType(), True)
x1000 = StructType([
StructField("X_PAT",
ArrayType(StructType([
StructField("X_PAT01", StringType()),
StructField("X_PAT02", StringType()),
StructField("X_PAT03", StringType())]))),
])
xml_schema.add("X1000", x1000, True)
df = spark.read.format("xml").option("rowTag", "root_tag").option("valueTag", False) \
.load("root_tag.xml", schema=xml_schema)
df.select("id", "X1000.X_PAT").show(truncate=False)
I get the output as below:
+------------+--------------------------------------------+
|id |X_PAT |
+------------+--------------------------------------------+
|fff9 |[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+------------+--------------------------------------------+
but I want the X_PAT to be flatten and create multiple columns like below then I will rename the colums.
+-----+-------+-------+-------+-------+-------+-------+
|id |X_PAT01|X_PAT02|X_PAT03|X_PAT01|X_PAT02|X_PAT03|
+-----+-------+-------+-------+-------+-------+-------+
|fff9 |IC1 |SUPPORT1|TE1 |IC2 |SUPPORT2|TE2 |
+-----+-------+-------+-------+-------+-------+-------+
then i would rename the new columns as below
id|XPAT_1_01|XPAT_1_02|XPAT_1_03|XPAT_2_01|XPAT_2_02|XPAT_2_03|
I tried using X1000.X_PAT.* but it is throwing below error
pyspark.sql.utils.AnalysisException: 'Can only star expand struct data types. Attribute: ArrayBuffer(L_1000A, S_PER);'
Any ideas please?
Try this:
df = spark.createDataFrame([('1',[['IC1', 'SUPPORT1', 'TE1'],['IC2', 'SUPPORT2', 'TE2']]),('2',[['IC1', 'SUPPORT1', 'TE1'],['IC2','SUPPORT2', 'TE2']])],['id','X_PAT01'])
Define a function to parse the data
def create_column(df):
data = df.select('X_PAT01').collect()[0][0]
for each_list in range(len(data)):
for each_item in range(len(data[each_list])):
df = df.withColumn('X_PAT_'+str(each_list)+'_0'+str(each_item), F.lit(data[each_list][each_item]))
return df
calling
df = create_column(df)
output
This is a simple approach to horizontally explode array elements as per your requirement:
df2=(df1
.select('id',
*(col('X_PAT')
.getItem(i) #Fetch the nested array elements
.getItem(j) #Fetch the individual string elements from each nested array element
.alias(f'X_PAT_{i+1}_{str(j+1).zfill(2)}') #Format the column alias
for i in range(2) #outer loop
for j in range(3) #inner loop
)
)
)
Input vs Output:
Input(df1):
+----+--------------------------------------------+
|id |X_PAT |
+----+--------------------------------------------+
|fff9|[[IC1, SUPPORT1, TE1], [IC2, SUPPORT2, TE2]]|
+----+--------------------------------------------+
Output(df2):
+----+----------+----------+----------+----------+----------+----------+
| id|X_PAT_1_01|X_PAT_1_02|X_PAT_1_03|X_PAT_2_01|X_PAT_2_02|X_PAT_2_03|
+----+----------+----------+----------+----------+----------+----------+
|fff9| IC1| SUPPORT1| TE1| IC2| SUPPORT2| TE2|
+----+----------+----------+----------+----------+----------+----------+
Although this involves for loops, as the operations are directly performed on the dataframe (without collecting/converting to RDD), you should not encounter any issue.

How to convert column of MapType(StringType, StringType) into StringType?

So I have this streaming dataframe and I'm trying to cast this 'customer_ids' column to a simple string.
schema = StructType()\
.add("customer_ids", MapType(StringType(), StringType()))\
.add("date", TimestampType())
original_sdf = spark.readStream.option("maxFilesPerTrigger", 800)\
.load(path=source, ftormat="parquet", schema=schema)\
.select('customer_ids', 'date')
The intend to this conversion is to group by this column and agregate by max(date) like this
original_sdf.groupBy('customer_ids')\
.agg(max('date')) \
.writeStream \
.trigger(once=True) \
.format("memory") \
.queryName('query') \
.outputMode("complete") \
.start()
but I got this exception
AnalysisException: u'expression `customer_ids` cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.
How can I cast this kind of streaming DataFrame column or any other way to groupBy this column?
TL;DR Use getItem method to access the values per key in a MapType column.
The real question is what key(s) you want to groupBy since a MapType column can have a variety of keys. Every key can be a column with values from the map column.
You can access keys using Column.getItem method (or a similar python voodoo):
getItem(key: Any): Colum An expression that gets an item at position ordinal out of an array, or gets a value by key key in a MapType.
(I use Scala and am leaving converting it to pyspark as a home exercise)
val ds = Seq(Map("hello" -> "world")).toDF("m")
scala> ds.show(false)
+-------------------+
|m |
+-------------------+
|Map(hello -> world)|
+-------------------+
scala> ds.select($"m".getItem("hello") as "hello").show
+-----+
|hello|
+-----+
|world|
+-----+

Resources