Convert date to ISO week date in Spark

Convert date to ISO week date in Spark - apache-spark

Having dates in one column, how to create a column containing ISO week date?
ISO week date is composed of year, week number and weekday.
year is not the same as the year obtained using year function.
week number is the easy part - it can be obtained using weekofyear.
weekday should return 1 for Monday and 7 for Sunday, while Spark's dayofweek cannot do it.
Example dataframe:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
('1977-12-31',),
('1978-01-01',),
('1978-01-02',),
('1978-12-31',),
('1979-01-01',),
('1979-12-30',),
('1979-12-31',),
('1980-01-01',)],
['my_date']
).select(F.col('my_date').cast('date'))
df.show()
#+----------+
#| my_date|
#+----------+
#|1977-12-31|
#|1978-01-01|
#|1978-01-02|
#|1978-12-31|
#|1979-01-01|
#|1979-12-30|
#|1979-12-31|
#|1980-01-01|
#+----------+
Desired result:
+----------+-------------+
| my_date|iso_week_date|
+----------+-------------+
|1977-12-31| 1977-W52-6|
|1978-01-01| 1977-W52-7|
|1978-01-02| 1978-W01-1|
|1978-12-31| 1978-W52-7|
|1979-01-01| 1979-W01-1|
|1979-12-30| 1979-W52-7|
|1979-12-31| 1980-W01-1|
|1980-01-01| 1980-W01-2|
+----------+-------------+

Spark SQL extract makes this much easier.
iso_year = F.expr("EXTRACT(YEAROFWEEK FROM my_date)")
iso_weekday = F.expr("EXTRACT(DAYOFWEEK_ISO FROM my_date)")
So, building off of the other answers with the use of concat_ws:
import pyspark.sql.functions as F
df.withColumn(
'iso_week_date',
F.concat_ws(
"-",
F.expr("EXTRACT(YEAROFWEEK FROM my_date)"),
F.lpad(F.weekofyear('my_date'), 3, "W0"),
F.expr("EXTRACT(DAYOFWEEK_ISO FROM my_date)")
)
).show()
#+----------+-------------+
#| my_date|iso_week_date|
#+----------+-------------+
#|1977-12-31| 1977-W52-6|
#|1978-01-01| 1977-W52-7|
#|1978-01-02| 1978-W01-1|
#|1978-12-31| 1978-W52-7|
#|1979-01-01| 1979-W01-1|
#|1979-12-30| 1979-W52-7|
#|1979-12-31| 1980-W01-1|
#|1980-01-01| 1980-W01-2|
#+----------+-------------+

Your solution is already nice, maybe you could shorten it by simplifying the calculations:
iso_weekday = (dayofweek(my_date) + 5)%7 + 1
iso_year= year(date_add(my_date, 4 - iso_weekday))
Which gives you:
import pyspark.sql.functions as F
df.withColumn(
'iso_week_date',
F.concat_ws(
"-",
F.year(F.expr("date_add(my_date, 4 - (dayofweek(my_date) + 5) % 7 + 1)")),
F.lpad(F.weekofyear('my_date'), 3, "W0"),
(F.dayofweek('my_date') + 5) % 7 + 1
)
).show()
#+----------+-------------+
#| my_date|iso_week_date|
#+----------+-------------+
#|1977-12-31| 1977-W52-6|
#|1978-01-01| 1977-W52-7|
#|1978-01-02| 1978-W01-1|
#|1978-12-31| 1978-W52-7|
#|1979-01-01| 1979-W01-1|
#|1979-12-30| 1979-W52-7|
#|1979-12-31| 1980-W01-1|
#|1980-01-01| 1980-W01-2|
#+----------+-------------+

First, one could create rules for columns for year and weekday. Then, concatenate them using concat_ws and lpad.
week_from_prev_year = (F.month('my_date') == 1) & (F.weekofyear('my_date') > 9)
week_from_next_year = (F.month('my_date') == 12) & (F.weekofyear('my_date') == 1)
iso_year = F.when(week_from_prev_year, F.year('my_date') - 1) \
.when(week_from_next_year, F.year('my_date') + 1) \
.otherwise(F.year('my_date'))
iso_weekday = F.when(F.dayofweek('my_date') != 1, F.dayofweek('my_date')-1).otherwise(7)
iso_week_date = F.concat_ws('-', iso_year, F.lpad(F.weekofyear('my_date'), 3, 'W0'), iso_weekday)
df2 = df.withColumn('iso_week_date', iso_week_date)
df2.show()
#+----------+-------------+
#| my_date|iso_week_date|
#+----------+-------------+
#|1977-12-31| 1977-W52-6|
#|1978-01-01| 1977-W52-7|
#|1978-01-02| 1978-W01-1|
#|1978-12-31| 1978-W52-7|
#|1979-01-01| 1979-W01-1|
#|1979-12-30| 1979-W52-7|
#|1979-12-31| 1980-W01-1|
#|1980-01-01| 1980-W01-2|
#+----------+-------------+

Related

Pyspark - filter, groupby, aggregate for different combinations of columns and functions

I have a simple operation to do in Pyspark but I need to run the operation with many different parameters. It is just filter on one column, then groupby a different column, and aggregate on a third column. In Python, the function is:
def filter_gby_reduce(df, filter_col = None, filter_value = None):
return df.filter(col(filter_col) == filter_value).groupby('ID').agg(max('Value'))
Let's say the different configurations are
func_params = spark.createDataFrame([('Day', 'Monday'), ('Month', 'January')], ['feature', 'filter_value'])
I could of course just run the functions one by one:
filter_gby_reduce(df, filter_col = 'Day', filter_value = 'Monday')
filter_gby_reduce(df, filter_col = 'Month', filter_value = 'January')
But my actual collection of parameters is much larger. Lastly, I also need to union all of the function results together into one dataframe. So is there a way in spark to write this more succinctly and in a way that will fully take advantage of parallelization?

One way of doing this is by generating the desired values as columns using when and max and passing these to agg. As you want the values unioned you have to unpivot the result using stack (no DataFrame API for that, so a selectExpr is used). Depending on your dataset you might get null if a filter excludes all data, these can be dropped if needed.
I'd recommend testing this vs the 'naive' approach of simply unioning a large amount of filtered dataframes.
import pyspark.sql.functions as f
func_params = [('Day', 'Monday'), ('Month', 'January')]
df = spark.createDataFrame([
('Monday', 'June', 1, 5),
('Monday', 'January', 1, 2),
('Monday', 'June', 1, 5),
('Monday', 'June', 2, 10)],
['Day', 'Month', 'ID', 'Value'])
cols = []
for column, flt in func_params:
name = f'{column}_{flt}'
val = f.when(f.col(column) == flt, f.col('Value')).otherwise(None)
cols.append(f.max(val).alias(name))
stack = f"stack({len(cols)}," + ','.join(f"'{column}_{flt}', {column}_{flt}" for column, flt in func_params) + ')'
(df
.groupby('ID')
.agg(*cols)
.selectExpr('ID', stack)
.withColumnRenamed('col0', 'param')
.withColumnRenamed('col1', 'Value')
.show()
)
+---+-------------+-----+
| ID| param|Value|
+---+-------------+-----+
| 1| Day_Monday| 5|
| 1|Month_January| 2|
| 2| Day_Monday| 10|
| 2|Month_January| null|
+---+-------------+-----+

Performance improvement for UDFs - get column name of least value per row in pyspark

I use this udf:
mincol = F.udf(lambda row: cols[row.index(min(row))], StringType())
df = df.withColumn("mycol", mincol(F.struct([df[x] for x in cols])))
to get the column name for least value per row as value for another column called 'mycol'.
But this code is very slow.
Any suggestions to improve performance?
I am using Pyspark 2.3

Here is another solution for Spark 2.3 which uses only built-in functions:
from sys import float_info
from pyspark.sql.functions import array, least, col, lit, concat_ws, expr
cols = df.columns
col_names = array(list(map(lit, cols)))
set_cols = list(map(col, cols))
# replace null with largest python float
df.na.fill(float_info.max) \
.withColumn("min", least(*cols)) \
.withColumn("cnames", col_names) \
.withColumn("set", concat_ws(",", *set_cols)) \
.withColumn("min_col", expr("cnames[find_in_set(min, set) - 1]")) \
.select(*[cols + ["min_col"]]) \
.show()
Steps:
Fill all nulls with the larger possible float number. This is a good candidate for null replacement since is hard to find a larger value.
Find min column using least.
Create the column cnames for storing the column names.
Create the column set, which contains all the values as a comma-separated string.
Create the column min_col using find_in_set. The function handles each string item separately and will return the index of the found item. Finally, we use the index with cnames[indx - 1] to retrieve the column name.

Here is an approach without udf. The idea is to create an array containing the value and name of each column and then sort this array.
df1 = spark.createDataFrame([
(1., 2., 3.),(3.,2.,1.), (9.,8.,-1.), (1.2, 1.2, 9.1), (3., None, 1.0)], \
["col1", "col2", "col3"])
cols = df1.columns
col_string = ', '.join("'{0}'".format(c) for c in cols)
df1 = df1.withColumn("vals", F.array(cols)) \
.withColumn("cols", F.expr("Array(" + col_string + ")")) \
.withColumn("zipped", F.arrays_zip("vals", "cols")) \
.withColumn("without_nulls", F.expr("filter(zipped, x -> not x.vals is null)")) \
.withColumn("sorted", F.expr("array_sort(without_nulls)")) \
.withColumn("min", F.col("sorted")[0].cols) \
.drop("vals", "cols", "zipped", "without_nulls", "sorted")
df1.show(truncate=False)
prints
+----+----+----+----+
|col1|col2|col3|min |
+----+----+----+----+
|1.0 |2.0 |3.0 |col1|
|3.0 |2.0 |1.0 |col3|
|9.0 |8.0 |-1.0|col3|
|1.2 |1.2 |9.1 |col1|
|3.0 |null|1.0 |col3|
+----+----+----+----+

Building derived column using Spark transformations

I got a table record as stated below.
Id Indicator Date
1 R 2018-01-20
1 R 2018-10-21
1 P 2019-01-22
2 R 2018-02-28
2 P 2018-05-22
2 P 2019-03-05
I need to pick the Ids that had more than two R indicator in the last one year and derive a new column called Marked_Flag as Y otherwise N. So the expected output should look like below,
Id Marked_Flag
1 Y
2 N
So what I did so far, I took the records in a dataset and then again build another dataset from that. The code looks like below.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
But my lead what this to be done using a single dataset and using Spark transformations. I am pretty new to Spark, any guidance or code snippet on this regard would be highly helpful.
Created two Datasets one to get the aggregation and another used the aggregated value to derive the new column.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
Input
Expected output

Try out the following. Note that I am using pyspark DataFrame here
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
[1, "R", "2018-01-20"],
[1, "R", "2018-10-21"],
[1, "P", "2019-01-22"],
[2, "R", "2018-02-28"],
[2, "P", "2018-05-22"],
[2, "P", "2019-03-05"]], ["Id", "Indicator","Date"])
gr = df.filter(F.col("Indicator")=="R").groupBy("Id").agg(F.count("Indicator"))
gr = gr.withColumn("Marked_Flag", F.when(F.col("count(Indicator)") > 1, "Y").otherwise('N')).drop("count(Indicator)")
gr.show()
# +---+-----------+
# | Id|Marked_Flag|
# +---+-----------+
# | 1| Y|
# | 2| N|
# +---+-----------+
#

How do find out the total amount for each month using spark in python

I'm looking for a way to aggregate by month my data. I want firstly to keep only month in my visitdate. My DataFrame looks like this:
Row(visitdate = 1/1/2013,
patientid = P1_Pt1959,
amount = 200,
note = jnut,
)
My objectif subsequently is to group by visitdate and calculate the sum of amount. I tried this :
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")
sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()
This is the result :
visitdate|totalamount|
+----------+-----------+
| 9/1/2013| 10800.0|
|25/04/2013| 12440.0|
|27/03/2014| 16930.0|
|26/03/2015| 18560.0|
|14/05/2013| 13770.0|
|30/06/2013| 13880.0
My objectif is to get something like this:
visitdate|totalamount|
+----------+-----------+
|1/1/2013| 10800.0|
|1/2/2013| 12440.0|
|1/3/2013| 16930.0|
|1/4/2014| 18560.0|
|1/5/2015| 13770.0|
|1/6/2015| 13880.0|

You need to truncate your date's down to months so they group properly, then do a groupBy/sum. There is a spark function to do this for you call date_trunc. For example.
from datetime import date
from pyspark.sql.functions import date_trunc, sum
data = [
(date(2000, 1, 2), 1000),
(date(2000, 1, 2), 2000),
(date(2000, 2, 3), 3000),
(date(2000, 2, 4), 4000),
]
df = spark.createDataFrame(sc.parallelize(data), ["date", "amount"])
df.groupBy(date_trunc("month", df.date)).agg(sum("amount"))
+-----------------------+-----------+
|date_trunc(month, date)|sum(amount)|
+-----------------------+-----------+
| 2000-01-01 00:00:00| 3000|
| 2000-02-01 00:00:00| 7000|
+-----------------------+-----------+

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+

For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}

To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |

An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()

I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])

Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df

from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.

if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')

Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls

Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert date to ISO week date in Spark - apache-spark

Related

Pyspark - filter, groupby, aggregate for different combinations of columns and functions

Performance improvement for UDFs - get column name of least value per row in pyspark

Building derived column using Spark transformations

How do find out the total amount for each month using spark in python

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Categories

Resources