How to aggregate string to dictionary like results in pyspark? - string

I have a dataframe and I want to aggregate to daily.
data = [
(125, '2012-10-10','good'),
(20, '2012-10-10','good'),
(40, '2012-10-10','bad'),
(60, '2012-10-10','NA')]
df = spark.createDataFrame(data, ["temperature", "date","performance"])
I could aggregate numerical values using spark built in functions like max, min, avg. How could I aggregate strings?
I expect something like:
date
max_temp
min_temp
performance_frequency
2012-10-10
125
20
"good": 2, "bad":1, "NA":1

We can use MapType and UDF with Counter to return the value counts,
from pyspark.sql import functions as F
from pyspark.sql.types import MapType,StringType,IntegerType
from collections import Counter
data = [(125, '2012-10-10','good'),(20, '2012-10-10','good'),(40, '2012-10-10','bad'),(60, '2012-10-10','NA')]
df = spark.createDataFrame(data, ["temperature", "date","performance"])
udf1 = F.udf(lambda x: dict(Counter(x)),MapType(StringType(),IntegerType()))
df.groupby('date').agg(F.min('temperature'),F.max('temperature'),udf1(F.collect_list('performance')).alias('performance_frequency')).show(1,False)
+----------+----------------+----------------+---------------------------------+
|date |min(temperature)|max(temperature)|performance_frequency |
+----------+----------------+----------------+---------------------------------+
|2012-10-10|20 |125 |Map(NA -> 1, bad -> 1, good -> 2)|
+----------+----------------+----------------+---------------------------------+
df.groupby('date').agg(F.min('temperature'),F.max('temperature'),udf1(F.collect_list('performance')).alias('performance_frequency')).collect()
[Row(date='2012-10-10', min(temperature)=20, max(temperature)=125, performance_frequency={'bad': 1, 'good': 2, 'NA': 1})]
Hope this helps!

Related

iterate over a column check condition and carry calculations with values of other data frames

import pandas as pd
import numpy as np
I do have 3 dataframes df1, df2 and df3.
df1=
data = {'Period': ['2024-04-O1', '2024-07-O1', '2024-10-O1', '2025-01-O1', '2025-04-O1', '2025-07-O1', '2025-10-O1', '2026-01-O1', '2026-04-O1', '2026-07-O1', '2026-10-O1', '2027-01-O1', '2027-04-O1', '2027-07-O1', '2027-10-O1', '2028-01-O1', '2028-04-O1', '2028-07-O1', '2028-10-O1'],
'Price': ['NaN','NaN','NaN','NaN', 'NaN','NaN','NaN','NaN', 'NaN','NaN','NaN','NaN',
'NaN','NaN','NaN','NaN', 'NaN','NaN','NaN'],
'years': [2024,2024,2024,2025,2025,2025,2025,2026,2026,2026,2026,2027,2027,2027,2027,2028,
2028,2028,2028],
'quarters':[2,3,4, 1,2,3,4, 1,2,3,4, 1,2,3,4, 1,2,3,4]
}
df1 = pd.DataFrame(data=data)
df2=
data = {'price': [473.26,244,204,185, 152, 157],
'year': [2023, 2024, 2025, 2026, 2027, 2028]
}
df3 = pd.DataFrame(data=data)
df3=
data = {'quarters': [1,2,3,4],
'weights': [1.22, 0.81, 0.83, 1.12]
}
df2 = pd.DataFrame(data=data)
My aim is to compute the price of df1. For each iteration through df1 check condition and carry calculations accordingly. For example for the 1st iteration, check if df1['year']=2024 and df1['quarters']=2. Then df1['price']=df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==2, 'weights'].
===>>> df1['price'][0]=**473.26*0.81**.
df1['price'][1]=**473.26*0.83**.
...
...
...
and so on.
I could ha used this method but i want to write a code in a more efficient way. I would like to use the following code structure.
for i in range(len(df1)):
if (df1['year']==2024) & (df1['quarter']==2):
df1['Price']= df2.loc[df2['year']==2024, 'price'] * df3.loc[df3['quarters']==2, 'weights']
elif (df1['year']==2024) & (df1['quarter']==3):
df1['price']= df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==3, 'weights']
elif (df1['year']==2024) & (df1['quarters']==4):
df1['Price']= df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==4, 'weights']
...
...
...
Thanks!!!
I think if I understand correctly you can use pd.merge to bring these fields together first.
df1 = df1.merge(df2, how='left' , left_on='years', right_on='year')
df1 = df1.merge(df3, how='left' , left_on='quarters', right_on='quarters')
df1['Price'] = df1['price']*df1['weights']

Groupby and transpose or unstack in Pandas

I have the following Python pandas dataframe:
There are more EventName's than shown on this date.
Each will have Race_Number = 'Race 1', 'Race 2', etc.
After a while the date increments.
.
I'm trying to create a dataframe that looks like this:
Each race has different numbers of runners.
Is there a way to do this in pandas ?
Thanks
I assumed output would be another DataFrame.
import pandas as pd
import numpy as np
from nltk import flatten
import copy
df = pd.DataFrame({'EventName': ['sydney', 'sydney', 'sydney', 'sydney', 'sydney', 'sydney'],
'Date': ['2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01'],
'Race_Number': ['Race1', 'Race1', 'Race1', 'Race2', 'Race2', 'Race3'],
'Number': [4, 7, 2, 9, 5, 10]
})
print(df)
dic={}
for rows in df.itertuples():
if rows.Race_Number in dic:
dic[rows.Race_Number] = flatten([dic[rows.Race_Number], rows.Number])
else:
dic[rows.Race_Number] = rows.Number
copy_dic = copy.deepcopy(dic)
seq = np.arange(0,len(dic.keys()))
for key, n_key in zip(copy_dic, seq):
dic[n_key] = dic.pop(key)
df = pd.DataFrame([dic])
print(df)

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}
To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |
An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.
Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()
I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])
Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df
from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.
if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')
Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls
Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

Append to dataframe with for loop. Python3

I'm trying to loop through a list(y) and output by appending a row for each item to a dataframe.
y=[datetime.datetime(2017, 3, 29), datetime.datetime(2017, 3, 30), datetime.datetime(2017, 3, 31)]
Desired Output:
Index Mean Last
2017-03-29 1.5 .76
2017-03-30 2.3 .4
2017-03-31 1.2 1
Here is the first and last part of the code I currently have:
import pandas as pd
import datetime
df5=pd.DataFrame(columns=['Mean','Last'],index=index)
for item0 in y:
.........
.........
df=df.rename(columns = {0:'Mean'})
df4=pd.concat([df, df3], axis=1)
print (df4)
df5.append(df4)
print (df5)
My code only puts one row into the dataframe like as opposed to a row for each item in y:
Index Mean Last
2017-03-29 1.5 .76
Try:
y = [datetime(2017, 3, 29), datetime(2017, 3, 30),datetime(2017, 3, 31)]
m = [1.5,2.3,1.2]
l = [0.76, .4, 1]
df = pd.DataFrame([],columns=['time','mean','last'])
for y0, m0, l0 in zip(y,m,l):
data = {'time':y0,'mean':m0,'last':l0}
df = df.append(data, ignore_index=True)
and if you want y to be the index:
df.index = df.time
There are a few ways to skin this, and it's hard to know which approach makes the most sense with the limited info given. But one way is to start with a dataframe that has only the index, iterate through the dataframe by row and populate the values from some other process. Here's an example of that approach:
import datetime
import numpy as np
import pandas as pd
y=[datetime.datetime(2017, 3, 29), datetime.datetime(2017, 3, 30), datetime.datetime(2017, 3, 31)]
main_df = pd.DataFrame(y, columns=['Index'])
#pop in the additional columns you want, but leave them blank
main_df['Mean'] = None
main_df['Last'] = None
#set the index
main_df.set_index(['Index'], inplace=True)
that gives us the following:
Mean Last
Index
2017-03-29 None None
2017-03-30 None None
2017-03-31 None None
Now let's loop and plug in some made up random values:
## loop through main_df and add values
for (index, row) in main_df.iterrows():
main_df.ix[index].Mean = np.random.rand()
main_df.ix[index].Last = np.random.rand()
this results in the following dataframe which has the None values filled:
Mean Last
Index
2017-03-29 0.174714 0.718738
2017-03-30 0.983188 0.648549
2017-03-31 0.07809 0.47031

pyspark corr for each group in DF (more than 5K columns)

I have a DF with 100 million rows and 5000+ columns. I am trying to find the corr between colx and remaining 5000+ columns.
aggList1 = [mean(col).alias(col + '_m') for col in df.columns] #exclude keys
df21= df.groupBy('key1', 'key2', 'key3', 'key4').agg(*aggList1)
df = df.join(broadcast(df21),['key1', 'key2', 'key3', 'key4']))
df= df.select([func.round((func.col(colmd) - func.col(colmd + '_m')), 8).alias(colmd)\
for colmd in all5Kcolumns])
aggCols= [corr(colx, col).alias(col) for col in colsall5K]
df2 = df.groupBy('key1', 'key2', 'key3').agg(*aggCols)
Right now it is not working because of spark 64KB codegen issue (even spark 2.2). So i am looping for each 300 columns and merging all at the end. But it is taking more than 30 hours in a cluster with 40 nodes (10 core each and each node with 100GB). Any help to tune this?
Below things already tried
- Re partition DF to 10,000
- Checkpoint in each loop
- cache in each loop
You can try with a bit of NumPy and RDDs. First a bunch of imports:
from operator import itemgetter
import numpy as np
from pyspark.statcounter import StatCounter
Let's define a few variables:
keys = ["key1", "key2", "key3"] # list of key column names
xs = ["x1", "x2", "x3"] # list of column names to compare
y = "y" # name of the reference column
And some helpers:
def as_pair(keys, y, xs):
""" Given key names, y name, and xs names
return a tuple of key, array-of-values"""
key = itemgetter(*keys)
value = itemgetter(y, * xs) # Python 3 syntax
def as_pair_(row):
return key(row), np.array(value(row))
return as_pair_
def init(x):
""" Init function for combineByKey
Initialize new StatCounter and merge first value"""
return StatCounter().merge(x)
def center(means):
"""Center a row value given a
dictionary of mean arrays
"""
def center_(row):
key, value = row
return key, value - means[key]
return center_
def prod(arr):
return arr[0] * arr[1:]
def corr(stddev_prods):
"""Scale the row to get 1 stddev
given a dictionary of stddevs
"""
def corr_(row):
key, value = row
return key, value / stddev_prods[key]
return corr_
and convert DataFrame to RDD of pairs:
pairs = df.rdd.map(as_pair(keys, y, xs))
Next let's compute statistics per group:
stats = (pairs
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.collectAsMap())
means = {k: v.mean() for k, v in stats.items()}
Note: With 5000 features and 7000 group there should no issue with keeping this structure in memory. With larger datasets you may have to use RDD and join but this will be slower.
Center the data:
centered = pairs.map(center(means))
Compute covariance:
covariance = (centered
.mapValues(prod)
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.mapValues(StatCounter.mean))
And finally correlation:
stddev_prods = {k: prod(v.stdev()) for k, v in stats.items()}
correlations = covariance.map(corr(stddev_prods))
Example data:
df = sc.parallelize([
("a", "b", "c", 0.5, 0.5, 0.3, 1.0),
("a", "b", "c", 0.8, 0.8, 0.9, -2.0),
("a", "b", "c", 1.5, 1.5, 2.9, 3.6),
("d", "e", "f", -3.0, 4.0, 5.0, -10.0),
("d", "e", "f", 15.0, -1.0, -5.0, 10.0),
]).toDF(["key1", "key2", "key3", "y", "x1", "x2", "x3"])
Results with DataFrame:
df.groupBy(*keys).agg(*[corr(y, x) for x in xs]).show()
+----+----+----+-----------+------------------+------------------+
|key1|key2|key3|corr(y, x1)| corr(y, x2)| corr(y, x3)|
+----+----+----+-----------+------------------+------------------+
| d| e| f| -1.0| -1.0| 1.0|
| a| b| c| 1.0|0.9972300220940342|0.6513360726920862|
+----+----+----+-----------+------------------+------------------+
and the method provided above:
correlations.collect()
[(('a', 'b', 'c'), array([ 1. , 0.99723002, 0.65133607])),
(('d', 'e', 'f'), array([-1., -1., 1.]))]
This solution, while a bit involved, is quite elastic and can be easily adjusted to handle different data distributions. It should be also possible to given further boost with JIT.

Resources