I have been building my application on Python but for some reason I need to put it on a distributed environment, so I'm trying to build and application
using Spark but unable to come up with a code as fast as shift in Pandas.
mask = (df['name_x'].shift(0) == df['name_y'].shift(0)) & \
(df['age_x'].shift(0) == df['age_y'].shift(0))
df = df[~mask1]
Where
mask.tolist()
gives
[True, False, True, False]
The final result df will contain only two rows (2nd and 4th).
Basically trying to remove rows where, [name_x, age_x]col duplicates if present on [name_y,age_y]col.
Above code is on Pandas dataframe. What would be the closest PySpark code which is as efficient but without importing Pandas?
I have checked Window on Spark but not sure of it.
shift plays no role in your code. This
import pandas as pd
df = pd.DataFrame({
"name_x" : ["ABC", "CDF", "DEW", "ABC"],
"age_x": [20, 20, 22, 21],
"name_y" : ["ABC", "CDF", "DEW", "ABC"],
"age_y" : [20, 21, 22, 19],
})
mask1 = (df['name_x'].shift(0) == df['name_y'].shift(0)) & \
(df['age_x'].shift(0) == df['age_y'].shift(0))
df[~mask1]
# name_x age_x name_y age_y
# 1 CDF 20 CDF 21
# 3 ABC 21 ABC 19
is just equivalent to
mask2 = (df['name_x'] == df['name_y']) & (df['age_x'] == df['age_y'])
df[~mask2]
# name_x age_x name_y age_y
# 1 CDF 20 CDF 21
# 3 ABC 21 ABC 19
Therefore all you need is filter:
sdf = spark.createDataFrame(df)
smask = ~((sdf["name_x"] == sdf["name_y"]) & (sdf["age_x"] == sdf["age_y"]))
sdf.filter(smask).show()
# +------+-----+------+-----+
# |name_x|age_x|name_y|age_y|
# +------+-----+------+-----+
# | CDF| 20| CDF| 21|
# | ABC| 21| ABC| 19|
# +------+-----+------+-----+
which, by De Morgan's laws, can be simplified to
(sdf["name_x"] != sdf["name_y"]) | (sdf["age_x"] != sdf["age_y"])
In general, shift can be expressed with Window functions.
Related
I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like this
entity | Probability | value
A | 0.8 | 10
A | 0.6 | 15
A | 0.3 | 20
B | 0.8 | 10
Then, for entity A, I'm expecting the avg value to be 0.8*10 + (1-0.8)*0.6*15 + (1-0.8)*(1-0.6)*0.3*20 + (1-0.8)*(1-0.6)*(1-0.3)*MAX_VALUE_DEFINED.
How could I achieve this in Spark using DataFrame agg func? I found it's challenging given the complexity to groupBy entity and compute the sequence of results.
You can use UDF to perform such custom calculations. The idea is using collect_list to group all probab and values of A into one place so you can loop through it. However, collect_list does not respect the order of your records, therefore might lead to the wrong calculation. One way to fix it is generating ID for each row using monotonically_increasing_id
import pyspark.sql.functions as F
#F.pandas_udf('double')
def markov_udf(values):
def markov(lst):
# you can implement your markov logic here
s = 0
for i, prob, val in lst:
s += prob
return s
return values.apply(markov)
(df
.withColumn('id', F.monotonically_increasing_id())
.groupBy('entity')
.agg(F.array_sort(F.collect_list(F.array('id', 'probability', 'value'))).alias('values'))
.withColumn('markov', markov_udf('values'))
.show(10, False)
)
+------+------------------------------------------------------+------+
|entity|values |markov|
+------+------------------------------------------------------+------+
|B |[[3.0, 0.8, 10.0]] |0.8 |
|A |[[0.0, 0.8, 10.0], [1.0, 0.6, 15.0], [2.0, 0.3, 20.0]]|1.7 |
+------+------------------------------------------------------+------+
There may be a better solution, but I think this does what you needed.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', 0.8, 10),
('A', 0.6, 15),
('A', 0.3, 20),
('B', 0.8, 10)],
['entity', 'Probability', 'value']
)
w_desc = W.partitionBy('entity').orderBy(F.desc('value'))
w_asc = W.partitionBy('entity').orderBy('value')
df = df.withColumn('_ent_max_val', F.max('value').over(w_desc))
df = df.withColumn('_prob2', 1 - F.col('Probability'))
df = df.withColumn('_cum_prob2', F.product('_prob2').over(w_asc) / F.col('_prob2'))
df = (df.groupBy('entity')
.agg(F.round((F.max('_ent_max_val') * F.product('_prob2')
+ F.sum(F.col('_cum_prob2') * F.col('Probability') * F.col('value'))
),2).alias('mean_value'))
)
df.show()
# +------+----------+
# |entity|mean_value|
# +------+----------+
# | A| 11.4|
# | B| 10.0|
# +------+----------+
I got a table record as stated below.
Id Indicator Date
1 R 2018-01-20
1 R 2018-10-21
1 P 2019-01-22
2 R 2018-02-28
2 P 2018-05-22
2 P 2019-03-05
I need to pick the Ids that had more than two R indicator in the last one year and derive a new column called Marked_Flag as Y otherwise N. So the expected output should look like below,
Id Marked_Flag
1 Y
2 N
So what I did so far, I took the records in a dataset and then again build another dataset from that. The code looks like below.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
But my lead what this to be done using a single dataset and using Spark transformations. I am pretty new to Spark, any guidance or code snippet on this regard would be highly helpful.
Created two Datasets one to get the aggregation and another used the aggregated value to derive the new column.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
Input
Expected output
Try out the following. Note that I am using pyspark DataFrame here
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
[1, "R", "2018-01-20"],
[1, "R", "2018-10-21"],
[1, "P", "2019-01-22"],
[2, "R", "2018-02-28"],
[2, "P", "2018-05-22"],
[2, "P", "2019-03-05"]], ["Id", "Indicator","Date"])
gr = df.filter(F.col("Indicator")=="R").groupBy("Id").agg(F.count("Indicator"))
gr = gr.withColumn("Marked_Flag", F.when(F.col("count(Indicator)") > 1, "Y").otherwise('N')).drop("count(Indicator)")
gr.show()
# +---+-----------+
# | Id|Marked_Flag|
# +---+-----------+
# | 1| Y|
# | 2| N|
# +---+-----------+
#
I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
Question 1: Is that a good way to do it?
Question 2: How to include that RDD back into the dataframe?
Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:
The number of rows in your DataFrame (df.count())
The length of data (given)
Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():
import pyspark.sql.functions as f
dataLength = 3
numRows = df.count()
df.select(
f.collect_list("id").alias("id"),
f.array(
[
f.array(
[f.collect_list("data").getItem(j).getItem(i)
for j in range(numRows)]
)
for i in range(dataLength)
]
).alias("data")
)\
.show(truncate=False)
#+---------+------------------------------------------------------------------------------+
#|id |data |
#+---------+------------------------------------------------------------------------------+
#|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
#+---------+------------------------------------------------------------------------------+
You can simply use a udf function for the zip function but before that you will have to use collect_list function
from pyspark.sql import functions as f
from pyspark.sql import types as t
def zipUdf(array):
return zip(*array)
zipping = f.udf(zipUdf, t.ArrayType(t.ArrayType(t.IntegerType())))
df.select(
f.collect_list(df.id).alias('id'),
zipping(f.collect_list(df.data)).alias('data')
).show(truncate=False)
which would give you
+---------+------------------------------------------------------------------------------+
|id |data |
+---------+------------------------------------------------------------------------------+
|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
+---------+------------------------------------------------------------------------------+
import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}
To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |
An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.
Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()
I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])
Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df
from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.
if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')
Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls
Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))
I don't understand the behaviour of this simple PySpark code snippet :
# Create simple test dataframe
l = [('Alice', 1),('Pierre', 3),('Jack', 5), ('Paul', 2)]
df_test = sqlcontext.createDataFrame(l, ['name', 'age'])
# Perform filter then Take 2 oldest
df_test = df_test.sort('age', ascending=False)\
.filter('age < 4') \
.limit(2)
df_test.show(2)
# This outputs as expected :
# +------+---+
# | name|age|
# +------+---+
# |Pierre| 3|
# | Paul| 2|
# +------+---+
df_test.collect()
# This outputs unexpectedly :
# [Row(name=u'Pierre', age=3), Row(name=u'Alice', age=1)]
Is this an expected behaviour of the collect() function ? How can I retrieve my column as a list that keeps the right order ?
Thanks
I had to use a sorter UDF to resolve this issue
def sorter(l):
import operator
res = sorted(l, key =operator.itemgetter(0))
L1=[item[1] for item in res]
#return " ".join(str(x) for x in L)
return "".join(L1)
sort_udf = udf(sorter)