I have a PySpark df:
Store_ID
Category
ID
Sales
1
A
123
23
2
A
123
45
1
A
234
67
1
B
567
78
2
B
567
34
3
D
789
12
1
A
890
12
Expected:
Store_ID
A_ID
B_ID
C_ID
D_ID
Sales_A
Sales_B
Sales_C
Sales_D
1
3
1
0
0
102
78
0
0
2
1
1
0
0
45
34
0
0
3
0
0
0
1
0
0
0
12
I am able to transform this way using SQL (created a temp view):
SELECT Store_Id,
SUM(IF(Category='A',Sales,0)) AS Sales_A,
SUM(IF(Category='B',Sales,0)) AS Sales_B,
SUM(IF(Category='C',Sales,0)) AS Sales_C,
SUM(IF(Category='D',Sales,0)) AS Sales_D,
COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;
How do we achieve the same in PySpark using native functions as its much faster?
This operation is called pivoting.
a couple of aggregations, since you need both, count of ID and sum of Sales
alias for aggregations, for changing column names
providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'A', 123, 23),
(2, 'A', 123, 45),
(1, 'A', 234, 67),
(1, 'B', 567, 78),
(2, 'B', 567, 34),
(3, 'D', 789, 12),
(1, 'A', 890, 12)],
['Store_ID', 'Category', 'ID', 'Sales'])
Script:
df = (df
.groupBy('Store_ID')
.pivot('Category', ['A', 'B', 'C', 'D'])
.agg(
F.countDistinct('ID').alias('ID'),
F.sum('Sales').alias('Sales'))
.fillna(0))
df.show()
# +--------+----+-------+----+-------+----+-------+----+-------+
# |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
# +--------+----+-------+----+-------+----+-------+----+-------+
# | 1| 3| 102| 1| 78| 0| 0| 0| 0|
# | 3| 0| 0| 0| 0| 0| 0| 1| 12|
# | 2| 1| 45| 1| 34| 0| 0| 0| 0|
# +--------+----+-------+----+-------+----+-------+----+-------+
Related
I have a table structure like this:
unique_id | group | value_1 | value_2 | value_3
abc_xxx 1 200 null 100
def_xxx 1 0 3 40
ghi_xxx 2 300 1 2
that I need to extract the following information from:
Total number of rows per group
Count number of rows per group that contains null values.
Count number of rows per group with zero values.
I can do the first one with a simple groupBy and count
df.select().groupBy(group).count()
I'm not so sure how to approach the next two which is needed for me to compute the null and zero rate from the total rows per group.
data= [
('abc_xxx', 1, 200, None, 100),
('def_xxx', 1, 0, 3, 40 ),
('ghi_xxx', 2, 300, 1, 2 ),
]
df = spark.createDataFrame(data, ['unique_id','group','value_1','value_2','value_3'])
# new edit
df = df\
.withColumn('contains_null', when(isnull(col('value_1')) | isnull(col('value_2')) | isnull(col('value_3')), lit(1)).otherwise(lit(0)))\
.withColumn('contains_zero', when((col('value_1')==0) | (col('value_2')==0) | (col('value_3')==0), lit(1)).otherwise(lit(0)))
df.groupBy('group')\
.agg(count('unique_id').alias('total_rows'), sum('contains_null').alias('null_value_rows'), sum('contains_zero').alias('zero_value_rows')).show()
+-----+----------+---------------+---------------+
|group|total_rows|null_value_rows|zero_value_rows|
+-----+----------+---------------+---------------+
| 1| 2| 1| 1|
| 2| 1| 0| 0|
+-----+----------+---------------+---------------+
# total_count = (count('value_1') + count('value_2') + count('value_3'))
# null_count = (sum(when(isnull(col('value_1')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_2')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_3')), lit(1)).otherwise(lit(0))))
# zero_count = (sum(when(col('value_1')==0, lit(1)).otherwise(lit(0)) + when(col('value_2')==0, lit(1)).otherwise(lit(0)) + when(col('value_3')==0, lit(1)).otherwise(lit(0))))
# df.groupBy('group')\
# .agg(total_count.alias('total_numbers'), null_count.alias('null_values'), zero_count.alias('zero_values')).show()
#+-----+-------------+-----------+-----------+
#|group|total_numbers|null_values|zero_values|
#+-----+-------------+-----------+-----------+
#| 1| 5| 1| 1|
#| 2| 3| 0| 0|
#+-----+-------------+-----------+-----------+
Input dataframe:
Item
L
W
H
I1
3
5
8
I2
2
1
2
I3
6
9
1
I4
7
3
4
The output dataframe should be as below. Create 3 new columns: L_n, W_n, H_n by checking the values from L, W, H cols. L_n is the longest dimension, W_n is the medium and H_n is the shortest dimension.
Item
L
W
H
L_n
W_n
H_n
I1
3
5
8
8
5
3
I2
2
1
2
2
2
1
I3
6
9
1
9
6
1
I4
7
3
4
7
4
3
I suggest creating an array (array), sorting it (array_sort) and selecting elements one-by-one (element_at).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('I1', 3, 5, 8),
('I2', 2, 1, 2),
('I3', 6, 9, 1),
('I4', 7, 3, 4)],
['Item', 'L', 'W', 'H']
)
arr = F.array_sort(F.array('L', 'W', 'H'))
df = df.select(
'*',
F.element_at(arr, 3).alias('L_n'),
F.element_at(arr, 2).alias('W_n'),
F.element_at(arr, 1).alias('H_n'),
)
df.show()
# +----+---+---+---+---+---+---+
# |Item| L| W| H|L_n|W_n|H_n|
# +----+---+---+---+---+---+---+
# | I1| 3| 5| 8| 8| 5| 3|
# | I2| 2| 1| 2| 2| 2| 1|
# | I3| 6| 9| 1| 9| 6| 1|
# | I4| 7| 3| 4| 7| 4| 3|
# +----+---+---+---+---+---+---+
I want to compare Input DataFrame with Main DataFrame and return the value of matching row to the input data,
Consider the example below
Input DataFrame
A
B
C
1
0
1
0
0
0
1
1
1
0
1
1
Main DataFrame
A
B
C
Point
1
1
1
P1
1
0
1
P2
After comparing the Input with main DataFrame the result should be like below
Output DataFrame
A
B
C
Point
1
0
1
P2
0
0
0
NA
1
1
1
P1
0
1
1
NA
You can use left join :
from pyspark.sql import functions as F
result_df = input_df.join(main_df, ["A", "B", "C"], "left") \
.withColumn("Point", F.coalesce(F.col("Point"), F.lit("NA")))
result_df.show()
#+---+---+---+-----+
#| A| B| C|Point|
#+---+---+---+-----+
#| 0| 0| 0| NA|
#| 1| 0| 1| P2|
#| 1| 1| 1| P1|
#| 0| 1| 1| NA|
#+---+---+---+-----+
Say I have a list of column names and they all exist in the dataframe
Cols = ['A', 'B', 'C', 'D'],
I am looking for a quick way to get a table/dataframe like
NA_counts min max
A 5 0 100
B 10 0 120
C 8 1 99
D 2 0 500
TIA
You can calculate each metric separately and then union all like this:
nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols]
max_cols = [max(col(c)).alias(c) for c in cols]
min_cols = [min(col(c)).alias(c) for c in cols]
nulls_df = df.select(lit("NA_counts").alias("count"), *nulls_cols)
max_df = df.select(lit("Max").alias("count"), *max_cols)
min_df = df.select(lit("Min").alias("count"), *min_cols)
nulls_df.unionAll(max_df).unionAll(min_df).show()
Output example:
+---------+---+---+----+----+
| count| A| B| C| D|
+---------+---+---+----+----+
|NA_counts| 1| 0| 3| 1|
| Max| 9| 5|Test|2017|
| Min| 1| 0|Test|2010|
+---------+---+---+----+----+
I work with Data Frame in PySpark
I have the following task: check how many "times" values from each column was > 2 for all columns. For u1 it is 0, for u2 => 2 and etc
user a b c d times
u1 1 0 1 0 0
u2 0 1 4 3 2
u3 2 1 7 0 1
My solution below. It works, I'm not sure that it is the best way and didn't try on real big data yet. I don't like transform to rdd and back to data frame. Is there anything better? I thouth in the beginning to claculate by UDF per columns, but didn't find a way to accamulte and sum all results per row:
def calculate_times(row):
times = 0
for index, item in enumerate(row):
if not isinstance(item, basestring):
if item > 2:
times = times+1
return times
def add_column(pair):
return dict(pair[0].asDict().items() + [("is_outlier", pair[1])])
def calculate_times_for_all(df):
rdd_with_times = df.map(lambda row: (calculate_times(row))
rdd_final = df.rdd.zip(rdd_with_times).map(add_column)
df_final = sqlContext.createDataFrame(rdd_final)
return df_final
for this solution i used this topic
How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?
Thanks!
It is just a simple one-liner. Example data:
df = sc.parallelize([
("u1", 1, 0, 1, 0), ("u2", 0, 1, 4, 3), ("u3", 2, 1, 7, 0)
]).toDF(["user", "a", "b", "c", "d"])
withColumn:
df.withColumn("times", sum((df[c] > 2).cast("int") for c in df.columns[1:]))
and the result:
+----+---+---+---+---+-----+
|user| a| b| c| d|times|
+----+---+---+---+---+-----+
| u1| 1| 0| 1| 0| 0|
| u2| 0| 1| 4| 3| 2|
| u3| 2| 1| 7| 0| 1|
+----+---+---+---+---+-----+
Note:
It columns are nullable you should correct for that, for example using coalesce:
from pyspark.sql.functions import coalesce
sum(coalesce((df[c] > 2).cast("int"), 0) for c in df.columns[1:])