Count lines of a dataset in function of a column in PySpark - apache-spark

I'm working with PySpark. I have a dataset like this:
I want to count lines of my dataset in function of my "Column3" column.
For example, here I want to get this dataset:

pyspark.sql.functions.count(col):
Aggregate function: returns the number of items in a group.
temp = spark.createDataFrame([
(0, 11, 'A'),
(1, 12, 'B'),
(2, 13, 'B'),
(0, 14, 'A'),
(1, 15, 'c'),
(2, 16, 'A'),
], ["column1", "column2", 'column3'])
temp.groupBy('column3').agg(count('*').alias('count')).sort('column3').show(10, False)
# +-------+-----+
# |column3|count|
# +-------+-----+
# |A |3 |
# |B |2 |
# |c |1 |
# +-------+-----+

df.groupBy('column_3').count()

Related

How do I create pivot this way in Pyspark?

I have a pyspark dataframe df :-
STORE
COL_APPLE_BB
COL_APPLE_NONBB
COL_PEAR_BB
COL_PEAR_NONBB
COL_ORANGE_BB
COL_ORANGE_NONBB
COL_GRAPE_BB
COL_GRAPE_NONBB
1
28
24
24
32
26
54
60
36
2
19
12
24
13
10
24
29
10
I have another pyspark df df2 :-
STORE
PDT
FRUIT
TYPE
1
1
APPLE
BB
1
2
ORANGE
NONBB
1
3
PEAR
BB
1
4
GRAPE
BB
1
5
APPLE
BB
1
6
ORANGE
BB
2
1
PEAR
NONBB
2
2
ORANGE
NONBB
2
3
APPLE
NONBB
Expected pyspark df2 with a column COL_VALUE for repective store,fruit,type:-
STORE
PDT
FRUIT
TYPE
COL_VALUE
1
1
APPLE
BB
28
1
2
ORANGE
NONBB
54
1
3
PEAR
BB
24
1
4
GRAPE
BB
60
1
5
APPLE
BB
28
1
6
ORANGE
BB
26
2
1
PEAR
NONBB
13
2
2
ORANGE
NONBB
24
2
3
APPLE
NONBB
12
from pyspark.sql.functions import *
df = spark.createDataFrame(
[
(1, 28, 24, 24, 32, 26, 54, 60, 36),
(2, 19, 12, 24, 13, 10, 24, 29, 10)
],
["STORE", "COL_APPLE_BB", "COL_APPLE_NONBB", "COL_PEAR_BB", "COL_PEAR_NONBB", "COL_ORANGE_BB", "COL_ORANGE_NONBB", "COL_GRAPE_BB","COL_GRAPE_NONBB"]
)
df2 = spark.createDataFrame(
[
(1, 1, "APPLE", "BB"),
(1, 2, "ORANGE", "NONBB"),
(1, 3, "PEAR", "BB"),
(1, 4, "GRAPE", "BB"),
(1, 5, "APPLE", "BB"),
(1, 6, "ORANGE", "BB"),
(2, 1, "PEAR", "NONBB"),
(2, 2, "ORANGE", "NONBB"),
(2, 3, "APPLE", "NONBB")
],
["STORE", "PDT", "FRUIT", "TYPE"]
)
unPivot_df = df.select("STORE",expr("stack(8, 'APPLE_BB',COL_APPLE_BB,\
'APPLE_NONBB',COL_APPLE_NONBB,\
'PEAR_BB', COL_PEAR_BB,\
'PEAR_NONBB', COL_PEAR_NONBB,\
'ORANGE_BB',COL_ORANGE_BB, \
'ORANGE_NONBB',COL_ORANGE_NONBB,\
'GRAPE_BB',COL_GRAPE_BB,\
'GRAPE_NONBB',COL_GRAPE_NONBB) as (Appended,COL_VALUE)"))
df2 = df2.withColumn("Appended",concat_ws('_',col("FRUIT"),col("TYPE")))
df2 = df2.join(unPivot_df,['STORE',"Appended"],"left")
df2.show()
+-----+------------+---+------+-----+---------+
|STORE| Appended|PDT| FRUIT| TYPE|COL_VALUE|
+-----+------------+---+------+-----+---------+
| 1|ORANGE_NONBB| 2|ORANGE|NONBB| 54|
| 1| PEAR_BB| 3| PEAR| BB| 24|
| 1| GRAPE_BB| 4| GRAPE| BB| 60|
| 1| APPLE_BB| 1| APPLE| BB| 28|
| 2|ORANGE_NONBB| 2|ORANGE|NONBB| 24|
| 2| APPLE_NONBB| 3| APPLE|NONBB| 12|
| 1| ORANGE_BB| 6|ORANGE| BB| 26|
| 1| APPLE_BB| 5| APPLE| BB| 28|
| 2| PEAR_NONBB| 1| PEAR|NONBB| 13|
+-----+------------+---+------+-----+---------+
If you have Spark 3.2 or higher you could use something like:
data = data.melt(
id_vars=['STORE'],
value_vars=data.columns[1:],
var_name="variable",
value_name="value"
)
to get a "long" form of the dataset, and then use regex_extract twice to get the required information from the variable column.
For earlier versions of Spark, use the following:
def process_row(row):
output = []
for index, key in enumerate(row.asDict()):
if key == "STORE":
store = row[key]
else:
_, fruit, type = key.split("_")
output.append((store, index, fruit, type, row[key]))
return output
data = data.rdd.flatMap(process_row).toDF(
schema=["STORE", "PDT", "FRUIT", "TYPE", "COLUMN_VALUE"]
)
Alternatively to melt, you can use stack in earlier Spark versions:
df = spark.createDataFrame(
[
(1, 28, 24),
(2, 19, 12),
],
["STORE", "COL_APPLE_BB", "COL_APPLE_NONBB"]
)
df2 = spark.createDataFrame(
[
(1, 1, "APPLE", "BB"),
(1, 2, "ORANGE", "NONBB"),
(1, 2, "APPLE", "NONBB"),
(2, 3, "APPLE", "NONBB")
],
["STORE", "PDT", "FRUIT", "TYPE"]
)
Create a column that matches the "COL_FRUIT_TYPE" in df:
df3 = df2.withColumn("fruit_type", F.concat(F.lit("COL_"), F.col("FRUIT"), F.lit("_"), F.col("TYPE")))
df3.show(10, False)
gives:
+-----+---+------+-----+----------------+
|STORE|PDT|FRUIT |TYPE |fruit_type |
+-----+---+------+-----+----------------+
|1 |1 |APPLE |BB |COL_APPLE_BB |
|1 |2 |ORANGE|NONBB|COL_ORANGE_NONBB|
|1 |2 |APPLE |NONBB|COL_APPLE_NONBB |
+-----+---+------+-----+----------------+
Then "unpivot" the first df:
from pyspark.sql.functions import expr
unpivotExpr = "stack({}, {}) as (fruit_type, COL_VALUE)".format(len(df.columns) - 1, ','.join( [("'{}', {}".format(c, str(c))) for c in df.columns[1:]] ) )
print(unpivotExpr)
unPivotDF = df.select("STORE", expr(unpivotExpr)) \
.where("STORE is not null")
unPivotDF.show(truncate=False)
The stack function takes as arguments: the number of "columns" that it will be "unpivoting" (here, it derives that it will be len(df.columns) - 1, as we will be skipping the STORE column); then, in case of just column, value pairs, it takes a list of these in the form col_name, value. Here, the
[("'{}', {}".format(c, str(c))) for c in df.columns[1:]] part takes columns from df, skipping the first one (STORE), then returns a pair for each of the remaining columns, such as 'COL_APPLE_BB', COL_APPLE_BB. In the end I join these into a comma-separated string (",".join()) and replace the placeholder {} with this string.
Example how stack function is usually called:
"stack(2, 'COL_APPLE_BB', COL_APPLE_BB, 'COL_APPLE_NONBB', COL_APPLE_NONBB) as (fruit_type, COL_VALUE)"
The unPivotDF.show(truncate=False) outputs:
+-----+---------------+---------+
|STORE|fruit_type |COL_VALUE|
+-----+---------------+---------+
|1 |COL_APPLE_BB |28 |
|1 |COL_APPLE_NONBB|24 |
|2 |COL_APPLE_BB |19 |
|2 |COL_APPLE_NONBB|12 |
+-----+---------------+---------+
and join the two:
df3.join(unPivotDF, ["fruit_type", "STORE"], "left")\
.select("STORE", "PDT", "FRUIT", "TYPE", "COL_VALUE").show(40, False)
result:
+-----+---+------+-----+---------+
|STORE|PDT|FRUIT |TYPE |COL_VALUE|
+-----+---+------+-----+---------+
|1 |2 |ORANGE|NONBB|null |
|1 |2 |APPLE |NONBB|24 |
|1 |1 |APPLE |BB |28 |
|2 |3 |APPLE |NONBB|12 |
+-----+---+------+-----+---------+
The drawback is that you need to enumerate the column names in stack, if I figure out a way to do this automatically, I will update the answer.
EDIT: I have updated the use of the stack function, so that it can derive the columns by itself.

PySpark max value for multiple columns

I have the below table:
df = spark.createDataFrame(
[('a', 1, 11, 44),
('b', 2, 21, 33),
('a', 2, 10, 40),
('c', 5, 55, 45),
('b', 4, 22, 35),
('a', 3, 9, 45)],
['id', 'left', 'right', 'centre'])
I need to find and display only the max values as shown below:
[![enter image description here][2]][2]
[![[2]: https://i.stack.imgur.com/q8bGq.png][2]][2]
Simple groupBy and agg:
from pyspark.sql import functions as F
df = df.groupBy('id').agg(
F.max('left').alias('max_left'),
F.max('right').alias('max_right'),
F.max('centre').alias('max_centre'),
)
df.show()
# +---+--------+---------+----------+
# | id|max_left|max_right|max_centre|
# +---+--------+---------+----------+
# | b| 4| 22| 35|
# | a| 3| 11| 45|
# | c| 5| 55| 45|
# +---+--------+---------+----------+
Or slightly more advanced:
df = df.groupBy('id').agg(
*[F.max(c).alias(f'max_{c}') for c in df.columns if c != 'id']
)

How to pivot and rename columns based on several grouped columns

I am getting trouble using agg function and renaming results properly. So far I have made the table of the following format.
sheet
equipment
chamber
time
value1
value2
a
E1
C1
1
11
21
a
E1
C1
2
12
22
a
E1
C1
3
13
23
b
E1
C1
1
14
24
b
E1
C1
2
15
25
b
E1
C1
3
16
26
I would like to create a statistical table like this:
sheet
E1_C1_value1_mean
E1_C1_value1_min
E1_C1_value1_max
E1_C1_value2_mean
E1_C1_value2_min
E1_C1_value2_max
a
12
11
13
22
21
23
b
15
14
16
25
24
26
Which I would like to groupBy "sheet", "equipment", "chamber" to aggregate mean, min, max values.
I also need to rename column by the rule: equip + chamber + aggregation function.
There are multiple "equipment" names and "chamber" names.
As pivot in spark only accept single column, therefore you have to concat the column which you want to pivot:
df = spark.createDataFrame(
[
('a', 'E1', 'C1', 1, 11, 21),
('a', 'E1', 'C1', 2, 12, 22),
('a', 'E1', 'C1', 3, 13, 23),
('b', 'E1', 'C1', 1, 14, 24),
('b', 'E1', 'C1', 2, 15, 25),
('b', 'E1', 'C1', 3, 16, 26),
],
schema=['sheet', 'equipment', 'chamber', 'time', 'value1', 'value2']
)
df.printSchema()
df.show(10, False)
+-----+---------+-------+----+------+------+
|sheet|equipment|chamber|time|value1|value2|
+-----+---------+-------+----+------+------+
|a |E1 |C1 |1 |11 |21 |
|a |E1 |C1 |2 |12 |22 |
|a |E1 |C1 |3 |13 |23 |
|b |E1 |C1 |1 |14 |24 |
|b |E1 |C1 |2 |15 |25 |
|b |E1 |C1 |3 |16 |26 |
+-----+---------+-------+----+------+------+
Assume there are lots of columns that you want to do the aggregation, you can use a loop to create and prevent the bulky coding:
aggregation = []
for col in df.columns[-2:]:
aggregation += [func.min(col).alias(f"{col}_min"), func.max(col).alias(f"{col}_max"), func.avg(col).alias(f"{col}_mean")]
df.withColumn('new_col', func.concat_ws('_', func.col('equipment'), func.col('chamber')))\
.groupby('sheet')\
.pivot('new_col')\
.agg(*aggregation)\
.orderBy('sheet')\
.show(100, False)
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
|sheet|E1_C1_value1_min|E1_C1_value1_max|E1_C1_value1_mean|E1_C1_value2_min|E1_C1_value2_max|E1_C1_value2_mean|
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
|a |11 |13 |12.0 |21 |23 |22.0 |
|b |14 |16 |15.0 |24 |26 |25.0 |
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
First, create a column out of those which you want to pivot.
Then, pivot and aggregate as usual.
Input dataframe:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('a', 'E1', 'C1', 1, 11, 21),
('a', 'E1', 'C1', 2, 12, 22),
('a', 'E1', 'C1', 3, 13, 23),
('b', 'E1', 'C1', 1, 14, 24),
('b', 'E1', 'C1', 2, 15, 25),
('b', 'E1', 'C1', 3, 16, 26)],
['sheet', 'equipment', 'chamber', 'time', 'value1', 'value2'])
Script:
df = df.withColumn('_temp', F.concat_ws('_', 'equipment', 'chamber'))
df = (df
.groupBy('sheet')
.pivot('_temp')
.agg(
F.mean('value1').alias('value1_mean'),
F.min('value1').alias('value1_min'),
F.max('value1').alias('value1_max'),
F.mean('value2').alias('value2_mean'),
F.min('value2').alias('value2_min'),
F.max('value2').alias('value2_max'),
)
)
df.show()
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
# |sheet|E1_C1_value1_mean|E1_C1_value1_min|E1_C1_value1_max|E1_C1_value2_mean|E1_C1_value2_min|E1_C1_value2_max|
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
# | b| 15.0| 14| 16| 25.0| 24| 26|
# | a| 12.0| 11| 13| 22.0| 21| 23|
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+

Spark: Create dataframe from arrays in a column

I have a Spark dataframe (using Scala) with a column arrays that contains Array[Array[Int]], i.e.
var data = Seq(
((1, 2, 3), (3, 4, 5), (6, 7, 8)),
((1, 5, 7), (3, 4, 5), (6, 3, 0)),
...
).toDF("arrays")
I want to create a new dataframe in which each row contains one Array[Int] and there should be no repetitions. For example, the dataframe above would become:
+-----------+
| array |
+-----------+
| (1, 2, 3) |
| (3, 4, 5) |
| (6, 7, 8) |
| (1, 5, 7) |
| (6, 3, 0) |
+-----------+
where (3, 4, 5) appears only once.
Try:
df.withColumn("array", explode(df.array)).dropDuplicates()

how to deal with hundreds of colums data from textfile when training a model using spark ml

I have a textfile with hundreds of columns , but the columns don't have column names.
The first column is the label and the others are features. I've read some examples that must specify cloumn names for the train data. But it is quite troublesome to specify all the names since there are too many columns.
how can I deal with this situation?
You can use VectorAssempler in combination with list comprehension to structure your data for model training. Consider this example data with two feature columns (x1 and x2) and a response variable y.
df = sc.parallelize([(5, 1, 6),
(6, 9, 4),
(5, 3, 3),
(4, 4, 2),
(4, 5, 1),
(2, 2, 2),
(1, 7, 3)]).toDF(["y", "x1", "x2"])
First, we create a list of column names that are not "y":
colsList = [x for x in df.columns if x!= 'y']
Now, we can use VectorAssembler:
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(colsList)
vectorizer.setOutputCol("features")
output = vectorizer.transform(df)
output.select("features", "y").show()
+---------+---+
| features| y|
+---------+---+
|[1.0,6.0]| 5|
|[9.0,4.0]| 6|
|[3.0,3.0]| 5|
|[4.0,2.0]| 4|
|[5.0,1.0]| 4|
|[2.0,2.0]| 2|
|[7.0,3.0]| 1|
+---------+---+

Resources