how I can pair rows with respect of a group? - apache-spark

how I can pair the elements of each row with respect of a group?
id title comp
1 'A' 45
1 'B' 32
1 'C' 1
2 'D' 5
2 'F' 6
out put:
I wanna pair row if they have the same 'id'
output:
id title comp
1 'A','B' 45,32
1 'B','C' 32,1
2 'D','F' 5,6

Use window function. Collect list of the immediate consecutive elements in the target columns. Remove the array brackets by converting resultant arrays into strings using array_join. Last row will have less elements. filter out where the list has less more than 0ne elements.
from pyspark.sql.functions import *
from pyspark.sql import Window
win=Window.partitionBy().orderBy(F.asc('title')).rowsBetween(0,1)
df.select("id", *[F.array_join(F.collect_list(c).over(win),',').alias(c) for c in df.drop('id').columns]).filter(length(col('comp'))>1).show()
+---+-----+-----+
| id|title| comp|
+---+-----+-----+
| 1| A,B|45,32|
| 1| B,C| 32,1|
| 1| C,D| 1,5|
| 2| D,F| 5,6|
+---+-----+-----+

Related

transition matrix from pyspark dataframe

I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+

pyspark value of column when other column has first nonmissing value

Suppose I have the following pyspark dataframe df:
id date var1 var2
1 1 NULL 2
1 2 b 3
2 1 a NULL
2 2 a 1
I want the first non missing observation for all var* columns and additionally the value of date where this is from, i.e. the final result should look like:
id var1 dt_var1 var2 dt_var2
1 b 2 2 1
2 a 1 1 2
Getting the values is straightforward using
df.orderBy(['id','date']).groupby('id').agg(
*[F.first(x, ignorenulls=True).alias(x) for x in ['var1', 'var2']]
)
But I fail to see how I could get the respective dates. I could loop variable for variable, drop missing, and keep the first row. But this sounds like a poor solution that will not scale well, as it would require a separate dataframe for each variable.
I would prefer a solution that scales to many columns (var3, var4,...)
You should not use groupby if you want to get the first non-null according to date ordering. The order is not guaranteed after a groupby operation even if you called orderby just before.
You need to use window functions instead. To get the date associated with each var value you can use this trick with structs:
from pyspark.sql import Window, functions as F
w = (Window.partitionBy("id").orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df1 = df.select(
"id",
*[F.first(
F.when(F.col(x).isNotNull(), F.struct(x, F.col("date").alias(f"dt_{x}"))),
ignorenulls=True).over(w).alias(x)
for x in ["var1", "var2"]
]
).distinct().select("id", "var1.*", "var2.*")
df1.show()
#+---+----+-------+----+-------+
#| id|var1|dt_var1|var2|dt_var2|
#+---+----+-------+----+-------+
#| 1| b| 2| 2| 1|
#| 2| a| 1| 1| 2|
#+---+----+-------+----+-------+

how I can make a column pair with respect of a group?

I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+

How to calculate the number of distinct values for all columns in Apache Spark DataFrame [duplicate]

This question already has answers here:
Spark DataFrame: count distinct values of every column
(6 answers)
Closed 5 years ago.
I want to calculate the number of distinct values for all columns in a DataFrame.
Say, I have a DataFrame like this:
x y z
-----
0 0 0
0 1 1
0 1 2
And I want another DataFrame (or any other structure) of format:
col | num
---------
'x' | 1
'y' | 2
'z' | 3
What would be the most efficient way of doing that?
You can use countDistinct to count distinct values; to apply this to all columns, use map on the columns to construct a list of expressions, and then apply this to agg function with varargs syntax:
val exprs = df.columns.map(x => countDistinct(x).as(x))
df.agg(exprs.head, exprs.tail: _*).show
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+

how to index categorical features in another way when using spark ml

The VectorIndexer in spark indexes categorical features according to the frequency of variables. But I want to index the categorical features in a different way.
For example, with a dataset as below, "a","b","c" will be indexed as 0,1,2 if I use the VectorIndexer in spark. But I want to index them according to the label.
There are 4 rows data which are indexed as 1, and among them 3 rows have feature 'a',1 row feautre 'c'. So here I will index 'a' as 0,'c' as 1 and 'b' as 2.
Is there any convienient way to implement this?
label|feature
-----------------
1 | a
1 | c
0 | a
0 | b
1 | a
0 | b
0 | b
0 | c
1 | a
If I understand your question correctly, you are looking to replicate the behaviour of StringIndexer() on grouped data. To do so (in pySpark), we first define an udf that will operate on a List column containing all the values per group. Note that elements with equal counts will be ordered arbitrarily.
from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType
def encoder(col):
# Generate count per letter
x = Counter(col)
# Create a dictionary, mapping each letter to its rank
ranking = {pair[0]: rank
for rank, pair in enumerate(x.most_common())}
# Use dictionary to replace letters by rank
new_list = [ranking[i] for i in col]
return(new_list)
encoder_udf = udf(encoder, ArrayType(IntegerType()))
Now we can aggregate the feature column into a list grouped by the column label using collect_list() , and apply our udf rowwise:
from pyspark.sql.functions import collect_list, explode
df1 = (df.groupBy("label")
.agg(collect_list("feature")
.alias("features"))
.withColumn("index",
encoder_udf("features")))
Consequently, you can explode the index column to get the encoded values instead of the letters:
df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 0|
| 0| 2|
| 1| 0|
| 1| 1|
| 1| 0|
| 1| 0|
+-----+-----+

Resources