Collect Column names in Spark without using case when - apache-spark

I have a dataframe with follwing columns(A, A_1, B, B_1, C, C_1, D, D_1, E, E_1).
For example if A = A_1 and B <> B_1 and C <> C_1 and D <> D_1 and E <> E_1 then I want to create a column and mark it as 'B,C,D,E' for this records. Similarly all other combinations are possible
I tried writing case when but there are a lot of combinations

i'm unsure if it'd be possible without looping over columns. so, here's an easy approach that loops over your columns and checks if the column pair is not equal - if not equal, then output the column name. we can then concatenate the column names that we get from the loop.
example below
data_ls = [
(1, 1, 1, 3, 4, 4, 5, 6),
(1, 0, 1, 3, 4, 4, 5, 6)
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['a', 'a_1', 'b', 'b_1', 'c', 'c_1', 'd', 'd_1'])
# +---+---+---+---+---+---+---+---+
# | a|a_1| b|b_1| c|c_1| d|d_1|
# +---+---+---+---+---+---+---+---+
# | 1| 1| 1| 3| 4| 4| 5| 6|
# | 1| 0| 1| 3| 4| 4| 5| 6|
# +---+---+---+---+---+---+---+---+
# create condition statements for each column pair in a list
col_conditions = [func.when(func.col(k) != func.col(k+'_1'), func.lit(k)) for k in data_sdf.columns if not k.endswith('_1')]
# [Column<'CASE WHEN (NOT (a = a_1)) THEN a END'>,
# Column<'CASE WHEN (NOT (b = b_1)) THEN b END'>,
# Column<'CASE WHEN (NOT (c = c_1)) THEN c END'>,
# Column<'CASE WHEN (NOT (d = d_1)) THEN d END'>]
# concatenate the case when statements with concat_ws
data_sdf. \
withColumn('ineq_cols', func.concat_ws(',', *col_conditions)). \
show(truncate=False)
# +---+---+---+---+---+---+---+---+---------+
# |a |a_1|b |b_1|c |c_1|d |d_1|ineq_cols|
# +---+---+---+---+---+---+---+---+---------+
# |1 |1 |1 |3 |4 |4 |5 |6 |b,d |
# |1 |0 |1 |3 |4 |4 |5 |6 |a,b,d |
# +---+---+---+---+---+---+---+---+---------+

Related

Employing Pyspark How to determine the frequency of each event and its event-by-event frequency

I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can achieve your desired result by doing this,
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
print("Original Data:")
df.show()
print("Result:")
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
) \
.withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
.withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
.orderBy("ID")\
.drop("ID", "group", "event_freq", "element_freq")\
.show()
Original Data:
+----+
|Data|
+----+
| a|
| a|
| a|
| a|
| a|
| b|
| b|
| b|
| a|
| a|
| b|
+----+
Result:
+----+---------+
|Data|Frequency|
+----+---------+
| a| a1,1|
| a| a1,2|
| a| a1,3|
| a| a1,4|
| a| a1,5|
| b| b1,1|
| b| b1,2|
| b| b1,3|
| a| a2,1|
| a| a2,2|
| b| b2,1|
+----+---------+
Use Window functions. I give you to options just in case
Option 1, separating groups and Frequency
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('group', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('group', when(col('data')!=col('group'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('group', concat('Data',sum('group').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', rank().over(Window.partitionBy('group').orderBy('index')))
).sort('index').drop('index').show(truncate=False)
+----+-----+---------+
|Data|group|Frequency|
+----+-----+---------+
|a |a1 |1 |
|a |a1 |2 |
|a |a1 |3 |
|a |a1 |4 |
|a |a1 |5 |
|b |b2 |1 |
|b |b2 |2 |
|b |b2 |3 |
|a |a2 |1 |
|a |a2 |2 |
|b |b3 |1 |
+----+-----+---------+
Option 2 gives an output you wanted
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('Frequency', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('Frequency', when(col('data')!=col('Frequency'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('Frequency', concat('Data',sum('Frequency').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', array_join(array('Frequency',rank().over(Window.partitionBy('Frequency').orderBy('index'))),','))
).sort('index').drop('index').show(truncate=False)
+----+---------+
|Data|Frequency|
+----+---------+
|a |a1,1 |
|a |a1,2 |
|a |a1,3 |
|a |a1,4 |
|a |a1,5 |
|b |b2,1 |
|b |b2,2 |
|b |b2,3 |
|a |a2,1 |
|a |a2,2 |
|b |b3,1 |
+----+---------+

Spark DataFrame: get row wise sorted column names based on column values

For every row in the below dataframe, I want to find column names (as an array or tuple or something else) according to descending column entries. So, for dataframe
+---+---+---+---+---+
| ID|key| a| b| c|
+---+---+---+---+---+
| 0| 1| 5| 2| 1|
| 1| 1| 3| 4| 5|
+---+---+---+---+---+
I want to find
+---+---+---+---+---+------------------+
| ID|key| a| b| c|descending_columns|
+---+---+---+---+---+------------------+
| 0| 1| 5| 2| 1| [a,b,c]|
| 1| 1| 3| 4| 5| [c,b,a]|
+---+---+---+---+---+------------------+
Ideally and in general, I want to be able to iterate through pre-specified columns and apply a function based on those column entries. This could look like:
import pyspark.sql.functions as f
name_cols = ["a","b","c"]
for col in name_cols:
values_ls.append = []
...schema specification....
values_ls.append(f.col(col) ...get column value... )
df1 = df.withColumn("descending_columns", values_ls)
The question is rather simple, but seems to be quite challenging to implement efficiently in pyspark.
I am using pyspark version 2.3.3.
For Spark Versions < 2.4 you can achieve this without a udf using sort_array and struct.
First get a list of the columns to sort
cols_to_sort = df.columns[2:]
print(cols_to_sort)
#['a', 'b', 'c']
Now build a struct with two elements - a "value" and a "key". The "key" is the column name and the "value" is the column value. If you ensure that the "value" comes first in the struct, you can use sort_array to sort this array of structs in the manner you want.
After the array is sorted, you just need to iterate over it and extract the "key" part, which contains the column names.
from pyspark.sql.functions import array, col, lit, sort_array, struct
df.withColumn(
"descending_columns",
array(
*[
sort_array(
array(
*[
struct([col(c).alias("value"), lit(c).alias("key")])
for c in cols_to_sort
]
),
asc=False
)[i]["key"]
for i in range(len(cols_to_sort))
]
)
).show(truncate=False)
#+---+---+---+---+---+------------------+
#|ID |key|a |b |c |descending_columns|
#+---+---+---+---+---+------------------+
#|0 |1 |5 |2 |1 |[a, b, c] |
#|1 |1 |3 |4 |5 |[c, b, a] |
#+---+---+---+---+---+------------------+
Even though this looks complicated, it should offer better performance than the udf solution.
Update: To sort by the original column order in the case of a tie in the value, you could insert another value in the struct which contains the index. Since the sort is descending, we use the negative of the index.
For example, if your input dataframe were the following:
df.show()
#+---+---+---+---+---+
#| ID|key| a| b| c|
#+---+---+---+---+---+
#| 0| 1| 5| 2| 1|
#| 1| 1| 3| 4| 5|
#| 2| 1| 4| 4| 5|
#+---+---+---+---+---+
The last row above has a tie in value between a and b. We want a to sort before b in this case.
df.withColumn(
"descending_columns",
array(
*[
sort_array(
array(
*[
struct(
[
col(c).alias("value"),
lit(-j).alias("index"),
lit(c).alias("key")
]
)
for j, c in enumerate(cols_to_sort)
]
),
asc=False
)[i]["key"]
for i in range(len(cols_to_sort))
]
)
).show(truncate=False)
#+---+---+---+---+---+------------------+
#|ID |key|a |b |c |descending_columns|
#+---+---+---+---+---+------------------+
#|0 |1 |5 |2 |1 |[a, b, c] |
#|1 |1 |3 |4 |5 |[c, b, a] |
#|2 |1 |4 |4 |5 |[c, a, b] |
#+---+---+---+---+---+------------------+
You could insert the columns into a single struct and process that in a udf.
from pyspark.sql import functions as F
from pyspark.sql import types as T
name_cols = ['a', 'b', 'c']
def ordered_columns(row):
return [x for _,x in sorted(zip(row.asDict().values(), name_cols), reverse=True)]
udf_ordered_columns = F.udf(ordered_columns, T.ArrayType(T.StringType()))
df1 = (
df
.withColumn(
'row',
F.struct(*name_cols)
)
.withColumn(
'descending_columns',
udf_ordered_columns('row')
)
)
Something like this should work, if above doesn't, then let me know.

Spark struct represented by OneHotEncoder

I have a data frame with two columns,
+---+-------+
| id| fruit|
+---+-------+
| 0| apple|
| 1| banana|
| 2|coconut|
| 1| banana|
| 2|coconut|
+---+-------+
also I have a universal List with all the items,
fruitList: Seq[String] = WrappedArray(apple, coconut, banana)
now I want to create a new column in the dataframe with an array of 1's,0's, where 1 represent the item exist and 0 if the item doesn't present for that row.
Desired Output
+---+-----------+
| id| fruitlist|
+---+-----------+
| 0| [1,0,0] |
| 1| [0,1,0] |
| 2|[0,0,1] |
| 1| [0,1,0] |
| 2|[0,0,1] |
+---+-----------+
This is something I tried,
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = spark.createDataFrame(Seq(
(0, "apple"),
(1, "banana"),
(2, "coconut"),
(1, "banana"),
(2, "coconut")
)).toDF("id", "fruit")
df.show
import org.apache.spark.sql.functions._
val fruitList = df.select(collect_set("fruit")).first().getAs[Seq[String]](0)
print(fruitList)
I tried to solve this with OneHotEncoder but the result was something like this after converting to dense vector, which is not what I needed.
+---+-------+----------+-------------+---------+
| id| fruit|fruitIndex| fruitVec| vd|
+---+-------+----------+-------------+---------+
| 0| apple| 2.0| (2,[],[])|[0.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
+---+-------+----------+-------------+---------+
If you have a collection as
val fruitList: Seq[String] = Array("apple", "coconut", "banana")
Then you can either do it using inbuilt functions or udf function
inbuilt functions (array, when and lit)
import org.apache.spark.sql.functions._
df.withColumn("fruitList", array(fruitList.map(x => when(lit(x) === col("fruit"),1).otherwise(0)): _*)).show(false)
udf function
import org.apache.spark.sql.functions._
def containedUdf = udf((fruit: String) => fruitList.map(x => if(x == fruit) 1 else 0))
df.withColumn("fruitList", containedUdf(col("fruit"))).show(false)
which should give you
+---+-------+---------+
|id |fruit |fruitList|
+---+-------+---------+
|0 |apple |[1, 0, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
+---+-------+---------+
udf functions are easy to understand and straight forward, dealing with primitive datatypes but should be avoided if optimized and fast inbuilt functions are available to do the same task
I hope the answer is helpful

Given primary key, compare other columns of two data frames and output diff columns in the vertical way

I want to compare two dataframes that have the same schema, and have a primary key column.
For each primary key, if other columns have any difference (could be multiple columns, so need to use some dynamic way to scan all other columns), I want to output the column name and values of both dataframes.
Also, I want to output the result if one primary key doesn't exist in another dataframe (so "full outer join" will be used). Here is some example:
dataframe1:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book2 | 2 |
|3 |book3 | 3 |
|4 |book4 | 4 |
+-----------+------+------+
dataframe2:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book8 | 8 |
|3 |book3 | 7 |
|5 |book5 | 5 |
+-----------+------+------+
The result would be:
+-----------+------+----------+------------+------------*
|primary_key|diff_column_name | dataframe1 | dataframe2 |
+-----------+------+----------+------------+------------*
|2 |book | book2 | book8 |
|2 |number | 2 | 8 |
|3 |number | 3 | 7 |
|4 |book | book4 | null |
|4 |number | 4 | null |
|5 |book | null | book5 |
|5 |number | null | 5 |
+-----------+------+----------+------------+------------*
I know the first step is to join both dataframes on the primary key:
// joining the two DFs on primary_key
val result = df1.as("l")
.join(df2.as("r"), "primary_key", "fullouter")
But I am not sure how to proceed. Can someone give me some advice? Thanks
Data:
val df1 = Seq(
(1, "book1", 1), (2, "book2", 2), (3, "book3", 3), (4, "book4", 4)
).toDF("primary_key", "book", "number")
val df2 = Seq(
(1, "book1", 1), (2, "book8", 8), (3, "book3", 7), (5, "book5", 5)
).toDF("primary_key", "book", "number")
Imports
import org.apache.spark.sql.functions._
Define list of columns:
val cols = Seq("book", "number")
Join as you do right now:
val joined = df1.as("l").join(df2.as("r"), Seq("primary_key"), "fullouter")
Define:
val comp = explode(array(cols.map(c => struct(
lit(c).alias("diff_column_name"),
// Value left
col(s"l.${c}").cast("string").alias("dataframe1"),
// Value right
col(s"r.${c}").cast("string").alias("dataframe2"),
// Differs
not(col(s"l.${c}") <=> col(s"r.${c}")).alias("diff")
)): _*))
Select and filter:
joined
.withColumn("comp", comp)
.select($"primary_key", $"comp.*")
// Filter out mismatches and get rid of obsolete diff
.where($"diff").drop("diff")
.orderBy("primary_key").show
// +-----------+----------------+----------+----------+
// | 2| book| book2| book8|
// | 2| number| 2| 8|
// | 3| number| 3| 7|
// | 4| book| book4| null|
// | 4| number| 4| null|
// | 5| book| null| book5|
// | 5| number| null| 5|
// +-----------+----------------+----------+----------+

How to find longest sequence of consecutive dates?

I have a database with time visit in timestamp like this
ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800
I use spark SQL and I need to have the longest sequence of consecutives dates for each ID like
ID, longest_seq (days)
1, 2
2, 5
3, 1
I tried to adapt this answer Detect consecutive dates ranges using SQL to my case but I didn't manage to have what I expect.
SELECT ID, MIN (d), MAX(d)
FROM (
SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d,
ROW_NUMBER() OVER(
PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST')
as date)) rn
FROM purchase
where ID is not null
GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date)
)
GROUP BY ID, rn
ORDER BY ID
If someone has some clue on how to fix this request, or what's wrong in it, I would appreciate the help
Thanks
[EDIT] A more explicit input /output
ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11
The result would be :
ID, MaxSeq (in days)
1,3
2,3
3,1
All the visits are in timestamp, but I need consecutives days, then each visit by day is counted once by day
My answer below is adapted from https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even for use in Spark SQL. You'll have wrap the SQL queries with:
spark.sql("""
SQL_QUERY
""")
So, for the first query:
CREATE TABLE intermediate_1 AS
SELECT
id,
time,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
This will give you:
id, time, rn, grp
1, 1, 1, 0
1, 2, 2, 0
1, 3, 3, 0
2, 1, 1, 0
2, 3, 2, 1
2, 4, 3, 1
2, 5, 4, 1
2, 10, 5, 5
2, 11, 6, 5
3, 1, 1, 0
3, 4, 2, 2
3, 9, 3, 6
3, 11, 4, 7
We can see that the consecutive rows have the same grp value. Then we will use GROUP BY and COUNT to get the number of consecutive time.
CREATE TABLE intermediate_2 AS
SELECT
id,
grp,
COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
This will return:
id, grp, num_consecutive
1, 0, 3
2, 0, 1
2, 1, 3
2, 5, 2
3, 0, 1
3, 2, 1
3, 6, 1
3, 7, 1
Now we just use MAX and GROUP BY to get the max number of consecutive time.
CREATE TABLE final AS
SELECT
id,
MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
Which will give you:
id, max_consecutive
1, 3
2, 3
3, 1
Hope this helps!
That's the case for my beloved window aggregate functions!
I think the following example could help you out (at least to get started).
The following is the dataset I use. I translated your time (in longs) to numeric time to denote the day (and avoid messing around with timestamps in Spark SQL which could make the solution harder to comprehend...possibly).
In the below visit dataset, time column represents the days between dates so 1s one by one represent consecutive days.
scala> visits.show
+---+----+
| ID|time|
+---+----+
| 1| 1|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 3|
| 1| 3|
| 2| 1|
| 3| 1|
| 3| 2|
| 3| 2|
+---+----+
Let's define the window specification to group id rows together.
import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
partitionBy("id").
orderBy("time")
With that you rank the rows and count rows with the same rank.
val answer = visits.
select($"id", $"time", rank over idsSortedByTime as "rank").
groupBy("id", "time", "rank").
agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
| 1| 1| 1| 2|
| 1| 2| 3| 1|
| 1| 3| 4| 3|
| 3| 1| 1| 1|
| 3| 2| 2| 2|
| 2| 1| 1| 1|
+---+----+----+-----+
That appears (very close?) to a solution. You seem done!
Using spark.sql and with intermediate tables
scala> val df = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("id","time")
df: org.apache.spark.sql.DataFrame = [id: int, time: int]
scala> df.createOrReplaceTempView("tb1")
scala> spark.sql(""" with tb2(select id,time, time-row_number() over(partition by id order by time) rw1 from tb1), tb3(select id,count(rw1) rw2 from tb2 group by id,rw1) select id, rw2 from tb3 where (id,rw2) in (select id,max(rw2) from tb3 group by id) group by id, rw2 """).show(false)
+---+---+
|id |rw2|
+---+---+
|1 |3 |
|3 |1 |
|2 |3 |
+---+---+
scala>
Solution using DataFrame API:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("ID","time")
df1.show(false)
df1.printSchema()
val w = Window.partitionBy("ID").orderBy("time")
val df2 = df1.withColumn("rank", col("time") - row_number().over(w))
.groupBy("ID", "rank")
.agg(count("rank").alias("count"))
.groupBy("ID")
.agg(max("count").alias("time"))
.orderBy("ID")
df2.show(false)
Console output:
+---+----+
|ID |time|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |5 |
|2 |10 |
|2 |11 |
|3 |1 |
|3 |4 |
|3 |9 |
|3 |11 |
+---+----+
root
|-- ID: integer (nullable = false)
|-- time: integer (nullable = false)
+---+----+
|ID |time|
+---+----+
|1 |3 |
|2 |3 |
|3 |1 |
+---+----+

Resources