Pyspark udf to detect "Actors" - apache-spark

I have a matrix(dataframe) I want to find all the rows there the row and columns intersect with a '1'. (The 'Character' row value matches the column name)
Example. Sam is an actor. (He has a '1' in the column 'actor' and the row the 'character' value of 'actor'.) This would be a row I'm would want returned.
df = spark.createDataFrame(
[
("actor", "sam", "1", "0", "0", "0", "0"),
("villan", "jack", "0", "0", "0", "0", "0"),
("actress", "rose", "0", "0", "0", "1", "0"),
("comedian", "mike", "0", "1", "1", "0", "1"),
("musician", "young", "1", "1", "1", "1", "0")
],
["character", "name", "actor", "villan", "comedian", "actress", "musician"]
)
+---------+-----+-----+------+--------+-------+--------+
|character| name|actor|villan|comedian|actress|musician|
+---------+-----+-----+------+--------+-------+--------+
| actor| sam| 1| 0| 0| 0| 0|
| villan| jack| 0| 0| 0| 0| 0|
| actress| rose| 0| 0| 0| 1| 0|
| comedian| mike| 0| 1| 1| 0| 1|
| musician|young| 1| 1| 1| 1| 0|
+---------+-----+-----+------+--------+-------+--------+

#create function
def myMatch( needle, haystack ):
return haystack[needle]
#create udf
matched = udf(myMatch, StringType()) # your existing data is strings
#apply udf
df.select(\
df.name ,\
matched( \
df.character, \
f.struct( *[df[col] for col in df.columns] ) )\ # shortcut to add all columns to a struct so it can be passed to udf
.alias("IsPlayingCharacter") )\
.show()

Related

Spark add duplicate only when one column is same and other is different

I have data like this
[
{"uuid":"fdkhflds","key": "A", "id": "1"},
{"uuid":"ieuieiue","key": "A", "id": "2"},
{"uuid":"qwtriqrr","key": "A", "id": "3"},
{"uuid":"dhgfsddd","key": "A", "id": "1"},
{"uuid":"sdjhfdjh","key": "E", "id": "4"}
]
I want to add flag in those column where key is same but id is different.
Expected output:
[
{"uuid":"fdkhflds","key": "A", "id": "1","de_dupe_required": 0},
{"uuid":"ieuieiue","key": "A", "id": "2","de_dupe_required": 1},
{"uuid":"qwtriqrr","key": "A", "id": "3","de_dupe_required": 1},
{"uuid":"dhgfsddd","key": "A", "id": "1","de_dupe_required": 0},
{"uuid":"sdjhfdjh","key": "E", "id": "4","de_dupe_required": 0}
]
Explanation:
Since first and fourth record have same key and id, So no flag is needed
Since fifth record has no same key or id, So no flag for this as well
Since second and third have the same key, but id is different so flag should be 1
You could achieve this with pyspark.sql.Window by generating a rank() for the keys ordered by the id. Then marking as de_dupe_required wherever the rank() is not 1.
from pyspark.sql import functions as F, Window
window_spec = Window.partitionBy("key").orderBy("id")
df = (df.withColumn("dupe_rank", F.rank().over(window_spec))
.withColumn("de_dupe_required", F.when(F.col("dupe_rank")==1, F.lit(0))
.otherwise(F.lit(1)))
.drop("dupe_rank")
)
df.show()
Output is:
+--------+---+---+----------------+
| uuid|key| id|de_dupe_required|
+--------+---+---+----------------+
|fdkhflds| A| 1| 0|
|dhgfsddd| A| 1| 0|
|ieuieiue| A| 2| 1|
|qwtriqrr| A| 3| 1|
|sdjhfdjh| E| 4| 0|
+--------+---+---+----------------+
Note this will still work if there are some combinations like having two (A,3) (as noted by #thebluephantom) since we order by id hence the rank will be greater than 1 for these rows.
Output for two (A,3):
+--------+---+---+----------------+
| uuid|key| id|de_dupe_required|
+--------+---+---+----------------+
|fdkhflds| A| 1| 0|
|dhgfsddd| A| 1| 0|
|ieuieiue| A| 2| 1|
|qwtriqrr| A| 3| 1|
|qwtriqrr| A| 3| 1|
|sdjhfdjh| E| 4| 0|
+--------+---+---+----------------+
The question is vague. This is my solution whereby we consider 2 A,3's being possible, thus not as per 1st answer.
%python
from pyspark.sql.functions import col, lit
df = spark.createDataFrame(
[
("A", 1, "xyz"),
("A", 2, "xyz"),
("A", 3, "xyz"),
("A", 3, "xyz"),
("A", 1, "xyz"),
("E", 4, "xyz"),
("A", 9, "xyz")
],
["c1", "c2", "c3"]
)
df2 = df.groupBy("c1", "c2").count().filter(col('count') == 1)
df3 = df2.groupBy("c1").count().filter(col('count') == 1)
df4 = df2.join(df3, df3.c1 == df2.c1, "leftanti").select("c1", "c2", lit(1)).toDF("c1", "c2", "ddr")
dfA = df.select("c1","c2")
dfB = df4.select("c1","c2")
df5 = dfA.exceptAll(dfB)
res = df4.withColumn("ddr", lit(1)).unionAll(df5.withColumn("ddr", lit(0)))
res.show()
returns:
+---+---+---+
| c1| c2|ddr|
+---+---+---+
| A| 2| 1|
| A| 9| 1|
| A| 1| 0|
| A| 1| 0|
| A| 3| 0|
| A| 3| 0|
| E| 4| 0|
+---+---+---+
It's about the algorithm, you can do the rest. It needs to be a step-wise approach.
You can do this using count window function.
Using your input data, I've added a new row for id=3. AFAIU, in this case the id=3 should also be marked 0 as there are now 2 occurrences for it.
data_sdf. \
withColumn('num_key_occurs', func.count('key').over(wd.partitionBy('key'))). \
withColumn('num_id_occurs_inkey', func.count('id').over(wd.partitionBy('key', 'id'))). \
withColumn('samekey_diffid',
((func.col('num_key_occurs') > 1) & (func.col('num_id_occurs_inkey') == 1)).cast('int')
). \
show()
# +---+---+--------+--------------+-------------------+--------------+
# | id|key| uuid|num_key_occurs|num_id_occurs_inkey|samekey_diffid|
# +---+---+--------+--------------+-------------------+--------------+
# | 4| E|sdjhfdjh| 1| 1| 0|
# | 1| A|fdkhflds| 5| 2| 0|
# | 1| A|dhgfsddd| 5| 2| 0|
# | 2| A|ieuieiue| 5| 1| 1|
# | 3| A|qwtriqrr| 5| 2| 0|
# | 3| A|blahbleh| 5| 2| 0|
# +---+---+--------+--------------+-------------------+--------------+
Feel free to drop the count columns at the end.

Pyspark groupBy multiple columns and aggregate using multiple udf functions

I want to group on multiple columns and then aggregate various columns by user-defined-functions (udf) that calculates mode for each of the columns. I demonstrate my problem by this sample code:
import pandas as pd
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType, IntegerType
df = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df["A"] = ["Mon", "Mon", "Mon", "Fri", "Fri", "Fri", "Fri"]
df["B"] = ["Feb", "Feb", "Feb", "May", "May", "May", "May"]
df["C"] = ["x", "y", "y", "m", "n", "r", "r"]
df["D"] = [3, 3, 5, 1, 1, 1, 9]
df_sdf = spark.createDataFrame(df)
df_sdf.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
|Mon|Feb| x| 3|
|Mon|Feb| y| 3|
|Mon|Feb| y| 5|
|Fri|May| m| 1|
|Fri|May| n| 1|
|Fri|May| r| 1|
|Fri|May| r| 9|
+---+---+---+---+
# Custom mode function to get mode value for string list and integer list
def custom_mode(lst): return(max(lst, key=lst.count))
custom_mode_str = udf(custom_mode, StringType())
custom_mode_int = udf(custom_mode, IntegerType())
grp_columns = ["A", "B"]
df_sdf.groupBy(grp_columns).agg(custom_mode_str(col("C")).alias("C"), custom_mode_int(col("D")).alias("D")).distinct().show()
However, I am getting the following error on last line of above code:
AnalysisException: expression '`C`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
The expected output for this code is:
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
|Mon|Feb| y| 3|
|Fri|May| r| 1|
+---+---+---+---+
I searched a lot but couldn't find something similar to this problem in pyspark. Thanks for your time.
Your UDF requires a list but you're providing a spark dataframe's column. You can pass a list to the function which will generate your desired result.
sdf.groupBy(['A', 'B']). \
agg(custom_mode_str(func.collect_list('C')).alias('C'),
custom_mode_int(func.collect_list('D')).alias('D')
). \
show()
# +---+---+---+---+
# | A| B| C| D|
# +---+---+---+---+
# |Mon|Feb| y| 3|
# |Fri|May| r| 1|
# +---+---+---+---+
The collect_list() is the key here as it will generate a list which will work with your UDF. See collection outputs below.
sdf.groupBy(['A', 'B']). \
agg(func.collect_list('C').alias('C_collected'),
func.collect_list('D').alias('D_collected')
). \
show()
# +---+---+------------+------------+
# | A| B| C_collected| D_collected|
# +---+---+------------+------------+
# |Mon|Feb| [x, y, y]| [3, 3, 5]|
# |Fri|May|[m, n, r, r]|[1, 1, 1, 9]|
# +---+---+------------+------------+

Spark: how to process certain column content individually in dataframe?

The data structure is like this:
id
name
data
001
aaa
true,false,false
002
bbb
true,true,true
003
ccc
false,true,true
I want to map the results in data to their names by their corresponding orders in the mapping table. In detail, the first step is to get the order number of False in data and then get the name by order number in the mapping table.
For example, the first record has two False and their index numbers are 2 and 3, so the mapping result is code2 and code3. Also, there are all true in the second record so the mapping result is an empty string.
the mapping table: ("code1","code2","code3")
the expected result:
id
name
data
001
aaa
code2,code3
002
bbb
003
ccc
code1
Is it possible to achieve this in the dataframe?
If you are using spark 3+ you can use filter and transform functions as
val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true"),
).toDF("id", "name", "data")
val cols = Seq("col1", "col2", "col3")
val dfNew = df.withColumn("data", split($"data", ","))
.withColumn("mapping", arrays_zip($"data", typedLit(cols)))
.withColumn("new1", filter($"mapping", (c: Column) => c.getField("data") === "false"))
.withColumn("data", transform($"new1", (c: Column) => c.getField("1")))
.drop("new1", "mapping")
dfNew.show(false)
Output:
+---+----+------------+
|id |name|data |
+---+----+------------+
|001|aaa |[col2, col3]|
|002|bbb |[] |
|003|ccc |[col1] |
+---+----+------------+
The following should work but be aware that it features a posexplode (explode an array with positional value) which can be a costly operation specially if you have a huge dataset.
val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true")
).toDF("id", "name", "data")
val codes = Seq(
(0, "code1"),
(1, "code2"),
(2, "code3")
).toDF("code_id", "codes")
val df1 = df.select($"*", posexplode(split($"data", ",")))
.join(codes, $"pos" === $"code_id")
.withColumn( "codes", when($"col" === "false", $"codes").otherwise(null) )
//+---+----+----------------+---+-----+-------+-----+
//| id|name| data|pos| col|code_id|codes|
//+---+----+----------------+---+-----+-------+-----+
//|001| aaa|true,false,false| 0| true| 0| null|
//|001| aaa|true,false,false| 1|false| 1|code2|
//|001| aaa|true,false,false| 2|false| 2|code3|
//|002| bbb| true,true,true| 0| true| 0| null|
//|002| bbb| true,true,true| 1| true| 1| null|
//|002| bbb| true,true,true| 2| true| 2| null|
//|003| ccc| false,true,true| 0|false| 0|code1|
//|003| ccc| false,true,true| 1| true| 1| null|
//|003| ccc| false,true,true| 2| true| 2| null|
//+---+----+----------------+---+-----+-------+-----+
val finalDf = df1.groupBy($"id", $"name").agg(concat_ws(",", collect_list($"codes")).as("data"))
//+---+----+-----------+
//| id|name| data|
//+---+----+-----------+
//|002| bbb| |
//|001| aaa|code2,code3|
//|003| ccc| code1|
//+---+----+-----------+

Check multiple columns for any column greater than zero using a regex

I need to apply a when function on multiple columns. I want to check if at least one of the columns has a value greater than 0.
This is my solution:
df.withColumn("any value", F.when(
(col("col1") > 0) |
(col("col2") > 0) |
(col("col3") > 0) |
...
(col("colX") > 0)
, "any greater than 0").otherwise(None))
Is it possible to do the same task with a regex, so I don't have to write all the column names?
So let's create sample data:
df = spark.createDataFrame(
[(0, 0, 0, 0), (0, 0, 2, 0), (0, 0, 0, 0), (1, 0, 0, 0)],
['a', 'b', 'c', 'd']
)
Then, you can build your condition from a list of columns (say all the columns of the dataframe) using map and reduce like this:
cols = df.columns
from pyspark.sql import functions as F
condition = reduce(lambda a, b: a | b, map(lambda c: F.col(c) > 0, cols))
df.withColumn("any value", F.when(condition, "any greater than 0")).show()
which yields:
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+
Another way you could have this done is create an array, use forall to check and conditionally assign values. Code below
df = df.withColumn('any value', array(df.columns)).withColumn('any value',when(forall('any value',lambda x: x==0),None).otherwise("any greater than 0"))
df.show()
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+

How can I use the literal value of a spark dataframe column?

I have this simple dataframe that looks like this,
+---+---+---+---+
|nm | ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
What I want to do is,
+---+---+---+---+---+
|nm |ca |cb |cc |p |
+---+---+---+---+---+
|a |123|0 |0 |1 |
|b |1 |2 |3 |1 |
|c |0 |1 |0 |0 |
+---+---+---+---+---+
bascially added a new column p, such that, if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
My code,
def purchaseCol: UserDefinedFunction =
udf((brand: String) => s"c$brand")
val a = ss.createDataset(List(
("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0)))
.toDF("nm", "ca", "cb", "cc")
a.show()
a.withColumn("p", when(lit(DataFrameUtils.purchaseCol($"nm")) > 0, 1).otherwise(0))
.show(false)
It doesnt seem to be working and is returning 0 for all rows in col 'p'.
PS: Columns number is over 100 and they are dynamically generated.
Map over rdd, calculate and add p to each row:
val a = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val b = a.rdd.map(r => {
val s = r.getAs[String]("nm")
val v = r.getAs[Int](s"c$s")
val p = if(v > 0) 1 else 0
Row.fromSeq(r.toSeq :+ p)
})
val new_schema = StructType(a.schema :+ StructField("p", IntegerType, true))
val df_new = spark.createDataFrame(b, new_schema)
df_new.show
+---+---+---+---+---+
| nm| ca| cb| cc| p|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+
If "c*" columns number is limited, UDF with all values can be used:
val nameMatcherFunct = (nm: String, ca: Int, cb: Int, cc: Int) => {
val value = nm match {
case "a" => ca
case "b" => cb
case "c" => cc
}
if (value > 0) 1 else 0
}
def purchaseValueUDF = udf(nameMatcherFunct)
val result = a.withColumn("p", purchaseValueUDF(col("nm"), col("ca"), col("cb"), col("cc")))
If you have many "c*" columns, function with Row as parameter can be used:
How to pass whole Row to UDF - Spark DataFrame filter
looking at your logic
if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
you can simply do
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) && (col("ca") > 0), lit(1)).otherwise(lit(0)))
but looking at your output dataframe, you would require an || instead of &&
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) || (col("ca") > 0), lit(1)).otherwise(lit(0)))
val a1 = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
a1.show()
+---+---+---+---+
| nm| ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
val newDf = a1.withColumn("P", when($"ca" > 0, 1).otherwise(0))
newDf.show()
+---+---+---+---+---+
| nm| ca| cb| cc| P|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+

Resources