how to match 2 column with each other in Apache Spark - Pyspark - apache-spark

I have a dataframe so assume my data is in Tabular format.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1, entity_2, entity_3|entity_1, entity_3|
Now using withColumn("Serial", explode(split(",")"Serial"))). I have achieved breaking columns into multiple rows as below. this was the 1st part of the requirement.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1 |entity_1, entity_3|
|30 |entity_2 |entity_1, entity_3|
|30 |entity_3 |entity_1, entity_3|
Now for the columns where there are no values it should be 0,
For values which is present in 'Serial' Column should be searched in 'Updated' column. If the value is present in 'Updated' column then it should display '1' else '2'
So for here in this case for entity_1 && entity_3 --> 1 must be displayed & for entity_2 --> 2 should be displayed
How to achieve this ..?

AFAIK, there is no way to check if one column is contained within or is a substring of another column directly without using a udf.
However, if you wanted to avoid using a udf, one way is to explode the "Updated" column. Then you can check for equality between the "Serial" column and the exploded "Updated" column and apply your conditions (1 if match, 2 otherwise)- call this "contains".
Finally, you can then groupBy("ID", "Serial", "Updated") and select the minimum of the "contains" column.
For example, after the two calls to explode() and checking your condition, you will have a DataFrame like this:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.show(truncate=False)
#+---+--------+-----------------+---------------+--------+
#|ID |Serial |Updated |updatedExploded|contains|
#+---+--------+-----------------+---------------+--------+
#|10 |pers1 | | |0 |
#|20 | | | |0 |
#|30 |entity_1|entity_1,entity_3|entity_1 |1 |
#|30 |entity_1|entity_1,entity_3|entity_3 |2 |
#|30 |entity_2|entity_1,entity_3|entity_1 |2 |
#|30 |entity_2|entity_1,entity_3|entity_3 |2 |
#|30 |entity_3|entity_1,entity_3|entity_1 |2 |
#|30 |entity_3|entity_1,entity_3|entity_3 |1 |
#+---+--------+-----------------+---------------+--------+
The "trick" of grouping by ("ID", "Serial", "Updated") and taking the minimum of "contains" works because:
If either "Serial" or "Updated" is null (or equal to empty string in this case), the value will be 0.
If at least one of the values in "Updated" matches with "Serial", one of the columns will have a 1.
If there are no matches, you will have only 2's
The final output:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.groupBy("ID", "Serial", "Updated")\
.agg(f.min("contains").alias("contains"))\
.sort("ID")\
.show(truncate=False)
#+---+--------+-----------------+--------+
#|ID |Serial |Updated |contains|
#+---+--------+-----------------+--------+
#|10 |pers1 | |0 |
#|20 | | |0 |
#|30 |entity_3|entity_1,entity_3|1 |
#|30 |entity_2|entity_1,entity_3|2 |
#|30 |entity_1|entity_1,entity_3|1 |
#+---+--------+-----------------+--------+
I'm chaining calls to pyspark.sql.functions.when() to check the conditions. The first part checks to see if either column is null or equal to the empty string. I believe that you probably only need to check for null in your actual data, but I put in the check for empty string based on how you displayed your example DataFrame.

Related

Get Column which contains a specific value

Using pyspark. I would have a data frame like this
col1
col2
col3
1
[3,7]
5
hello
4
666
4
world
4
Now I want to get the column name where the number 666 is included.
So the result should be "col3".
Thanks
Edits
Added other values than int. The great answers are focussed on only int values. Sry.
deleted: while we are at it, I guess the index can also be retrieved easily.
(df.withColumn('result', F.array(*[F.array(F.lit(x).alias('y'), col(x).alias('y')) for x in df.columns]))#Create an array of cols and values
.withColumn('result', expr("transform(filter(result, (c,i)->(c[1]==666)),(c,i)->c[0])"))#Filter array with 666 and extract col
.show(truncate=False))
|col1|col2|col3|result|
+----+----+----+------+
|1 |3 |5 |[] |
|2 |4 |666 |[col3]|
|4 |6 |4 |[] |
here's an approach that creates an array using the columns and then filters it.
data_sdf. \
withColumn('allcols',
func.array(*[func.struct(func.lit(c).alias('name'), func.col(c).cast('string').alias('value'))
for c in data_sdf.columns]
)
). \
withColumn('cols_w_666_arr',
func.expr('transform(filter(allcols, x -> x.value = "666"), c -> c.name)')
). \
drop('allcols'). \
show(truncate=False)
# +---+---+---+--------------+
# |c1 |c2 |c3 |cols_w_666_arr|
# +---+---+---+--------------+
# |1 |3 |5 |[] |
# |2 |4 |666|[c3] |
# |4 |666|4 |[c2] |
# |666|666|4 |[c1, c2] |
# +---+---+---+--------------+

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

How can make a query that returns Column Headers if the Sum of the values in the column >0

I have a data range with dummy data and I want to make a query that returns only the headers when the sum of the columns is higher than 0. My first attempt has been to at least try to make a query that returns the columns for which Sum(Column)>0 by using this formula:
=query(A1:D,"Select A, B, C, D, WHERE SUM(A)>0 AND SUM(B)>0 AND SUM(C)>0 AND SUM(D)>0",0)
But I haven't had any luck. Here is a sample of a dummy table. I would very much appreciate any pointers in this matter.
| Dog| Cow | Cat|Horse|
|:---|:---:|:--:|----:|
| 1 | 0 |2 |3 |
| 2 | 0 |4 |6 |
| 3 | 0 |6 |9 |
| 4 | 0 |8 |12 |
A simple approach would be:
=query({A:D},"Select "&if(sum(A2:A)>0,"Col1",)&if(sum(B2:B)>0,",Col2",)&if(sum(C2:C)>0,",Col3",)&if(sum(D2:D)>0,",Col4",)&" ",1)

PySpark: Check if value in col is like a key in a dict

I would like to take my dictionary which contains keywords and check a column in a pyspark df to see if that keyword exists and if so then return the value from the dictionary in a new column.
The problem looks like this;
myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}
df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])
+-----+-------------------------+
| id |reason |
+-----+-------------------------+
|1 |Needed better support |
|2 |Better value from android|
|3 | Price was to expensive |
|4 | Support problems |
+-----+-------------------------+
The end result that I am looking for is this:
+-----+-------------------------+---------------------+
| id |reason |new_reason |
+-----+-------------------------+---------------------+
|1 |Needed better support | Support Issue |
|2 |Better value from android| Left for Competitor |
|3 |Price was to expensive | Pricing Issue |
|4 |Support issue | Support Issue |
+-----+-------------------------+---------------------+
What's the best way to build an efficient function to do this in pyspark?
You can use when expressions to check whether the column reason matches the dict keys. You can dynamically generate the when expressions using python functools.reduce function by passing the list myDict.keys():
from functools import reduce
from pyspark.sql import functions as F
df2 = df.withColumn(
"new_reason",
reduce(
lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
myDict.keys(),
F
)
)
df2.show(truncate=False)
#+---+-------------------------+-------------------+
#|id |reason |new_reason |
#+---+-------------------------+-------------------+
#|1 |Needed better Support |Support Issue |
#|2 |Better value from android|Left for Competitor|
#|3 |Price was to expensive |Pricing Issue |
#|4 |Support problems |Support Issue |
#+---+-------------------------+-------------------+
You can create a keywords dataframe, and join to the original dataframe using an rlike condition. I added \\\\b before and after the keywords so that only words between word boundaries will be matched, and there won't be partial word matches (e.g. "pineapple" matching "apple").
import pyspark.sql.functions as F
keywords = spark.createDataFrame([[k,v] for (k,v) in myDict.items()]).toDF('key', 'new_reason')
result = df.join(
keywords,
F.expr("lower(reason) rlike '\\\\b' || lower(key) || '\\\\b'"),
'left'
).drop('key')
result.show(truncate=False)
+---+-------------------------+-------------------+
|id |reason |new_reason |
+---+-------------------------+-------------------+
|1 |Needed better Support |Support Issue |
|2 |Better value from android|Left for Competitor|
|3 |Price was to expensive |Pricing Issue |
|4 |Support problems |Support Issue |
+---+-------------------------+-------------------+

Excel: How to group by Number and search for occurrence

I want to count the occurrence of different Values grouped by a Reference#.
Given is the Excel below. The functions should search for same Reference# at column A.
Count the distinct values in column B and the Result should written in Column C.
How can I achive this functionallity ?
|-----A------|-------B-----|-------C------|
|Reference | Value | Result |
|------------|-------------|--------------|
|1 |0815 |1 |
|1 |0815 |1 |
|1 |0815 |1 |
|2 |0816 |2 |
|2 |0817 |2 |
|2 |0817 |2 |
|3 |2020 |3 |
|3 |2021 |3 |
|3 |2022 |3 |
|-----------------------------------------|
If you want to count unique numeric values, then try in C2:
=SUM(--(FREQUENCY(IF(A$2:A$10=A2,B$2:B$10),B$2:B$10)>0))
Note: Enter through CtrlShiftEnter
More info here
If they are text values then:
=SUM(--(FREQUENCY(IF(A$2:A$10=A2,MATCH(B$2:B$10,B$2:B$10,0)),ROW(B$2:B$10)+1)>0))
Note: Enter through CtrlShiftEnter
More info here
If one has Excel O365:
=COUNT(UNIQUE(FILTER(B$2:B$10,A$2:A$10=A2)))
More info here
Use INDEX/MATCH to bring the number if match is found, if not add 1 to the max:
=IFERROR(INDEX($C$1:C1,MATCH(A2,$A$1:A1,0)),MAX($C$1:C1)+1)

Resources