Spark, return multiple rows on group? - apache-spark

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?

You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

Related

PySpark: Check if value in col is like a key in a dict

I would like to take my dictionary which contains keywords and check a column in a pyspark df to see if that keyword exists and if so then return the value from the dictionary in a new column.
The problem looks like this;
myDict = {
'price': 'Pricing Issue',
'support': 'Support Issue',
'android': 'Left for Competitor'
}
df = sc.parallelize([('1','Needed better Support'),('2','Better value from android'),('3','Price was to expensive')]).toDF(['id','reason'])
+-----+-------------------------+
| id |reason |
+-----+-------------------------+
|1 |Needed better support |
|2 |Better value from android|
|3 | Price was to expensive |
|4 | Support problems |
+-----+-------------------------+
The end result that I am looking for is this:
+-----+-------------------------+---------------------+
| id |reason |new_reason |
+-----+-------------------------+---------------------+
|1 |Needed better support | Support Issue |
|2 |Better value from android| Left for Competitor |
|3 |Price was to expensive | Pricing Issue |
|4 |Support issue | Support Issue |
+-----+-------------------------+---------------------+
What's the best way to build an efficient function to do this in pyspark?
You can use when expressions to check whether the column reason matches the dict keys. You can dynamically generate the when expressions using python functools.reduce function by passing the list myDict.keys():
from functools import reduce
from pyspark.sql import functions as F
df2 = df.withColumn(
"new_reason",
reduce(
lambda c, k: c.when(F.lower(F.col("reason")).rlike(rf"\b{k.lower()}\b"), myDict[k]),
myDict.keys(),
F
)
)
df2.show(truncate=False)
#+---+-------------------------+-------------------+
#|id |reason |new_reason |
#+---+-------------------------+-------------------+
#|1 |Needed better Support |Support Issue |
#|2 |Better value from android|Left for Competitor|
#|3 |Price was to expensive |Pricing Issue |
#|4 |Support problems |Support Issue |
#+---+-------------------------+-------------------+
You can create a keywords dataframe, and join to the original dataframe using an rlike condition. I added \\\\b before and after the keywords so that only words between word boundaries will be matched, and there won't be partial word matches (e.g. "pineapple" matching "apple").
import pyspark.sql.functions as F
keywords = spark.createDataFrame([[k,v] for (k,v) in myDict.items()]).toDF('key', 'new_reason')
result = df.join(
keywords,
F.expr("lower(reason) rlike '\\\\b' || lower(key) || '\\\\b'"),
'left'
).drop('key')
result.show(truncate=False)
+---+-------------------------+-------------------+
|id |reason |new_reason |
+---+-------------------------+-------------------+
|1 |Needed better Support |Support Issue |
|2 |Better value from android|Left for Competitor|
|3 |Price was to expensive |Pricing Issue |
|4 |Support problems |Support Issue |
+---+-------------------------+-------------------+

Get all rows after doing GroupBy in SparkSQL

I tried to do group by in SparkSQL which works good but most of the rows went missing.
spark.sql(
"""
| SELECT
| website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)
I am getting output like this :
+------------------+---------+
|website_session_id|min_pv_id|
+------------------+---------+
|1 |1 |
|10 |15 |
|100 |168 |
|1000 |1910 |
|10000 |20022 |
|100000 |227964 |
|100001 |227966 |
|100002 |227967 |
|100003 |227970 |
|100004 |227973 |
+------------------+---------+
Same query in MySQL gives the desired result like this :
What is the best way to do ,so that all rows are fetched in my Query.
Please note I already checked other answers related to this, like joining to get all rows etc, but I want to know if there is any other way by with we can get the result like we get in MySQL ?
It looks like it is ordered by alphabetically, in which case 10 comes before 2.
You might want to check that the columns type is a number, not string.
What datatypes do the columns have (printSchema())?
I think website_session_id is of string type. Cast it to an integer type and see what you get:
spark.sql(
"""
| SELECT
| CAST(website_session_id AS int) as website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)

how to match 2 column with each other in Apache Spark - Pyspark

I have a dataframe so assume my data is in Tabular format.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1, entity_2, entity_3|entity_1, entity_3|
Now using withColumn("Serial", explode(split(",")"Serial"))). I have achieved breaking columns into multiple rows as below. this was the 1st part of the requirement.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1 |entity_1, entity_3|
|30 |entity_2 |entity_1, entity_3|
|30 |entity_3 |entity_1, entity_3|
Now for the columns where there are no values it should be 0,
For values which is present in 'Serial' Column should be searched in 'Updated' column. If the value is present in 'Updated' column then it should display '1' else '2'
So for here in this case for entity_1 && entity_3 --> 1 must be displayed & for entity_2 --> 2 should be displayed
How to achieve this ..?
AFAIK, there is no way to check if one column is contained within or is a substring of another column directly without using a udf.
However, if you wanted to avoid using a udf, one way is to explode the "Updated" column. Then you can check for equality between the "Serial" column and the exploded "Updated" column and apply your conditions (1 if match, 2 otherwise)- call this "contains".
Finally, you can then groupBy("ID", "Serial", "Updated") and select the minimum of the "contains" column.
For example, after the two calls to explode() and checking your condition, you will have a DataFrame like this:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.show(truncate=False)
#+---+--------+-----------------+---------------+--------+
#|ID |Serial |Updated |updatedExploded|contains|
#+---+--------+-----------------+---------------+--------+
#|10 |pers1 | | |0 |
#|20 | | | |0 |
#|30 |entity_1|entity_1,entity_3|entity_1 |1 |
#|30 |entity_1|entity_1,entity_3|entity_3 |2 |
#|30 |entity_2|entity_1,entity_3|entity_1 |2 |
#|30 |entity_2|entity_1,entity_3|entity_3 |2 |
#|30 |entity_3|entity_1,entity_3|entity_1 |2 |
#|30 |entity_3|entity_1,entity_3|entity_3 |1 |
#+---+--------+-----------------+---------------+--------+
The "trick" of grouping by ("ID", "Serial", "Updated") and taking the minimum of "contains" works because:
If either "Serial" or "Updated" is null (or equal to empty string in this case), the value will be 0.
If at least one of the values in "Updated" matches with "Serial", one of the columns will have a 1.
If there are no matches, you will have only 2's
The final output:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.groupBy("ID", "Serial", "Updated")\
.agg(f.min("contains").alias("contains"))\
.sort("ID")\
.show(truncate=False)
#+---+--------+-----------------+--------+
#|ID |Serial |Updated |contains|
#+---+--------+-----------------+--------+
#|10 |pers1 | |0 |
#|20 | | |0 |
#|30 |entity_3|entity_1,entity_3|1 |
#|30 |entity_2|entity_1,entity_3|2 |
#|30 |entity_1|entity_1,entity_3|1 |
#+---+--------+-----------------+--------+
I'm chaining calls to pyspark.sql.functions.when() to check the conditions. The first part checks to see if either column is null or equal to the empty string. I believe that you probably only need to check for null in your actual data, but I put in the check for empty string based on how you displayed your example DataFrame.

Spark - match states inside a row of dataframe

Below is my dataframe which I was able to wrangle and extract from multi struct Json files
-------------------------------------------
Col1 | Col2| Col3 | Col4
-------------------------------------------
A | 1 |2018-03-28T19:03:39| Active
-------------------------------------------
A | 1 |2018-03-28T19:03:40| Clear
-------------------------------------------
A | 1 |2018-03-28T19:11:21| Active
-------------------------------------------
A | 1 |2018-03-28T20:13:06| Active
-------------------------------------------
A | 1 |2018-03-28T20:13:07| Clear
-------------------------------------------
This is what I came up with by grouping by keys
A|1|[(2018-03-28T19:03:39,Active),(2018-03-28T19:03:40,Clear),(2018-03-28T19:11:21,Active),(2018-03-28T20:13:06,Active),(2018-03-28T20:13:07,Clear)]
and this is my desired output..
--------------------------------------------------------
Col1 | Col2| Active time | Clear Time
--------------------------------------------------------
A | 1 |2018-03-28T19:03:39| 2018-03-28T19:03:40
--------------------------------------------------------
A | 1 |2018-03-28T20:13:06| 2018-03-28T20:13:07
--------------------------------------------------------
I am kind of stuck at this step and not sure how to proceed further to get the desired output. Any direction is appreciated.
Spark version - 2.1.1
Scala version - 2.11.8
You can use window function for the grouping and ordering to get the consecutive active and clear time. Since you are looking for filtering out the the rows which doesn't have consecutive clear or active status, you would need a filter too.
so if you have dataframe as
+----+----+-------------------+------+
|Col1|Col2|Col3 |Col4 |
+----+----+-------------------+------+
|A |1 |2018-03-28T19:03:39|Active|
|A |1 |2018-03-28T19:03:40|Clear |
|A |1 |2018-03-28T19:11:21|Active|
|A |1 |2018-03-28T20:13:06|Active|
|A |1 |2018-03-28T20:13:07|Clear |
+----+----+-------------------+------+
you can simply do as I explained above
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("Col1", "Col2").orderBy("Col3")
import org.apache.spark.sql.functions._
df.withColumn("active", lag(struct(col("Col3"), col("Col4")), 1).over(windowSpec))
.filter(col("active.Col4") === "Active" && col("Col4") === "Clear")
.select(col("Col1"), col("Col2"), col("active.Col3").as("Active Time"), col("Col3").as("Clear Time"))
.show(false)
and you should get
+----+----+-------------------+-------------------+
|Col1|Col2|Active Time |Clear Time |
+----+----+-------------------+-------------------+
|A |1 |2018-03-28T19:03:39|2018-03-28T19:03:40|
|A |1 |2018-03-28T20:13:06|2018-03-28T20:13:07|
+----+----+-------------------+-------------------+

How to add column and records on a dataset given a condition

I'm working on a program that brands data as OutOfRange based on the values present on certain columns.
I have three columns: Age, Height, and Weight. I want to create a fourth column called OutOfRange and assign it a value of 0(false) or 1(true) if the values in those three columns exceed a specific threshold.
If age is lower than 18 or higher than 60, that row will be assigned a value of 1 (0 otherwise). If height is lower than 5, that row will be assigned a value of 1 (0 otherwise), and so on.
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
Given a dataframe as
+---+------+------+
|Age|Height|Weight|
+---+------+------+
|20 |3 |70 |
|17 |6 |80 |
|30 |5 |60 |
|61 |7 |90 |
+---+------+------+
You can apply when function to apply the logics explained in the question as
import org.apache.spark.sql.functions._
df.withColumn("OutOfRange", when(col("Age") <18 || col("Age") > 60 || col("Height") < 5, 1).otherwise(0))
which would result following dataframe
+---+------+------+----------+
|Age|Height|Weight|OutOfRange|
+---+------+------+----------+
|20 |3 |70 |1 |
|17 |6 |80 |1 |
|30 |5 |60 |0 |
|61 |7 |90 |1 |
+---+------+------+----------+
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
This is not possible without recreating the Dataset all together since Datasets are inherently immutable.
However you can save the Dataset as a Hive table, which will allow you to do what you want to do. Saving the Dataset as a Hive table will write the contents of your Dataset to disk under the default spark-warehouse directory.
df.write.mode("overwrite").saveAsTable("my_table")
// Add a row
spark.sql("insert into my_table (Age, Height, Weight, OutofRange) values (20, 30, 70, 1)
// Update a row
spark.sql("update my_table set OutOfRange = 1 where Age > 30")
....
Hive support must be enabled for spark at time of instantiation in order to do this.

Resources