Spark - match states inside a row of dataframe - apache-spark

Below is my dataframe which I was able to wrangle and extract from multi struct Json files
-------------------------------------------
Col1 | Col2| Col3 | Col4
-------------------------------------------
A | 1 |2018-03-28T19:03:39| Active
-------------------------------------------
A | 1 |2018-03-28T19:03:40| Clear
-------------------------------------------
A | 1 |2018-03-28T19:11:21| Active
-------------------------------------------
A | 1 |2018-03-28T20:13:06| Active
-------------------------------------------
A | 1 |2018-03-28T20:13:07| Clear
-------------------------------------------
This is what I came up with by grouping by keys
A|1|[(2018-03-28T19:03:39,Active),(2018-03-28T19:03:40,Clear),(2018-03-28T19:11:21,Active),(2018-03-28T20:13:06,Active),(2018-03-28T20:13:07,Clear)]
and this is my desired output..
--------------------------------------------------------
Col1 | Col2| Active time | Clear Time
--------------------------------------------------------
A | 1 |2018-03-28T19:03:39| 2018-03-28T19:03:40
--------------------------------------------------------
A | 1 |2018-03-28T20:13:06| 2018-03-28T20:13:07
--------------------------------------------------------
I am kind of stuck at this step and not sure how to proceed further to get the desired output. Any direction is appreciated.
Spark version - 2.1.1
Scala version - 2.11.8

You can use window function for the grouping and ordering to get the consecutive active and clear time. Since you are looking for filtering out the the rows which doesn't have consecutive clear or active status, you would need a filter too.
so if you have dataframe as
+----+----+-------------------+------+
|Col1|Col2|Col3 |Col4 |
+----+----+-------------------+------+
|A |1 |2018-03-28T19:03:39|Active|
|A |1 |2018-03-28T19:03:40|Clear |
|A |1 |2018-03-28T19:11:21|Active|
|A |1 |2018-03-28T20:13:06|Active|
|A |1 |2018-03-28T20:13:07|Clear |
+----+----+-------------------+------+
you can simply do as I explained above
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("Col1", "Col2").orderBy("Col3")
import org.apache.spark.sql.functions._
df.withColumn("active", lag(struct(col("Col3"), col("Col4")), 1).over(windowSpec))
.filter(col("active.Col4") === "Active" && col("Col4") === "Clear")
.select(col("Col1"), col("Col2"), col("active.Col3").as("Active Time"), col("Col3").as("Clear Time"))
.show(false)
and you should get
+----+----+-------------------+-------------------+
|Col1|Col2|Active Time |Clear Time |
+----+----+-------------------+-------------------+
|A |1 |2018-03-28T19:03:39|2018-03-28T19:03:40|
|A |1 |2018-03-28T20:13:06|2018-03-28T20:13:07|
+----+----+-------------------+-------------------+

Related

Split large dataframe into small ones Spark

I have a DF that has 200 million lines. I cant group this DF and I have to split this DF in 8 smaller DFs (approx 30 million lines each). I've tried this approach but with no success. Without caching the DF, the count of the splitted DFs does not match the larger DF. If I use cache I get out of disk space (my config is 64gb RAM and 512 SSD).
Considering this, I though about the following approach:
Load the entire DF
Give 8 random numbers to this DF
Distribute the random number evenly in the DF
Consider the following DF as example:
+------+--------+
| val1 | val2 |
+------+--------+
|Paul | 1.5 |
|Bostap| 1 |
|Anna | 3 |
|Louis | 4 |
|Jack | 2.5 |
|Rick | 0 |
|Grimes| null|
|Harv | 2 |
|Johnny| 2 |
|John | 1 |
|Neo | 5 |
|Billy | null|
|James | 2.5 |
|Euler | null|
+------+--------+
The DF has 14 lines, I though to use window to create the following DF:
+------+--------+----+
| val1 | val2 | sep|
+------+--------+----+
|Paul | 1.5 |1 |
|Bostap| 1 |1 |
|Anna | 3 |1 |
|Louis | 4 |1 |
|Jack | 2.5 |1 |
|Rick | 0 |1 |
|Grimes| null|1 |
|Harv | 2 |2 |
|Johnny| 2 |2 |
|John | 1 |2 |
|Neo | 5 |2 |
|Billy | null|2 |
|James | 2.5 |2 |
|Euler | null|2 |
+------+--------+----+
Considering the last DF, I will use a filter to filter by sep. My doubt is: How can I use window function to generate the column sep of last DF?
Since you are randomly splitting the dataframe into 8 parts, you could use randomSplit():
split_weights = [1.0] * 8
splits = df.randomSplit(split_weights)
for df_split in splits:
# do what you want with the smaller df_split
Note that this will not ensure same number of records in each df_split. There may be some fluctuation but with 200 million records it will be negligible.
If you want to process and store to files with the count names to avoid getting mixed up.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('parquet-files')
split_w = [1.0] * 5
splits = df.randomSplit(split_w)
for count, df_split in enumerate(splits, start=1):
df_split.write.parquet(f'split-files/split-file-{count}', mode='overwrite')
The file sizes will be averagely the same size, some with a slight difference.

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

What is the best way to fill missing info on all columns with Null\0 for missing records in Spark dataframe while groupby?

Let's say I have the following Spark frame:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-13|1 |1 |0 |
+--------+----------+-----------+-------------------+-------------------+
Now I want to not only impute the missing dates in date column with the right dates so that dataframe keeps its continuous time-series nature and equally sequenced frame but also impute other columns with Null or 0 (while groupBy preferably).
My code is below:
import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType
dict2 = [("2021-08-11 04:05:06", "A"),
("2021-08-11 04:15:06", "B"),
("2021-08-11 09:15:26", "A"),
("2021-08-11 11:04:06", "B"),
("2021-08-11 14:55:16", "A"),
("2021-08-13 04:12:11", "B"),
]
schema = StructType([
StructField("timestamp", StringType(), True), \
StructField("UserName", StringType(), True), \
])
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
#sdf.printSchema()
#sdf.show(truncate=False)
#+-------------------+--------+
#|timestamp |UserName|
#+-------------------+--------+
#|2021-08-11 04:05:06|A |
#|2021-08-11 04:15:06|B |
#|2021-08-11 09:15:26|A |
#|2021-08-11 11:04:06|B |
#|2021-08-11 14:55:16|A |
#|2021-08-13 04:12:11|B |
#+-------------------+--------+
#Generate date and timestamp
sdf1 = sdf.withColumn('timestamp', F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withColumn('date', F.to_date("timestamp", "yyyy-MM-dd").cast(DateType())) \
.select('timestamp', 'date', 'UserName')
#sdf1.show(truncate = False)
#+-------------------+----------+--------+
#|timestamp |date |UserName|
#+-------------------+----------+--------+
#|2021-08-11 04:05:06|2021-08-11|A |
#|2021-08-11 04:15:06|2021-08-11|B |
#|2021-08-11 09:15:26|2021-08-11|A |
#|2021-08-11 11:04:06|2021-08-11|B |
#|2021-08-11 14:55:16|2021-08-11|A |
#|2021-08-13 04:12:11|2021-08-13|B |
#+-------------------+----------+--------+
#Aggeragate records numbers for specific features (Username) for certain time-resolution PerDay(24hrs), HalfDay(2x12hrs)
df = sdf1.groupBy("UserName", "date").agg(
F.sum(F.hour("timestamp").between(0, 24).cast("int")).alias("NoLogPerDay"),
F.sum(F.hour("timestamp").between(0, 11).cast("int")).alias("NoLogPer-1st-12-hrs"),
F.sum(F.hour("timestamp").between(12, 23).cast("int")).alias("NoLogPer-2nd-12-hrs"),
).sort('date')
df.show(truncate = False)
The problem is when I groupBy on date and UserName, I missed some dates which user B had activities but user A not or vice versa. So I'm interested in reflecting these no activities in the Spark dataframe by refilling those dates (no need to timestamp) and allocating 0 to those columns. I'm not sure if I can do this while grouping or before or after!
I already checked some related post as well as PySpark offers window functions and inspired this answer so until now I've tried this:
# compute the list of all dates from available dates
max_date = sdf1.select(F.max('date')).first()['max(date)']
min_date = sdf1.select(F.min('date')).first()['min(date)']
print(min_date) #2021-08-11
print(max_date) #2021-08-13
#compute list of available dates based on min_date & max_date from available data
dates_list = [max_date - dt.timedelta(days=x) for x in range((max_date - min_date).days +1)]
print(dates_list)
#create a temporaray Spark dataframe for date column includng missing dates with interval 1 day
sqlCtx = SQLContext(sc)
df2 = sqlCtx.createDataFrame(data=dates_list)
#Apply leftouter join on date column
dff = df2.join(sdf1, ["date"], "leftouter")
#dff.sort('date').show(truncate = False)
#possible to use .withColumn().otherwise()
#.withColumn('date',when(col('date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('date')))
#Replace 0 for null for all integer columns
dfff = dff.na.fill(value=0).sort('date')
dfff.select('date','Username', 'NoLogPerDay','NoLogPer-1st-12-hrs','NoLogPer-2nd-12-hrs').sort('date').show(truncate = False)
Please note that I'm not interested in using UDF or hacking it via toPandas()
so expected results should be like below after groupBy:
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 | <--
|A |2021-08-12|0 |0 |0 | <--
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 | <--
+--------+----------+-----------+-------------------+-------------------+
Here's is one way of doing:
First, generate new dataframe all_dates_df that contains the sequence of the dates from min to max date in your grouped df. For this you can use sequence function:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(date), max(date), interval 1 day) as date"
).select(F.explode("date").alias("date"))
all_dates_df.show()
#+----------+
#| date|
#+----------+
#|2021-08-11|
#|2021-08-12|
#|2021-08-13|
#+----------+
Now, you need to duplicate each date for all the users using a cross join with distinct UserName dataframe and finally join with the grouped df to get the desired output:
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["UserName", "date"],
"left"
).fillna(0)
result_df.show()
#+--------+----------+-----------+-------------------+-------------------+
#|UserName| date|NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
#+--------+----------+-----------+-------------------+-------------------+
#| A|2021-08-11| 3| 2| 1|
#| B|2021-08-11| 2| 2| 0|
#| A|2021-08-12| 0| 0| 0|
#| B|2021-08-12| 0| 0| 0|
#| B|2021-08-13| 1| 1| 0|
#| A|2021-08-13| 0| 0| 0|
#+--------+----------+-----------+-------------------+-------------------+
Essentially, you may generate all the possible options and left join on this to achieve your missing date.
The sequence sql function may be helpful here to generate all your possible dates. You may pass it your min and max date along with your interval you would like it to increment by. The following examples continue with the code on your google collab.
Using the functions min,max,collect_set and table generating functions explode you may achieve the following:
possible_user_dates=(
# Step 1 - Get all possible UserNames and desired dates
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
)
# Step 2 - Use explode to split the collected arrays into rows (ouput immediately below)
.withColumn("UserName",F.explode("UserName"))
.withColumn("date",F.explode("date"))
.distinct()
)
possible_user_dates.show(truncate=False)
+--------+----------+
|UserName|date |
+--------+----------+
|B |2021-08-11|
|A |2021-08-11|
|B |2021-08-12|
|A |2021-08-12|
|B |2021-08-13|
|A |2021-08-13|
+--------+----------+
Performing your left join
final_df = (
possible_user_dates.join(
df,
["UserName","date"],
"left"
)
# Since the left join will place NULLs where values are missing.
# Eg. where a User was not active on a particular date
# We use `fill` to replace the null values with `0`
.na.fill(0)
)
final_df.show(truncate=False)
+--------+----------+-----------+-------------------+-------------------+
|UserName|date |NoLogPerDay|NoLogPer-1st-12-hrs|NoLogPer-2nd-12-hrs|
+--------+----------+-----------+-------------------+-------------------+
|B |2021-08-11|2 |2 |0 |
|A |2021-08-11|3 |2 |1 |
|B |2021-08-12|0 |0 |0 |
|A |2021-08-12|0 |0 |0 |
|B |2021-08-13|1 |1 |0 |
|A |2021-08-13|0 |0 |0 |
+--------+----------+-----------+-------------------+-------------------+
For debugging purposes, I've included the output of a few intermediary steps
Step 1 Output:
df.select(
F.collect_set("UserName").alias("UserName"),
F.expr("sequence(min(date),max(date), interval 1 day)").alias("date")
).show(truncate=False)
+--------+------------------------------------+
|UserName|date |
+--------+------------------------------------+
|[B, A] |[2021-08-11, 2021-08-12, 2021-08-13]|
+--------+------------------------------------+

how to match 2 column with each other in Apache Spark - Pyspark

I have a dataframe so assume my data is in Tabular format.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1, entity_2, entity_3|entity_1, entity_3|
Now using withColumn("Serial", explode(split(",")"Serial"))). I have achieved breaking columns into multiple rows as below. this was the 1st part of the requirement.
|ID | Serial | Updated
-------------------------------------------------------
|10 |pers1 | |
|20 | | |
|30 |entity_1 |entity_1, entity_3|
|30 |entity_2 |entity_1, entity_3|
|30 |entity_3 |entity_1, entity_3|
Now for the columns where there are no values it should be 0,
For values which is present in 'Serial' Column should be searched in 'Updated' column. If the value is present in 'Updated' column then it should display '1' else '2'
So for here in this case for entity_1 && entity_3 --> 1 must be displayed & for entity_2 --> 2 should be displayed
How to achieve this ..?
AFAIK, there is no way to check if one column is contained within or is a substring of another column directly without using a udf.
However, if you wanted to avoid using a udf, one way is to explode the "Updated" column. Then you can check for equality between the "Serial" column and the exploded "Updated" column and apply your conditions (1 if match, 2 otherwise)- call this "contains".
Finally, you can then groupBy("ID", "Serial", "Updated") and select the minimum of the "contains" column.
For example, after the two calls to explode() and checking your condition, you will have a DataFrame like this:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.show(truncate=False)
#+---+--------+-----------------+---------------+--------+
#|ID |Serial |Updated |updatedExploded|contains|
#+---+--------+-----------------+---------------+--------+
#|10 |pers1 | | |0 |
#|20 | | | |0 |
#|30 |entity_1|entity_1,entity_3|entity_1 |1 |
#|30 |entity_1|entity_1,entity_3|entity_3 |2 |
#|30 |entity_2|entity_1,entity_3|entity_1 |2 |
#|30 |entity_2|entity_1,entity_3|entity_3 |2 |
#|30 |entity_3|entity_1,entity_3|entity_1 |2 |
#|30 |entity_3|entity_1,entity_3|entity_3 |1 |
#+---+--------+-----------------+---------------+--------+
The "trick" of grouping by ("ID", "Serial", "Updated") and taking the minimum of "contains" works because:
If either "Serial" or "Updated" is null (or equal to empty string in this case), the value will be 0.
If at least one of the values in "Updated" matches with "Serial", one of the columns will have a 1.
If there are no matches, you will have only 2's
The final output:
df.withColumn("Serial", f.explode(f.split("Serial", ",")))\
.withColumn("updatedExploded", f.explode(f.split("Updated", ",")))\
.withColumn(
"contains",
f.when(
f.isnull("Serial") |
f.isnull("Updated") |
(f.col("Serial") == "") |
(f.col("Updated") == ""),
0
).when(
f.col("Serial") == f.col("updatedExploded"),
1
).otherwise(2)
)\
.groupBy("ID", "Serial", "Updated")\
.agg(f.min("contains").alias("contains"))\
.sort("ID")\
.show(truncate=False)
#+---+--------+-----------------+--------+
#|ID |Serial |Updated |contains|
#+---+--------+-----------------+--------+
#|10 |pers1 | |0 |
#|20 | | |0 |
#|30 |entity_3|entity_1,entity_3|1 |
#|30 |entity_2|entity_1,entity_3|2 |
#|30 |entity_1|entity_1,entity_3|1 |
#+---+--------+-----------------+--------+
I'm chaining calls to pyspark.sql.functions.when() to check the conditions. The first part checks to see if either column is null or equal to the empty string. I believe that you probably only need to check for null in your actual data, but I put in the check for empty string based on how you displayed your example DataFrame.

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Resources