Divide spark dataframe into chunks using row values as separators

Divide spark dataframe into chunks using row values as separators - apache-spark

In my PySpark code I have a DataFrame populated with data coming from a sensor and each single row has timestamp, event_description and event_value.
Each sensor event is composed by measurements defined by an id and a value. The only guarantee I have is that all the "phases" related to a single event are included between two EV_SEP rows (unsorted).
Inside each event "block" there is an event label which is the value associated to EV_CODE.
+-------------------------+------------+-------------+
| timestamp | event_id | event_value |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- |
+-------------------------+------------+-------------+
I would like to create a new column containing that label, so that I know that all the events are associated to that label:
+-------------------------+----------+-------------+------------+
| timestamp | event_id | event_value | event_code |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
With pandas I can easily get the indexes of the EV_SEP rows, split the table into blocks, take the EV_CODE from each block and create an event_code column with such value.
A possible solution would be:
Sort the DataFrame according to timestamp
Convert the dataframe to a RDD and call zipWithIndex
get the indexes containing EV_SEP
calculate the block ranges (start_index, end_index)
Process single "chunks" (filtering on indexes) to extract EV_CODE
finally create the wanted column
Is there any better way to solve this problem?

from pyspark.sql import functions as f
Sample data:
df.show()
+-----------------------+--------+-----------+
|timestamp |event_id|event_value|
+-----------------------+--------+-----------+
|2017-01-01 00:00:12.540|EV_SEP |null |
|2017-01-01 00:00:14.201|EV_2 |10 |
|2017-01-01 00:00:13.331|EV_1 |11 |
|2017-01-01 00:00:15.203|EV_CODE |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |
|2017-01-01 00:00:20.201|EV_2 |10 |
|2017-01-01 00:00:24.203|EV_CODE |DEF |
|2017-01-01 00:00:31.670|EV_SEP |null |
+-----------------------+--------+-----------+
Add index:
df_idx = df.filter(df['event_id'] == 'EV_SEP') \
.withColumn('idx', f.row_number().over(Window.partitionBy().orderBy(df['timestamp'])))
df_block = df.filter(df['event_id'] != 'EV_SEP').withColumn('idx', f.lit(0))
'Spread' index:
df = df_idx.union(df_block).withColumn('idx', f.max('idx').over(
Window.partitionBy().orderBy('timestamp').rowsBetween(Window.unboundedPreceding, Window.currentRow))).cache()
Add EV_CODE:
df_code = df.filter(df['event_id'] == 'EV_CODE').withColumnRenamed('event_value', 'event_code')
df = df.join(df_code, on=[df['idx'] == df_code['idx']]) \
.select(df['timestamp'], df['event_id'], df['event_value'], df_code['event_code'])
Finally:
+-----------------------+--------+-----------+----------+
|timestamp |event_id|event_value|event_code|
+-----------------------+--------+-----------+----------+
|2017-01-01 00:00:12.540|EV_SEP |null |ABC |
|2017-01-01 00:00:13.331|EV_1 |11 |ABC |
|2017-01-01 00:00:14.201|EV_2 |10 |ABC |
|2017-01-01 00:00:15.203|EV_CODE |ABC |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |DEF |
|2017-01-01 00:00:20.201|EV_2 |10 |DEF |
|2017-01-01 00:00:24.203|EV_CODE |DEF |DEF |
+-----------------------+--------+-----------+----------+

Creating a new Hadoop InputFormat would be a more computationally efficient way to accomplish your goal here (although is arguably the same or more gymnastics in terms of code). You can specify alternative Hadoop input formats using sc.hadoopFile in the Python API, but you must take care of conversion from the Java format to Python. You can then specify the format. The converters available in PySpark are relatively few but this reference proposes using the Avro converter as an example. You might also simply find it convenient to let your custom Hadoop input format output text which you then additionally parse in Python to avoid the issue of implementing a converter.
Once you have that in place, you would create a special input format (in Java or Scala using the Hadoop API's) to treat the special sequences of rows having EV_SEP as record delimiters instead of the newline character. You could do this quite simply by collecting rows as they are read in an accumulator (just a simple ArrayList could do as a proof-of-concept) and then emitting the accumulated list of records when you find two EV_SEP rows in a row.
I would point out that using TextInputFormat as a basis for such a design might be tempting, but that the input format will split such files arbitrarily at newline characters and you will need to implement custom logic to properly support splitting the files. Alternatively, you can avoid the problem by simply not implementing file splitting. This is a simple modification to the partitioner.
If you do need to split files, the basic idea is:
Pick a split offset by evenly dividing the file into parts
Seek to the offset
Seek back character-by-character from the offset to where the delimiter sequence is found (in this case two rows in a row with type EV_SEP.
Detecting these sequences for the edge case around file splitting would be a challenge. I would suggest establishing the largest byte-width of rows and reading sliding-window chunks of an appropriate width (basically 2x the size of the rows) backwards from your starting point, then matching against those windows using a precompiled Java regex and Matcher. This is similar to how Sequence Files find their sync marks, but uses a regex to detect the sequence instead of strict equality.
As a side note, I would be concerned given some of the other background you mention here that sorting the DataFrame by timestamp could alter the contents of events that happen in the same time period in different files.

Related

Get all rows after doing GroupBy in SparkSQL

I tried to do group by in SparkSQL which works good but most of the rows went missing.
spark.sql(
"""
| SELECT
| website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)
I am getting output like this :
+------------------+---------+
|website_session_id|min_pv_id|
+------------------+---------+
|1 |1 |
|10 |15 |
|100 |168 |
|1000 |1910 |
|10000 |20022 |
|100000 |227964 |
|100001 |227966 |
|100002 |227967 |
|100003 |227970 |
|100004 |227973 |
+------------------+---------+
Same query in MySQL gives the desired result like this :
What is the best way to do ,so that all rows are fetched in my Query.
Please note I already checked other answers related to this, like joining to get all rows etc, but I want to know if there is any other way by with we can get the result like we get in MySQL ?

It looks like it is ordered by alphabetically, in which case 10 comes before 2.
You might want to check that the columns type is a number, not string.
What datatypes do the columns have (printSchema())?

I think website_session_id is of string type. Cast it to an integer type and see what you get:
spark.sql(
"""
| SELECT
| CAST(website_session_id AS int) as website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)

Finding the Delta of a Column within Pyspark Interactive Shell

I have this DataFrame:
DataFrame[visitors: int, beach: string, Date: date]
With the following data:
+-----------+-------------+--------+
|date |beach |visitors|
+-----------+------------+---------+
|2020-03-02 |Bondi Beach |205 |
|2020-03-02 |Nissi Beach |218 |
|2020-03-03 |Bar Beach |201 |
|2020-03-04 |Navagio |102 |
|2020-03-04 |Champangne |233 |
|2020-03-05 |Lighthouse |500 |
|2020-03-06 |Mazo |318 |
+-----------+------------+---------+
And I'm looking to find the delta of these columns using the data in the visitor's column.
Expected output:
+-----------+-------------+--------+-------+
|date |beach |visitors| Delta |
+-----------+------------+---------+-------+
|2020-03-02 |Bondi Beach |205 |-13 | (205-218)
|2020-03-02 |Nissi Beach |218 |17 | (218-201)
|2020-03-03 |Bar Beach |201 |99 | (201-102)
|2020-03-04 |Navagio |102 |-131 | (102-233)
|2020-03-04 |Champangne |233 |-267 | (233-500)
|2020-03-05 |Lighthouse |500 |182 | (500-318)
|2020-03-06 |Mazo |318 |318 | (318-0)
+-----------+------------+---------+-------+

You can use the lead function for your problem. Since the lead of the last row is null, I'm using the coalesce function to replace nulls with the visitors' column.
from pyspark.sql.window import Window
from pyspark.sql.functions import *
w=Window().orderBy("date")
df.withColumn("delta", col("visitors") - lead("visitors").over(w))\
.withColumn('delta', coalesce('delta', 'visitors')).show()
+----------+-----------+--------+-----+
| date| beach|visitors|delta|
+----------+-----------+--------+-----+
|2020-03-02|Bondi Beach| 205| -13|
|2020-03-02|Nissi Beach| 218| 17|
|2020-03-03| Bar Beach| 201| 99|
|2020-03-04| Navagio| 102| -131|
|2020-03-04| Champangne| 233| -267|
|2020-03-05| Lighthouse| 500| 182|
|2020-03-06| Mazo| 318| 318|
+----------+-----------+--------+-----+
Note: I'm just ordering by the date field. It will be good to have another column like an id to include in the order by clause so that the order is maintained. Also, using a window without partitions can have a performance impact.

How to implement a custom Pyspark explode (for array of structs), 4 columns in 1 explode?

I am trying to implement a custom explode in Pyspark. I have 4 columns that are arrays of structs with virtually the same schema (one columns structs contain one less field than the other three).
For each row in my DataFrame, I have 4 columns that are arrays of structs. The columns are students, teaching_assistants, teachers, administrators.
The students, teaching_assistants and teachers are arrays of structs with field id, student_level and name.
For example, here is a sample row in the DataFrame.
The students, teaching_assistants and teachers structs all have the same schema ("id", "student_level", "name") and the administrators struct has the "id" and "name" fields but is missing the student level.
I want to perform a custom explode such that for every row I have one entry for each student, teaching assistant, professor and administrator along with the original column name in case I had to search by "person type".
So for the screenshot of the row above, the output would be 8 rows:
+-----------+---------------------+----+---------------+----------+
| School_id | type | id | student_level | name |
+-----------+---------------------+----+---------------+----------+
| 1999 | students | 1 | 0 | Brian |
| 1999 | students | 9 | 2 | Max |
| 1999 | teaching_assistants | 19 | 0 | Xander |
| 1999 | teachers | 21 | 0 | Charlene |
| 1999 | teachers | 12 | 2 | Rob |
| 1999 | administrators | 23 | None | Marsha |
| 1999 | administrators | 11 | None | Ryan |
| 1999 | administrators | 14 | None | Bob |
+-----------+---------------------+----+---------------+----------+
For the administrators, the student_level column would just be null. The problem is if I use the explode function, I end up with all of these items in different columns.
Is it possible to accomplish this in Pyspark? One thought I had was to figure out how to combine the 4 array columns into 1 array and then do an explode on the array, although I am not sure if combining arrays of structs and getting the column names as a field is feasible (I've tried various things) and I also don't know if it would work if the administrators were missing a field.
In the past, I have done this by converting to RDD and using a flatmap/custom udf but it was very inefficient for millions of rows.

The idea is to use stack to transform the columns students, teaching_assistants, teachers and administrators into separate rows with the correct value for each type. After that, the column containing the data can be exploded and then the elements of the single structs can be transformed into separate columns.
Using stack requires that all columns that are stacked have the same type. This means that all columns must contain arrays of the same struct and also the nullability of all elements of the struct must match. Therefore the administrators column has to converted into the correct struct type first.
df.withColumn("administrators", F.expr("transform(administrators, " +
"a -> if(1<2,named_struct('id', a.id, 'name', a.name, 'student_level', "+
"cast(null as long)),null))"))\
.select("School_id", F.expr("stack(4, 'students', students, "+
"'teaching_assistants', teaching_assistants, 'teachers', teachers, "+
"'administrators', administrators) as (type, temp1)")) \
.withColumn("temp2", F.explode("temp1"))\
.select("School_id", "type", "temp2.id", "temp2.name", "temp2.student_level")\
.show()
prints
+---------+-------------------+---+--------+-------------+
|School_id| type| id| name|student_level|
+---------+-------------------+---+--------+-------------+
| 1999| students| 1| Brian| 0|
| 1999| students| 9| Max| 2|
| 1999|teaching_assistants| 19| Xander| 0|
| 1999| teachers| 21|Charlene| 0|
| 1999| teachers| 12| Rob| 2|
| 1999| administrators| 23| Marsha| null|
| 1999| administrators| 11| Ryan| null|
| 1999| administrators| 14| Bob| null|
+---------+-------------------+---+--------+-------------+
The strange looking if(1<2, named_struct(...), null) in the first line is necessary to set the correct nullabilty for the elements of the administrators array.
This solution works for Spark 2.4+. If it was possible to transform the administrators struct in a previous step, this solution would also work for earlier versions.

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.

what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Bug/Error with KMeans (and BisectingKMeans) clustering

I'm working on a data where I'm required to work with clusters.
I know the Spark framework won't let me have one single cluster; the minimum number of clusters is two.
I created some dummy random data to test my program, and my program is displaying wrong results because my KMeans function is generating ONE cluster! How come? I don't understand. Is it because my data is random? I have not specified anything on my kmeans. This is the part of the code that handles the K-Means:
kmeans = new BisectingKMeans();
model = kmeans.fit(dataset); //trains the k-means with the dataset to create a model
clusterCenters = model.clusterCenters();
dataset.show(false);
for(Vector v : clusterCenters){
System.out.println(v);
}
The output is the following:
+----+----+------+
|File|Size|Volume|
+----+----+------+
|F1 |13 |1689 |
|F2 |18 |1906 |
|F3 |16 |1829 |
|F4 |14 |1726 |
|F5 |10 |1524 |
|F6 |16 |1844 |
|F7 |15 |1752 |
|F8 |12 |1610 |
|F9 |10 |1510 |
|F10 |11 |1554 |
|F11 |12 |1632 |
|F12 |13 |1663 |
|F13 |18 |1901 |
|F14 |13 |1686 |
|F15 |18 |1910 |
|F16 |19 |1986 |
|F17 |11 |1585 |
|F18 |10 |1500 |
|F19 |13 |1665 |
|F20 |13 |1664 |
+----+----+------+
only showing top 20 rows
[-1.7541523789077474E-16,2.0655699373151038E-15] //only one cluster center!!! why??
Why does this happen? What do I need to fix to solve this? Having only one cluster ruins my program

On random data, the correct output of bisecting k-means often is a single cluster only.
With bisecting k-means you only give a maximum number of clusters. But it can stop early, if the results do not improve. In you case, splitting the data into two clusters apparently did not improve the quality, so this bisection is not accepted.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string