pyspark; check if an element is in collect_list [duplicate] - apache-spark

This question already has answers here:
How to filter based on array value in PySpark?
(2 answers)
Closed 4 years ago.
I am working on a dataframe df, for instance the following dataframe:
df.show()
Output:
+----+------+
|keys|values|
+----+------+
| aa| apple|
| bb|orange|
| bb| desk|
| bb|orange|
| bb| desk|
| aa| pen|
| bb|pencil|
| aa| chair|
+----+------+
I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))
The resulting dataframe is then as follows:
df_new.show()
Output:
+----+----------------------+
|keys|collectedSet_values |
+----+----------------------+
|bb |[orange, pencil, desk]|
|aa |[apple, pen, chair] |
+----+----------------------+
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.
Please comment your solutions/ideas.
Kind Regards.

Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:
df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()
Output:
+----+----------------------+--------------+
|keys|collectedSet_values |contains_chair|
+----+----------------------+--------------+
|bb |[orange, pencil, desk]|false |
|aa |[apple, pen, chair] |true |
+----+----------------------+--------------+
The same applies to the result of collect_list.

Related

How to split a dictionary in a Pyspark dataframe into multiple rows?

I have the following dataframe that is extracted with the following command:
extract = data.select('properties.id', 'flags')
| id | flags |
|-------| ---------------------------|
| v_001 | "{"93":true,"83":true}" |
| v_002 | "{"45":true,"76":true}" |
The desired result I want is:
| id | flags |
|-------| ------|
| v_001 | 93 |
| v_001 | 83 |
| v_002 | 45 |
| v_002 | 76 |
I tried to apply explode as the following:
extract = data.select('properties.id', explode(col('flags')))
But I encountered the following:
cannot resolve 'explode(flags)' due to data type mismatch: input to function explode should be array or map type, not struct<93:boolean,83:boolean,45:boolean,76:boolean>
This makes sense as the schema of the column is not compatible with the explode function. How can I adjust the function to get my desired result? Is there a better way to solve this problem?
P.D.: The desired table schema is not the best design but this is out of my scope since this will involve another topic discussion.
As you might already looked, explode requires ArrayType and it seems you are only taking the keys from the dict in flags.
So, you can first convert the flags to MapType and use map_keys to extract all keys into list.
df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
This will result in like this
+-----+--------+
| id| flags|
+-----+--------+
|v_001|[93, 83]|
|v_002|[45, 76]|
+-----+--------+
Then you can use explode on the flags.
.select('id', F.explode('flags'))
+-----+---+
| id|col|
+-----+---+
|v_001| 93|
|v_001| 83|
|v_002| 45|
|v_002| 76|
+-----+---+
The whole code
df = (df.withColumn('flags', F.map_keys(F.from_json('flags', MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))
Update
It is probably better to supply the schema and read as MapType for the flags but if your json is complex and hard to create the schema, you can convert the struct into String once then convert to MapType.
# Add this line before `from_json`
df = df.select('id', F.to_json('flags').alias('flags'))
# Or you can do in 1 shot.
df = (df.withColumn('flags', F.map_keys(F.from_json(F.to_json('flags'), MapType(StringType(), BooleanType()))))
.select('id', F.explode('flags')))

How to implement a custom Pyspark explode (for array of structs), 4 columns in 1 explode?

I am trying to implement a custom explode in Pyspark. I have 4 columns that are arrays of structs with virtually the same schema (one columns structs contain one less field than the other three).
For each row in my DataFrame, I have 4 columns that are arrays of structs. The columns are students, teaching_assistants, teachers, administrators.
The students, teaching_assistants and teachers are arrays of structs with field id, student_level and name.
For example, here is a sample row in the DataFrame.
The students, teaching_assistants and teachers structs all have the same schema ("id", "student_level", "name") and the administrators struct has the "id" and "name" fields but is missing the student level.
I want to perform a custom explode such that for every row I have one entry for each student, teaching assistant, professor and administrator along with the original column name in case I had to search by "person type".
So for the screenshot of the row above, the output would be 8 rows:
+-----------+---------------------+----+---------------+----------+
| School_id | type | id | student_level | name |
+-----------+---------------------+----+---------------+----------+
| 1999 | students | 1 | 0 | Brian |
| 1999 | students | 9 | 2 | Max |
| 1999 | teaching_assistants | 19 | 0 | Xander |
| 1999 | teachers | 21 | 0 | Charlene |
| 1999 | teachers | 12 | 2 | Rob |
| 1999 | administrators | 23 | None | Marsha |
| 1999 | administrators | 11 | None | Ryan |
| 1999 | administrators | 14 | None | Bob |
+-----------+---------------------+----+---------------+----------+
For the administrators, the student_level column would just be null. The problem is if I use the explode function, I end up with all of these items in different columns.
Is it possible to accomplish this in Pyspark? One thought I had was to figure out how to combine the 4 array columns into 1 array and then do an explode on the array, although I am not sure if combining arrays of structs and getting the column names as a field is feasible (I've tried various things) and I also don't know if it would work if the administrators were missing a field.
In the past, I have done this by converting to RDD and using a flatmap/custom udf but it was very inefficient for millions of rows.
The idea is to use stack to transform the columns students, teaching_assistants, teachers and administrators into separate rows with the correct value for each type. After that, the column containing the data can be exploded and then the elements of the single structs can be transformed into separate columns.
Using stack requires that all columns that are stacked have the same type. This means that all columns must contain arrays of the same struct and also the nullability of all elements of the struct must match. Therefore the administrators column has to converted into the correct struct type first.
df.withColumn("administrators", F.expr("transform(administrators, " +
"a -> if(1<2,named_struct('id', a.id, 'name', a.name, 'student_level', "+
"cast(null as long)),null))"))\
.select("School_id", F.expr("stack(4, 'students', students, "+
"'teaching_assistants', teaching_assistants, 'teachers', teachers, "+
"'administrators', administrators) as (type, temp1)")) \
.withColumn("temp2", F.explode("temp1"))\
.select("School_id", "type", "temp2.id", "temp2.name", "temp2.student_level")\
.show()
prints
+---------+-------------------+---+--------+-------------+
|School_id| type| id| name|student_level|
+---------+-------------------+---+--------+-------------+
| 1999| students| 1| Brian| 0|
| 1999| students| 9| Max| 2|
| 1999|teaching_assistants| 19| Xander| 0|
| 1999| teachers| 21|Charlene| 0|
| 1999| teachers| 12| Rob| 2|
| 1999| administrators| 23| Marsha| null|
| 1999| administrators| 11| Ryan| null|
| 1999| administrators| 14| Bob| null|
+---------+-------------------+---+--------+-------------+
The strange looking if(1<2, named_struct(...), null) in the first line is necessary to set the correct nullabilty for the elements of the administrators array.
This solution works for Spark 2.4+. If it was possible to transform the administrators struct in a previous step, this solution would also work for earlier versions.

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Divide spark dataframe into chunks using row values as separators

In my PySpark code I have a DataFrame populated with data coming from a sensor and each single row has timestamp, event_description and event_value.
Each sensor event is composed by measurements defined by an id and a value. The only guarantee I have is that all the "phases" related to a single event are included between two EV_SEP rows (unsorted).
Inside each event "block" there is an event label which is the value associated to EV_CODE.
+-------------------------+------------+-------------+
| timestamp | event_id | event_value |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC |
+-------------------------+------------+-------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- |
+-------------------------+------------+-------------+
I would like to create a new column containing that label, so that I know that all the events are associated to that label:
+-------------------------+----------+-------------+------------+
| timestamp | event_id | event_value | event_code |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:12.540 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:14.201 | EV_2 | 10 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:13.331 | EV_1 | 11 | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:15.203 | EV_CODE | ABC | ABC |
+-------------------------+----------+-------------+------------+
| 2017-01-01 00:00:16.670 | EV_SEP | ----- | ABC |
+-------------------------+----------+-------------+------------+
With pandas I can easily get the indexes of the EV_SEP rows, split the table into blocks, take the EV_CODE from each block and create an event_code column with such value.
A possible solution would be:
Sort the DataFrame according to timestamp
Convert the dataframe to a RDD and call zipWithIndex
get the indexes containing EV_SEP
calculate the block ranges (start_index, end_index)
Process single "chunks" (filtering on indexes) to extract EV_CODE
finally create the wanted column
Is there any better way to solve this problem?
from pyspark.sql import functions as f
Sample data:
df.show()
+-----------------------+--------+-----------+
|timestamp |event_id|event_value|
+-----------------------+--------+-----------+
|2017-01-01 00:00:12.540|EV_SEP |null |
|2017-01-01 00:00:14.201|EV_2 |10 |
|2017-01-01 00:00:13.331|EV_1 |11 |
|2017-01-01 00:00:15.203|EV_CODE |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |
|2017-01-01 00:00:20.201|EV_2 |10 |
|2017-01-01 00:00:24.203|EV_CODE |DEF |
|2017-01-01 00:00:31.670|EV_SEP |null |
+-----------------------+--------+-----------+
Add index:
df_idx = df.filter(df['event_id'] == 'EV_SEP') \
.withColumn('idx', f.row_number().over(Window.partitionBy().orderBy(df['timestamp'])))
df_block = df.filter(df['event_id'] != 'EV_SEP').withColumn('idx', f.lit(0))
'Spread' index:
df = df_idx.union(df_block).withColumn('idx', f.max('idx').over(
Window.partitionBy().orderBy('timestamp').rowsBetween(Window.unboundedPreceding, Window.currentRow))).cache()
Add EV_CODE:
df_code = df.filter(df['event_id'] == 'EV_CODE').withColumnRenamed('event_value', 'event_code')
df = df.join(df_code, on=[df['idx'] == df_code['idx']]) \
.select(df['timestamp'], df['event_id'], df['event_value'], df_code['event_code'])
Finally:
+-----------------------+--------+-----------+----------+
|timestamp |event_id|event_value|event_code|
+-----------------------+--------+-----------+----------+
|2017-01-01 00:00:12.540|EV_SEP |null |ABC |
|2017-01-01 00:00:13.331|EV_1 |11 |ABC |
|2017-01-01 00:00:14.201|EV_2 |10 |ABC |
|2017-01-01 00:00:15.203|EV_CODE |ABC |ABC |
|2017-01-01 00:00:16.670|EV_SEP |null |DEF |
|2017-01-01 00:00:20.201|EV_2 |10 |DEF |
|2017-01-01 00:00:24.203|EV_CODE |DEF |DEF |
+-----------------------+--------+-----------+----------+
Creating a new Hadoop InputFormat would be a more computationally efficient way to accomplish your goal here (although is arguably the same or more gymnastics in terms of code). You can specify alternative Hadoop input formats using sc.hadoopFile in the Python API, but you must take care of conversion from the Java format to Python. You can then specify the format. The converters available in PySpark are relatively few but this reference proposes using the Avro converter as an example. You might also simply find it convenient to let your custom Hadoop input format output text which you then additionally parse in Python to avoid the issue of implementing a converter.
Once you have that in place, you would create a special input format (in Java or Scala using the Hadoop API's) to treat the special sequences of rows having EV_SEP as record delimiters instead of the newline character. You could do this quite simply by collecting rows as they are read in an accumulator (just a simple ArrayList could do as a proof-of-concept) and then emitting the accumulated list of records when you find two EV_SEP rows in a row.
I would point out that using TextInputFormat as a basis for such a design might be tempting, but that the input format will split such files arbitrarily at newline characters and you will need to implement custom logic to properly support splitting the files. Alternatively, you can avoid the problem by simply not implementing file splitting. This is a simple modification to the partitioner.
If you do need to split files, the basic idea is:
Pick a split offset by evenly dividing the file into parts
Seek to the offset
Seek back character-by-character from the offset to where the delimiter sequence is found (in this case two rows in a row with type EV_SEP.
Detecting these sequences for the edge case around file splitting would be a challenge. I would suggest establishing the largest byte-width of rows and reading sliding-window chunks of an appropriate width (basically 2x the size of the rows) backwards from your starting point, then matching against those windows using a precompiled Java regex and Matcher. This is similar to how Sequence Files find their sync marks, but uses a regex to detect the sequence instead of strict equality.
As a side note, I would be concerned given some of the other background you mention here that sorting the DataFrame by timestamp could alter the contents of events that happen in the same time period in different files.

How to apply different aggregation functions to the same column why grouping spark dataframe? [duplicate]

This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 5 years ago.
To group by a Spark data-frame with pyspark I use command like that:
df2 = df.groupBy('_c1','_c3').agg({'_c4':'max', '_c2' : 'avg'})
As a result I get output like that:
+-----------------+-------------+------------------+--------+
| _c1| _c3| avg(_c2)|max(_c4)|
+-----------------+-------------+------------------+--------+
| Local-gov| HS-grad| 644952.5714285715| 9|
| Local-gov| Assoc-acdm|365081.64285714284| 12|
| Never-worked| Some-college| 462294.0| 10|
| Local-gov| Bachelors| 398296.35| 13|
| Federal-gov| HS-grad| 493293.0| 9|
| Private| 12th| 632520.5454545454| 8|
| State-gov| Assoc-voc| 412814.0| 11|
| ?| HS-grad| 545870.9230769231| 9|
| Private| Prof-school|340322.89130434784| 15|
+-----------------+-------------+------------------+--------+
Which is nice but there are two things that I miss:
I would like to have a control over the names of the columns. For example I want a new column to be named avg_c2 instead avg(_c2).
I want to aggregate the same column in different ways. For example I might want to know minimum and maximum of column _c4. I tried that following and it does not work:
df2 = df.groupBy('_c1','_c3').agg({'_c4':('min','max'), '_c2' : 'avg'})
Is there a way to achieve what I need?
you have to use withColumn api and generate new columns or replace the old ones
Or you can use alias to have the required column name instead of default avg(_c2)
I haven't used pyspark yet but in scala I do something like
import org.apache.spark.sql.functions._
df2 = df.groupBy("_c1","_c3").agg(max(col("_c4")).alias("max_c4"), min(col("_c4")).alias("min_c4"), avg(col("_c2")).alias("avg_c2"))

Resources