Spark dataframe explode function

Spark dataframe explode function - apache-spark

Can anyone please explain why case Row, Seq[Row] are used after the explode of a dataframe field which has collection of elements.
And also can you please explain me the reason why asInstanceOf is required to get the values from the exploded field?
Here is the syntax:
val explodedDepartmentWithEmployeesDF = departmentWithEmployeesDF.explode(departmentWithEmployeesDF("employees")) {
case Row(employee: Seq[Row]) =>
employee.map(employee =>
Employee(employee(0).asInstanceOf[String],
employee(1).asInstanceOf[String], employee(2).asInstanceOf[String]) ) }

First I will note, that I can not explain why your explode() turns into Row(employee: Seq[Row]) as I don't know the schema of your DataFrame. I have to assume it has to do with the structure of your data.
Not knowing you original data, I have created a small data set to work from
scala> val df = sc.parallelize( Array( (1, "dsfds dsf dasf dsf dsf d"), (2, "2344 2353 24 23432 234"))).toDF("id", "text")
df: org.apache.spark.sql.DataFrame = [id: int, text: string]
If I now map over it, you can se that it returns rows containing data of type Any.
scala> df.map {case row: Row => (row(0), row(1)) }
res21: org.apache.spark.rdd.RDD[(Any, Any)] = MapPartitionsRDD[17] at map at <console>:33
You have basically lost type information, which is why you need to explicitly specify the type when you want to use the data in the row
scala> df.map {case row: Row => (row(0).asInstanceOf[Int], row(1).asInstanceOf[String]) }
res22: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[18] at map at <console>:33
So, in order to explode it, I have to do the following
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.Row
df.explode(col("id"), col("text")) {case row: Row =>
val id = row(0).asInstanceOf[Int]
val words = row(1).asInstanceOf[String].split(" ")
words.map(word => (id, word))
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.Row
res30: org.apache.spark.sql.DataFrame = [id: int, text: string, _1: int, _2: string]
scala> res30 show
+---+--------------------+---+-----+
| id| text| _1| _2|
+---+--------------------+---+-----+
| 1|dsfds dsf dasf ds...| 1|dsfds|
| 1|dsfds dsf dasf ds...| 1| dsf|
| 1|dsfds dsf dasf ds...| 1| dasf|
| 1|dsfds dsf dasf ds...| 1| dsf|
| 1|dsfds dsf dasf ds...| 1| dsf|
| 1|dsfds dsf dasf ds...| 1| d|
| 2|2344 2353 24 2343...| 2| 2344|
| 2|2344 2353 24 2343...| 2| 2353|
| 2|2344 2353 24 2343...| 2| 24|
| 2|2344 2353 24 2343...| 2|23432|
| 2|2344 2353 24 2343...| 2| 234|
+---+--------------------+---+-----+
If you want named columns, you can define a case class to hold you exploded data
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.Row
case class ExplodedData(word: String)
df.explode(col("id"), col("text")) {case row: Row =>
val words = row(1).asInstanceOf[String].split(" ")
words.map(word => ExplodedData(word))
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.Row
defined class ExplodedData
res35: org.apache.spark.sql.DataFrame = [id: int, text: string, word: string]
scala> res35.select("id","word").show
+---+-----+
| id| word|
+---+-----+
| 1|dsfds|
| 1| dsf|
| 1| dasf|
| 1| dsf|
| 1| dsf|
| 1| d|
| 2| 2344|
| 2| 2353|
| 2| 24|
| 2|23432|
| 2| 234|
+---+-----+
Hope this brings some clearity.

I think you can read more about the document and do a test first.
explode of a dataframe still return a dataframe.And it accept a lambda function f: (Row) ⇒ TraversableOnce[A] as parameter.
in the lambda function, you will match the input by case. You've already known that your input will be Row of employee, which is still a Seq of Row.So the case of input will Row(employee: Seq[Row]) , if you don't understand this part, you can learn more thing about unapply funciton in scala.
And than, employee(I believe you should use employees here), as a Seq of Row, will apply the map function to map each row to a Employee. And you will use the scala apply function to get the i'th value in this row. But the return value is an Object , so you have to use asInstanceOf to convert it into the type you expected.

Related

Merge two columns in a single DataFrame and count the occurrences using PySpark

I've two columns in my DataFrame name1 and name2.
I want to join them and count the occurrence (without Null values!).
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
Desired result:
+------------------------------+
|name |Occurrence |
+------------------------------+
|Luc Krier |2 |
|Jeanny Thorn |3 |
|Teddy E Beecher |1 |
|Philippe Schauss |1 |
|Meindert I Tholen |2 |
|Liam Muller |1 |
|Ben Weller |1 |
+------------------------------+
How can I achieve this?

You can use explode with array fuction to merge the columns into one then simply group by and count, like this :
from pyspark.sql.functions import col, array, explode, count
df.select(explode(array("name1", "name2")).alias("name")) \
.filter("nullif(name, '') is not null") \
.groupBy("name") \
.agg(count("*").alias("Occurrence")) \
.show()
#+-----------------+----------+
#| name|Occurrence|
#+-----------------+----------+
#|Meindert I Tholen| 2|
#| Jeanny Thorn| 3|
#| Luc Krier| 2|
#| Teddy E Beecher| 1|
#|Philippe Schauss| 1|
#| Ben Weller| 1|
#| Liam Muller| 1|
#+-----------------+----------+
Another way is to select each column, union then group by and count:
df.select(col("name1").alias("name")).union(df.select(col("name2").alias("name"))) \
.filter("nullif(name, '') is not null")\
.groupBy("name") \
.agg(count("name").alias("Occurrence")) \
.show()

Many fancy answers out there, but the easiest solution should be to do a union and then aggregate the count:
df2 = (df.select('name1')
.union(df.select('name2'))
.filter("name1 != ''")
.groupBy('name1')
.count()
.toDF('name', 'Occurrence')
)
df2.show()
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
|Meindert I Tholen| 2|
| Jeanny Thorn| 3|
| Luc Krier| 2|
| Teddy E Beecher| 1|
|Philippe Schauss| 1|
| Ben Weller| 1|
| Liam Muller| 1|
+-----------------+----------+

There are better ways to do it. One naive way of doing it is as follows
from collections import Counter
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OccurenceCount").getOrCreate()
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
counter_dict = dict(Counter(df.select("name1", "name2").rdd.flatMap(lambda x: x).collect()))
counter_list = list(map(list, counter_dict.items()))
frequency_df = spark.createDataFrame(counter_list, ["name", "Occurrence"])
frequency_df.show()
Output:
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
| | 1|
| Liam Muller| 1|
| Teddy E Beecher| 1|
| Ben Weller| 1|
| Jeanny Thorn| 3|
| Luc Krier| 2|
|Philippe Schauss| 1|
|Meindert I Tholen| 2|
+-----------------+----------+

Does this work?
# Groupby & count both dataframes individually to reduce size.
df_name1 = (df.groupby(['name1']).count()
.withColumnRenamed('name1', 'name')
.withColumnRenamed('count', 'count1'))
df_name2 = (df.groupby(['name2']).count()
.withColumnRenamed('name2', 'name')
.withColumnRenamed('count', 'count2'))
# Join the two dataframes containing frequency counts
# Any null value in the 'count' column can be correctly interpreted as zero.
df_count = (df_name1.join(df_name2, on=['name'], how='outer')
.fillna(0, subset=['count1', 'count2']))
# Sum the two counts and drop the useless columns
df_count = (df_count.withColumn('count', df_count['count1'] + df_count['count2'])
.drop('count1').drop('count2').dropna(subset=['name']))
# (Optional) While any rows with a null name have been removed, rows with an
# empty string ("") for a name are still there. We can drop the empty name
# rows like this.
df_count = df_count[df_count['name'] != '']
df_count.show()
# +-----------------+-----+
# | name|count|
# +-----------------+-----+
# |Meindert I Tholen| 2|
# | Jeanny Thorn| 3|
# | Luc Krier| 2|
# | Teddy E Beecher| 1|
# |Philippe Schauss| 1|
# | Ben Weller| 1|
# | Liam Muller| 1|
# +-----------------+-----+

You can get the required output as follows in scala :
import org.apache.spark.sql.functions._
val df = Seq(
("Luc Krier","Jeanny Thorn"),
("Jeanny Thorn","Ben Weller"),
( "Teddy E Beecher","Luc Krier"),
("Philippe Schauss","Jeanny Thorn"),
("Meindert I Tholen","Liam Muller"),
("Meindert I Tholen","")
).toDF("name1", "name2")
val df1 = df.filter($"name1".isNotNull).filter($"name1" !==
"").groupBy("name1").agg(count("name1").as("count1"))
val df2 = df.filter($"name2".isNotNull).filter($"name2" !==
"").groupBy("name2").agg(count("name2").as("count2"))
val newdf = df1.join(df2, $"name1" === $"name2","outer").withColumn("count1",
when($"count1".isNull,0).otherwise($"count1")).withColumn("count2",
when($"count2".isNull,0).otherwise($"count2")).withColumn("Count",$"count1" +
$"count2")
val finalDF =newdf.withColumn("name",when($"name1".isNull,$"name2")
.when($"name2".isNull,$"name1").otherwise($"name1")).select("name","Count")
display(finalDF)
You can see the final output as image below :

generate 2 rows for each row in spark using optimized DSL

I have data like:
id,ts_start,ts_end,foo_start,foo_end
1,1,2,f_s,f_e
2,3,4,foo,bar
3,3,6,foo,f_e
I.e. a single record with all the start and end information aggregated.
Using a flat map, these could be transformed to
id,ts,foo
1,1,f_s
1,2,f_e
How can I do the same using the optimized SQL DSL with explode or maybe pivot?
edit
Obviously, I do not want to read in the data two times and union the result.
Or is this the only option if I do not want to use flatmap + serde + custom code?

given:
val df = Seq(
(1,1,2,"f_s","f_e"),
(2,3,4,"foo","bar"),
(3,3,6,"foo","f_e")
).toDF("id","ts_start","ts_end","foo_start","foo_end")
you can do:
df
.select($"id",
explode(
array(
struct($"ts_start".as("ts"),$"foo_start".as("foo")),
struct($"ts_end".as("ts"),$"foo_end".as("foo"))
)
).as("tmp")
)
.select(
$"id",
$"tmp.*"
)
.show()
which gives:
+---+---+---+
| id| ts|foo|
+---+---+---+
| 1| 1|f_s|
| 1| 2|f_e|
| 2| 3|foo|
| 2| 4|bar|
| 3| 3|foo|
| 3| 6|f_e|
+---+---+---+

Split String of RDD and combine with other RDD element in one statement

I can flatMap the 2nd element of the RDD, fine.
val rdd = sc.parallelize( Seq( (1, "Hello how are you"),
(1, "I am fine"),
(2, "Yes you are")
)
)
val rdd2 = rdd.flatMap(x => x._2.split(" "))
However, I would like to append x._1 to each split item of x._2 immediately to form a tuple (String, Int). For some reason I cannot see it - and I do not want to convert to a DF array and explode. Any ideas?

Just iterate over the array (split result) and append the value you need:
val rdd = sc.parallelize( Seq( (1, "Hello how are you"),
(1, "I am fine"),
(2, "Yes you are")
)
)
val rdd2 = rdd.flatMap(x => x._2.split(" ").map(item => s"${item}+${x._1}"))

You can get the same results at the df() abstraction also. Check this out
val df = Seq( (1, "Hello how are you"),(1, "I am fine"),(2, "Yes you are")).toDF("a","b")
df.show(false)
df.flatMap( r => { val y = r.getString(1).split(" "); ( 0 until y.size).map( i => (r.getInt(0), y(i))) }).show
Results:
+---+-----------------+
|a |b |
+---+-----------------+
|1 |Hello how are you|
|1 |I am fine |
|2 |Yes you are |
+---+-----------------+
+---+-----+
| _1| _2|
+---+-----+
| 1|Hello|
| 1| how|
| 1| are|
| 1| you|
| 1| I|
| 1| am|
| 1| fine|
| 2| Yes|
| 2| you|
| 2| are|
+---+-----+

Reading Hive Tables in Spark Dataframe without header

I have the following Hive table:
select* from employee;
OK
abc 19 da
xyz 25 sa
pqr 30 er
suv 45 dr
when I read this in spark(pyspark):
df = hiveCtx.sql('select* from spark_hive.employee')
df.show()
+----+----+-----+
|name| age| role|
+----+----+-----+
|name|null| role|
| abc| 19| da|
| xyz| 25| sa|
| pqr| 30| er|
| suv| 45| dr|
+----+----+-----+
I end up getting the headers in my spark DataFrame. Is there a simple way to remove it ?
Also, am I missing something while reading the table into the DataFrame (Ideally I shouldn't be getting the header right ?) ?

You have to remove header from the result. You can do like this:
scala> val df = sql("select * from employee")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df.show
+----+----+----+
| id|name| age|
+----+----+----+
|null|name|null|
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+----+----+----+
scala> val header = df.first()
header: org.apache.spark.sql.Row = [null,name,null]
scala> val data = df.filter(row => row != header)
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, name: string ... 1 more field]
scala> data.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+---+----+---+
Thanks.

you can use skip.header.line.count for skip this header. You could also specify the same while creating the table. For example:
create external table testtable ( id int,name string, age int)
row format delimited .............
tblproperties ("skip.header.line.count"="1");
after that load the data and then check your query I hope you will get expected output.

Not the Most elegant way, but this worked with pyspark:
rddWithoutHeader = dfemp.rdd.filter(lambda line: line!=header)
dfnew = sqlContext.createDataFrame(rddWithoutHeader)

Explode array data into rows in spark [duplicate]

This question already has answers here:
Dividing complex rows of dataframe to simple rows in Pyspark
(3 answers)
Closed 5 years ago.
I have a dataset in the following way:
FieldA FieldB ArrayField
1 A {1,2,3}
2 B {3,5}
I would like to explode the data on ArrayField so the output will look in the following way:
FieldA FieldB ExplodedField
1 A 1
1 A 2
1 A 3
2 B 3
2 B 5
I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.
How would you implement it in Spark.
Notice that the input dataset is very large.

The explode function should get that done.
pyspark version:
>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"])
>>> from pyspark.sql.functions import explode
>>> df.withColumn("col3", explode(df.col3)).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+
Scala version
scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]
scala> df.withColumn("col3", explode($"col3")).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+

You can use explode function
Below is the simple example for your case
import org.apache.spark.sql.functions._
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "A", List(1,2,3)),
(2, "B", List(3, 5))
)).toDF("FieldA", "FieldB", "FieldC")
data.withColumn("ExplodedField", explode($"FieldC")).drop("FieldC")
Hope this helps!

explode does exactly what you want. Docs:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
Also, here is an example from a different question using it:
https://stackoverflow.com/a/44418598/1461187

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark dataframe explode function - apache-spark

Related

Merge two columns in a single DataFrame and count the occurrences using PySpark

generate 2 rows for each row in spark using optimized DSL

Split String of RDD and combine with other RDD element in one statement

Reading Hive Tables in Spark Dataframe without header

Explode array data into rows in spark [duplicate]

Categories

Resources