DataFrame and DataSet - converting values to <k,v> pair - apache-spark

Sample Input (black coloured text) and Output (red coloured text)
I have a DataFrame (one in black), how can I transform it to one like in red?
(column number, value)
[Image is attached]
val df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("file:/home/hduser/Desktop/Demo.csv")
case class Employee(EmpId: String, Experience: Double, Salary: Double)
val ds = df.as[Employee]
I need the solution in both DataFrame and DataSet way.
Thank you in advance! :-)

I believe it's a structure you want when you say pair. check if below code gives your expected output.
With DataFrame:
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val data = Seq(("111",5,50000),("222",6,60000),("333",7,60000))
val df = data.toDF("EmpId","Experience","Salary")
val newdf = df.withColumn("EmpId", struct(lit("1").as("key"),col("EmpId").as("value")))
.withColumn("Experience", struct(lit("2").as("key"),col("Experience").as("value")))
.withColumn("Salary", struct(lit("3").as("key"),col("Salary").as("value")))
.show(false)
output:
+--------+----------+----------+
|EmpId |Experience|Salary |
+--------+----------+----------+
|[1, 111]|[2, 5] |[3, 50000]|
|[1, 222]|[2, 6] |[3, 60000]|
|[1, 333]|[2, 7] |[3, 60000]|
+--------+----------+----------+
With Dataset:
First you need to define case class for new structure otherwise you can't create a dataset
case class Employee2(EmpId: EmpData, Experience: EmpData, Salary: EmpData)
case class EmpData(key: String,value:String)
val ds = df.as[Employee]
val newDS = ds.map(rec=>{
(EmpData("1",rec.EmpId), EmpData("2",rec.Experience.toString),EmpData("3",rec.Salary.toString))
})
val finalDS = newDS.toDF("EmpId","Experience","Salary").as[Employee2]
finalDS.show(false)
Output:
+--------+--------+------------+
|EmpId |Experience|Salary |
+--------+--------+------------+
|[1, 111]|[2, 5] |[3, 50000] |
|[1, 222]|[2, 6] |[3, 60000] |
|[1, 333]|[2, 7] |[3, 60000] |
+--------+--------+------------+
Thanks

Related

Loading Data into Spark Dataframe without delimiters in source

I have a dataset with no delimiters:
111222333444
555666777888
Desired output:
|_c1_|_c2_|_c3_|_c4_|
|111 |222 |333 |444 |
|555 |666 |777 |888 |
i have tried this to attain the output
val myDF = spark.sparkContext.textFile("myFile").toDF()
val myNewDF = myDF.withColumn("c1", substring(col("value"), 0, 3))
.withColumn("c2", substring(col("value"), 3, 6))
.withColumn("c3", substring(col("value"), 6, 9)
.withColumn("c4", substring(col("value"), 9, 12))
.drop("value")
.show()
but i need to manipulate c4 (multiply 100) but the datatype is string not double.
Update: I encountered a scenarios
when i execute this,
val myNewDF = myDF.withColumn("c1", expr("substring(value, 0, 3)"))
.withColumn("c2", expr("substring(value, 3, 6"))
.withColumn("c3", expr("substring(value, 6, 9)"))
.withColumn("c4", (expr("substring(value, 9, 12)").cast("double") * 100))
.drop("value")
myNewDF.show(5,false) // it only shows "value" column (which i dropped) and "c1" column
myNewDF.printSchema // only showing 2 rows. why is it not showing all the newly created 4 columns?
Create test dataframe:
scala> var df = Seq(("111222333444"),("555666777888")).toDF("s")
df: org.apache.spark.sql.DataFrame = [s: string]
Split column s into an array of 3-character chunks:
scala> var res = df.withColumn("temp",split(col("s"),"(?<=\\G...)"))
res: org.apache.spark.sql.DataFrame = [s: string, temp: array<string>]
Map array elements to new columns:
scala> res = res.select((1 until 5).map(i => col("temp").getItem(i-1).as("c"+i)):_*)
res: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
scala> res.show(false)
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
Leaving a little to puzzle for yourself, like 1) reading the file and naming your dataset / dataframe columns explicitly, this simulated approach with RDD should help you on your way:
val rdd = sc.parallelize(Seq(("111222333444"),
("555666777888")
)
)
val df = rdd.map(x => (x.slice(0,3), x.slice(3,6), x.slice(6,9), x.slice(9,12))).toDF()
df.show(false)
returns:
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
OR
using DF's:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("111222333444"),
("555666777888"))
).toDF()
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)")).withColumn("c4", expr("substring(value, 10, 3)"))
df2.show(false)
returns:
+------------+---+---+---+---+
|value |c1 |c2 |c3 |c4 |
+------------+---+---+---+---+
|111222333444|111|222|333|444|
|555666777888|555|666|777|888|
+------------+---+---+---+---+
you can drop the value, leave that up to you.
Like the answer above but gets complicated if not all 3 size chunks.
Your updated question for double times 100:
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)"))
.withColumn("c4", (expr("substring(value, 10, 3)").cast("double") * 100))

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

Flatten rows in sliding window using Spark

I'm processing a large number of rows from either a database or a file using Apache Spark. Part of the processing creates a sliding window of 3 rows where the rows need to flattened and additional calculations performed on the flattened rows. Below is a simplified example of what is trying to be done.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.desc
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.expressions.Window
object Main extends App {
val ss = SparkSession.builder().appName("DataSet Test")
.master("local[*]").getOrCreate()
import ss.implicits._
case class Foo(a:Int, b:String )
// rows from database or file
val foos = Seq(Foo(-18, "Z"),
Foo(-11, "G"),
Foo(-8, "A"),
Foo(-4, "C"),
Foo(-1, "F")).toDS()
// work on 3 rows
val sliding_window_spec = Window.orderBy(desc("a")).rowsBetween( -2, 0)
// flattened object with example computations
case class FooResult(a1:Int, b1:String, a2:Int, b2:String, a3:Int, b3:String, computation1:Int, computation2:String )
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
// expected results
val fooResults = Seq(FooResult( -1, "F", -4, "C", -8, "A", -5, "FCA" ),
FooResult( -4, "C", -8, "A", -11, "G", -12, "CAG" ),
FooResult( -8, "A", -11, "G", -18, "Z", -19, "AGZ" )).toDS()
ss.stop()
}
How can I convert the foos into the fooResults? I'm using Apache Spark 2.3.0
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
You can simply use collect_list inbuilt function using the window function you've already defined and then by defining a udf function, you can do the computation part and flattening part. finally you can filter and expand the struct column to get your final desired result as
def slidingUdf = udf((list1: Seq[Int], list2:Seq[String])=> {
if(list1.size < 3) null
else {
val zipped = list1.zip(list2)
FooResult(zipped(0)._1, zipped(0)._2, zipped(1)._1, zipped(1)._2, zipped(2)._1, zipped(2)._2, zipped(0)._1+zipped(1)._1, zipped(0)._2+zipped(1)._2+zipped(2)._2)
}
})
foos.select(slidingUdf(collect_list("a").over(sliding_window_spec), collect_list("b").over(sliding_window_spec)).as("test"))
.filter(col("test").isNotNull)
.select(col("test.*"))
.show(false)
which should give you
+---+---+---+---+---+---+------------+------------+
|a1 |b1 |a2 |b2 |a3 |b3 |computation1|computation2|
+---+---+---+---+---+---+------------+------------+
|-1 |F |-4 |C |-8 |A |-5 |FCA |
|-4 |C |-8 |A |-11|G |-12 |CAG |
|-8 |A |-11|G |-18|Z |-19 |AGZ |
+---+---+---+---+---+---+------------+------------+
Note: Remember that the case classes should be defined outside the scope of the current session

pyspark. zip arrays in a dataframe

I have the following PySpark DataFrame:
+------+----------------+
| id| data |
+------+----------------+
| 1| [10, 11, 12]|
| 2| [20, 21, 22]|
| 3| [30, 31, 32]|
+------+----------------+
At the end, I want to have the following DataFrame
+--------+----------------------------------+
| id | data |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+
I order to do this. First I extract the data arrays as follow:
tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)
In this way, I have in samples1 an RDD with the content
[[10,20,30],[11,21,31],[12,22,32]]
Question 1: Is that a good way to do it?
Question 2: How to include that RDD back into the dataframe?
Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:
The number of rows in your DataFrame (df.count())
The length of data (given)
Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():
import pyspark.sql.functions as f
dataLength = 3
numRows = df.count()
df.select(
f.collect_list("id").alias("id"),
f.array(
[
f.array(
[f.collect_list("data").getItem(j).getItem(i)
for j in range(numRows)]
)
for i in range(dataLength)
]
).alias("data")
)\
.show(truncate=False)
#+---------+------------------------------------------------------------------------------+
#|id |data |
#+---------+------------------------------------------------------------------------------+
#|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
#+---------+------------------------------------------------------------------------------+
You can simply use a udf function for the zip function but before that you will have to use collect_list function
from pyspark.sql import functions as f
from pyspark.sql import types as t
def zipUdf(array):
return zip(*array)
zipping = f.udf(zipUdf, t.ArrayType(t.ArrayType(t.IntegerType())))
df.select(
f.collect_list(df.id).alias('id'),
zipping(f.collect_list(df.data)).alias('data')
).show(truncate=False)
which would give you
+---------+------------------------------------------------------------------------------+
|id |data |
+---------+------------------------------------------------------------------------------+
|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
+---------+------------------------------------------------------------------------------+

SparkSQL — collect_set and sort_array does not sort integer column properly

I want to generate a sorted, collected set in SparkSQL, like so:
spark.sql("SELECT id, col_2, sort_array(collect_set(value)) AS collected
FROM my_table GROUP BY id, col_2").show()
where value is an integer.
But it fails to sort the array in proper numeric order — and does something rather ad hoc (sort on beginning of the first number in the value instead? Is sort_array operating on a string?).
So instead of:
+----+-------+------------+
| id | col_2 | collected |
+----+-------+------------+
| 1 | 2 | [456,1234]|
+----+-------+------------+
I get:
+----+-------+------------+
| id | col_2 | collected |
+----+-------+------------+
| 1 | 2 | [1234,456]|
+----+-------+------------+
EDIT:
Looking at what spark.sql(…) returns it is obvious that this query returns strings instead:
DataFrame[id: string, col_2: string, collected: array<string>]
How can that be when the original dataframe is all integers.
EDIT 2:
This seems to be a problem related to pyspark, as I'm not experiencing the problem with spark-shell and writing the same stuff in scala
I tested with Apache Spark 2.0.0.
It works for me. To make sure I tested with data [(1, 2, 1234), (1, 2, 456)] and [(1, 2, 456), (1, 2, 1234)]. The result is same.
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([(1, 2, 1234), (1, 2, 456)], ['id', 'col_2', 'value'])
# test with reversed order, too
#df = sqlContext.createDataFrame([(1, 2, 456), (1, 2, 1234)], ['id', 'col_2', 'value'])
df.createOrReplaceTempView("my_table")
sqlContext.sql("SELECT id, col_2, sort_array(collect_set(value)) AS collected FROM my_table GROUP BY id, col_2").show()
Result
+---+-----+-----------+
| id|col_2| collected|
+---+-----+-----------+
| 1| 2|[456, 1234]|
+---+-----+-----------+
Some observations
when a value is None it appears as null e.g. [null, 456, 1234]
when there is a string value, Spark throws error "TypeError: Can not merge type LongType and StringType"
I think the problem is not the SQL but in the earlier steps where DataFrame was created.

Resources