Spark Aggregating multiple columns (possible to array) from join output - apache-spark

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.

Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

Related

Efficient way to add UUID in pyspark [duplicate]

This question already has answers here:
Pyspark add sequential and deterministic index to dataframe
(2 answers)
Closed 2 years ago.
I have a DataFrame that I want to add a column of distinct uuid4() rows. My code:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
from uuid import uuid4
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([
[1, 1, 'teste'],
[2, 2, 'teste'],
[3, 0, 'teste'],
[4, 5, 'teste'],
],
list('abc'))
df = df.withColumn("_tmp", f.lit(1))
uuids = [str(uuid4()) for _ in range(df.count())]
df1 = spark_session.createDataFrame(uuids, StringType())
df1 = df_1.withColumn("_tmp", f.lit(1))
df2 = df.join(df_1, "_tmp", "inner").drop("_tmp")
df2.show()
But I've got this ERROR:
Py4JJavaError: An error occurred while calling o1571.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
I already try with alias and using monotonically_increasing_id as the join column, but I see
here that I cannot trust in monotonically_increasing_id as merge column.
I'm expecting:
+---+---+-----+------+
| a| b| c| value|
+---+---+-----+------+
| 1| 1|teste| uuid4|
| 2| 2|teste| uuid4|
| 3| 0|teste| uuid4|
| 4| 5|teste| uuid4|
+---+---+-----+------+
what's the correct approach in this case?
I use row_number as #Tetlanesh suggest. I have to create an ID column to ensure that row_number count every row of Window.
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from uuid import uuid4
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
from pyspark.sql.functions import row_number
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([
[1, 1, 'teste'],
[1, 2, 'teste'],
[2, 0, 'teste'],
[2, 5, 'teste'],
],
list('abc'))
df = df.alias("_tmp")
df.registerTempTable("_tmp")
df2 = self.spark_session.sql("select *, uuid() as uuid from _tmp")
df2.show()
Another approach is using windows, but It's not efficient as the first one:
df = df.withColumn("_id", f.lit(1))
df = df.withColumn("_tmp", row_number().over(Window.orderBy('_id')))
uuids = [(str(uuid4()), 1) for _ in range(df.count())]
df1 = spark_session.createDataFrame(uuids, ['uuid', '_id'])
df1 = df1.withColumn("_tmp", row_number().over(Window.orderBy('_id')))
df2 = df.join(df1, "_tmp", "inner").drop('_id')
df2.show()
both outputs:
+---+---+-----+------+
| a| b| c| uuid|
+---+---+-----+------+
| 1| 1|teste| uuid4|
| 2| 2|teste| uuid4|
| 3| 0|teste| uuid4|
| 4| 5|teste| uuid4|
+---+---+-----+------+

How do I replace a full stop with a zero in PySpark?

I am trying to replace a full stop in my raw data with the value 0 in PySpark.
I tried to use a .when and .otherwise statement.
I tried to use regexp_replace to change the '.' to 0.
Code tried:
from pyspark.sql import functions as F
#For #1 above:
dataframe2 = dataframe1.withColumn("test_col", F.when(((F.col("test_col") == F.lit(".")), 0).otherwise(F.col("test_col")))
#For #2 above:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
Instead of "." it should rewrite the column with numbers only (i.e. there is a number in non full stop rows, otherwise, it's a full stop that should be replaced with 0).
pyspark version
from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions
column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])
data = [[3, 'r1'], [9, 'r2.'], [27, '.']]
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()
filtered_data_frame = data_frame.withColumn('num',
functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()
filtered_data_frame = data_frame.withColumn('text',
functions.when(data_frame['text'] == '.', '0').otherwise(
data_frame['text']))
filtered_data_frame.show()
output
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
sample code does query properly
package otz.scalaspark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object ValueReplacement {
def main(args: Array[String]) {
val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
val sparkContext = new SparkContext(sparkConfig)
val someData = Seq(
Row(3, "r1"),
Row(9, "r2"),
Row(27, "r3"),
Row(81, "r4")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val sqlContext = new SQLContext(sparkContext)
val dataFrame = sqlContext.createDataFrame(
sparkContext.parallelize(someData),
StructType(someSchema)
)
val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
filteredDataFrame.show()
}
}
output
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
You attempt #2 was almost correct, if you have a dataframe1 like:
+--------+
|test_col|
+--------+
| 1.0|
| 2.0|
| 2|
+--------+
Your attempt must be yielding:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
dataframe2.show()
+--------+
|test_col|
+--------+
| 000|
| 000|
| 0|
+--------+
Here the . means all the letter are to be replaced and not just '.'.
However, if you add an escape sequence (\) before the dot then things should work fine.
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '\.', '0'))
dataframe2.show()
+--------+
|test_col|
+--------+
| 100|
| 200|
| 2|
+--------+

spark dynamically create struct/json per group

I have a spark dataframe like
+-----+---+---+---+------+
|group| a| b| c|config|
+-----+---+---+---+------+
| a| 1| 2| 3| [a]|
| b| 2| 3| 4|[a, b]|
+-----+---+---+---+------+
val df = Seq(("a", 1, 2, 3, Seq("a")),("b", 2, 3,4, Seq("a", "b"))).toDF("group", "a", "b","c", "config")
How can I add an additional column i.e.
df.withColumn("select_by_config", <<>>).show
as a struct or JSON which combines a number of columns (specified by config) in something similar to a hive named struct / spark struct / json column? Note, this struct is specific per group and not constant for the whole dataframe; it is specified in config column.
I can imagine that a df.map could do the trick, but the serialization overhead does not seem to be efficient. How can this be achieved via SQL only expressions? Maybe as a Map-type column?
edit
a possible but really clumsy solution for 2.2 is:
val df = Seq((1,"a", 1, 2, 3, Seq("a")),(2, "b", 2, 3,4, Seq("a", "b"))).toDF("id", "group", "a", "b","c", "config")
df.show
import spark.implicits._
final case class Foo(id:Int, c1:Int, specific:Map[String, Int])
df.map(r => {
val config = r.getAs[Seq[String]]("config")
print(config)
val others = config.map(elem => (elem, r.getAs[Int](elem))).toMap
Foo(r.getAs[Int]("id"), r.getAs[Int]("c"), others)
}).show
are there any better ways to solve the problem for 2.2?
If you use a recent build (Spark 2.4.0 RC 1 or later) a combination of higher order functions should do the trick. Create a map of columns:
import org.apache.spark.sql.functions.{
array, col, expr, lit, map_from_arrays, map_from_entries
}
val cols = Seq("a", "b", "c")
val dfm = df.withColumn(
"cmap",
map_from_arrays(array(cols map lit: _*), array(cols map col: _*))
)
and transform the config:
dfm.withColumn(
"config_mapped",
map_from_entries(expr("transform(config, k -> struct(k, cmap[k]))"))
).show
// +-----+---+---+---+------+--------------------+----------------+
// |group| a| b| c|config| cmap| config_mapped|
// +-----+---+---+---+------+--------------------+----------------+
// | a| 1| 2| 3| [a]|[a -> 1, b -> 2, ...| [a -> 1]|
// | b| 2| 3| 4|[a, b]|[a -> 2, b -> 3, ...|[a -> 2, b -> 3]|
// +-----+---+---+---+------+--------------------+----------------+

Using Spark find completeness from multiple datasets

I have requirement to verify certain data present in multiple datasets(or csv) using Spark 2. This can is defined as matching of two more keys from all datasets and generate report of all matching and non matching keys from all datasets.
For e.g. There are four datasets and each dataset has two matching keys with other keys. It means all the datasets needs to be matched on two matching keys defined.Lets say userId,userName are two matching keys in below datasets:
Dataset A: userId,userName,age,contactNumber
Dataset B: orderId,orderDetails,userId,userName
Dataset C: departmentId,userId,userName,departmentName
Dataset D: userId,userName,address,pin
Dataset A:
userId,userName,age,contactNumber
1,James,29,1111
2,Ferry,32,2222
3,Lotus,21,3333
Dataset B:
orderId,orderDetails,userId,userName
DF23,Chocholate,1,James
DF45,Gifts,3,Lotus
Dataset C:
departmentId,userId,userName,departmentName
N99,1,James,DE
N100,2,Ferry,AI
Dataset D:
userId,userName,address,pin
1,James,minland street,cvk-dfg
Need to generate a report (or similar report) like
------------------------------------------
userId,userName,status
------------------------------------------
1,James,MATCH
2,Ferry,MISSING IN B, MISSING IN D
3,Lotus,MISSING IN B, MISSING IN C, MISSING IN D
I have tried joining of datasets as follwos
DatsetA-B:
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH
2,Ferry,32,2222,,,,,Missing IN Left
3,Lotus,21,3333,DF45,Gifts,3,Lotus,MATCH
DatsetC-D:
departmentId,userId,userName,departmentName,userId,userName,address,pin,status
N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH
N100,2,Ferry,AI,,,,,Missing IN Right
DatsetAB-CD:
Joining criteria: userId and userName of A with C, userId and userName of B with D
userId,userName,age,contactNumber,orderId,orderDetails,userId,userName,status,departmentId,userId,userName,departmentName,userId,userName,address,pin,status,status
1,James,29,1111,DF23,Chocholate,1,James,MATCH,N99,1,James,DE,1,James,minland street,cvk-dfg,MATCH,MATCH
2,Ferry,32,2222,,,,,Missing IN Left,N100,2,Ferry,AI,,,,,Missing IN Right,Missing IN Right
No row is coming for userId 3
If data is defined as:
val dfA = Seq((1, "James", 29, 1111), (2, "Ferry", 32, 2222),(3, "Lotus", 21, 3333)).toDF("userId,userName,age,contactNumber".split(","): _*)
val dfB = Seq(("DF23", "Chocholate", 1, "James"), ("DF45", "Gifts", 3, "Lotus")).toDF("orderId,orderDetails,userId,userName".split(","): _*)
val dfC = Seq(("N99", 1, "James", "DE"), ("N100", 2, "Ferry", "AI")).toDF("departmentId,userId,userName,departmentName".split(","): _*)
val dfD = Seq((1, "James", "minland street", "cvk-dfg")).toDF("userId,userName,address,pin".split(","): _*)
Define keys:
import org.apache.spark.sql.functions._
val keys = Seq("userId", "userName")
Combine Datasets:
val dfs = Map("A" -> dfA, "B" -> dfB, "C" -> dfC, "D" -> dfD)
val combined = dfs.map {
case (key, df) => df.withColumn("df", lit(key)).select("df", keys: _*)
}.reduce(_ unionByName _)
Pivot and convert result to booleans:
val pivoted = combined
.groupBy(keys.head, keys.tail: _*)
.pivot("df", dfs.keys.toSeq)
.count()
val result = dfs.keys.foldLeft(pivoted)(
(df, c) => df.withColumn(c, col(c).isNotNull.alias(c))
)
// +------+--------+----+-----+-----+-----+
// |userId|userName| A| B| C| D|
// +------+--------+----+-----+-----+-----+
// | 1| James|true| true| true| true|
// | 3| Lotus|true| true|false|false|
// | 2| Ferry|true|false| true|false|
// +------+--------+----+-----+-----+-----+
Use resulting boolean matrix to generate final report.
This can become pretty expensive when datasets become large. If you know that one dataset contains all possible keys, and you don't require exact results (some false negatives are acceptable), you can use Bloom filter.
Here we can use dfA as a reference:
val expectedNumItems = dfA.count
val fpp = 0.00001
val key = struct(keys map col: _*).cast("string").alias("key")
val filters = dfs.filterKeys(_ != "A").mapValues(df => {
val f = df.select(key).stat.bloomFilter("key", expectedNumItems, fpp);
udf((s: String) => f.mightContain(s))
})
filters.foldLeft(dfA.select(keys map col: _*)){
case (df, (c, f)) => df.withColumn(c, f(key))
}.show
// +------+--------+-----+-----+-----+
// |userId|userName| B| C| D|
// +------+--------+-----+-----+-----+
// | 1| James| true| true| true|
// | 2| Ferry|false| true|false|
// | 3| Lotus| true|false|false|
// +------+--------+-----+-----+-----+

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))
Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+

Resources