How to use window specification and join condition per column values? - apache-spark

Here is my DF1
OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber|^|FFAction
4295858898|^|204|^|205|^|1|^|I|!|
4295858898|^|204|^|208|^|2|^|I|!|
4295858898|^|204|^|209|^|2|^|I|!|
4295858898|^|204|^|211|^|3|^|I|!|
4295858898|^|204|^|212|^|3|^|I|!|
4295858898|^|204|^|214|^|4|^|I|!|
4295858898|^|204|^|215|^|4|^|I|!|
4295858898|^|206|^|207|^|1|^|I|!|
4295858898|^|206|^|210|^|2|^|I|!|
4295858898|^|206|^|213|^|3|^|I|!|
Here is my DF2
DataPartition|^|PartitionYear|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber|^|FFAction|!|
SelfSourcedPublic|^|2002|^|1511224917595|^|4295858941|^|24|^|25|^|4|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917596|^|4295858941|^|24|^|25|^|4|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917597|^|4295858941|^|30|^|31|^|2|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917598|^|4295858941|^|30|^|31|^|2|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917599|^|4295858941|^|30|^|32|^|1|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917600|^|4295858941|^|30|^|32|^|1|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917601|^|4295858941|^|24|^|33|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917602|^|4295858941|^|24|^|33|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917603|^|4295858941|^|24|^|34|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917604|^|4295858941|^|24|^|34|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917605|^|4295858941|^|1|^|2|^|4|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917606|^|4295858941|^|1|^|3|^|4|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917607|^|4295858941|^|5|^|6|^|4|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917608|^|4295858941|^|5|^|7|^|4|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917609|^|4295858941|^|12|^|10|^|2|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917610|^|4295858941|^|12|^|11|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917611|^|4295858941|^|1|^|13|^|1|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917612|^|4295858941|^|12|^|14|^|1|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917613|^|4295858941|^|5|^|15|^|3|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917614|^|4295858941|^|5|^|16|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917615|^|4295858941|^|1|^|17|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917616|^|4295858941|^|1|^|18|^|3|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917617|^|4295858941|^|5|^|19|^|1|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917618|^|4295858941|^|5|^|20|^|2|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917619|^|4295858941|^|5|^|21|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917620|^|4295858941|^|1|^|22|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917621|^|4295858941|^|1|^|23|^|2|^|O|!|
SelfSourcedPublic|^|2016|^|1511224917622|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1511224917642|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1511224917643|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1511224917644|^|4295858941|^|null|^|37|^|null|^|D|!|
I want to implement join based on the value of the column.
This is what I am trying to achieve in Spark-Scala for example but don't know how to implement it
If the FFAction_1 =I in the DF2 then below condition
(join and partitionBy on three columns "OrganizationId", "AnnualPeriodId","InterimPeriodId")
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")
.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
when($"FFAction_1".isNotNull, concat(col("FFAction_1"),
lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
If the FFAction_1 =O or D then below condition
(join and partitionBy on two columns "OrganizationId","InterimPeriodId")
val windowSpec = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")
.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
when($"FFAction_1".isNotNull, concat(col("FFAction_1"),
lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Below is my full code
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val get_cus_YearPartition = spark.udf.register("get_cus_YearPartition", (filePath: String) => filePath.split("\\.")(4))
val rdd = sc.textFile("s3://trfsmallfffile/Interim2Annual/MAIN")
val header = rdd.filter(_.contains("OrganizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("OrganizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("OrganizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal=data.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalWithYear=df1resultFinal.withColumn("PartitionYear", get_cus_YearPartition(input_file_name))
//Loading Incremental
val rdd1 = sc.textFile("s3://trfsmallfffile/Interim2Annual/INCR")
val header1 = rdd1.filter(_.contains("OrganizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("OrganizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
//------------------------------- filtering only the latest from increamental ------------------------------
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId","AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey1 = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank")
val windowSpec2 = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = latestForEachKey1.withColumn("tobefiltered", first("FFAction|!|").over(windowSpec2))
.filter($"tobefiltered" === "I|!|" || $"tobefiltered" === "O|!|" || ($"tobefiltered" === "D|!|" && $"FFAction|!|" === "D|!|"))
.drop("tobefiltered", "TimeStamp")
//-----------------separating the increamental df for insert, deletion and overwrite----------------
//---------------insert rows are selected -------------------------------
//insert a row if I is detected and if O is found then first delete and then insert
val insertdf = latestForEachKey.filter($"FFAction|!|" === "I|!|" || $"FFAction|!|" === "O|!|").select(df1resultFinalWithYear.schema.fieldNames.map(col):_*)
//------------------deleted rows with primary key "OrganizationId", "InterimPeriodId"------------------
// delete rows from parent if both D or O is found in increamental
val deletedf = latestForEachKey.filter($"FFAction|!|" === "D|!|" || $"FFAction|!|" === "O|!|").select($"OrganizationId", $"InterimPeriodId", lit("delete").as("Delete"))
//join by two primary keys for deletion and delete from the parent dataframe
val dfMainOutput = df1resultFinalWithYear.join(deletedf, Seq("OrganizationId", "InterimPeriodId"), "left").filter($"Delete".isNull).drop("Delete")
val dfToSave=dfMainOutput.union(insertdf).withColumn("FFAction|!|", when($"FFAction|!|" === "O|!|" || $"FFAction|!|" === "I|!|", lit("I|!|")))
val dfMainOutputFinal = dfToSave.na.fill("").select($"DataPartition", $"PartitionYear",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").filter(_ != "PartitionYear").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val header = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","PartitionYear")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsmallfffile/Interim2Annual/output")
val FFRowCount =dfMainOutputFinalWithoutNull.groupBy("DataPartition","PartitionYear").count
FFRowCount.coalesce(1).write.format("com.databricks.spark.xml")
.option("rootTag", "FFFileType")
.option("rowTag", "FFPhysicalFile")
.save("s3://trfsmallfffile/Interim2Annual/Descr")

DISCLAIMER Somehow this and the other question I've just answered seem duplicates so one is going to get marked as such soon or we find out the difference between them and the disclaimer goes away. Time will tell.
Given the requirement to select the final window specification and join condition based on the values of FFAction_1 column, I'd do filter first and decide what window aggregation and join to use.
val df1 = spark.
read.
option("header", true).
option("sep", "|").
csv("df1.csv").
select("OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber", "FFAction")
scala> df1.show
+--------------+--------------+---------------+-------------+--------+
|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber|FFAction|
+--------------+--------------+---------------+-------------+--------+
| 4295858898| 204| 205| 1| I|
| 4295858898| 204| 208| 2| I|
| 4295858898| 204| 209| 2| I|
| 4295858898| 204| 211| 3| I|
| 4295858898| 204| 212| 3| I|
| 4295858898| 204| 214| 4| I|
| 4295858898| 204| 215| 4| I|
| 4295858898| 206| 207| 1| I|
| 4295858898| 206| 210| 2| I|
| 4295858898| 206| 213| 3| I|
+--------------+--------------+---------------+-------------+--------+
The right-hand side of the join is fairly similar in "shape".
val df2 = spark.
read.
option("header", true).
option("sep", "|").
csv("df2.csv").
select("DataPartition_1", "PartitionYear_1", "TimeStamp", "OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber_1", "FFAction_1")
scala> df2.show
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
| DataPartition_1|PartitionYear_1| TimeStamp|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber_1|FFAction_1|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|SelfSourcedPublic| 2002|1510725106270| 4295858941| 24| 25| 4| O|
|SelfSourcedPublic| 2002|1510725106271| 4295858941| 24| 25| 5| O|
|SelfSourcedPublic| 2003|1510725106272| 4295858941| 30| 31| 2| O|
|SelfSourcedPublic| 2003|1510725106273| 4295858941| 30| 31| 3| O|
|SelfSourcedPublic| 2001|1510725106293| 4295858941| 5| 20| 2| O|
|SelfSourcedPublic| 2001|1510725106294| 4295858941| 5| 21| 3| O|
|SelfSourcedPublic| 2002|1510725106295| 4295858941| 1| 22| 4| O|
|SelfSourcedPublic| 2002|1510725106296| 4295858941| 1| 23| 5| O|
|SelfSourcedPublic| 2016|1510725106297| 4295858941| 35| 36| 1| I|
|SelfSourcedPublic| 2016|1510725106297| 4295858941| 35| 36| 1| D|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
With the above datasets, I'd filter out to see if there's at least one I in df2 in FFAction_1 column and select the correct window specification and join condition.
The trick is to use join operator followed by where (or filter) operator so you can decide on what join condition to use.
val noIs = df2.filter($"FFAction_1" === "I").take(1).isEmpty
val (windowSpec, joinCond) = if (noIs) {
(windowSpecForOs, joinForOs)
} else {
(windowSpecForIs, joinForIs)
}
val latestForEachKey = df2result.withColumn("rank", rank() over windowSpec)
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey).where(joinCond)

Related

Second highest value by department using Apache Spark DataFrame

I have a table where department and value is available now if we will use SQL query to achieve the output of second highest value by each department then we write like:
select * from (SELECT department, asset_value, DENSE_RANK() over (partition by DEPARTMENT order by ASSET_VALUE) as x from [User1].[dbo].[UsersRecord]) y where x = 2;
But, in Apache Spark, if we have a restriction to not to use the SparkSQL and we have to achieve the output using DataFrame only then how we should write the Scala code?
I tried from my end by taking the help of the Documentation but could not figure it out.
My Code:
package com.tg.testpack1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object class2
{
def main(args: Array[String])
{
val sparksessionobject = SparkSession.builder()
.master("local")
.config("spark.sql.warehouse.dir", "C:/Users/ox/spark/spark/spark-warehouse")
.getOrCreate()
sparksessionobject.conf.set("spark.sql.shuffle.partitions", 4)
sparksessionobject.conf.set("spark.executor.memory", "2g")
import sparksessionobject.sqlContext.implicits._
val df = Seq(("DEPT1", 1000), ("DEPT1", 500), ("DEPT1", 700), ("DEPT2", 400), ("DEPT2", 200), ("DEPT3", 500), ("DEPT3", 200))
.toDF("department", "assetValue")
df.withColumn("col3", dense_rank().over(Window.partitionBy($"department").orderBy($"assetValue".desc))).show
}
}
I am getting the output:
+-------------+----------+----+
|accountNumber|assetValue|col3|
+-------------+----------+----+
| DEPT3| 500| 1|
| DEPT3| 200| 2|
| DEPT1| 1000| 1|
| DEPT1| 700| 2|
| DEPT1| 500| 3|
| DEPT2| 400| 1|
| DEPT2| 200| 2|
+-------------+----------+----+
I am expecting the output:
+-------------+----------+----+
|accountNumber|assetValue|col3|
+-------------+----------+----+
| DEPT1| 700| 2|
| DEPT2| 200| 2|
| DEPT3| 200| 2|
+-------------+----------+----+
Modify your code as below :
val byDeptOrderByAssetDesc = Window
.partitionBy($"department")
.orderBy($"assetValue" desc)
df.withColumn("col3", dense_rank() over byDeptOrderByAssetDesc)
.filter("col3=2")

How do I replace a full stop with a zero in PySpark?

I am trying to replace a full stop in my raw data with the value 0 in PySpark.
I tried to use a .when and .otherwise statement.
I tried to use regexp_replace to change the '.' to 0.
Code tried:
from pyspark.sql import functions as F
#For #1 above:
dataframe2 = dataframe1.withColumn("test_col", F.when(((F.col("test_col") == F.lit(".")), 0).otherwise(F.col("test_col")))
#For #2 above:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
Instead of "." it should rewrite the column with numbers only (i.e. there is a number in non full stop rows, otherwise, it's a full stop that should be replaced with 0).
pyspark version
from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions
column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])
data = [[3, 'r1'], [9, 'r2.'], [27, '.']]
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()
filtered_data_frame = data_frame.withColumn('num',
functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()
filtered_data_frame = data_frame.withColumn('text',
functions.when(data_frame['text'] == '.', '0').otherwise(
data_frame['text']))
filtered_data_frame.show()
output
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
sample code does query properly
package otz.scalaspark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object ValueReplacement {
def main(args: Array[String]) {
val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
val sparkContext = new SparkContext(sparkConfig)
val someData = Seq(
Row(3, "r1"),
Row(9, "r2"),
Row(27, "r3"),
Row(81, "r4")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val sqlContext = new SQLContext(sparkContext)
val dataFrame = sqlContext.createDataFrame(
sparkContext.parallelize(someData),
StructType(someSchema)
)
val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
filteredDataFrame.show()
}
}
output
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
You attempt #2 was almost correct, if you have a dataframe1 like:
+--------+
|test_col|
+--------+
| 1.0|
| 2.0|
| 2|
+--------+
Your attempt must be yielding:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
dataframe2.show()
+--------+
|test_col|
+--------+
| 000|
| 000|
| 0|
+--------+
Here the . means all the letter are to be replaced and not just '.'.
However, if you add an escape sequence (\) before the dot then things should work fine.
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '\.', '0'))
dataframe2.show()
+--------+
|test_col|
+--------+
| 100|
| 200|
| 2|
+--------+

Issue with adding Column to a DataFrame

Below code fails with AnalysisException: sc.version String = 1.6.0
case class Person(name: String, age: Long)
val caseClassDF = Seq(Person("Andy", 32)).toDF()
caseClassDF.count()
val seq = Seq(1)
val rdd = sqlContext.sparkContext.parallelize(seq)
val df2 = rdd.toDF("Counts")
df2.count()
val withCounts = caseClassDF.withColumn("duration", df2("Counts"))
For some reason, it works with UDF:
import org.apache.spark.sql.functions.udf
case class Person(name: String, age: Long, day: Int)
val caseClassDF = Seq(Person("Andy", 32, 1), Person("Raman", 22, 1), Person("Rajan", 40, 1), Person("Andy", 42, 2), Person("Raman", 42, 2), Person("Rajan", 50, 2)).toDF()
val calculateCounts= udf((x: Long, y: Int) =>
x+y)
val df1 = caseClassDF.withColumn("Counts", calculateCounts($"age", $"day"))
df1.show
+-----+---+---+------+
| name|age|day|Counts|
+-----+---+---+------+
| Andy| 32| 1| 33|
|Raman| 22| 1| 23|
|Rajan| 40| 1| 41|
| Andy| 42| 2| 44|
|Raman| 42| 2| 44|
|Rajan| 50| 2| 52|
+-----+---+---+------+
caseClassDF.withColumn("duration", df2("Counts")), Here the column should be of the same dataframe (in your case caseClassDF). AFAIK, Spark does not allow column of a different DataFrame in withColumn.
PS: I am a user of Spark 1.6.x, not sure whether this has come up in Spark 2.x

How to aggregate one hot encoded values in pipeline?

I am able to use string indexers and one hot encoders to create the features column on the far right. Notice how id 1 has multiple rows. I was wondering how to aggregate the sparse vector in features using a pipeline
or some other alternative so that the feature for id 1 = (7,[0,3,5],[1.0,1.0,1.0]).
I want to take this input:
+---+------+----+-----+
| id|houses|cars|label|
+---+------+----+-----+
| 0| M| A| 1.0|
| 1| M| C| 1.0|
| 1| M| B| 1.0|
| 2| F| A| 0.0|
| 3| F| D| 0.0|
| 4| Z| B| 1.0|
| 5| Z| C| 0.0|
+---+------+----+-----+
then one hot encode the houses column, the cars column, combine them and aggregate by id
and generate this output:
+-------------------+
| features|
+-------------------+
|(7,[0,4],[1.0,1.0])|
|(7,[0,3,5],[1.0,1.0,1.0])|
|(7,[2,4],[1.0,1.0])|
|(7,[2,6],[1.0,1.0])|
|(7,[1,3],[1.0,1.0])|
|(7,[1,5],[1.0,1.0])|
+-------------------+
def oneHotEncoderExample(sqlContext: SQLContext): Unit = {
// define data
val df = sqlContext.createDataFrame(Seq(
(0, "M", "A", 1.0),
(1, "M", "C", 1.0),
(1, "M", "B", 1.0),
(2, "F", "A", 0.0),
(3, "F", "D", 0.0),
(4, "Z", "B", 1.0),
(5, "Z", "C", 0.0)
)).toDF("id", "houses", "cars", "label")
df.show()
// define stages of pipeline
val indexerHouse = new StringIndexer()
.setInputCol("houses")
.setOutputCol("housesIndex")
val encoderHouse = new OneHotEncoder()
.setDropLast(false)
.setInputCol("housesIndex")
.setOutputCol("typeHouses")
val indexerCar = new StringIndexer()
.setInputCol("cars")
.setOutputCol("carsIndex")
val encoderCar = new OneHotEncoder()
.setDropLast(false)
.setInputCol("carsIndex")
.setOutputCol("typeCars")
val assembler = new VectorAssembler()
.setInputCols(Array("typeHouses", "typeCars"))
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
// define pipeline
val pipeline = new Pipeline()
.setStages(Array(
indexerHouse, encoderHouse,
indexerCar, encoderCar,
assembler, lr))
// Fit the pipeline to training documents.
val pipelineModel = pipeline.fit(df)
}
// helper code to simulate and aggregate current pipeline (generates table below)
val indexedHouse = indexerHouse.fit(df).transform(df)
indexedHouse.show()
val encodedHouse = encoderHouse.transform(indexedHouse)
encodedHouse.show()
val indexedCar = indexerCar.fit(df).transform(df)
indexedCar.show()
val encodedCar = encoderCar.transform(indexedCar)
encodedCar.show()
val assembledFeature = assembler.transform(encodedHouse.join(encodedCar, usingColumns = Seq("id", "houses", "cars")))
assembledFeature.show()

An error about Dataset.filter in Spark SQL

I want to filter the dataset only to contain the record which can be found in MySQL.
Here is the Dataset:
dataset.show()
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
And here is the table in MySQL:
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 3| c|
| 4| d|
+---+-----+
This is my code (running in spark-shell):
import java.util.Properties
case class App(id: Int, name: String)
val data = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c")))
val dataFrame = data.map { case (id, name) => App(id, name) }.toDF
val dataset = dataFrame.as[App]
val url = "jdbc:mysql://ip:port/tbl_name"
val table = "my_tbl_name"
val user = "my_user_name"
val password = "my_password"
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
dataset.filter((x: App) =>
0 != sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count).show()
But I get "java.lang.NullPointerException"
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
I have tested
val x = App(1, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count
val y = App(5, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + y.id.toString), properties).count
and I can get the right result 1 and 0.
What's the problem with filter?
What's the problem with filter?
You get an exception because you're trying to execute an action (count on a DataFrame) inside a transformation (filter). Neither nested actions nor transformations are supported in Spark.
Correct solution is as usual either join on compatible data structures, lookup using local data structure or query directly against external system (without using Spark data structures).

Resources