Issue with adding Column to a DataFrame

Issue with adding Column to a DataFrame - apache-spark

Below code fails with AnalysisException: sc.version String = 1.6.0
case class Person(name: String, age: Long)
val caseClassDF = Seq(Person("Andy", 32)).toDF()
caseClassDF.count()
val seq = Seq(1)
val rdd = sqlContext.sparkContext.parallelize(seq)
val df2 = rdd.toDF("Counts")
df2.count()
val withCounts = caseClassDF.withColumn("duration", df2("Counts"))

For some reason, it works with UDF:
import org.apache.spark.sql.functions.udf
case class Person(name: String, age: Long, day: Int)
val caseClassDF = Seq(Person("Andy", 32, 1), Person("Raman", 22, 1), Person("Rajan", 40, 1), Person("Andy", 42, 2), Person("Raman", 42, 2), Person("Rajan", 50, 2)).toDF()
val calculateCounts= udf((x: Long, y: Int) =>
x+y)
val df1 = caseClassDF.withColumn("Counts", calculateCounts($"age", $"day"))
df1.show
+-----+---+---+------+
| name|age|day|Counts|
+-----+---+---+------+
| Andy| 32| 1| 33|
|Raman| 22| 1| 23|
|Rajan| 40| 1| 41|
| Andy| 42| 2| 44|
|Raman| 42| 2| 44|
|Rajan| 50| 2| 52|
+-----+---+---+------+

caseClassDF.withColumn("duration", df2("Counts")), Here the column should be of the same dataframe (in your case caseClassDF). AFAIK, Spark does not allow column of a different DataFrame in withColumn.
PS: I am a user of Spark 1.6.x, not sure whether this has come up in Spark 2.x

Related

How do I replace a full stop with a zero in PySpark?

I am trying to replace a full stop in my raw data with the value 0 in PySpark.
I tried to use a .when and .otherwise statement.
I tried to use regexp_replace to change the '.' to 0.
Code tried:
from pyspark.sql import functions as F
#For #1 above:
dataframe2 = dataframe1.withColumn("test_col", F.when(((F.col("test_col") == F.lit(".")), 0).otherwise(F.col("test_col")))
#For #2 above:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
Instead of "." it should rewrite the column with numbers only (i.e. there is a number in non full stop rows, otherwise, it's a full stop that should be replaced with 0).

pyspark version
from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions
column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])
data = [[3, 'r1'], [9, 'r2.'], [27, '.']]
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()
filtered_data_frame = data_frame.withColumn('num',
functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()
filtered_data_frame = data_frame.withColumn('text',
functions.when(data_frame['text'] == '.', '0').otherwise(
data_frame['text']))
filtered_data_frame.show()
output
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+

sample code does query properly
package otz.scalaspark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object ValueReplacement {
def main(args: Array[String]) {
val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
val sparkContext = new SparkContext(sparkConfig)
val someData = Seq(
Row(3, "r1"),
Row(9, "r2"),
Row(27, "r3"),
Row(81, "r4")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val sqlContext = new SQLContext(sparkContext)
val dataFrame = sqlContext.createDataFrame(
sparkContext.parallelize(someData),
StructType(someSchema)
)
val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
filteredDataFrame.show()
}
}
output
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+

You attempt #2 was almost correct, if you have a dataframe1 like:
+--------+
|test_col|
+--------+
| 1.0|
| 2.0|
| 2|
+--------+
Your attempt must be yielding:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
dataframe2.show()
+--------+
|test_col|
+--------+
| 000|
| 000|
| 0|
+--------+
Here the . means all the letter are to be replaced and not just '.'.
However, if you add an escape sequence (\) before the dot then things should work fine.
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '\.', '0'))
dataframe2.show()
+--------+
|test_col|
+--------+
| 100|
| 200|
| 2|
+--------+

Efficiently calculate top-k elements in spark

I have a dataframe similarly to:
+---+-----+-----+
|key|thing|value|
+---+-----+-----+
| u1| foo| 1|
| u1| foo| 2|
| u1| bar| 10|
| u2| foo| 10|
| u2| foo| 2|
| u2| bar| 10|
+---+-----+-----+
And want to get a result of:
+---+-----+---------+----+
|key|thing|sum_value|rank|
+---+-----+---------+----+
| u1| bar| 10| 1|
| u1| foo| 3| 2|
| u2| foo| 12| 1|
| u2| bar| 10| 2|
+---+-----+---------+----+
Currently, there is code similarly to:
val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1", "bar", 10), ("u2", "foo", 10), ("u2", "foo", 2), ("u2", "bar", 10)).toDF("key", "thing", "value")
// calculate sums per key and thing
val aggregated = df.groupBy("key", "thing").agg(sum("value").alias("sum_value"))
// get topk items per key
val k = lit(10)
val topk = aggregated.withColumn("rank", rank over Window.partitionBy("key").orderBy(desc("sum_value"))).filter('rank < k)
However, this code is very inefficient. A window function generates a total order of items and causes a gigantic shuffle.
How can I calculate top-k items more efficiently?
Maybe using approximate functions i.e. sketches similarly to https://datasketches.github.io/ or https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html

This is a classical algorithm of recommender systems.
case class Rating(thing: String, value: Int) extends Ordered[Rating] {
def compare(that: Rating): Int = -this.value.compare(that.value)
}
case class Recommendation(key: Int, ratings: Seq[Rating]) {
def keep(n: Int) = this.copy(ratings = ratings.sorted.take(n))
}
val TOPK = 10
df.groupBy('key)
.agg(collect_list(struct('thing, 'value)) as "ratings")
.as[Recommendation]
.map(_.keep(TOPK))
You can also check the source code at:
Spotify Big Data Rosetta Code / TopItemsPerUser.scala, several solutions here for Spark or Scio
Spark MLLib / TopByKeyAggregator.scala, considered the best practice when using their recommendation algorithm, it looks like their examples still uses RDD though.
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
sc.parallelize(Array(("u1", ("foo", 1)), ("u1", ("foo", 2)), ("u1", ("bar", 10)), ("u2", ("foo", 10)),
("u2", ("foo", 2)), ("u2", ("bar", 10))))
.topByKey(10)(Ordering.by(_._2))

RDD`s to the rescue
aggregated.as[(String, String, Long)].rdd.groupBy(_._1).map{ case (thing, it) => (thing, it.map(e=> (e._2, e._3)).toList.sortBy(sorter => sorter._2).take(1))}.toDF.show
+---+----------+
| _1| _2|
+---+----------+
| u1| [[foo,3]]|
| u2|[[bar,10]]|
+---+----------+
This can most likely be improved using the suggestion from the comment. I.e. when not starting out from aggregated, but rather df. This could look similar to:
df.as[(String, String, Long)].rdd.groupBy(_._1).map{case (thing, it) => {
val aggregatedInner = it.groupBy(e=> (e._2)).mapValues(events=> events.map(value => value._3).sum)
val topk = aggregatedInner.toArray.sortBy(sorter=> sorter._2).take(1)
(thing, topk)
}}.toDF.show

How to use window specification and join condition per column values?

Here is my DF1
OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber|^|FFAction
4295858898|^|204|^|205|^|1|^|I|!|
4295858898|^|204|^|208|^|2|^|I|!|
4295858898|^|204|^|209|^|2|^|I|!|
4295858898|^|204|^|211|^|3|^|I|!|
4295858898|^|204|^|212|^|3|^|I|!|
4295858898|^|204|^|214|^|4|^|I|!|
4295858898|^|204|^|215|^|4|^|I|!|
4295858898|^|206|^|207|^|1|^|I|!|
4295858898|^|206|^|210|^|2|^|I|!|
4295858898|^|206|^|213|^|3|^|I|!|
Here is my DF2
DataPartition|^|PartitionYear|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber|^|FFAction|!|
SelfSourcedPublic|^|2002|^|1511224917595|^|4295858941|^|24|^|25|^|4|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917596|^|4295858941|^|24|^|25|^|4|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917597|^|4295858941|^|30|^|31|^|2|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917598|^|4295858941|^|30|^|31|^|2|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917599|^|4295858941|^|30|^|32|^|1|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917600|^|4295858941|^|30|^|32|^|1|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917601|^|4295858941|^|24|^|33|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917602|^|4295858941|^|24|^|33|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917603|^|4295858941|^|24|^|34|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917604|^|4295858941|^|24|^|34|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917605|^|4295858941|^|1|^|2|^|4|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917606|^|4295858941|^|1|^|3|^|4|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917607|^|4295858941|^|5|^|6|^|4|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917608|^|4295858941|^|5|^|7|^|4|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917609|^|4295858941|^|12|^|10|^|2|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917610|^|4295858941|^|12|^|11|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917611|^|4295858941|^|1|^|13|^|1|^|O|!|
SelfSourcedPublic|^|2003|^|1511224917612|^|4295858941|^|12|^|14|^|1|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917613|^|4295858941|^|5|^|15|^|3|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917614|^|4295858941|^|5|^|16|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917615|^|4295858941|^|1|^|17|^|3|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917616|^|4295858941|^|1|^|18|^|3|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917617|^|4295858941|^|5|^|19|^|1|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917618|^|4295858941|^|5|^|20|^|2|^|O|!|
SelfSourcedPublic|^|2001|^|1511224917619|^|4295858941|^|5|^|21|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917620|^|4295858941|^|1|^|22|^|2|^|O|!|
SelfSourcedPublic|^|2002|^|1511224917621|^|4295858941|^|1|^|23|^|2|^|O|!|
SelfSourcedPublic|^|2016|^|1511224917622|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1511224917642|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1511224917643|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1511224917644|^|4295858941|^|null|^|37|^|null|^|D|!|
I want to implement join based on the value of the column.
This is what I am trying to achieve in Spark-Scala for example but don't know how to implement it
If the FFAction_1 =I in the DF2 then below condition
(join and partitionBy on three columns "OrganizationId", "AnnualPeriodId","InterimPeriodId")
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")
.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
when($"FFAction_1".isNotNull, concat(col("FFAction_1"),
lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
If the FFAction_1 =O or D then below condition
(join and partitionBy on two columns "OrganizationId","InterimPeriodId")
val windowSpec = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("OrganizationId","AnnualPeriodId","InterimPeriodId"), "outer")
.select($"OrganizationId", $"AnnualPeriodId",$"InterimPeriodId",
when($"FFAction_1".isNotNull, concat(col("FFAction_1"),
lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Below is my full code
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val get_cus_YearPartition = spark.udf.register("get_cus_YearPartition", (filePath: String) => filePath.split("\\.")(4))
val rdd = sc.textFile("s3://trfsmallfffile/Interim2Annual/MAIN")
val header = rdd.filter(_.contains("OrganizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("OrganizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema)
val schemaHeader = StructType(header.map(cols => StructField(cols.replace(".", "."), StringType)).toSeq)
val dataHeader = sqlContext.createDataFrame(rdd.filter(!_.contains("OrganizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schemaHeader)
val df1resultFinal=data.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalWithYear=df1resultFinal.withColumn("PartitionYear", get_cus_YearPartition(input_file_name))
//Loading Incremental
val rdd1 = sc.textFile("s3://trfsmallfffile/Interim2Annual/INCR")
val header1 = rdd1.filter(_.contains("OrganizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
val data1 = sqlContext.createDataFrame(rdd1.filter(!_.contains("OrganizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema1)
//------------------------------- filtering only the latest from increamental ------------------------------
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId","AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey1 = data1.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank")
val windowSpec2 = Window.partitionBy("OrganizationId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = latestForEachKey1.withColumn("tobefiltered", first("FFAction|!|").over(windowSpec2))
.filter($"tobefiltered" === "I|!|" || $"tobefiltered" === "O|!|" || ($"tobefiltered" === "D|!|" && $"FFAction|!|" === "D|!|"))
.drop("tobefiltered", "TimeStamp")
//-----------------separating the increamental df for insert, deletion and overwrite----------------
//---------------insert rows are selected -------------------------------
//insert a row if I is detected and if O is found then first delete and then insert
val insertdf = latestForEachKey.filter($"FFAction|!|" === "I|!|" || $"FFAction|!|" === "O|!|").select(df1resultFinalWithYear.schema.fieldNames.map(col):_*)
//------------------deleted rows with primary key "OrganizationId", "InterimPeriodId"------------------
// delete rows from parent if both D or O is found in increamental
val deletedf = latestForEachKey.filter($"FFAction|!|" === "D|!|" || $"FFAction|!|" === "O|!|").select($"OrganizationId", $"InterimPeriodId", lit("delete").as("Delete"))
//join by two primary keys for deletion and delete from the parent dataframe
val dfMainOutput = df1resultFinalWithYear.join(deletedf, Seq("OrganizationId", "InterimPeriodId"), "left").filter($"Delete".isNull).drop("Delete")
val dfToSave=dfMainOutput.union(insertdf).withColumn("FFAction|!|", when($"FFAction|!|" === "O|!|" || $"FFAction|!|" === "I|!|", lit("I|!|")))
val dfMainOutputFinal = dfToSave.na.fill("").select($"DataPartition", $"PartitionYear",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").filter(_ != "PartitionYear").map(c => col(c)): _*).as("concatenated"))
val headerColumn = dataHeader.columns.toSeq
val header = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","PartitionYear")
.format("csv")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsmallfffile/Interim2Annual/output")
val FFRowCount =dfMainOutputFinalWithoutNull.groupBy("DataPartition","PartitionYear").count
FFRowCount.coalesce(1).write.format("com.databricks.spark.xml")
.option("rootTag", "FFFileType")
.option("rowTag", "FFPhysicalFile")
.save("s3://trfsmallfffile/Interim2Annual/Descr")

DISCLAIMER Somehow this and the other question I've just answered seem duplicates so one is going to get marked as such soon or we find out the difference between them and the disclaimer goes away. Time will tell.
Given the requirement to select the final window specification and join condition based on the values of FFAction_1 column, I'd do filter first and decide what window aggregation and join to use.
val df1 = spark.
read.
option("header", true).
option("sep", "|").
csv("df1.csv").
select("OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber", "FFAction")
scala> df1.show
+--------------+--------------+---------------+-------------+--------+
|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber|FFAction|
+--------------+--------------+---------------+-------------+--------+
| 4295858898| 204| 205| 1| I|
| 4295858898| 204| 208| 2| I|
| 4295858898| 204| 209| 2| I|
| 4295858898| 204| 211| 3| I|
| 4295858898| 204| 212| 3| I|
| 4295858898| 204| 214| 4| I|
| 4295858898| 204| 215| 4| I|
| 4295858898| 206| 207| 1| I|
| 4295858898| 206| 210| 2| I|
| 4295858898| 206| 213| 3| I|
+--------------+--------------+---------------+-------------+--------+
The right-hand side of the join is fairly similar in "shape".
val df2 = spark.
read.
option("header", true).
option("sep", "|").
csv("df2.csv").
select("DataPartition_1", "PartitionYear_1", "TimeStamp", "OrganizationId", "AnnualPeriodId", "InterimPeriodId", "InterimNumber_1", "FFAction_1")
scala> df2.show
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
| DataPartition_1|PartitionYear_1| TimeStamp|OrganizationId|AnnualPeriodId|InterimPeriodId|InterimNumber_1|FFAction_1|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
|SelfSourcedPublic| 2002|1510725106270| 4295858941| 24| 25| 4| O|
|SelfSourcedPublic| 2002|1510725106271| 4295858941| 24| 25| 5| O|
|SelfSourcedPublic| 2003|1510725106272| 4295858941| 30| 31| 2| O|
|SelfSourcedPublic| 2003|1510725106273| 4295858941| 30| 31| 3| O|
|SelfSourcedPublic| 2001|1510725106293| 4295858941| 5| 20| 2| O|
|SelfSourcedPublic| 2001|1510725106294| 4295858941| 5| 21| 3| O|
|SelfSourcedPublic| 2002|1510725106295| 4295858941| 1| 22| 4| O|
|SelfSourcedPublic| 2002|1510725106296| 4295858941| 1| 23| 5| O|
|SelfSourcedPublic| 2016|1510725106297| 4295858941| 35| 36| 1| I|
|SelfSourcedPublic| 2016|1510725106297| 4295858941| 35| 36| 1| D|
+-----------------+---------------+-------------+--------------+--------------+---------------+---------------+----------+
With the above datasets, I'd filter out to see if there's at least one I in df2 in FFAction_1 column and select the correct window specification and join condition.
The trick is to use join operator followed by where (or filter) operator so you can decide on what join condition to use.
val noIs = df2.filter($"FFAction_1" === "I").take(1).isEmpty
val (windowSpec, joinCond) = if (noIs) {
(windowSpecForOs, joinForOs)
} else {
(windowSpecForIs, joinForIs)
}
val latestForEachKey = df2result.withColumn("rank", rank() over windowSpec)
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey).where(joinCond)

How to aggregate one hot encoded values in pipeline?

I am able to use string indexers and one hot encoders to create the features column on the far right. Notice how id 1 has multiple rows. I was wondering how to aggregate the sparse vector in features using a pipeline
or some other alternative so that the feature for id 1 = (7,[0,3,5],[1.0,1.0,1.0]).
I want to take this input:
+---+------+----+-----+
| id|houses|cars|label|
+---+------+----+-----+
| 0| M| A| 1.0|
| 1| M| C| 1.0|
| 1| M| B| 1.0|
| 2| F| A| 0.0|
| 3| F| D| 0.0|
| 4| Z| B| 1.0|
| 5| Z| C| 0.0|
+---+------+----+-----+
then one hot encode the houses column, the cars column, combine them and aggregate by id
and generate this output:
+-------------------+
| features|
+-------------------+
|(7,[0,4],[1.0,1.0])|
|(7,[0,3,5],[1.0,1.0,1.0])|
|(7,[2,4],[1.0,1.0])|
|(7,[2,6],[1.0,1.0])|
|(7,[1,3],[1.0,1.0])|
|(7,[1,5],[1.0,1.0])|
+-------------------+
def oneHotEncoderExample(sqlContext: SQLContext): Unit = {
// define data
val df = sqlContext.createDataFrame(Seq(
(0, "M", "A", 1.0),
(1, "M", "C", 1.0),
(1, "M", "B", 1.0),
(2, "F", "A", 0.0),
(3, "F", "D", 0.0),
(4, "Z", "B", 1.0),
(5, "Z", "C", 0.0)
)).toDF("id", "houses", "cars", "label")
df.show()
// define stages of pipeline
val indexerHouse = new StringIndexer()
.setInputCol("houses")
.setOutputCol("housesIndex")
val encoderHouse = new OneHotEncoder()
.setDropLast(false)
.setInputCol("housesIndex")
.setOutputCol("typeHouses")
val indexerCar = new StringIndexer()
.setInputCol("cars")
.setOutputCol("carsIndex")
val encoderCar = new OneHotEncoder()
.setDropLast(false)
.setInputCol("carsIndex")
.setOutputCol("typeCars")
val assembler = new VectorAssembler()
.setInputCols(Array("typeHouses", "typeCars"))
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
// define pipeline
val pipeline = new Pipeline()
.setStages(Array(
indexerHouse, encoderHouse,
indexerCar, encoderCar,
assembler, lr))
// Fit the pipeline to training documents.
val pipelineModel = pipeline.fit(df)
}
// helper code to simulate and aggregate current pipeline (generates table below)
val indexedHouse = indexerHouse.fit(df).transform(df)
indexedHouse.show()
val encodedHouse = encoderHouse.transform(indexedHouse)
encodedHouse.show()
val indexedCar = indexerCar.fit(df).transform(df)
indexedCar.show()
val encodedCar = encoderCar.transform(indexedCar)
encodedCar.show()
val assembledFeature = assembler.transform(encodedHouse.join(encodedCar, usingColumns = Seq("id", "houses", "cars")))
assembledFeature.show()

An error about Dataset.filter in Spark SQL

I want to filter the dataset only to contain the record which can be found in MySQL.
Here is the Dataset:
dataset.show()
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
And here is the table in MySQL:
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 3| c|
| 4| d|
+---+-----+
This is my code (running in spark-shell):
import java.util.Properties
case class App(id: Int, name: String)
val data = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c")))
val dataFrame = data.map { case (id, name) => App(id, name) }.toDF
val dataset = dataFrame.as[App]
val url = "jdbc:mysql://ip:port/tbl_name"
val table = "my_tbl_name"
val user = "my_user_name"
val password = "my_password"
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
dataset.filter((x: App) =>
0 != sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count).show()
But I get "java.lang.NullPointerException"
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
I have tested
val x = App(1, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count
val y = App(5, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + y.id.toString), properties).count
and I can get the right result 1 and 0.
What's the problem with filter?

What's the problem with filter?
You get an exception because you're trying to execute an action (count on a DataFrame) inside a transformation (filter). Neither nested actions nor transformations are supported in Spark.
Correct solution is as usual either join on compatible data structures, lookup using local data structure or query directly against external system (without using Spark data structures).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Issue with adding Column to a DataFrame - apache-spark

caseClassDF.withColumn("duration", df2("Counts")), Here the column should be of the same dataframe (in your case caseClassDF). AFAIK, Spark does not allow column of a different DataFrame in withColumn. PS: I am a user of Spark 1.6.x, not sure whether this has come up in Spark 2.x

Related

How do I replace a full stop with a zero in PySpark?

Efficiently calculate top-k elements in spark

How to use window specification and join condition per column values?

How to aggregate one hot encoded values in pipeline?

An error about Dataset.filter in Spark SQL

Categories

Resources