How to aggregate one hot encoded values in pipeline?

How to aggregate one hot encoded values in pipeline? - apache-spark

I am able to use string indexers and one hot encoders to create the features column on the far right. Notice how id 1 has multiple rows. I was wondering how to aggregate the sparse vector in features using a pipeline
or some other alternative so that the feature for id 1 = (7,[0,3,5],[1.0,1.0,1.0]).
I want to take this input:
+---+------+----+-----+
| id|houses|cars|label|
+---+------+----+-----+
| 0| M| A| 1.0|
| 1| M| C| 1.0|
| 1| M| B| 1.0|
| 2| F| A| 0.0|
| 3| F| D| 0.0|
| 4| Z| B| 1.0|
| 5| Z| C| 0.0|
+---+------+----+-----+
then one hot encode the houses column, the cars column, combine them and aggregate by id
and generate this output:
+-------------------+
| features|
+-------------------+
|(7,[0,4],[1.0,1.0])|
|(7,[0,3,5],[1.0,1.0,1.0])|
|(7,[2,4],[1.0,1.0])|
|(7,[2,6],[1.0,1.0])|
|(7,[1,3],[1.0,1.0])|
|(7,[1,5],[1.0,1.0])|
+-------------------+
def oneHotEncoderExample(sqlContext: SQLContext): Unit = {
// define data
val df = sqlContext.createDataFrame(Seq(
(0, "M", "A", 1.0),
(1, "M", "C", 1.0),
(1, "M", "B", 1.0),
(2, "F", "A", 0.0),
(3, "F", "D", 0.0),
(4, "Z", "B", 1.0),
(5, "Z", "C", 0.0)
)).toDF("id", "houses", "cars", "label")
df.show()
// define stages of pipeline
val indexerHouse = new StringIndexer()
.setInputCol("houses")
.setOutputCol("housesIndex")
val encoderHouse = new OneHotEncoder()
.setDropLast(false)
.setInputCol("housesIndex")
.setOutputCol("typeHouses")
val indexerCar = new StringIndexer()
.setInputCol("cars")
.setOutputCol("carsIndex")
val encoderCar = new OneHotEncoder()
.setDropLast(false)
.setInputCol("carsIndex")
.setOutputCol("typeCars")
val assembler = new VectorAssembler()
.setInputCols(Array("typeHouses", "typeCars"))
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
// define pipeline
val pipeline = new Pipeline()
.setStages(Array(
indexerHouse, encoderHouse,
indexerCar, encoderCar,
assembler, lr))
// Fit the pipeline to training documents.
val pipelineModel = pipeline.fit(df)
}
// helper code to simulate and aggregate current pipeline (generates table below)
val indexedHouse = indexerHouse.fit(df).transform(df)
indexedHouse.show()
val encodedHouse = encoderHouse.transform(indexedHouse)
encodedHouse.show()
val indexedCar = indexerCar.fit(df).transform(df)
indexedCar.show()
val encodedCar = encoderCar.transform(indexedCar)
encodedCar.show()
val assembledFeature = assembler.transform(encodedHouse.join(encodedCar, usingColumns = Seq("id", "houses", "cars")))
assembledFeature.show()

Related

Efficiently calculate top-k elements in spark

I have a dataframe similarly to:
+---+-----+-----+
|key|thing|value|
+---+-----+-----+
| u1| foo| 1|
| u1| foo| 2|
| u1| bar| 10|
| u2| foo| 10|
| u2| foo| 2|
| u2| bar| 10|
+---+-----+-----+
And want to get a result of:
+---+-----+---------+----+
|key|thing|sum_value|rank|
+---+-----+---------+----+
| u1| bar| 10| 1|
| u1| foo| 3| 2|
| u2| foo| 12| 1|
| u2| bar| 10| 2|
+---+-----+---------+----+
Currently, there is code similarly to:
val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1", "bar", 10), ("u2", "foo", 10), ("u2", "foo", 2), ("u2", "bar", 10)).toDF("key", "thing", "value")
// calculate sums per key and thing
val aggregated = df.groupBy("key", "thing").agg(sum("value").alias("sum_value"))
// get topk items per key
val k = lit(10)
val topk = aggregated.withColumn("rank", rank over Window.partitionBy("key").orderBy(desc("sum_value"))).filter('rank < k)
However, this code is very inefficient. A window function generates a total order of items and causes a gigantic shuffle.
How can I calculate top-k items more efficiently?
Maybe using approximate functions i.e. sketches similarly to https://datasketches.github.io/ or https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html

This is a classical algorithm of recommender systems.
case class Rating(thing: String, value: Int) extends Ordered[Rating] {
def compare(that: Rating): Int = -this.value.compare(that.value)
}
case class Recommendation(key: Int, ratings: Seq[Rating]) {
def keep(n: Int) = this.copy(ratings = ratings.sorted.take(n))
}
val TOPK = 10
df.groupBy('key)
.agg(collect_list(struct('thing, 'value)) as "ratings")
.as[Recommendation]
.map(_.keep(TOPK))
You can also check the source code at:
Spotify Big Data Rosetta Code / TopItemsPerUser.scala, several solutions here for Spark or Scio
Spark MLLib / TopByKeyAggregator.scala, considered the best practice when using their recommendation algorithm, it looks like their examples still uses RDD though.
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
sc.parallelize(Array(("u1", ("foo", 1)), ("u1", ("foo", 2)), ("u1", ("bar", 10)), ("u2", ("foo", 10)),
("u2", ("foo", 2)), ("u2", ("bar", 10))))
.topByKey(10)(Ordering.by(_._2))

RDD`s to the rescue
aggregated.as[(String, String, Long)].rdd.groupBy(_._1).map{ case (thing, it) => (thing, it.map(e=> (e._2, e._3)).toList.sortBy(sorter => sorter._2).take(1))}.toDF.show
+---+----------+
| _1| _2|
+---+----------+
| u1| [[foo,3]]|
| u2|[[bar,10]]|
+---+----------+
This can most likely be improved using the suggestion from the comment. I.e. when not starting out from aggregated, but rather df. This could look similar to:
df.as[(String, String, Long)].rdd.groupBy(_._1).map{case (thing, it) => {
val aggregatedInner = it.groupBy(e=> (e._2)).mapValues(events=> events.map(value => value._3).sum)
val topk = aggregatedInner.toArray.sortBy(sorter=> sorter._2).take(1)
(thing, topk)
}}.toDF.show

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.

Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

spark dynamically create struct/json per group

I have a spark dataframe like
+-----+---+---+---+------+
|group| a| b| c|config|
+-----+---+---+---+------+
| a| 1| 2| 3| [a]|
| b| 2| 3| 4|[a, b]|
+-----+---+---+---+------+
val df = Seq(("a", 1, 2, 3, Seq("a")),("b", 2, 3,4, Seq("a", "b"))).toDF("group", "a", "b","c", "config")
How can I add an additional column i.e.
df.withColumn("select_by_config", <<>>).show
as a struct or JSON which combines a number of columns (specified by config) in something similar to a hive named struct / spark struct / json column? Note, this struct is specific per group and not constant for the whole dataframe; it is specified in config column.
I can imagine that a df.map could do the trick, but the serialization overhead does not seem to be efficient. How can this be achieved via SQL only expressions? Maybe as a Map-type column?
edit
a possible but really clumsy solution for 2.2 is:
val df = Seq((1,"a", 1, 2, 3, Seq("a")),(2, "b", 2, 3,4, Seq("a", "b"))).toDF("id", "group", "a", "b","c", "config")
df.show
import spark.implicits._
final case class Foo(id:Int, c1:Int, specific:Map[String, Int])
df.map(r => {
val config = r.getAs[Seq[String]]("config")
print(config)
val others = config.map(elem => (elem, r.getAs[Int](elem))).toMap
Foo(r.getAs[Int]("id"), r.getAs[Int]("c"), others)
}).show
are there any better ways to solve the problem for 2.2?

If you use a recent build (Spark 2.4.0 RC 1 or later) a combination of higher order functions should do the trick. Create a map of columns:
import org.apache.spark.sql.functions.{
array, col, expr, lit, map_from_arrays, map_from_entries
}
val cols = Seq("a", "b", "c")
val dfm = df.withColumn(
"cmap",
map_from_arrays(array(cols map lit: _*), array(cols map col: _*))
)
and transform the config:
dfm.withColumn(
"config_mapped",
map_from_entries(expr("transform(config, k -> struct(k, cmap[k]))"))
).show
// +-----+---+---+---+------+--------------------+----------------+
// |group| a| b| c|config| cmap| config_mapped|
// +-----+---+---+---+------+--------------------+----------------+
// | a| 1| 2| 3| [a]|[a -> 1, b -> 2, ...| [a -> 1]|
// | b| 2| 3| 4|[a, b]|[a -> 2, b -> 3, ...|[a -> 2, b -> 3]|
// +-----+---+---+---+------+--------------------+----------------+

Issue with adding Column to a DataFrame

Below code fails with AnalysisException: sc.version String = 1.6.0
case class Person(name: String, age: Long)
val caseClassDF = Seq(Person("Andy", 32)).toDF()
caseClassDF.count()
val seq = Seq(1)
val rdd = sqlContext.sparkContext.parallelize(seq)
val df2 = rdd.toDF("Counts")
df2.count()
val withCounts = caseClassDF.withColumn("duration", df2("Counts"))

For some reason, it works with UDF:
import org.apache.spark.sql.functions.udf
case class Person(name: String, age: Long, day: Int)
val caseClassDF = Seq(Person("Andy", 32, 1), Person("Raman", 22, 1), Person("Rajan", 40, 1), Person("Andy", 42, 2), Person("Raman", 42, 2), Person("Rajan", 50, 2)).toDF()
val calculateCounts= udf((x: Long, y: Int) =>
x+y)
val df1 = caseClassDF.withColumn("Counts", calculateCounts($"age", $"day"))
df1.show
+-----+---+---+------+
| name|age|day|Counts|
+-----+---+---+------+
| Andy| 32| 1| 33|
|Raman| 22| 1| 23|
|Rajan| 40| 1| 41|
| Andy| 42| 2| 44|
|Raman| 42| 2| 44|
|Rajan| 50| 2| 52|
+-----+---+---+------+

caseClassDF.withColumn("duration", df2("Counts")), Here the column should be of the same dataframe (in your case caseClassDF). AFAIK, Spark does not allow column of a different DataFrame in withColumn.
PS: I am a user of Spark 1.6.x, not sure whether this has come up in Spark 2.x

An error about Dataset.filter in Spark SQL

I want to filter the dataset only to contain the record which can be found in MySQL.
Here is the Dataset:
dataset.show()
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
And here is the table in MySQL:
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 3| c|
| 4| d|
+---+-----+
This is my code (running in spark-shell):
import java.util.Properties
case class App(id: Int, name: String)
val data = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c")))
val dataFrame = data.map { case (id, name) => App(id, name) }.toDF
val dataset = dataFrame.as[App]
val url = "jdbc:mysql://ip:port/tbl_name"
val table = "my_tbl_name"
val user = "my_user_name"
val password = "my_password"
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
dataset.filter((x: App) =>
0 != sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count).show()
But I get "java.lang.NullPointerException"
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
I have tested
val x = App(1, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count
val y = App(5, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + y.id.toString), properties).count
and I can get the right result 1 and 0.
What's the problem with filter?

What's the problem with filter?
You get an exception because you're trying to execute an action (count on a DataFrame) inside a transformation (filter). Neither nested actions nor transformations are supported in Spark.
Correct solution is as usual either join on compatible data structures, lookup using local data structure or query directly against external system (without using Spark data structures).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to aggregate one hot encoded values in pipeline? - apache-spark

Related

Efficiently calculate top-k elements in spark

Spark Aggregating multiple columns (possible to array) from join output

spark dynamically create struct/json per group

Issue with adding Column to a DataFrame

An error about Dataset.filter in Spark SQL

Categories

Resources