best way to write Spark Dataset[U] - apache-spark

Ok, so I am aware of the fact that Dataset.as[U] just changes the view of the Dataframe for typed operations.
As seen in this example:
case class One(one: Int)
val df = Seq(
(1,2,3),
(11,22,33),
(111,222,333)
).toDF("one", "two", "thre")
val ds : Dataset[One] = df.as[One]
ds.show
prints
+----+----+-----+
| one| two|three|
+----+----+-----+
| 1| 2| 3|
| 11| 22| 33|
| 111| 222| 333|
+----+----+-----+
This is totally fine and works in my favor most of the times. BUT now I need to write that ds to disk, with only that column one.
To enforce the schema I could do
.map(x => x) as this is a typed operation, the case class schema will take effect. This operation also results in a Dataset[One], but with the underlying data reduced to column one. This just seems awfully expensive looking at the execution plan
== Physical Plan ==
*SerializeFromObject [assertnotnull(input[0, $line2012488405320.$read$$iw$$iw$One, true]).one AS one#9408]
+- *MapElements <function1>, obj#9407: $line2012488405320.$read$$iw$$iw$One
+- *DeserializeToObject newInstance(class $line2012488405320.$read$$iw$$iw$One), obj#9406: $line2012488405320.$read$$iw$$iw$One
+- LocalTableScan [one#9391]
What are alternative implementations to achieve
ds.show
+----+
| one|
+----+
| 1|
| 11|
| 111|
+----+
UPDATE 1
I was thinking of a general solution to the problem. Maybe something along these lines?:
def caseClassAccessorNames[T <: Product](implicit tag: TypeTag[T]) = {
typeOf[T]
.members
.collect {
case m: MethodSymbol if m.isCaseAccessor => m.name
}
.map(m => m.toString)
}
def project[T <: Product](ds: Dataset[T])(implicit tag: TypeTag[T]): Dataset[T] = {
import ds.sparkSession.implicits._
val columnsOfT: Seq[Column] =
caseClassAccessorNames
.map(col)(scala.collection.breakOut)
val t: DataFrame = ds.select(columnsOfT: _*)
t.as[T]
}
I managed to get this working for the trivial example, but need to evaluate it further. I wonder if there are alternative, maybe built-in, ways to achieve something like this?

Related

Tree/nested structures in Spark from relational data model

If i understand correctly, i could consider spark dataset as a list of objects of type T. How can two datasets be joined in a way that parent contains a list of children? But also a child would have the list of its own children...
One approach to this would be to do a groupBy of the children based on the key, but collect_list returns only one column and i suppose there is a better way to do this.
Wanted result is basically a dataset (list of customer objects?) of type Customer, but with additions:
Each customer would have a list of invoices.
Each invoice would have its own attributes, but also a list of items inside...
...and this can continue (a tree)
End result would then be something like
case class Customer(customer_id: Int, name: String, address: String, age: Int, invoices: List[Invoices])
case class Invoice(invoice_id: Int, customer_id: Int, invoice_num:String, date: Int, invoice_type: String, items: List[Items])
And to that result i would need to come from the following inputs:
case class Customer(customer_id: Int, name: String, address: String, age: Int)
case class Invoice(invoice_id: Int, customer_id: Int, invoice_num:String, date: Int, invoice_type: String)
case class InvoiceItem(item_id: Int, invoice_id: Int, num_of_items: Int, price: Double, total: Double)
val customers_df = Seq(
(11,"customer1", "address1", 10, "F")
,(12,"customer2", "address2", 20, "M")
,(13,"customer3", "address3", 30, "F")
).toDF("customer_id", "name", "address", "age", "sex")
val customers_ds = customers_df.as[Customer].as("c")
customers_ds.show
val invoices_df = Seq(
(21,11, "10101/1", 20181105, "manual")
,(22,11, "10101/2", 20181105, "manual")
,(23,11, "10101/3", 20181105, "manual")
,(24,12, "10101/4", 20181105, "generated")
,(25,12, "10101/5", 20181105, "pos")
).toDF("invoice_id", "customer_id", "invoice_num", "date", "invoice_type")
val invoices_ds = invoices_df.as[Invoice].as("i")
invoices_ds.show
val invoice_items_df = Seq(
(31, 21, 5, 10.0, 50.0)
,(32, 21, 3, 15.0, 45.0)
,(33, 22, 6, 11.0, 66.0)
,(34, 22, 7, 2.0, 14.0)
,(35, 23, 1, 100.0, 100.0)
,(36, 24, 4, 4.0, 16.0)
).toDF("item_id", "invoice_id", "num_of_items", "price", "total")
val invoice_items_ds = invoice_items_df.as[InvoiceItem].as("ii")
invoice_items_ds.show
In tables it looks like this:
+-----------+---------+--------+---+---+
|customer_id| name| address|age|sex|
+-----------+---------+--------+---+---+
| 11|customer1|address1| 10| F|
| 12|customer2|address2| 20| M|
| 13|customer3|address3| 30| F|
+-----------+---------+--------+---+---+
+----------+-----------+-----------+--------+------------+
|invoice_id|customer_id|invoice_num| date|invoice_type|
+----------+-----------+-----------+--------+------------+
| 21| 11| 10101/1|20181105| manual|
| 22| 11| 10101/2|20181105| manual|
| 23| 11| 10101/3|20181105| manual|
| 24| 12| 10101/4|20181105| generated|
| 25| 12| 10101/5|20181105| pos|
+----------+-----------+-----------+--------+------------+
+-------+----------+------------+-----+-----+
|item_id|invoice_id|num_of_items|price|total|
+-------+----------+------------+-----+-----+
| 31| 21| 5| 10.0| 50.0|
| 32| 21| 3| 15.0| 45.0|
| 33| 22| 6| 11.0| 66.0|
| 34| 22| 7| 2.0| 14.0|
| 35| 23| 1|100.0|100.0|
| 36| 24| 4| 4.0| 16.0|
+-------+----------+------------+-----+-----+
It seems you are trying to read normalized data into a tree of Scala objects. You can certainly do this with Spark but Spark may not be the optimal tool for this. If the data is small-enough to fit in memory, which I assume is true from your question, object-relational mapping (ORM) libraries may be better suited for the job.
If you still want to use Spark, you are on the right path with groupBy and collect_list. What you are missing is the struct() function.
case class Customer(id: Int)
case class Invoice(id: Int, customer_id: Int)
val customers = spark.createDataset(Seq(Customer(1))).as("customers")
val invoices = spark.createDataset(Seq(Invoice(1, 1), Invoice(2, 1)))
case class CombinedCustomer(id: Int, invoices: Option[Seq[Invoice]])
customers
.join(
invoices
.groupBy('customer_id)
.agg(collect_list(struct('*)).as("invoices"))
.withColumnRenamed("customer_id", "id"),
Seq("id"), "left_outer")
.as[CombinedCustomer]
.show
struct('*) builds a StructType column from the entire row. You can also pick any columns, e.g., struct('x.as("colA"), 'colB).
This produces
+---+----------------+
| id| invoices|
+---+----------------+
| 1|[[1, 1], [2, 1]]|
+---+----------------+
Now, in the case where the customer data is expected not to fit in memory, i.e., using a simple collect is not an option, there are a number of different strategies you can take.
The simplest, and one you should consider instead of collecting to the driver, requires that independent processing of each customer's data is acceptable. In that case, try using map and distribute the per-customer processing logic to the workers.
If independent processing by customer is not acceptable the general strategy is as follows:
Aggregate the data into structured rows as needed using the above approach.
Repartition the data to ensure that everything you need for processing is in a single partition.
(optionally) sortWithinPartitions to ensure that the data within a partition is ordered as you need it.
Use mapPartitions.
You can use Spark-SQL and have one dataset each for customer, invoices and items.
Then you can simply use joins and aggregate functions between these datasets to get desired output.
Spark SQL have very negligible performance difference between sql style and programmatic way.

Spark: DataFrame Aggregation (Scala)

I have a below requirement to aggregate the data on Spark dataframe in scala.
And, I have two datasets.
Dataset 1 contains values (val1, val2..) for each "t" types distributed on several different columns like (t1,t2...) .
val data1 = Seq(
("1","111",200,"221",100,"331",1000),
("2","112",400,"222",500,"332",1000),
("3","113",600,"223",1000,"333",1000)
).toDF("id1","t1","val1","t2","val2","t3","val3")
data1.show()
+---+---+----+---+----+---+----+
|id1| t1|val1| t2|val2| t3|val3|
+---+---+----+---+----+---+----+
| 1|111| 200|221| 100|331|1000|
| 2|112| 400|222| 500|332|1000|
| 3|113| 600|223|1000|333|1000|
+---+---+----+---+----+---+----+
Dataset 2 represent the same thing by having a separate row for each "t" type.
val data2 = Seq(("1","111",200),("1","221",100),("1","331",1000),
("2","112",400),("2","222",500),("2","332",1000),
("3","113",600),("3","223",1000), ("3","333",1000)
).toDF("id*","t*","val*")
data2.show()
+---+---+----+
|id*| t*|val*|
+---+---+----+
| 1|111| 200|
| 1|221| 100|
| 1|331|1000|
| 2|112| 400|
| 2|222| 500|
| 2|332|1000|
| 3|113| 600|
| 3|223|1000|
| 3|333|1000|
+---+---+----+
Now,I need to groupBY(id,t,t*) fields and print the balances for sum(val) and sum(val*) as a separate record.
And both balances should be equal.
My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
| 1|111| 200|111| 200|
| 1|221| 100|221| 100|
| 1|331| 1000|331| 1000|
| 2|112| 400|112| 400|
| 2|222| 500|222| 500|
| 2|332| 1000|332| 1000|
| 3|113| 600|113| 600|
| 3|223| 1000|223| 1000|
| 3|333| 1000|333| 1000|
+---+---+--------+---+---------+
I'm thinking of exploding the dataset1 into mupliple records for each "t" type and then join with dataset2.
But could you please suggest me a better approach which wouldn't affect the performance if datasets become bigger?
The simplest solution is to do sub-selects and then union datasets:
val ts = Seq(1, 2, 3)
val dfs = ts.map (t => data1.select("t" + t as "t", "v" + t as "v"))
val unioned = dfs.drop(1).foldLeft(dfs(0))((l, r) => l.union(r))
val ds = unioned.join(df2, 't === col("t*")
here aggregation
You can also try array with explode:
val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
.withColumn("t", explode(colList))
.select('t, 'id1 as "id")
val ds = df2.withColumn("val",
when('t === 't1, 'val1)
.when('t === 't2, 'val2)
.when('t === 't3, 'val3)
.otherwise(0))
The last step is to join this Dataset with data2:
ds.join(data2, 't === col("t*"))
.groupBy("t", "t*")
.agg(first("id1") as "id1", sum(val), sum("val*"))

create column with a running total in a Spark Dataset

Suppose we have a Spark Dataset with two columns, say Index and Value, sorted by the first column (Index).
((1, 100), (2, 110), (3, 90), ...)
We'd like to have a Dataset with a third column with a running total of the values in the second column (Value).
((1, 100, 100), (2, 110, 210), (3, 90, 300), ...)
Any suggestions how to do this efficiently, with one pass through the data? Or are there any canned CDF type functions out there that could be utilized for this?
If need be, the Dataset can be converted to a Dataframe or an RDD to accomplish the task, but it will have to remain a distributed data structure. That is, it cannot be simply collected and turned to an array or sequence, and no mutable variables are to be used (val only, no var).
but it will have to remain a distributed data structure.
Unfortunately what you've said you seek to do isn't possible in Spark. If you are willing to repartition the data set to a single partition (in effect consolidating it on a single host) you could easily write a function to do what you wish, keeping the incremented value as a field.
Since Spark functions don't share state across the network when they execute, there's no way to create the shared state you would need to keep the data set completely distributed.
If you're willing to relax your requirement and allow the data to be consolidated and read through in a single pass on one host then you may do what you wish by repartitioning to a single partition and applying a function. This does not pull the data onto the driver (keeping it in HDFS/the cluster) but does still compute the output serially, on a single executor. For example:
package com.github.nevernaptitsa
import java.io.Serializable
import java.util
import org.apache.spark.sql.{Encoders, SparkSession}
object SparkTest {
class RunningSum extends Function[Int, Tuple2[Int, Int]] with Serializable {
private var runningSum = 0
override def apply(v1: Int): Tuple2[Int, Int] = {
runningSum+=v1
return (v1, runningSum)
}
}
def main(args: Array[String]): Unit ={
val session = SparkSession.builder()
.appName("runningSumTest")
.master("local[*]")
.getOrCreate()
import session.implicits._
session.createDataset(Seq(1,2,3,4,5))
.repartition(1)
.map(new RunningSum)
.show(5)
session.createDataset(Seq(1,2,3,4,5))
.map(new RunningSum)
.show(5)
}
}
The two statements here show different output, the first providing the correct output (serial, because repartition(1) is called), and the second providing incorrect output because the result is computed in parallel.
Results from first statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
+---+---+
Results from second statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 9|
+---+---+
A colleague suggested the following which relies on the RDD.mapPartitionsWithIndex() method.
(To my knowledge, the other data structure do not provide this kind of reference to their partitions' indices.)
val data = sc.parallelize((1 to 5)) // sc is the SparkContext
val partialSums = data.mapPartitionsWithIndex{ (i, values) =>
Iterator((i, values.sum))
}.collect().toMap // will in general have size other than data.count
val cumSums = data.mapPartitionsWithIndex{ (i, values) =>
val prevSums = (0 until i).map(partialSums).sum
values.scanLeft(prevSums)(_+_).drop(1)
}

Does GraphFrames api support creation of Bipartite graphs?

Does GraphFrames api support creation of Bipartite graphs in the current version?
Current version: 0.1.0
Spark version : 1.6.1
As pointed out in the comments to this question, neither GraphFrames nor GraphX have built-in support for bipartite graphs. However, they both have more than enough flexibility to let you create bipartite graphs. For a GraphX solution, see this previous answer. That solution uses a shared trait between the different vertex / object type. And while that works with RDDs that's not going to work for DataFrames. A row in a DataFrame has a fixed schema -- it can't sometimes contain a price column and sometimes not. It can have a price column that's sometimes null, but the column has to exist in every row.
Instead, the solution for GraphFrames seems to be that you need to define a DataFrame that's essentially a linear sub-type of both types of objects in your bipartite graph -- it has to contain all of the fields of both types of objects. This is actually pretty easy -- a join with full_outer is going to give you that. Something like this:
val players = Seq(
(1,"dave", 34),
(2,"griffin", 44)
).toDF("id", "name", "age")
val teams = Seq(
(101,"lions","7-1"),
(102,"tigers","5-3"),
(103,"bears","0-9")
).toDF("id","team","record")
You could then create a super-set DataFrame like this:
val teamPlayer = players.withColumnRenamed("id", "l_id").join(
teams.withColumnRenamed("id", "r_id"),
$"r_id" === $"l_id", "full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
teamPlayer.show
+---+-------+----+------+------+
| id| name| age| team|record|
+---+-------+----+------+------+
|101| null|null| lions| 7-1|
|102| null|null|tigers| 5-3|
|103| null|null| bears| 0-9|
| 1| dave| 34| null| null|
| 2|griffin| 44| null| null|
+---+-------+----+------+------+
You could possibly do it a little cleaner with structs:
val tpStructs = players.select($"id" as "l_id", struct($"name", $"age") as "player").join(
teams.select($"id" as "r_id", struct($"team",$"record") as "team"),
$"l_id" === $"r_id",
"full_outer"
).withColumn("l_id", coalesce($"l_id", $"r_id"))
.drop($"r_id")
.withColumnRenamed("l_id", "id")
tpStructs.show
+---+------------+------------+
| id| player| team|
+---+------------+------------+
|101| null| [lions,7-1]|
|102| null|[tigers,5-3]|
|103| null| [bears,0-9]|
| 1| [dave,34]| null|
| 2|[griffin,44]| null|
+---+------------+------------+
I'll also point out that more or less the same solution would work in GraphX with RDDs. You could always create a vertex via joining two case classes that don't share any traits:
case class Player(name: String, age: Int)
val playerRdd = sc.parallelize(Seq(
(1L, Player("date", 34)),
(2L, Player("griffin", 44))
))
case class Team(team: String, record: String)
val teamRdd = sc.parallelize(Seq(
(101L, Team("lions", "7-1")),
(102L, Team("tigers", "5-3")),
(103L, Team("bears", "0-9"))
))
playerRdd.fullOuterJoin(teamRdd).collect foreach println
(101,(None,Some(Team(lions,7-1))))
(1,(Some(Player(date,34)),None))
(102,(None,Some(Team(tigers,5-3))))
(2,(Some(Player(griffin,44)),None))
(103,(None,Some(Team(bears,0-9))))
With all respect to the previous answer, this seems like a more flexible way to handle it -- without having to share a trait between the combined objects.

Spark, DataFrame: apply transformer/estimator on groups

I have a DataFrame that looks like follow:
+-----------+-----+------------+
| userID|group| features|
+-----------+-----+------------+
|12462563356| 1| [5.0,43.0]|
|12462563701| 2| [1.0,8.0]|
|12462563701| 1| [2.0,12.0]|
|12462564356| 1| [1.0,1.0]|
|12462565487| 3| [2.0,3.0]|
|12462565698| 2| [1.0,1.0]|
|12462565698| 1| [1.0,1.0]|
|12462566081| 2| [1.0,2.0]|
|12462566081| 1| [1.0,15.0]|
|12462566225| 2| [1.0,1.0]|
|12462566225| 1| [9.0,85.0]|
|12462566526| 2| [1.0,1.0]|
|12462566526| 1| [3.0,79.0]|
|12462567006| 2| [11.0,15.0]|
|12462567006| 1| [10.0,15.0]|
|12462567006| 3| [10.0,15.0]|
|12462586595| 2| [2.0,42.0]|
|12462586595| 3| [2.0,16.0]|
|12462589343| 3| [1.0,1.0]|
+-----------+-----+------------+
Where the columns types are: userID: Long, group: Int, and features:vector.
This is already a grouped DataFrame, i.e. a userID will appear in a particular group at max one time.
My goal is to scale the features column per group.
Is there a way to apply a feature transformer (in my case I would like to apply a StandardScaler) per group instead of applying it to the full DataFrame.
P.S. using ML is not mandatory, so no problem if the solution is based on MLlib.
Compute statistics
Spark >= 3.0
Now Summarizer supports standard deviations so
val summary = data
.groupBy($"group")
.agg(Summarizer.metrics("mean", "std")
.summary($"features").alias("stats"))
.as[(Int, (Vector, Vector))]
.collect.toMap
Spark >= 2.3
In Spark 2.3 or later you could also use Summarizer:
import org.apache.spark.ml.stat.Summarizer
val summaryVar = data
.groupBy($"group")
.agg(Summarizer.metrics("mean", "variance")
.summary($"features").alias("stats"))
.as[(Int, (Vector, Vector))]
.collect.toMap
and adjust downstream code to handle variances instead of standard deviations.
Spark < 2.0, Spark < 2.3 with adjustments for conversions between ml and mllib Vectors.
You can compute statistics by group using almost the same code as default Scaler:
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row
// Compute Multivariate Statistics
val summary = data.select($"group", $"features")
.rdd
.map {
case Row(group: Int, features: Vector) => (group, features)
}
.aggregateByKey(new MultivariateOnlineSummarizer)(/* Create an empty new MultivariateOnlineSummarizer */
(agg, v) => agg.add(v), /* seqOp : Add a new sample Vector to this summarizer, and update the statistical summary. */
(agg1, agg2) => agg1.merge(agg2)) /* combOp : As MultivariateOnlineSummarizer accepts a merge action with another MultivariateOnlineSummarizer, and update the statistical summary. */
.mapValues {
s => (
s.variance.toArray.map(math.sqrt(_)), /* compute the square root variance for each key */
s.mean.toArray /* fetch the mean for each key */
)
}.collectAsMap
Transformation
If expected number of groups is relatively low you can broadcast these:
val summaryBd = sc.broadcast(summary)
and transform your data:
val scaledRows = df.rdd.map{ case Row(userID, group: Int, features: Vector) =>
val (stdev, mean) = summaryBd.value(group)
val vs = features.toArray.clone()
for (i <- 0 until vs.size) {
vs(i) = if(stdev(i) == 0.0) 0.0 else (vs(i) - mean(i)) * (1 / stdev(i))
}
Row(userID, group, Vectors.dense(vs))
}
val scaledDf = sqlContext.createDataFrame(scaledRows, df.schema)
Otherwise you can simply join. It shouldn't be hard to wrap this as a ML transformer with group column as a param.

Resources