Spark dataset alias column on-the-fly like for a dataframe - apache-spark

May be a really silly question, but for:
val ds3 = ds.groupBy($"ip")
.avg("humidity")
it is not clear how for a dataset, not dataframe, how I can rename the column like using alias on-the-fly. I tried a few things but to no avail. No errors when trying, but no effect.
I would like "avg_humidity" as col name.
Extending the question, what if I issue:
val ds3 = ds.groupBy($"ip")
.avg()
How to handle that?

avg does not provide an alias func you might need an extra withColumnRenamed
val ds3 = ds.groupBy($"ip")
.avg("humidity")
.withColumnRenamed("avg(humidity)","avg_humidity")
instead you can use .agg(avg("humidity").as("avg_humidity"))
val ds3 = ds.groupBy($"ip").agg(avg("humidity").as("avg_humidity"))

groupBy(cols: Column*) returns a RelationalGroupedDataset.
The return type for avg(colNames: String*) on it is a DataFrame, so by using as(alias: String) you're simply assigning alias to a new DataFrame, not to a column(s).
SO discussion about renaming columns in a DataFrame is here.

Related

When to use map function on spark in transforming values

I'm new to spark and working with it. Previously I worked with python and pandas, pandas has a map function which is often used to apply transformation on columns. I found out that spark also have map function as well but until now I haven't used it at all except for extracting values like this df.select("id").map(r => r.getString(0)).collect.toList
import spark.implicits._
val df3 = df2.map(row=>{
val util = new Util()
val fullName = row.getString(0) +row.getString(1) +row.getString(2)
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
my questions are,
is it common to use map function to transform dataframe columns?
is it common to use map like block of code above? source from sparkbyexamples
when do people usually use map?

How to convert from dataframe to RDD and back with a case class [duplicate]

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

Read parquet into spark dataset ignoring missing fields [duplicate]

This question already has an answer here:
Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
(1 answer)
Closed 5 years ago.
Lets assume I create a parquet file as follows :
case class A (i:Int,j:Double,s:String)
var l1 = List(A(1,2.0,"s1"),A(2,3.0,"S2"))
val ds = spark.createDataset(l1)
ds.write.parquet("/tmp/test.parquet")
Is it possible to read it into a Dataset of a type with a different schema, where the only difference is few additional fields?
Eg:
case class B (i:Int,j:Double,s:String,d:Double=1.0) // d is extra and has a default value
Is there a way that i can make this work? :
val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]
In Spark, if the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required. It means for the following code to work:
val ds2 = spark.read.parquet("/tmp/test.parquet").as[B]
Following modifications needs to be done:
val ds2 = spark.read.parquet("/tmp/test.parquet").withColumn("d", lit(1D)).as[B]
Or, if creating additional column is not possible, then following can be done:
val ds2 = spark.read.parquet("/tmp/test.parquet").map{
case row => B(row.getInt(0), row.getDouble(1), row.getString(2))
}

Spark Rename Dataframe Columns

I have 2 files in HDFS - one is a csv file with no header and one is a list of column names. I'm wondering if it's possible to assign the column names to the other data frame without actually typing them out like described here.
I'm looking for something like this:
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("/user/training_data.txt")
val header = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", ",").load("/user/col_names.txt")
df.columns(header)
Is this possible?
One way could be to read the header file using scala.io like this:
import scala.io.Source
val header = Source.fromFile("/user/col_names.txt").getLines.map(_.split(","))
val newNames = header.next
Then, read the CSV file using spark-csv as you do, specifying no header and converting the names like:
val df = spark.read.format("com.databricks.spark.csv")
.option("header", "false").option("delimiter", "\t")
.load("/user/training_data.txt").toDF(newNames: _*)
notice the _* type annotation.
The _* is type ascription in Scala (meaning that we can give a list as argument, and it will still work, applying the same function to each member of the-said list)
more here: What is the purpose of type ascriptions in Scala?

How to read a nested collection in Spark

I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. I felt might be helpful to give some stats on the table.
Number of columns in main table ~600. Number of rows ~200m.
Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].
Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the spark-shell (1.3.1):
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Inner(a: String, b: String)
defined class Inner
scala> case class Outer(key: String, inners: Seq[Inner])
defined class Outer
Write the parquet file:
scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
scala> outers.toDF.saveAsParquetFile("outers.parquet")
Read the parquet file:
scala> import org.apache.spark.sql.catalyst.expressions.Row
import org.apache.spark.sql.catalyst.expressions.Row
scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]
scala> val outers = dataFrame.map { row =>
| val key = row.getString(0)
| val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
| Outer(key, inners)
| }
outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.
Now that you have a RDD[Outer], you can apply any wanted transformation or action.
// Filter the outers
outers.filter(_.inners.nonEmpty)
// Filter the inners
outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
dataFrame.select('col1, 'col2).map { row => ... }
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.
Create a test dataframe:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
df.show()
## +-+--------------------+
## |a| intlist|
## +-+--------------------+
## |1|ArrayBuffer(1, 2, 3)|
## |2|ArrayBuffer(4, 5, 6)|
## +-+--------------------+
Use explode to flatten the list column:
from pyspark.sql.functions import explode
df.select(df.a, explode(df.intlist)).show()
## +-+---+
## |a|_c0|
## +-+---+
## |1| 1|
## |1| 2|
## |1| 3|
## |2| 4|
## |2| 5|
## |2| 6|
## +-+---+
Another approach would be using pattern matching like this:
val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match {
case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
case List(a:String, b: String) => (a, b)
}).toList
})
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
SELECT hholdid, tsp.person_seq_no
FROM ( SELECT hholdid, explode(tsp_ids) as tsp
FROM disc_mrt.unified_fact uf
)
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
Noticable not resolved JIRA on explode() for SQL access:
SPARK-7549: Support aggregating over nested fields

Resources