When to use map function on spark in transforming values - apache-spark

I'm new to spark and working with it. Previously I worked with python and pandas, pandas has a map function which is often used to apply transformation on columns. I found out that spark also have map function as well but until now I haven't used it at all except for extracting values like this df.select("id").map(r => r.getString(0)).collect.toList
import spark.implicits._
val df3 = df2.map(row=>{
val util = new Util()
val fullName = row.getString(0) +row.getString(1) +row.getString(2)
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
my questions are,
is it common to use map function to transform dataframe columns?
is it common to use map like block of code above? source from sparkbyexamples
when do people usually use map?

Related

Higher Order functions in Spark SQL

Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?
In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.
Thanks in advance.
Not going down the .filter road as I cannot see the focus there.
For .transform
dataframe transform at DF-level
transform on an array of a DF in v 2.4
transform on an array of a DF in v 3
The following:
dataframe transform
From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.
This they say is messy:
...
def inc(i: Int) = i + 1
val tmp0 = func0(inc, 3)(testDf)
val tmp1 = func1(1)(tmp0)
val tmp2 = func2(2)(tmp1)
val res = tmp2.withColumn("col3", expr("col2 + 3"))
compared to:
val res = testDf.transform(func0(inc, 4))
.transform(func1(1))
.transform(func2(2))
.withColumn("col3", expr("col2 + 3"))
transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination
import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),
Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))
transform with lambda function new array function on a DF in v 3 using withColumn
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df = Seq(
(Array("New York", "Seattle")),
(Array("Barcelona", "Bangalore"))
).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"),
(col: Column) => concat(col, lit(" is fun!"))))
Try them.
Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):
transform works similar to the map function in Scala. I’m not sure why
they chose to name this function transform… I think array_map would
have been a better name, especially because the Dataset#transform
function is commonly used to chain DataFrame transformations.
Update
If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

spark override the dataframe variable without using var

I have one API which perform delete operation on dataframe like below
def deleteColmns(df:DataFrame,clmList :List[org.apache.spark.sql.Column]):DataFrame{
var ddf:DataFrame = null
for(clm<-clmList){
ddf.drop(clm)
}
return ddf
}
Since it is not good practice to use the var in functional programming , how to avoid this situation
With Spark >2.0, you can drop multiple columns using a sequence of column name :
val clmList: Seq[Column] = _
val strList: Seq[String] = clmList.map(c => s"$c")
df.drop(strList: _*)
Otherwise, you can always use foldLeft to fold left on the DataFrame and drop your columns :
clmList.foldLeft(df)((acc, c) => acc.drop(c))
I hope this helps.

HBase batch get with spark scala

I am trying to fetch data from HBase based on a list of row keys, in the API document there is a method called get(List gets), I am trying to use that, however the compiler is complaining something like this, does anyone had this experiance
overloaded method value get with alternatives: (x$1: java.util.List[org.apache.hadoop.hbase.client.Get])Array[org.apache.hadoop.hbase.client.Result] <and> (x$1: org.apache.hadoop.hbase.client.Get)org.apache.hadoop.hbase.client.Result cannot be applied to (List[org.apache.hadoop.hbase.client.Get])
The code I tried.
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)
I ended up using JavaConvert to make it java.util.List, then it worked
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
import scala.collection.JavaConverters._
val getJ=gets.asJava
val results = hTable.get(getJ).toList
your gets is of type List[Get]. Here List is of Scala type. However, HBase get request expects Java List type. You can use Seq[Get] instead of List[Get] as Scala Seq is more closer to Java List.
So, you can try with below code:
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:Seq[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)

How should I convert an RDD of org.apache.spark.ml.linalg.Vector to Dataset?

I'm struggling to understand how the conversion among RDDs, DataSets and DataFrames works.
I'm pretty new to Spark, and I get stuck every time I need to pass from a data model to another (especially from RDDs to Datasets and Dataframes).
Could anyone explain me the right way to do it?
As an example, now I have a RDD[org.apache.spark.ml.linalg.Vector] and I need to pass it to my machine learning algorithm, for example a KMeans (Spark DataSet MLlib). So, I need to convert it to Dataset with a single column named "features" which should contain Vector typed rows. How should I do this?
All you need is an Encoder. Imports
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.ml.linalg
RDD:
val rdd = sc.parallelize(Seq(
linalg.Vectors.dense(1.0, 2.0), linalg.Vectors.sparse(2, Array(), Array())
))
Conversion:
val ds = spark.createDataset(rdd)(ExpressionEncoder(): Encoder[linalg.Vector])
.toDF("features")
ds.show
// +---------+
// | features|
// +---------+
// |[1.0,2.0]|
// |(2,[],[])|
// +---------+
ds.printSchema
// root
// |-- features: vector (nullable = true)
To convert a RDD to a dataframe, the easiest way is to use toDF() in Scala. To use this function, it is necessary to import implicits which is done using the SparkSession object. It can be done as follows:
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = rdd.toDF("features")
toDF() takes an RDD of tuples. When the RDD is built up of common Scala objects they will be implicitly converted, i.e. there is no need to do anything, and when the RDD has multiple columns there is no need to do anything either, the RDD already contains a tuple. However, in this special case you need to first convert RDD[org.apache.spark.ml.linalg.Vector] to RDD[(org.apache.spark.ml.linalg.Vector)]. Therefore, it is necessary to do a convertion to tuple as follows:
val df = rdd.map(Tuple1(_)).toDF("features")
The above will convert the RDD to a dataframe with a single column called features.
To convert to a dataset the easiest way is to use a case class. Make sure the case class is defined outside the Main object. First convert the RDD to a dataframe, then do the following:
case class A(features: org.apache.spark.ml.linalg.Vector)
val ds = df.as[A]
To show all possible convertions, to access the underlying RDD from a dataframe or dataset can be done using .rdd:
val rdd = df.rdd
Instead of converting back and forth between RDDs and dataframes/datasets it's usually easier to do all the computations using the dataframe API. If there is no suitable function to do what you want, usually it's possible to define an UDF, user defined function. See for example here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

Resources