HBase batch get with spark scala - apache-spark

I am trying to fetch data from HBase based on a list of row keys, in the API document there is a method called get(List gets), I am trying to use that, however the compiler is complaining something like this, does anyone had this experiance
overloaded method value get with alternatives: (x$1: java.util.List[org.apache.hadoop.hbase.client.Get])Array[org.apache.hadoop.hbase.client.Result] <and> (x$1: org.apache.hadoop.hbase.client.Get)org.apache.hadoop.hbase.client.Result cannot be applied to (List[org.apache.hadoop.hbase.client.Get])
The code I tried.
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)

I ended up using JavaConvert to make it java.util.List, then it worked
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
import scala.collection.JavaConverters._
val getJ=gets.asJava
val results = hTable.get(getJ).toList

your gets is of type List[Get]. Here List is of Scala type. However, HBase get request expects Java List type. You can use Seq[Get] instead of List[Get] as Scala Seq is more closer to Java List.
So, you can try with below code:
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:Seq[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)

Related

When to use map function on spark in transforming values

I'm new to spark and working with it. Previously I worked with python and pandas, pandas has a map function which is often used to apply transformation on columns. I found out that spark also have map function as well but until now I haven't used it at all except for extracting values like this df.select("id").map(r => r.getString(0)).collect.toList
import spark.implicits._
val df3 = df2.map(row=>{
val util = new Util()
val fullName = row.getString(0) +row.getString(1) +row.getString(2)
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
my questions are,
is it common to use map function to transform dataframe columns?
is it common to use map like block of code above? source from sparkbyexamples
when do people usually use map?

Spark - SparkSession access issue

I have a problem similar to one in
Spark java.lang.NullPointerException Error when filter spark data frame on inside foreach iterator
String_Lines.foreachRDD{line ->
line.foreach{x ->
// JSON to DF Example
val sparkConfig = SparkConf().setAppName("JavaKinesisWordCountASL").setMaster("local[*]").
set("spark.sql.warehouse.dir", "file:///C:/tmp")
val spark = SparkSession.builder().config(sparkConfig).orCreate
val outer_jsonData = Arrays.asList(x)
val outer_anotherPeopleDataset = spark.createDataset(outer_jsonData, Encoders.STRING())
spark.read().json(outer_anotherPeopleDataset).createOrReplaceTempView("jsonInnerView")
spark.sql("select name, address.city, address.state from jsonInnerView").show(false)
println("Current String #"+ x)
}
}
#thebluephantom did explain it to the point. I have my code in foreachRDD now, but still it doesn't work. This is Kotlin and I am running it in my local laptop with IntelliJ. Somehow it's not picking sparksession as I understand after reading all blogs. If I delete "spark.read and spark.sql", everything else works OK. What should I do to fix this?
If I delete "spark.read and spark.sql", everything else works OK
If you delete those, you're not actually making Spark do anything, only defining what Spark actions should happen (Spark actions are lazy)
Somehow it's not picking sparksession as I understand
It's "picking it up" just fine. The error is happening because it's picking up a brand new SparkSession. You should already have defined one of these outside of the forEachRDD method, but if you try to reuse it, you might run into different issues
Assuming String_Lines is already a Dataframe. There's no point in looping over all of its RDD data and trying to create brand new SparkSession. Or if it's a DStream, convert it to Streaming Dataframe instead...
That being said, you should be able to immediately select data from it
// unclear what the schema of this is
val selected = String_Lines.selectExpr("name", "address.city", "address.state")
selected.show(false)
You may need to add a get_json_object function in there if you're trying to parse strings to JSON
I am able to solve it finally.
I modified code like this.... Its clean and working.
This is String_Lines data type
val String_Lines: JavaDStream<String>
String_Lines.foreachRDD { x ->
val df = spark.read().json(x)
df.printSchema()
df.show(2,false)
}
Thanks,
Chandra

Kotlin with spark create dataframe from POJO which has pojo classes within

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.
It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.
Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

How to convert from dataframe to RDD and back with a case class [duplicate]

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

Kotlin and Spark - SAM issues

Maybe I'm doing something that is not quite supported, but I really want to use Kotlin as I learn Apache Spark with this book
Here is the Scala code sample I'm trying to run. The flatMap() accepts a FlatMapFunction SAM type:
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
Here is my attempt to do this in Kotlin. But it is having a compilation issue on the fourth line:
val conf = SparkConf().setMaster("local").setAppName("Line Counter")
val sc = SparkContext(conf)
val input = sc.textFile("C:\\spark_workspace\\myfile.txt",1)
val words = input.flatMap{ s:String -> s.split(" ") } //ERROR
When I hover over it I get this compile error:
Am I doing anything unreasonable or unsupported? I don't see any suggestions to autocomplete with lambdas either :(
Despite the fact the problem is solved I would like to provide some information regarding the reasons of compilation problem. In this example input has a type of RDD, whose flatMap() method accepts a lambda that should return TraversableOnce[U]. As Scala has it's own collections framework, Java collection types cannot be converted to TraversableOnce.
Moreover, I'm not so sure Scala Functions are really SAMs. As far as I can see from the screenshots Kotlin doesn't offer replacing a Function instance with a lambda.
Ah, I figured it out. I knew there was a way since Spark supports both Java and Scala. The key to this particular problem was to use a JavaSparkContext instead of the Scala-based SparkContext.
For some reason Scala and Kotlin don't always get along with SAM conversions. But Java and Kotlin do...
fun main(args: Array<String>) {
val conf = SparkConf().setMaster("local").setAppName("Line Counter")
val sc = JavaSparkContext(conf)
val input = sc.textFile("C:\\spark_workspace\\myfile.txt",1)
val words = input.flatMap { it.split(" ") }
}
See my comment at #Michael for my fix. However, can I recommend the open source Kotlin Spark API by JetBrains for future reference? It solves many lambda errors, especially using the Dataset API but can also make working with Spark from Kotlin generally easier:
withSpark(appName = "Line Counter", master = "local") {
val input = sc.textFile("C:\\spark_workspace\\myfile.txt", 1)
val words = input.flatMap { s: String -> s.split(" ").iterator() }
}

Resources