Spark - SparkSession access issue - apache-spark

I have a problem similar to one in
Spark java.lang.NullPointerException Error when filter spark data frame on inside foreach iterator
String_Lines.foreachRDD{line ->
line.foreach{x ->
// JSON to DF Example
val sparkConfig = SparkConf().setAppName("JavaKinesisWordCountASL").setMaster("local[*]").
set("spark.sql.warehouse.dir", "file:///C:/tmp")
val spark = SparkSession.builder().config(sparkConfig).orCreate
val outer_jsonData = Arrays.asList(x)
val outer_anotherPeopleDataset = spark.createDataset(outer_jsonData, Encoders.STRING())
spark.read().json(outer_anotherPeopleDataset).createOrReplaceTempView("jsonInnerView")
spark.sql("select name, address.city, address.state from jsonInnerView").show(false)
println("Current String #"+ x)
}
}
#thebluephantom did explain it to the point. I have my code in foreachRDD now, but still it doesn't work. This is Kotlin and I am running it in my local laptop with IntelliJ. Somehow it's not picking sparksession as I understand after reading all blogs. If I delete "spark.read and spark.sql", everything else works OK. What should I do to fix this?

If I delete "spark.read and spark.sql", everything else works OK
If you delete those, you're not actually making Spark do anything, only defining what Spark actions should happen (Spark actions are lazy)
Somehow it's not picking sparksession as I understand
It's "picking it up" just fine. The error is happening because it's picking up a brand new SparkSession. You should already have defined one of these outside of the forEachRDD method, but if you try to reuse it, you might run into different issues
Assuming String_Lines is already a Dataframe. There's no point in looping over all of its RDD data and trying to create brand new SparkSession. Or if it's a DStream, convert it to Streaming Dataframe instead...
That being said, you should be able to immediately select data from it
// unclear what the schema of this is
val selected = String_Lines.selectExpr("name", "address.city", "address.state")
selected.show(false)
You may need to add a get_json_object function in there if you're trying to parse strings to JSON

I am able to solve it finally.
I modified code like this.... Its clean and working.
This is String_Lines data type
val String_Lines: JavaDStream<String>
String_Lines.foreachRDD { x ->
val df = spark.read().json(x)
df.printSchema()
df.show(2,false)
}
Thanks,
Chandra

Related

Spark - Java - Filter Streaming Queries

I've a Spark application that receives data in a dataframe:
Dataset<Row> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "topic").load().selectExpr("CAST(key AS STRING) as key");
String my_key = df.select("key").first().toString();
if (my_key == "a")
{
do_stuff
}
Basically I will need to in case of value a then I apply some transformations on the dataframe otherwise I apply other transformations.
However, I am dealing with streaming queries and when I tried to apply my code above I got:
Queries with streaming sources must be executed with writeStream.start()
The error happens when I make the first operation.
Anyone have any ideas?
Thanks in advance :)
I was able to sole my problem using:
Dataset<Row> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "topic").load().selectExpr("CAST(key AS STRING) as key").filter(functions.col("key").contains("a"));

HBase batch get with spark scala

I am trying to fetch data from HBase based on a list of row keys, in the API document there is a method called get(List gets), I am trying to use that, however the compiler is complaining something like this, does anyone had this experiance
overloaded method value get with alternatives: (x$1: java.util.List[org.apache.hadoop.hbase.client.Get])Array[org.apache.hadoop.hbase.client.Result] <and> (x$1: org.apache.hadoop.hbase.client.Get)org.apache.hadoop.hbase.client.Result cannot be applied to (List[org.apache.hadoop.hbase.client.Get])
The code I tried.
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)
I ended up using JavaConvert to make it java.util.List, then it worked
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
import scala.collection.JavaConverters._
val getJ=gets.asJava
val results = hTable.get(getJ).toList
your gets is of type List[Get]. Here List is of Scala type. However, HBase get request expects Java List type. You can use Seq[Get] instead of List[Get] as Scala Seq is more closer to Java List.
So, you can try with below code:
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:Seq[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)

Deleting specific column in Cassandra from Spark

I was able to delete specific column with the RDD API with -
sc.cassandraTable("books_ks", "books")
.deleteFromCassandra("books_ks", "books",SomeColumns("book_price"))
I am struggling to do this with the Dataframe API.
Can someone please share an example?
You cannot delete via the DF API and it's unnatural via the RDD api. RDDs and DFs are immutable, meaning no modification. You can filter them to cut them down but this generates a new RDD / DF.
Having said that what you can do is filter out the rows that you wish to delete and then just build a C* client to carry out that deletion:
// imports for Spark and C* connection
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
spark.setCassandraConf("Test Cluster", CassandraConnectorConf.ConnectionHostParam.option("localhost"))
val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" -> "books_ks", "table" -> "books")).load()
val dfToDelete = df.filter($"price" < 3).select($"price");
dfToDelete.show();
// import for C* client
import com.datastax.driver.core._
// build a C* client (part of the dependency of the scala driver)
val clusterBuilder = Cluster.builder().addContactPoints("127.0.0.1");
val cluster = clusterBuilder.build();
val session = cluster.connect();
// loop over everything that you filtered in the DF and delete specified row.
for(price <- dfToDelete.collect())
session.execute("DELETE FROM books_ks.books WHERE price=" + price.get(0).toString);
Few Warnings This wont work well if you're trying to delete a large portion of rows. Using collect here means that this work will be done in Spark's driver program, aka SPOF & bottle-neck.
Better way to do this would be to go a) define a DF UDF to carry out the delete, benefit would be you get parallelization. Option b) to the RDD level and just the delete as you've shown above.
Moral of the story, just because it can be done, doesn't mean it should be done.

How to check if a DataFrame was already cached/persisted before?

For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone?
You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2.
Code example :
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val df = sparkSession.read.csv("src/main/resources/sales.csv")
df.createTempView("sales")
//interacting with catalog
val catalog = sparkSession.catalog
//print the databases
catalog.listDatabases().select("name").show()
// print all the tables
catalog.listTables().select("name").show()
// is cached
println(catalog.isCached("sales"))
df.cache()
println(catalog.isCached("sales"))
Using the above code you can list all the tables and check weather a table is cached or not.
You can check the working code example here

NotSerializableException: org.apache.hadoop.io.LongWritable

I know this question has been answered many times, but I tried everything and I do not come to a solution. I have the following code which raises a NotSerializableException
val ids : Seq[Long] = ...
ids.foreach{ id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[MyWritable]).lookup(new LongWritable(id))
}
With the following exception
Caused by: java.io.NotSerializableException: org.apache.hadoop.io.LongWritable
Serialization stack:
...
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
When creating the SparkContext, I do
val sparkConfig = new SparkConf().setAppName("...").setMaster("...")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.registerKryoClasses(Array(classOf[BitString[_]], classOf[MinimalBitString], classOf[org.apache.hadoop.io.LongWritable]))
sparkConfig.set("spark.kryoserializer.classesToRegister", "org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable")
and looking at the environment tab, I can see these entries. However, I do not understand why
the Kryo serializer does not seem to be used (the stack does not mention Kryo)
LongWritable is not serialized.
I'm using Apache Spark v. 1.5.1
Loading repeatedly the same data inside a loop is extremely inefficient. If you perform actions against the same data load it once and cache:
val rdd = sc
.sequenceFile("file", classOf[LongWritable], classOf[MyWritable])
rdd.cache()
Spark doesn't consider Hadoop Writables to be serializable. There is an open JIRA (SPARK-2421) for this. To handle LongWritables simple get should be enough:
rdd.map{case (k, v) => k.get()}
Regarding your custom class it is your responsibility to deal with this problem.
Effective lookup requires a partitoned RDD. Otherwise it has to search every partition in your RDD.
import org.apache.spark.HashPartitioner
val numPartitions: Int = ???
val partitioned = rdd.partitionBy(new HashPartitioner(numPartitions))
Generally speaking RDDs are not designed for random access. Even with defined partitioner lookup has to linearly search candidate partition. With 5000 uniformly distributed keys and 10M objects in an RDD it most likely means a repeated search over a whole RDD. You have few options to avoid that:
filter
val idsSet = sc.broadcast(ids.toSet)
rdd.filter{case (k, v) => idsSet.value.contains(k)}
join
val idsRdd = sc.parallelize(ids).map((_, null))
idsRdd.join(rdd).map{case (k, (_, v)) => (k, v)}
IndexedRDD - it doesn't like a particularly active project though
With 10M entries you'll probably be better with searching locally in memory than using Spark. For a larger data you should consider using a proper key-value store.
I'm new to apache spark but tried to solve your problem, please evaluate it, if it can help you out with the problem of serialization, it's occurring because for spark - hadoop LongWritable and other writables are not serialized.
val temp_rdd = sc.parallelize(ids.map(id =>
sc.sequenceFile("file", classOf[LongWritable], classOf[LongWritable]).toArray.toSeq
)).flatMap(identity)
ids.foreach(id =>temp_rdd.lookup(new LongWritable(id)))
Try this solution. It worked fine for me.
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkMapReduceApp");
conf.registerKryoClasses(new Class<?>[]{
LongWritable.class,
Text.class
});

Resources