Spark job on hbase data - apache-spark

I am new to spark and I am trying to get my facebook data from HBASE table with following schema:
I want to do a spark job on it as explained below. Following is my code to get the JavaPairRDD.
SparkConf sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]");
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryoserializer.buffer.mb", "256");
sparkConf.set("spark.kryoserializer.buffer.max", "512");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "localhost:2181");
conf.set("hbase.regionserver.port", "60010");
String tableName = "fbData";
conf.set("hbase.master", "localhost:60010");
conf.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
Now using map() of RDD I am able to get the JavaRDD for posts/comments/replies using type column:
JavaRDD<Post> results = hBaseRDD.map(new Function<Tuple2<ImmutableBytesWritable, Result>, Post>() {
//fetching posts
return post;
}
Now I have 3 JavaRDDs for posts, comments and replies. POJO Post has fields for comments and replies. So I want to add the comments and Replies to the post using parent post Id. How can I accomplish this with Spark? Way that I thought of was to iterate through all posts, then iterate through all the comments and replies. Thanks in advance.

One way you can do this is by making your 3 RDDs JavaPairRDDs, with the fields in comment as the key. You can then use the join method.
Assuming that the results and comments RDD are pair RDDs then you can just do:
JavaPairRDD<??> aggregatedResults = results.join(comments)
I do not know what type you would use for the combined objects.

Related

Spark - SparkSession access issue

I have a problem similar to one in
Spark java.lang.NullPointerException Error when filter spark data frame on inside foreach iterator
String_Lines.foreachRDD{line ->
line.foreach{x ->
// JSON to DF Example
val sparkConfig = SparkConf().setAppName("JavaKinesisWordCountASL").setMaster("local[*]").
set("spark.sql.warehouse.dir", "file:///C:/tmp")
val spark = SparkSession.builder().config(sparkConfig).orCreate
val outer_jsonData = Arrays.asList(x)
val outer_anotherPeopleDataset = spark.createDataset(outer_jsonData, Encoders.STRING())
spark.read().json(outer_anotherPeopleDataset).createOrReplaceTempView("jsonInnerView")
spark.sql("select name, address.city, address.state from jsonInnerView").show(false)
println("Current String #"+ x)
}
}
#thebluephantom did explain it to the point. I have my code in foreachRDD now, but still it doesn't work. This is Kotlin and I am running it in my local laptop with IntelliJ. Somehow it's not picking sparksession as I understand after reading all blogs. If I delete "spark.read and spark.sql", everything else works OK. What should I do to fix this?
If I delete "spark.read and spark.sql", everything else works OK
If you delete those, you're not actually making Spark do anything, only defining what Spark actions should happen (Spark actions are lazy)
Somehow it's not picking sparksession as I understand
It's "picking it up" just fine. The error is happening because it's picking up a brand new SparkSession. You should already have defined one of these outside of the forEachRDD method, but if you try to reuse it, you might run into different issues
Assuming String_Lines is already a Dataframe. There's no point in looping over all of its RDD data and trying to create brand new SparkSession. Or if it's a DStream, convert it to Streaming Dataframe instead...
That being said, you should be able to immediately select data from it
// unclear what the schema of this is
val selected = String_Lines.selectExpr("name", "address.city", "address.state")
selected.show(false)
You may need to add a get_json_object function in there if you're trying to parse strings to JSON
I am able to solve it finally.
I modified code like this.... Its clean and working.
This is String_Lines data type
val String_Lines: JavaDStream<String>
String_Lines.foreachRDD { x ->
val df = spark.read().json(x)
df.printSchema()
df.show(2,false)
}
Thanks,
Chandra

How to check if a DataFrame was already cached/persisted before?

For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone?
You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2.
Code example :
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val df = sparkSession.read.csv("src/main/resources/sales.csv")
df.createTempView("sales")
//interacting with catalog
val catalog = sparkSession.catalog
//print the databases
catalog.listDatabases().select("name").show()
// print all the tables
catalog.listTables().select("name").show()
// is cached
println(catalog.isCached("sales"))
df.cache()
println(catalog.isCached("sales"))
Using the above code you can list all the tables and check weather a table is cached or not.
You can check the working code example here

How to broadcast data from MySQL and use it in streaming batches?

// How do I get attributes from MYSQL DB during each streaming batch and broadcast it.
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext (sc, Seconds(streamingBatchSizeinSeconds))
val eventDStream=getDataFromKafka(ssc)
val eventDtreamFiltered=eventFilter(eventDStream,eventType)
Whatever you do in getDataFromKafka and eventFilter I think you get a DStream to work with. That's how your future computations are described by and every batch interval you have a RDD to work with.
The answer to your question greatly depends on what exactly you want to do exactly, but let's assume that you're done with this stream processing of Kafka records and you want to do something with them.
If foreach were acceptable, you could do the following:
// I use Spark 2.x here
// Read attributes from MySQL
val myAttrs = spark.read.jdbc([mysql-url-here]).collect
// Broadcast the attributes so they're available on executors
val attrs = sc.broadcast(myAttrs) // do it once OR move it as part of foreach below
eventDtreamFiltered.foreach { rdd =>
// for each RDD reach out to attrs broadcast
val _attrs = attrs.get
// do something here with the rdd and _attrs
}
I tyle!

How to create a DStream from a List of string?

I have a list of string, but i cant find a way to change the list to a DStream of spark streaming.
I tried this:
val tmpList = List("hi", "hello")
val rdd = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rowRdd = rdd.map(v => Row(v: _*))
But the eclipse says sparkContext is not a member of sqlContext, so, How can i do this?
Appreciate your help, Please.
DStream is the sequence of RDD and it is created when you have register a received to some streaming source like Kafka. For testing if you want to create DStream from list of RDD's you can do that as follows:
val rdd1 = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rdd2 = sqlContext.sparkContext.parallelize(Seq(tmpList1))
ssc.queueStream[String](mutable.Queue(rdd1,rdd2))
Hope it answers your question.

Cassandra Spark Connector

My cassandra CF has date and id as partition Key .
while querying I only know the date , so I loop over range of id's .
My question revolves around how the connector executes the following code.
SparkDriver code looks like -
SparkConf conf = new SparkConf().setAppName("DemoApp")
.conf.setMaster("local[*]")
.set("spark.cassandra.connection.host", "10.*.*.*")
.set("spark.cassandra.connection.port", "*");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions javaFunctions = CassandraJavaUtil.javaFunctions(sc);
String date = "23012017";
for(String id : idlist) {
JavaRDD<CassandraRow> cassandraRowsRDD =
javaFunctions.cassandraTable("datakeyspace", "sample2")
.where("date = ?",date)
.where("id = ? ", id)
.select("data");
cassandraRowsRDDList.add(cassandraRowsRDD);
}
List<CassandraRow> collectAllRows = new ArrayList<CassandraRow>();
for(JavaRDD<CassandraRow> rdd : cassandraRowsRDDList){
//do transformations
collectAllRows.addAll(rdd.collect());
}
1) First of all I wanted to ask if I loop over the idlist ,say idlist has 1000 elements which might be increasing ever , will this be efficient ? how each select query be distributed in the cluster ?Especially how Cassandra DB connections will be maintained ?
2) In my driver program After looping over I am putting All the rows in List , and then apply transformations to each row and filter out the duplicates . Will this also be distributed by spark on the cluster or will this take place at driver's side .
Kindly help .!
There is a better way of doing this provided by spark cassandra connector.
you can create a rdd of (date,id) and then call joinWithCassandraTable function on columns date and id. Connector do it smartly all the data will be fetched by the workers only and that too without shuffle that is each worker will fetch the data only for the date and id it is having.

Resources