Save Kafka messages into HBase through Spark. Session never closes

Save Kafka messages into HBase through Spark. Session never closes - apache-spark

I am trying to use Spark streaming to receive messages from Kafka, convert them to Put and insert into HBase.
I create a inputDstream to receive messages from Kafka and then create a JobConf and finally use saveAsHadoopDataset(JobConf) to save records into HBase.
Every time record inserted into HBase, a session from Hbase to zookeeper would be set up but never closes. If number of connections increases more than max client Connections of zookeeper, spark streaming crashes.
my codes are shown below:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
object ReceiveKafkaAsDstream {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("ReceiveKafkaAsDstream")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topics = "test"
val brokers = "10.0.2.15:6667"
val topicSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val tableName = "KafkaTable"
val conf = HBaseConfiguration.create()
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat", "/user/root/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val records = messages
.map(_._2)
.map(SampleKafkaRecord.parseToSampleRecord)
records.print()
records.foreachRDD{ stream => stream.map(SampleKafkaRecord.SampleToHbasePut).saveAsHadoopDataset(jobConfig) }
ssc.start()
ssc.awaitTermination()
}
case class SampleKafkaRecord(id: String, name: String)
object SampleKafkaRecord extends Serializable {
def parseToSampleRecord(line: String): SampleKafkaRecord = {
val values = line.split(";")
SampleKafkaRecord(values(0), values(1))
}
def SampleToHbasePut(CSVData: SampleKafkaRecord): (ImmutableBytesWritable, Put) = {
val rowKey = CSVData.id
val putOnce = new Put(rowKey.getBytes)
putOnce.addColumn("cf1".getBytes, "column-Name".getBytes, CSVData.name.getBytes)
return (new ImmutableBytesWritable(rowKey.getBytes), putOnce)
}
}
}
I set duration of SSC (SparkStreamingContext) as 1s and set maxClientCnxns as 10 in zookeeper conf file zoo.cfg, so there are at most 10 connections allowed from one client to zookeeper.
After 10 seconds (10 sessions set up from HBase to zookeeper), I got the error shown below:
16/08/24 14:59:30 WARN RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/hbaseid
16/08/24 14:59:31 INFO ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/08/24 14:59:31 INFO ClientCnxn: Socket connection established to localhost.localdomain/127.0.0.1:2181, initiating session
16/08/24 14:59:31 WARN ClientCnxn: Session 0x0 for server localhost.localdomain/127.0.0.1:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
In my understanding, this error exists because number of connections is more than max connection of zookeeper. If I set maxClientCnxn as 20, streaming processing is able to last 20s. I know I can set maxClientCnxn as unlimited, but I really dont think it is a good way solving this problem.
Another thing is if I use TextFileStream to get text files as DStream and save them into hbase using saveAsHadoopDataset(jobConf), it runs pretty well. If I just read data from kafka using val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) and simply print info out, there is no issue either. Problem comes when I receive kafka messages and then save them into HBase in the application.
My environment is HDP 2.4 sandbox. versions spark: 1.6, hbase:1.1.2, kafka:2.10.0, zookeeper: 3.4.6.
Any help is appreciated.

Well, finally I get it work.
Attribute set:
There is a attributes called "zookeeper.connection.timeout.ms". This attribute should be set as 1s.
Change to new API:
Change method saveAsHadoopDataset(JobConf) to saveAsNewAPIHadoopDataset(JobConf). I still dont know why the old API is not working.
Change import org.apache.hadoop.hbase.mapred.TableOutputFormat to import org.apache.hadoop.hbase.mapreduce.TableOutputFormat

Related

How to receive multiple messages of kafka topics in parallel with spark streaming DStream

i used a mysql CDC(changing data capturing) system which can capture new records inserted into mysql tables and sent to kafka with one topic per one table. Then in spark streaming, i want to receive multiple messages of kafka topics in parallel with spark streaming DStream API for further processing of these changing data from these mysql tables.
CDC is setup just fine and messages are tested received OK for all tables by kafka-consume-topic.sh. But in spark streaming, i can only got 1 table's messages received. but if only 1 topic/stream created in the application one by one for all tables' test, all tables seperately are tested ok to be read by spark streaming. i searched long time for relevant questions, articles and examples in spark github project, no lucky to get solution. there are examples to union non Direct stream, but those spark streaming APIs are old and i hesitate to adopt them doubting later maybe lots of wheel re-inventing works to do.
below are my codes:
package com.fm.data.fmtrade
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
object TestKafkaSparkStreaming {
def main(args: Array[String]): Unit = {
var conf=new SparkConf()//.setMaster("spark://192.168.177.120:7077")
.setAppName("SparkStreamKaflaWordCount Demo")
.set("spark.streaming.concurrentJobs", "8")
val ss = SparkSession
.builder()
.config(conf)
.appName(args.mkString(" "))
.getOrCreate()
val topicsArr: Array[String] = Array(
"betadbserver1.copytrading.t_trades",
"betadbserver1.copytrading.t_users",
"betadbserver1.account.s_follower",
"betadbserver1.copytrading.t_followorder",
"betadbserver1.copytrading.t_follow",
"betadbserver1.copytrading.t_activefollow",
"betadbserver1.account.users",
"betadbserver1.account.user_accounts"
)
var group="con-consumer-group111" + (new util.Random).nextInt(10000)
val kafkaParam = Map(
"bootstrap.servers" -> "beta-hbase02:9092,beta-hbase03:9092,beta-hbase04:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> group,
"auto.offset.reset" -> "earliest",//"latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val ssc = new StreamingContext(ss.sparkContext, Seconds(4))
//val streams =
topicsArr.foreach{//.slice(0,1)
topic =>
val newTopicsArr=Array(topic)
val stream=
KafkaUtils.createDirectStream[String,String](ssc, PreferConsistent, Subscribe[String,String](newTopicsArr,kafkaParam))
stream.map(s =>(s.key(),s.value())).print();
}
/*
val unifiedStream = ssc.union(streams)
unifiedStream.repartition(2)
unifiedStream.map(s =>(s.key(),s.value())).print()
*/
/*
unifiedStream.foreachRDD{ rdd =>
rdd.foreachPartition{ partitionOfRecords =>
partitionOfRecords.foreach{ record =>
}
}
}
*/
ssc.start();
ssc.awaitTermination();
}
}

How to interact with different cassandra cluster from the same spark context

I want to migrate my old cassandra cluster data to a new cluster and thinking to write some spark jobs to do that. Is there any way to interact with multiple cassandra cluster from the same SparkContext. So that i can read the data from one cluster and write to another cluster using saveToCassandra function inside the same sparkJob.
val products = sc.cassandraTable("first_cluster","products").cache()
products.saveToCassandra("diff_cluster","products2")
Can we save the data into a different cluster ?

Example from spark-cassandra-connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
def twoClusterExample ( sc: SparkContext) = {
val connectorToClusterOne = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.1"))
val connectorToClusterTwo = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.2"))
val rddFromClusterOne = {
// Sets connectorToClusterOne as default connection for everything in this code block
implicit val c = connectorToClusterOne
sc.cassandraTable("ks","tab")
}
{
//Sets connectorToClusterTwo as the default connection for everything in this code block
implicit val c = connectorToClusterTwo
rddFromClusterOne.saveToCassandra("ks","tab")
}
}

Serialization of transform function in checkpointing

I'm trying to understand Spark Streaming's RDD transformations and checkpointing in the context of serialization. Consider the following example Spark Streaming app:
private val helperObject = HelperObject()
private def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName(Constants.SparkAppName)
.setIfMissing("spark.master", Constants.SparkMasterDefault)
implicit val streamingContext = new StreamingContext(
new SparkContext(conf),
Seconds(Constants.SparkStreamingBatchSizeDefault))
val myStream = StreamUtils.createStream()
myStream.transform(transformTest(_)).print()
streamingContext
}
def transformTest(rdd: RDD[String]): RDD[String] = {
rdd.map(str => helperObject.doSomething(str))
}
val ssc = StreamingContext.getOrCreate(Settings.progressDir,
createStreamingContext)
ssc.start()
while (true) {
helperObject.setData(...)
}
From what I've read in other SO posts, transformTest will be invoked on the driver program once for every batch after streaming starts. Assuming createStreamingContext is invoked (no checkpoint is available), I would expect that the instance of helperObject defined up top would be serialized out to workers once per batch, hence picking up the changes applied to it via helperObject.setData(...). Is this the case?
Now, if createStreamingContext is not invoked (a checkpoint is available), then I would expect that the instance of helperObject cannot possibly be picked up for each batch, since it can't have been captured if createStreamingContext is not executed. Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Is it possible to update helperObject throughout execution from the driver program when using checkpointing? If so, what's the best approach?

If helperObject is going to be serialized to each executors?
Ans: Yes.
val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}
Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Ans Yes.
If you wish to refresh your helperObject behaviour for each RDD operation you can still do that by making your helperObject more intelligent and not sending the helperObject directly but via a function which has the following signature () => helperObject_Class.
Since it is a function it is serializable. It is a very common design pattern used for sending objects that are not serializable e.g. database connection object or for your fun use case.
An example is given from Kafka Exactly once semantics using database
package example
import kafka.serializer.StringDecoder
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import scalikejdbc._
import com.typesafe.config.ConfigFactory
import org.apache.spark.{SparkContext, SparkConf, TaskContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}
/** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
Offsets and results will be stored per-batch, on the driver
*/
object TransactionalPerBatch {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load
val kafkaParams = Map(
"metadata.broker.list" -> conf.getString("kafka.brokers")
)
val jdbcDriver = conf.getString("jdbc.driver")
val jdbcUrl = conf.getString("jdbc.url")
val jdbcUser = conf.getString("jdbc.user")
val jdbcPassword = conf.getString("jdbc.password")
val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
ssc.start()
ssc.awaitTermination()
}
def setupSsc(
kafkaParams: Map[String, String],
jdbcDriver: String,
jdbcUrl: String,
jdbcUser: String,
jdbcPassword: String
)(): StreamingContext = {
val ssc = new StreamingContext(new SparkConf, Seconds(60))
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
// begin from the the offsets committed to the database
val fromOffsets = DB.readOnly { implicit session =>
sql"select topic, part, off from txn_offsets".
map { resultSet =>
TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
}.list.apply().toMap
}
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
stream.foreachRDD { rdd =>
// Note this block is running on the driver
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// simplest possible "metric", namely a count of messages per topic
// Notice the aggregation is done using spark methods, and results collected back to driver
val results = rdd.reduceByKey {
// This is the only block of code running on the executors.
// reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
_+_
}.collect
// Back to running on the driver
// localTx is transactional, if metric update or offset update fails, neither will be committed
DB.localTx { implicit session =>
// store metric results
results.foreach { pair =>
val (topic, metric) = pair
val metricRows = sql"""
update txn_data set metric = metric + ${metric}
where topic = ${topic}
""".update.apply()
if (metricRows != 1) {
throw new Exception(s"""
Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
""")
}
}
// store offsets
offsetRanges.foreach { osr =>
val offsetRows = sql"""
update txn_offsets set off = ${osr.untilOffset}
where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
""".update.apply()
if (offsetRows != 1) {
throw new Exception(s"""
Got $offsetRows rows affected instead of 1 when attempting to update offsets for
${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
Was a partition repeated after a worker failure?
""")
}
}
}
}
ssc
}
}

Spark: Read HBase in secured cluster

I have an easy task: I want to read HBase data in a Kerberos secured cluster.
So far I tried 2 approaches:
sc.newAPIHadoopRDD(): here I don't know how to handle the kerberos authentication
create a HBase connection from the HBase API: Here I don't really know how to convert the result into RDDs
Furthermore there seem to be some HBase-Spark connectors. But somehow I didn't really manage to find them as Maven artifact and/or they require a fixed structure of the result (but I just need to have the HBase Result object since the columns in my data are not fixed).
Do you have any example or tutorials or ....?
I appreciate any help and hints.
Thanks in advance!

I assume that you are using spark + scala +Hbase
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
object SparkWithMyTable {
def main(args: Array[String]) {
//Initiate spark context with spark master URL. You can modify the URL per your environment.
val sc = new SparkContext("spark://ip:port", "MyTableTest")
val tableName = "myTable"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "list of cluster ip's")
conf.set("hbase.zookeeper"+ ".property.clientPort","2181");
conf.set("hbase.master", "masterIP:60000");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hbase.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user#---", keyTabPath);
// Add local HBase conf
// conf.addResource(new Path("file://hbase/hbase-0.94.17/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
// create my table with column family
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(tableName)) {
print("Creating MyTable")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
val columnDesc = new HColumnDescriptor("cf1");
admin.disableTable(Bytes.toBytes(tableName));
admin.addColumn(tableName, columnDesc);
admin.enableTable(Bytes.toBytes(tableName));
}
//first put data into table
val myTable = new HTable(conf, tableName);
for (i <- 0 to 5) {
var p = new Put();
p = new Put(new String("row" + i).getBytes());
p.add("cf1".getBytes(), "column-1".getBytes(), new String(
"value " + i).getBytes());
myTable.put(p);
}
myTable.flushCommits();
//how to create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//get the row count
val count = hBaseRDD.count()
print("HBase RDD count:"+count)
System.exit(0)
}
}
Maven Artifact
<dependency>
<groupId>it.nerdammer.bigdata</groupId>
<artifactId>spark-hbase-connector_2.10</artifactId>
<version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x
</dependency>
Can also have a look at
Spark play with HBase's Result object: handling HBase KeyValue and ByteArray in Scala with Spark -- Real World Examples
scan-that-works-on-kerberos
HBaseScanRDDExample.scala

How to work with real time streaming data/logs using spark streaming?

I am newbie to Spark and Scala.
I want to implement a REAL TIME Spark Consumer which could read the network logs on per minute basis [fetching around 1GB of JSON log lines/minute] from Kafka Publisher and finally store the aggregated values in ElasticSearch.
Aggregations is based on few values [like bytes_in, bytes_out etc] using composite key [like : client MAC, client IP, server MAC, Server IP etc].
Spark Consumer which I have written is:
object LogsAnalyzerScalaCS{
def main(args : Array[String]) {
val sparkConf = new SparkConf().setAppName("LOGS-AGGREGATION")
sparkConf.set("es.nodes", "my ip address")
sparkConf.set("es.port", "9200")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.nodes.discovery", "false")
val elasticResource = "conrec_1min/1minute"
val ssc = new StreamingContext(sparkConf, Seconds(30))
val zkQuorum = "my zk quorum IPs:2181"
val consumerGroupId = "LogsConsumer"
val topics = "Logs"
val topicMap = topics.split(",").map((_,3)).toMap
val json = KafkaUtils.createStream(ssc, zkQuorum, consumerGroupId, topicMap)
val logJSON = json.map(_._2)
try{
logJSON.foreachRDD( rdd =>{
if(!rdd.isEmpty()){
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = sqlContext.read.json(rdd)
val groupedData =
((df.groupBy("id","start_time_formated","l2_c","l3_c",
"l4_c","l2_s","l3_s","l4_s")).agg(count("f_id") as "total_f", sum("p_out") as "total_p_out",sum("p_in") as "total_p_in",sum("b_out") as "total_b_out",sum("b_in") as "total_b_in", sum("duration") as "total_duration"))
val dataForES = groupedData.withColumnRenamed("start_time_formated", "start_time")
dataForES.saveToEs(elasticResource)
dataForES.show();
}
})
}
catch{
case e: Exception => print("Exception has occurred : "+e.getMessage)
}
ssc.start()
ssc.awaitTermination()
}
object SQLContextSingleton {
#transient private var instance: org.apache.spark.sql.SQLContext = _
def getInstance(sparkContext: SparkContext): org.apache.spark.sql.SQLContext = {
if (instance == null) {
instance = new org.apache.spark.sql.SQLContext(sparkContext)
}
instance
}
}
}
First of all I would like to know if at all my approach is correct or not [considering I need 1 min logs aggregation]?
There seems to be an issue using this code:
This Consumer will pull data from the Kafka broker every 30 seconds
and saving the final aggregation to Elasticsearch for that 30
sec data, hence increasing the number of rows in Elasticsearch for
unique key [at least 2 entries per one minute]. UI tool [
let's say Kibana] needs to do further aggregation. If I increase the
polling time from 30 sec to 60 sec then it takes a lot of time to
aggregate and hence not at all remains real time.
I want to implement it in such a way that in ElasticSearch only one
row per key should get saved. Hence I want to perform aggregation
till the time I am not getting new key values in my dataset which is
getting pulled from Kafka broker [per minute basis]. After doing
some googling I have found that this could be achieved using
groupByKey() and updateStateByKey() functions but I am not able to
make out how I could use this in my case [should I convert the JSON
Log lines into a string of log line with flat values and then use
these functions there]? If I will use these functions then when
should I save the final aggregated values into ElasticSearch?
Is there any other way of achieving it?
Your quick help will be appreciated.
Regards,
Bhupesh

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Main {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(15))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)//,localhost:9094,localhost:9095"
val topics = Array("test")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val out = stream.map(record =>
record.value
)
val words = out.flatMap(_.split(" "))
val count = words.map(word => (word, 1))
val wdc = count.reduceByKey(_+_)
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
wdc.foreachRDD{rdd=>
val es = sqlContext.createDataFrame(rdd).toDF("word","count")
import org.elasticsearch.spark.sql._
es.saveToEs("wordcount/testing")
es.show()
}
ssc.start()
ssc.awaitTermination()
}
}
To see full example and sbt
apache-sparkscalahadoopkafkaapache-spark-sql spark-streamingapache-spark-2.0elastic

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Save Kafka messages into HBase through Spark. Session never closes - apache-spark

Related

How to receive multiple messages of kafka topics in parallel with spark streaming DStream

How to interact with different cassandra cluster from the same spark context

Serialization of transform function in checkpointing

Spark: Read HBase in secured cluster

How to work with real time streaming data/logs using spark streaming?

Categories

Resources