I need to connect multiple Flume sinks with Spark Streaming
this is my flume file :
agent1.sinks.sink1a.type = avro
agent1.sinks.sink1a.hostname = localhost
agent1.sinks.sink1a.port = 9091
agent1.sinks.sink1b.type = avro
agent1.sinks.sink1b.hostname = localhost
agent1.sinks.sink1b.port = 9092
But only 9091 port is connecting 9092 is not able to connect
Here is my spark code to create multiple Flume Streams :
val sparkConf = new SparkConf().setAppName("WordCount")
val ssc = new StreamingContext(sparkConf, Seconds(20))
val rawLines = FlumeUtils.createStream(ssc,"localhost", 9091)
val rawLines1 = FlumeUtils.createStream(ssc,"localhost", 9092)
val lines = rawLines.map{record => {
(new String(record.event.getBody().array()))}}
val lines1 = rawLines1.map{record1 => {
(new String(record1.event.getBody().array()))}}
val lines_combined = lines.union(lines1)
val words = lines_combined.flatMap(_.split(" "))
What am I doing wrong ?.Any help regarding the same would be great.
Related
I want to migrate my old cassandra cluster data to a new cluster and thinking to write some spark jobs to do that. Is there any way to interact with multiple cassandra cluster from the same SparkContext. So that i can read the data from one cluster and write to another cluster using saveToCassandra function inside the same sparkJob.
val products = sc.cassandraTable("first_cluster","products").cache()
products.saveToCassandra("diff_cluster","products2")
Can we save the data into a different cluster ?
Example from spark-cassandra-connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
def twoClusterExample ( sc: SparkContext) = {
val connectorToClusterOne = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.1"))
val connectorToClusterTwo = CassandraConnector(sc.getConf.set("spark.cassandra.connection.host", "127.0.0.2"))
val rddFromClusterOne = {
// Sets connectorToClusterOne as default connection for everything in this code block
implicit val c = connectorToClusterOne
sc.cassandraTable("ks","tab")
}
{
//Sets connectorToClusterTwo as the default connection for everything in this code block
implicit val c = connectorToClusterTwo
rddFromClusterOne.saveToCassandra("ks","tab")
}
}
In my spark streaming application . I am creating two DStream from two Kafka topic. I am doing so , because i need to process the two DStream differently. Below is the code example:
object KafkaConsumerTest3 {
var sc:SparkContext = null
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
val Array(zkQuorum, group, topics1, topics2, numThreads) = Array("localhost:2181", "group3", "test_topic4", "test_topic5","5")
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val topicMap1 = topics1.split(",").map((_, numThreads.toInt)).toMap
val topicMap2 = topics2.split(",").map((_, numThreads.toInt)).toMap
val lines2 = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap2).map(_._2)
val lines1 = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap1).map(_._2)
lines2.foreachRDD{rdd =>
rdd.foreach { println }}
lines1.foreachRDD{rdd =>
rdd.foreach { println }}
ssc.start()
ssc.awaitTermination()
}
}
Both the topics may or may not have data . In my case first topic is not getting data currently but the second topic is getting. But my spark application is not printing any data. And there is no exception as well.
Is there anything i am missing? or how do i resolve this issue.
Found out the issue with above code. The problem is that we have used master as local[2] and we are registering two receiver.Increasing the number of thread solve the problem.
I am trying to consume message from kafka producer through spark streaming program .
Here is my program
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// lines.print()
lines.foreachRDD(rdd=>{
rdd.foreach(message=>
println(message))
})
The above program is running successfully. But I could not see any message get printed.
Set your master url using "local[*]"
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
You can also try to call collect() and see if you get messages.
lines.foreachRDD { rdd =>
rdd.collect().foreach(println)
}
I have deployed the jar of spark on cluster. I submit the spark job using spark-submit command follow by my project jar.
I have many Spak Conf in my project. Conf will be decided based on which class i am running but every time i run the spark job i got this warning
7/01/09 07:32:51 WARN SparkContext: Use an existing SparkContext, some
configuration may not take effect.
Query Does it mean SparkContext is already there and my job is picking this.
Query Why come configurations are not taking place
Code
private val conf = new SparkConf()
.setAppName("ELSSIE_Ingest_Cassandra")
.setMaster(sparkIp)
.set("spark.sql.shuffle.partitions", "8")
.set("spark.cassandra.connection.host", cassandraIp)
.set("spark.sql.crossJoin.enabled", "true")
object SparkJob extends Enumeration {
val Program1, Program2, Program3, Program4, Program5 = Value
}
object ElssieCoreContext {
def getSparkSession(sparkJob: SparkJob.Value = SparkJob.RnfIngest): SparkSession = {
val sparkSession = sparkJob match {
case SparkJob.Program1 => {
val updatedConf = conf.set("spark.cassandra.output.batch.size.bytes", "2048").set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
case SparkJob.Program2 => {
val updatedConf = conf.set("spark.sql.broadcastTimeout", "2000")
SparkSession.builder().config(updatedConf).getOrCreate()
}
}
}
And In Program1.scala i call by
val spark = ElssieCoreContext.getSparkSession()
val sc = spark.sparkContext
I am trying to use Spark streaming to receive messages from Kafka, convert them to Put and insert into HBase.
I create a inputDstream to receive messages from Kafka and then create a JobConf and finally use saveAsHadoopDataset(JobConf) to save records into HBase.
Every time record inserted into HBase, a session from Hbase to zookeeper would be set up but never closes. If number of connections increases more than max client Connections of zookeeper, spark streaming crashes.
my codes are shown below:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
object ReceiveKafkaAsDstream {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("ReceiveKafkaAsDstream")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topics = "test"
val brokers = "10.0.2.15:6667"
val topicSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val tableName = "KafkaTable"
val conf = HBaseConfiguration.create()
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.set("mapreduce.output.fileoutputformat", "/user/root/out")
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val records = messages
.map(_._2)
.map(SampleKafkaRecord.parseToSampleRecord)
records.print()
records.foreachRDD{ stream => stream.map(SampleKafkaRecord.SampleToHbasePut).saveAsHadoopDataset(jobConfig) }
ssc.start()
ssc.awaitTermination()
}
case class SampleKafkaRecord(id: String, name: String)
object SampleKafkaRecord extends Serializable {
def parseToSampleRecord(line: String): SampleKafkaRecord = {
val values = line.split(";")
SampleKafkaRecord(values(0), values(1))
}
def SampleToHbasePut(CSVData: SampleKafkaRecord): (ImmutableBytesWritable, Put) = {
val rowKey = CSVData.id
val putOnce = new Put(rowKey.getBytes)
putOnce.addColumn("cf1".getBytes, "column-Name".getBytes, CSVData.name.getBytes)
return (new ImmutableBytesWritable(rowKey.getBytes), putOnce)
}
}
}
I set duration of SSC (SparkStreamingContext) as 1s and set maxClientCnxns as 10 in zookeeper conf file zoo.cfg, so there are at most 10 connections allowed from one client to zookeeper.
After 10 seconds (10 sessions set up from HBase to zookeeper), I got the error shown below:
16/08/24 14:59:30 WARN RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase-unsecure/hbaseid
16/08/24 14:59:31 INFO ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/08/24 14:59:31 INFO ClientCnxn: Socket connection established to localhost.localdomain/127.0.0.1:2181, initiating session
16/08/24 14:59:31 WARN ClientCnxn: Session 0x0 for server localhost.localdomain/127.0.0.1:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
In my understanding, this error exists because number of connections is more than max connection of zookeeper. If I set maxClientCnxn as 20, streaming processing is able to last 20s. I know I can set maxClientCnxn as unlimited, but I really dont think it is a good way solving this problem.
Another thing is if I use TextFileStream to get text files as DStream and save them into hbase using saveAsHadoopDataset(jobConf), it runs pretty well. If I just read data from kafka using val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) and simply print info out, there is no issue either. Problem comes when I receive kafka messages and then save them into HBase in the application.
My environment is HDP 2.4 sandbox. versions spark: 1.6, hbase:1.1.2, kafka:2.10.0, zookeeper: 3.4.6.
Any help is appreciated.
Well, finally I get it work.
Attribute set:
There is a attributes called "zookeeper.connection.timeout.ms". This attribute should be set as 1s.
Change to new API:
Change method saveAsHadoopDataset(JobConf) to saveAsNewAPIHadoopDataset(JobConf). I still dont know why the old API is not working.
Change import org.apache.hadoop.hbase.mapred.TableOutputFormat to import org.apache.hadoop.hbase.mapreduce.TableOutputFormat