I tried the given process (https://azure.microsoft.com/en-in/documentation/articles/hdinsight-apache-spark-eventhub-streaming/) step by step. I have just modified the spark receiver code according to my requirement. The spark streaming consumer api when I am spark-submitting its fetching the data from EventHub as DStream[Array[Bytes]] which I am doing a foreachRDD and converting into an RDD[String] . The issue I am facing here is that the statements below the streaming line are not getting executed until I stop the program execution by pressing ctrl+c.
package com.onerm.spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.sql.hive.HiveContext
import java.util.concurrent.{Executors, ExecutorService}
object HiveEvents {
def b2s(a: Array[Byte]): String = new String(a)
def main(args: Array[String]): Unit = {
val ehParams = Map[String, String](
"eventhubs.policyname" -> "myreceivepolicy",
"eventhubs.policykey" -> "jgrH/5yjdMjajQ1WUAQsKAVGTu34=",
"eventhubs.namespace" -> "SparkeventHubTest-ns",
"eventhubs.name" -> "SparkeventHubTest",
"eventhubs.partition.count" -> "4",
"eventhubs.consumergroup" -> "$default",
"eventhubs.checkpoint.dir" -> "/EventCheckpoint_0.1",
"eventhubs.checkpoint.interval" -> "10"
)
val conf = new SparkConf().setAppName("Eventhubs Onerm")
val sc= new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val pool:ExecutorService=Executors.newFixedThreadPool(5)
val ssc = new StreamingContext(sc, Seconds(120))
var dataString :RDD[String] =sc.emptyRDD
val stream=EventHubsUtils.createUnionStream(ssc, ehParams)
**//lines below are not getting executed until I stop the execution**
stream.print()
stream.foreachRDD {
rdd =>
if(rdd.isEmpty())
{
println("RDD IS EMPTY ")
}
else
{
dataString=rdd.map(line=>b2s(line))
println("COUNT" +dataString.count())
sqlContext.read.json(dataString).registerTempTable("jsoneventdata")
val filterData=sqlContext.sql("SELECT id,ClientProperties.PID,ClientProperties.Program,ClientProperties.Platform,ClientProperties.Version,ClientProperties.HWType,ClientProperties.OffVer,ContentID,Data,Locale,MappedSources,MarketingMessageContext.ActivityInstanceID,MarketingMessageContext.CampaignID,MarketingMessageContext.SegmentName,MarketingMessageContext.OneRMInstanceID,MarketingMessageContext.DateTimeSegmented,Source,Timestamp.Date,Timestamp.Epoch,TransactionID,UserAction,EventProcessedUtcTime,PartitionId,EventEnqueuedUtcTime from jsoneventdata")
filterData.show(10)
filterData.saveAsParquetFile("EventCheckpoint_0.1/ParquetEvent")
} }
ssc.start()
ssc.awaitTermination()
}
}
Related
I am trying to stream twitter data using Apache Spark in Intellij however when i use the function coalesce , it says that it cannot resolve symbol coalesce. Here is my main code:
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
//englishTweets.print()
englishTweets.foreachRDD{rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val tweets = rdd.map( field =>
(
field.getId,
field.getUser.getScreenName,
field.getCreatedAt.toInstant.toString,
field.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
sentiment(field.getText)
)
)
val tweetsdf = tweets.toDF("userID", "user", "createdAt", "text", "sentimentType")
tweetsdf.printSchema()
tweetsdf.show(false)
}.coalesce(1).write.csv("hdfs://localhost:9000/usr/sparkApp/test/testing.csv")
I have tried with my own dataset, and I have read a dataset and while writing I have applied coalesce function and it is giving results, please refer to this it may help you.
import org.apache.spark.sql.SparkSession
import com.spark.Rdd.DriverProgram
import org.apache.log4j.{ Logger, Level }
import org.apache.spark.sql.SaveMode
import java.sql.Date
object JsonDataDF {
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("hadoop.home.dir", "C:\\hadoop"); // This is the system property which is useful to find the winutils.exe
Logger.getLogger("org").setLevel(Level.WARN) // This will remove Logs
case class AOK(appDate:Date, arr:String, base:String, Comments:String)
val dp = new DriverProgram
val spark = dp.getSparkSession()
def main(args : Array[String]): Unit = {
import spark.implicits._
val jsonDf = spark.read.option("multiline", "true").json("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JSONdata.txt").as[AOK]
jsonDf.coalesce(1) // Refer Here
.write
.mode(SaveMode.Overwrite)
.option("header", "true")
.format("csv")
.save("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JsonToCsv")
}
}
I have streamed data from Kafka topics using Spark. This is the code that I have tried. Here I was just displaying the streaming data in console. I want to store this data as a text file in HDFS.
import _root_.kafka.serializer.DefaultDecoder
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.storage.StorageLevel
object StreamingDataNew {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("Kafka").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val kafkaConf = Map(
"metadata.broker.list" -> "localhost:9092",
"zookeeper.connect" -> "localhost:2181",
"group.id" -> "kafka-streaming-example",
"zookeeper.connection.timeout.ms" -> "200000"
)
val lines = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaConf,
Map("topic-one" -> 1), // subscripe to topic and partition 1
StorageLevel.MEMORY_ONLY
)
println("printing" + lines.toString())
val words = lines.flatMap { case (x, y) => y.split(" ") }
words.print()
ssc.start()
ssc.awaitTermination()
}
}
I found that we can write the DStream using 'saveAsTextFiles'.But can someone clearly mention the steps on how to connect with Hortonworks and store in HDFS using the above scala code.
I found the answer,
This code worked out for me.
package com.spark.streaming
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
object MessageStreaming {
def main(args: Array[String]): Unit = {
println("Message streaming")
val conf = new org.apache.spark.SparkConf().setMaster("local[*]").setAppName("kafka-streaming")
val context = new SparkContext(conf)
val ssc = new StreamingContext(context, org.apache.spark.streaming.Seconds(10))
val kafkaParams = Map(
"bootstrap.servers" -> "kafka.kafka-cluster.com:9092",
"group.id" -> "kafka-streaming-example",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"auto.offset.reset" -> "latest",
"zookeeper.connection.timeout.ms" -> "200000"
)
val topics = Array("cdc-classic")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))
val content = stream.filter(x => x.value() != null)
val sqlContext = new org.apache.spark.sql.SQLContext(context)
import sqlContext.implicits._
stream.map(_.value).foreachRDD(rdd => {
rdd.foreach(println)
if (!rdd.isEmpty()) {
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).json("hdfs://dev1a/user/hg5tv0/hadoop/MessagesFromKafka")
}
})
ssc.start()
ssc.awaitTermination
}}
How can a file from HDFS be read in a spark function not using sparkContext within the function.
Example:
val filedata_rdd = rdd.map { x => ReadFromHDFS(x.getFilePath) }
Question is how ReadFromHDFS can be implemented?Usually to read from HDFS we could do a sc.textFile but in this case sc cannot be used in the function.
You don't necessarily need service context to interact with HDFS. You can simply broadcast the hadoop configuration from master and use the broadcasted configuration value on executors to construct a hadoop.fs.FileSystem. Then the world is your. :)
Following is the code:
import java.io.StringWriter
import com.sachin.util.SparkIndexJobHelper._
import org.apache.commons.io.IOUtils
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SerializableWritable, SparkConf}
class Test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[15]")
.setAppName("TestJob")
val sc = createSparkContext(conf)
val confBroadcast = sc.broadcast(new SerializableWritable(sc.hadoopConfiguration))
val rdd: RDD[String] = ??? // your existing rdd
val filedata_rdd = rdd.map { x => readFromHDFS(confBroadcast.value.value, x) }
}
def readFromHDFS(configuration: Configuration, path: String): String = {
val fs: FileSystem = FileSystem.get(configuration)
val inputStream = fs.open(new Path(path));
val writer = new StringWriter();
IOUtils.copy(inputStream, writer, "UTF-8");
writer.toString();
}
}
I am using the following code to consume and print messages from kafka. However, the messages are not printing. I can confirm that the issue is not with Kafka. How do you print the messages?
package com.paras.KafkaSpark
import java.util.HashMap
import java.util.Properties
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object StreamKafkaLearn {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaTest")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
// Get the lines, split them into words, count the words and print
//lines.print()
lines.foreachRDD{ rdd =>
val testString = rdd.collect()
println(testString.mkString("\n")) // prints to the stdout of the driver
}
ssc.start()
ssc.awaitTermination()
}
}
I have encountered with error : rdd action will be suspended in DStream foreachRDD function.
Please refer to the following code.
import _root_.kafka.common.TopicAndPartition
import _root_.kafka.message.MessageAndMetadata
import _root_.kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object StreamingTest {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(5))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicOffset = Map(TopicAndPartition("test_log",0)->200000L)
val messageHandler = (mmd: MessageAndMetadata[String, String]) => mmd.message
val kafkaStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,String](ssc,kafkaParams,topicOffset,messageHandler)
kafkaStream.foreachRDD(rdd=>{
println(rdd.count())
val collected = rdd.collect()
})
ssc.start()
ssc.awaitTermination()
}
}
Error:
The function rdd.count() or rdd.collect() will be suspended.
I am using spark version is 1.4.1.
Am I using it in a wrong way?
Thanks in advance.
if we didn't set the maxRatePerPartition from kafka, it will try to read all data, so it will be look like suspended. But it actually busy at reading data.
After I set the following configuration
spark.streaming.kafka.maxRatePerPartition=1000
It will print log.