dstream parse JSON and save to textFile : SparkStreaming - apache-spark

i have a Kakfa topic in which data is stored in a JSON format. I have written a spark streaming code and I want to save just the values from the Kafka topic to a file in HDFS .
This is how the data in my kafka topic looks like :
{"group_city":"\"Washington\"","group_country":"\"us\"","event_name":"\"Outdoor Afro Goes Ziplining\""}
Below, is the code I have written. When i print it, I get the parsed JSON, but my problem comes when i try to save just the values to text file.
val dstream = KafkaUtils.createDirectStream[String, String](ssc,preferredHosts,ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
//___PRINTING RECORDS________
val output= dstream.foreachRDD { rdd =>
rdd.foreach { record =>
val values = record.value()
val tweet = scala.util.parsing.json.JSON.parseFull(values)
val map:Map[String,String] = tweet.get.asInstanceOf[Map[String, String]]
map.foreach(p => println(p._2))
}
}

You can save the rdd with saveAsTextFile, But since you only want to save the values you can convert to dataframe and write as a csv
dstream.foreachRDD(rawRDD => {
// get the data
val rdd = rawRDD.map(_._2)
rdd.saveAsTextFile("file path")
// or read the json String to dataframe and write as a csv
spark.read.json(rdd).write.mode(SaveMode.Append).csv("path for output")
})
Hope this helps!

Related

Spark job reading the sorted AVRO files in dataframe, but writing to kafka without order

I have AVRO files sorted with ID and each ID has folder called "ID=234" and data inside the folder is in AVRO format and sorted on the basis of date.
I am running spark job which takes input path and reads avro in dataframe. This dataframe then writes to kafka topic with 5 partition.
val properties: Properties = getProperties(args)
val spark = SparkSession.builder().master(properties.getProperty("master"))
.appName(properties.getProperty("appName")).getOrCreate()
val sqlContext = spark.sqlContext
val sourcePath = properties.getProperty("sourcePath")
val dataDF = sqlContext.read.avro(sourcePath).as("data")
val count = dataDF.count();
val schemaRegAdd = properties.getProperty("schemaRegistry")
val schemaRegistryConfs = Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegAdd,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> SchemaManager.SchemaStorageNamingStrategies.TOPIC_NAME
)
val start = Instant.now
dataDF.select(functions.struct(properties.getProperty("message.key.name")).alias("key"), functions.struct("*").alias("value"))
.toConfluentAvroWithPlainKey(properties.getProperty("topic"), properties.getProperty("schemaName"),
properties.getProperty("schemaNamespace"))(schemaRegistryConfs)
.write.format("kafka")
.option("kafka.bootstrap.servers",properties.getProperty("kafka.brokers"))
.option("topic",properties.getProperty("topic")).save()
}
My use case is to write all messages from each ID (sorted on date) sequencially such as all sorted data from one ID 1 should be added first then from ID 2 and so on. Kafka message has key as ID.
Dont forgot that the data inside a RDD/dataset is shuffle when you do transformations so you lose the order.
the best way to achieve this is to read file one by one and send it to kafka instead of read a full directory in your val sourcePath = properties.getProperty("sourcePath")

Spark dataframe is NULL (Invalid Tree)

I have a spark (spark 2.1) job which processes the stream data using Kafka direct stream. I enriched the stream data with the data files stored in HDFS. I first read the data files(*.parquet) and stored them in a data frame, then enrich one record each time with this data frame.
The code ran without any error, but the enrichment did not occur. I ran the codes in debug mode and found the data frame (eg. df) is shown as an invalid tree. Why the data frame is null inside rdd.foreachPartition? how to correct this problem? Thanks!
val kafkaSinkVar = ssc.sparkContext.broadcast(KafkaSink(kafkaServers, outputTopic))
Service.aggregate(kafkaInputStream).foreachRDD(rdd => {
val df =ss.read.parquet( filePath + "/*.parquet" )
println("Record Count in DF: " + df.count()) ==> the console shows the files were loaded successfully with the record count = 1300
rdd.foreachPartition(partition => {
val futures = partition.map(event => {
sentMsgsNo.add(1L)
val eventEnriched = someEnrichmen1(event,df) ==> df is shown as invalid tree here
kafkaSinkVar.value.sendCef(eventEnriched)
})
})
})
})

How to parse the streaming XML into dataframe?

I'm consuming the XML file from kafka topic .Can anyone tell me how to parse the XML into dataframe.
val df = sqlContext.read
.format("com.databricks.spark.xml")
//.option("rowTag","ns:header")
// .options(Map("rowTag"->"ntfyTrns:payloadHeader","rowTag"->"ns:header"))
.option("rowTag","ntfyTrnsDt:notifyTransactionDetailsReq")
.load("/home/ubuntu/SourceXML.xml")
df.show
df.printSchema()
df.select(col("ns:header.ns:captureSystem")).show()
I able to exact the information information from XML .I dont know how to pass or convert or load the RDD[String] from kafka topic to sql read API.
Thanks!
I am facing the same situation, doing some research I found that some people is using this method to convert the RDD to a DataFrame using the following code as shown here:
val wrapped = rdd.map(xml => s"""<a>$xml</a>""")
val df = new XmlReader().xmlRdd(sqlContext, wrapped)
You just have to obtain the RDD from the DStream, I am doing this using pyspark
streamElement = ssc.textFileStream("s3n://your_path")
streamElement.foreachRDD(process)
where process method has the following structure, so you can do everything with your rdds
def process(time, rdd):
return value

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"

Add a header before text file on save in Spark

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.
I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.
You can make an RDD out of your header line and then union it, yes:
val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)
Then you end up with a bunch of part-xxxxx files that you merge.
The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.
More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.
Some help on writing it without Union(Supplied the header at the time of merge)
val fileHeader ="This is header"
val fileHeaderStream: InputStream = new ByteArrayInputStream(fileHeader.getBytes(StandardCharsets.UTF_8));
val output = IOUtils.copyBytes(fileHeaderStream,out,conf,false)
Now loop over you file parts to write the complete file using
val in: DataInputStream = ...<data input stream from file >
IOUtils.copyBytes(in, output, conf, false)
This made sure for me that header always comes as first line even when you use coalasec/repartition for efficient writing
def addHeaderToRdd(sparkCtx: SparkContext, lines: RDD[String], header: String): RDD[String] = {
val headerRDD = sparkCtx.parallelize(List((-1L, header))) // We index the header with -1, so that the sort will put it on top.
val pairRDD = lines.zipWithIndex()
val pairRDD2 = pairRDD.map(t => (t._2, t._1))
val allRDD = pairRDD2.union(headerRDD)
val allSortedRDD = allRDD.sortByKey()
return allSortedRDD.values
}
Slightly diff approach with Spark SQL
From Question: I now want to save this RDD as a CSV file and add a header. Each line of this RDD is already formatted correctly.
With Spark 2.x you have several options to convert RDD to DataFrame
val rdd = .... //Assume rdd properly formatted with case class or tuple
val df = spark.createDataFrame(rdd).toDF("col1", "col2", ... "coln")
df.write
.format("csv")
.option("header", "true") //adds header to file
.save("hdfs://location/to/save/csv")
Now we can even use Spark SQL DataFrame to load, transform and save CSV file
spark.sparkContext.parallelize(Seq(SqlHelper.getARow(temRet.columns,
temRet.columns.length))).union(temRet.rdd).map(x =>
x.mkString("\x01")).coalesce(1, true).saveAsTextFile(retPath)
object SqlHelper {
//create one row
def getARow(x: Array[String], size: Int): Row = {
var columnArray = new Array[String](size)
for (i <- 0 to (size - 1)) {
columnArray(i) = x(i).toString()
}
Row.fromSeq(columnArray)
}
}

Resources