How to decompress Gzip files from EventHub using Spark Structured Streaming - apache-spark

Is there a way to read gzip files from Eventhub and decompress them using spark structured streaming, want to store uncompressed json at ADLS using Spark Structured Streaming Trigger once.
I'm getting NULL data when i tried to read the EventHub Data which is currently compressed via Spark Structured Streaming. I would need some logic how to decompress the EventHub data while reading.
Any help would be greatly appreciated.

I was able to achieve this by writing a scala UDF. Hope this may help somebody in the future.
val decompress = udf{compressed: Array[Byte] => {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}}

Related

How to append kafka consumer output to a file(parquet) in HDFS using Java/Scala?

This is a Kafka batch process. I want to read a local CSV file and write it into a Kafka topic.
Then consumer has to get data from the topic they subscribed.
Expected: I want the consumed data to be appended to a file in Parquet format in HDFS.
Please help me to achieve this in an efficient manner.
Kafka Producer input:
Kafka Consumer output:
I want the value to be appended to a file in HDFS.
Doing that from scratch would be quite complicated.
You can use the Kafka connect sink HDFS connector that handle out of the box parquet format output ( this would need a bit pre processing your records though, to put them in a correct format like json with schema etc..).
More info here :
https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html

spark structured streaming producing .c000.csv files

i am trying to fetch data from the kafka topic and pushing the same to hdfs location. I am facing following issue.
After every message (kafka) the hdfs location is updated with part files with .c000.csv format.i have created one hive table on top of the HDFS location, but the HIVE is not able to read data whatever written from spark structured streaming.
below is the file format after spark structured streaming
part-00001-abdda104-0ae2-4e8a-b2bd-3cb474081c87.c000.csv
Here is my code to insert:
val kafkaDatademostr = spark.readStream.format("kafka").option("kafka.bootstrap.servers","ttt.tt.tt.tt.com:8092").option("subscribe","demostream").option("kafka.security.protocol","SASL_PLAINTEXT").load
val interval=kafkaDatademostr.select(col("value").cast("string")) .alias("csv").select("csv.*")
val interval2=interval.selectExpr("split(value,',')[0] as rog" ,"split(value,',')[1] as vol","split(value,',')[2] as agh","split(value,',')[3] as aght","split(value,',')[4] as asd")
// interval2.writeStream.outputMode("append").format("console").start()
interval2.writeStream.outputMode("append").partitionBy("rog").format("csv").trigger(Trigger.ProcessingTime("30 seconds")).option("path", "hdfs://vvv/apps/hive/warehouse/area.db/test_kafcsv/").start()
Can someone help me, why is it creating files like this?
If I do dfs -cat /part-00001-ad35a3b6-8485-47c8-b9d2-bab2f723d840.c000.csv i can see my values.... but its not reading with hive due to format issue...
This c000 files are temporary files in which streaming data writes it data. As you are on appending mode, spark executor holds that writer thread , that's why on run time you are not able to read it using hive serializer, though hadoop fs -cat is working .

Spark - How to create a RDD from Kinesis input without using streaming libraries

I'm wondering how to create an RDD reading data from Kinesis with a specific offset, with a non-streaming Spark job.
For Kafka I know this is possible with KafkaUtils.createRDD.
But I don't find the same library for Kinesis. Any suggestion or workaround?
Thanks!

Spark - Reading JSON from Partitioned Folders using Firehose

Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great.
How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader?
My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe using the new structured streaming in Spark 2.0. I know this is all experimental - hoping someone has used S3 as a streaming file source before, where the data is paritioned into folders as described above. Of course would prefer straight of a Kinesis stream but there is no date on this connector for 2.0 so Firehose->S3 is the interim.
ND: I am using databricks, which mounts S3 into DBFS, but could easily be EMR of course or other Spark providers. Be great to see a notebook too if one is shareable that gives an example.
Cheers!
Can I read nested subfolders and create a static DataFrame from all the leaf JSON files? Is there an option to the DataFrame reader?
Yes, as your directory structure is regular(YYYY/MM/DD/HH), you can give the path till leaf node with wildcard chars like below
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val jsonDf = spark.read.format("json").json("base/path/*/*/*/*/*.json")
// Here */*/*/*/*.json maps to YYYY/MM/DD/HH/filename.json
Of course, would prefer straight of a Kinesis stream but there is no date on this connector for 2.0 so Firehose->S3 is the interim.
I could see there is a library for Kinesis integration with Spark Streaming. So, you can read the streaming data directly and perform SQL operations on it without reading from S3.
groupId = org.apache.spark
artifactId = spark-streaming-kinesis-asl_2.11
version = 2.0.0
Sample code with Spark Streaming and SQL
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
val kinesisStream = KinesisUtils.createStream(
streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL],
[region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
kinesisStream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val jsonDf = rdd.toDF() // or rdd.toDF("specify schema/columns here")
// Create a temporary view with DataFrame
jsonDf.createOrReplaceTempView("json_data_tbl")
//As we have DataFrame and SparkSession object we can perform most
//of the Spark SQL stuff here
}
Full disclosure: I work for Databricks but I do not represent them on Stack Overflow.
How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader?
DataFrameReader supports loading a sequence. See the documentation for def
json(paths: String*): DataFrame. You can specify the sequence, use a globbing pattern or build it programmatically (recommended):
val inputPathSeq = Seq[String]("/mnt/myles/structured-streaming/2016/12/18/02", "/mnt/myles/structured-streaming/2016/12/18/03")
val inputPathGlob = "/mnt/myles/structured-streaming/2016/12/18/*"
val basePath = "/mnt/myles/structured-streaming/2016/12/18/0"
val inputPathList = (2 to 4).toList.map(basePath+_+"/*.json")
I know this is all experimental - hoping someone has used S3 as a streaming file source before, where the data is partitioned into folders as described above. Of course would prefer straight of a Kinesis stream but there is no date on this connector for 2.0 so Firehose->S3 is the interim.
Since you're using DBFS, I'm going to assume the S3 buckets where data are streaming from Firehose are already mounted to DBFS. Check out Databricks documentation if you need help mounting your S3 bucket to DBFS. Once you have your input path described above, you can simply load the files into a static or streaming dataframe:
Static
val staticInputDF =
spark
.read
.schema(jsonSchema)
.json(inputPathSeq : _*)
staticInputDF.isStreaming
res: Boolean = false
Streaming
val streamingInputDF =
spark
.readStream // `readStream` instead of `read` for creating streaming DataFrame
.schema(jsonSchema) // Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) // Treat a sequence of files as a stream by picking one file at a time
.json(inputPathSeq : _*)
streamingCountsDF.isStreaming
res: Boolean = true
Most of this is taken straight from Databricks documentation on Structured Streaming. There is even a notebook example you can import into Databricks directly.

Convert Xml to Avro from Kafka to hdfs via spark streaming or flume

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

Resources