For a given scenario, I want to filter the datasets in structured streaming in a combination of continuous and batch triggers.
I know it sounds unrealistic or maybe not feasible. Below is what I am trying to achieve.
Let the processing time-interval set in the app be 5 minutes.
Let the record be of below schema:
{
"type":"record",
"name":"event",
"fields":[
{ "name":"Student", "type":"string" },
{ "name":"Subject", "type":"string" }
]}
My streaming app is supposed to write the result into the sink by considering either of the below two criteria.
If a student is having more than 5 subjects. (Priority to be given to this criteria.)
Processing time provided in the trigger expired.
private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"alarm\","
+ "\"fields\":["
+ " { \"name\":\"student\", \"type\":\"string\" },"
+ " { \"name\":\"subject\", \"type\":\"string\" }"
+ "]}";
private static Schema.Parser parser = new Schema.Parser();
private static Schema schema = parser.parse(USER_SCHEMA);
static {
recordInjection = GenericAvroCodecs.toBinary(schema);
type = (StructType) SchemaConverters.toSqlType(schema).dataType();
}
sparkSession.udf().register("deserialize", (byte[] data) -> {
GenericRecord record = recordInjection.invert(data).get();
return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
}, DataTypes.createStructType(type.fields()));
Dataset<Row> ds2 = ds1
.select("value").as(Encoders.BINARY())
.selectExpr("deserialize(value) as rows")
.select("rows.*")
.selectExpr("student","subject");
StreamingQuery query1 = ds2
.writeStream()
.foreachBatch(
new VoidFunction2<Dataset<Row>, Long>() {
#Override
public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
}
}
).format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "new_in")
.option("checkpointLocation", "checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime(10000))
.start();
query1.awaitTermination();
Kafka Producer console:
Student:Test, Subject:x
Student:Test, Subject:y
Student:Test, Subject:z
Student:Test1, Subject:x
Student:Test2, Subject:x
Student:Test, Subject:w
Student:Test1, Subject:y
Student:Test2, Subject:y
Student:Test, Subject:v
In the Kafka consumer console, I am expecting like below.
Test:{x,y,z,w,v} =>This should be the first response
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time
Related
I am new to Spark.
I want to keep getting message from kafka, and then save to S3 once the message size over 100000.
I implemented it by Dataset.collectAsList(), but it throw error with Total size of serialized results of 3 tasks (1389.3 MiB) is bigger than spark.driver.maxResultSize
So I turned to use foreach, and it said null point exception when used SparkSession to createDataFrame.
Any idea about it? Thanks.
---Code---
SparkSession spark = generateSparkSession();
registerUdf4AddPartition(spark);
Dataset<Row> dataset = spark.readStream().format("kafka")
.option("kafka.bootstrap.servers", args[0])
.option("kafka.group.id", args[1])
.option("subscribe", args[2])
.option("kafka.security.protocol", SecurityProtocol.SASL_PLAINTEXT.name)
.load();
DataStreamWriter<Row> console = dataset.toDF().writeStream().foreachBatch((rawDataset, time) -> {
Dataset<Row> rowDataset = rawDataset.selectExpr("CAST(value AS STRING)");
//using foreach
rowDataset.foreach(row -> {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
});
// using collectAsList
List<Row> rows = rowDataset.collectAsList();
for (Row row : rows) {
List<Span> rawDataList = new CsvToBeanBuilder(new StringReader(row.getString(0))).withType(Span.class).build().parse();
spans.addAll(rawDataList);
batchSave(spark);
}
});
StreamingQuery start = console.start();
start.awaitTermination();
public static void batchSave(SparkSession spark){
synchronized (spans){
if(spans.size() == 100000){
System.out.println(spans.isEmpty());
Dataset<Row> spanDataSet = spark.createDataFrame(spans, Span.class);
Dataset<Row> finalResult = addCustomizedTimeByUdf(spanDataSet);
StringBuilder pathBuilder = new StringBuilder("s3a://fwk-dataplatform-np/datalake/log/FWK/ART2/test/leftAndRight");
finalResult.repartition(1).write().partitionBy("year","month","day","hour").format("csv").mode("append").save(pathBuilder.toString());
spans.clear();
}
}
}
Since the main SparkSession is running in driver, and tasks in foreach... is running distributed in executors, so the spark is not defined to all other executors.
BTW, there is no meaning to use synchronized inside foreach task since everything is distributed.
Before sending an Avro GenericRecord to Kafka, a Header is inserted like so.
ProducerRecord<String, byte[]> record = new ProducerRecord<>(topicName, key, message);
record.headers().add("schema", schema);
Consuming the record.
When using Spark Streaming, the header from the ConsumerRecord is intact.
KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)).foreachRDD(rdd -> {
rdd.foreach(record -> {
System.out.println(new String(record.headers().headers("schema").iterator().next().value()));
});
});
;
But when using Spark SQL Streaming, the header seems to be missing.
StreamingQuery query = dataset.writeStream().foreach(new ForeachWriter<>() {
...
#Override
public void process(Row row) {
String topic = (String) row.get(2);
int partition = (int) row.get(3);
long offset = (long) row.get(4);
String key = new String((byte[]) row.get(0));
byte[] value = (byte[]) row.get(1);
ConsumerRecord<String, byte[]> record = new ConsumerRecord<String, byte[]>(topic, partition, offset, key,
value);
//I need the schema to decode the Avro!
}
}).start();
Where can I find the custom header value when using Spark SQL Streaming approach?
Version:
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
UPDATE
I tried 3.0.0-preview2 of spark-sql_2.12 and spark-sql-kafka-0-10_2.12. I added
.option("includeHeaders", true)
But I still only get these columns from the Row.
+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
Kafka headers in Structured Streaming supported only from 3.0: https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
Please look for includeHeaders for more details.
I need to send data from Kafka to Kinesis Firehose. I am processing Kafka data using Spark Structured Streaming. I am not sure how to process the dataset of streaming query into an ArrayList variable - say, recordList - of e.g. 100 records (could be any other value) and then call Firehose API's putRecordBatch(recordList) to put the records into Firehose.
I think you want to check out Foreach and ForeachBatch depending on your Spark Version. ForeachBatch comes in V2.4.0 and foreach is available < V2.4.0. If there is no streaming sink implementation available for Kinesis Firehouse then you should make your own implementation of the ForeachWriter. Databricks has some nice examples of using foreach to create the custom writers.
I haven't ever used Kinesis but here is an example of what your custom sink might look like.
case class MyConfigInfo(info1: String, info2: String)
class KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
val kinesisProducer = _
def open(partitionId: Long,version: Long): Boolean = {
kinesisProducer = //set up the kinesis producer using MyConfigInfo
true
}
def process(value: (String, String)): Unit = {
//ask kinesisProducer to send data
}
def close(errorOrNull: Throwable): Unit = {
//close the kinesis producer
}
}
if you're using that AWS kinesisfirehose API you might do something like this
case class MyConfigInfo(info1: String, info2: String)
class KinesisSink(configInfo: MyConfigInfo) extends ForeachWriter[(String, String)] {
val firehoseClient = _
val req = putRecordBatchRequest = new PutRecordBatchRequest()
val records = 0
val recordLimit = //maybe you need to set this?
def open(partitionId: Long,version: Long): Boolean = {
firehoseClient = //set up the firehose client using MyConfigInfo
true
}
def process(value: (String, String)): Unit = {
//ask fireHose client to send data or batch the request
val record: Record = //create Record out of value
req.setRecords(record)
records = records + 1
if(records >= recordLimit) {
firehoseClient.putRecordBatch(req)
records = 0
}
}
def close(errorOrNull: Throwable): Unit = {
//close the firehose client
//or instead you could put the batch request to the firehose client here but i'm not sure if that's good practice
}
}
Then you'd use it as such
val writer = new KinesisSink(configuration)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()
I have spark 2.0 code which would read .gz(text) files and writes them to the HIVE table.
Can i know How do i ignore the first two lines from all of my files. Just want to skip the first two lines.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("SparkSessionFiles")
.config("spark.some.config.option", "some-value")
.enableHiveSupport()
.getOrCreate();
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.map(new Function<String, mySchema>()
{
#Override
public mySchema call(String line) throws Exception
{
String[] parts = line.split(";");
mySchema mySchema = new mySchema();
mySchema.setCFIELD1 (parts[0]);
mySchema.setCFIELD2 (parts[1]);
mySchema.setCFIELD3 (parts[2]);
mySchema.setCFIELD4 (parts[3]);
mySchema.setCFIELD5 (parts[4]);
return mySchema;
}
});
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> myDF = spark.createDataFrame(peopleRDD, mySchema.class);
myDF.createOrReplaceTempView("myView");
spark.sql("INSERT INTO myHIVEtable SELECT * from myView");
UPDATE: Modified code
Lambdas are not working on my eclipse. So used regular java syntax. I am getting an exceception now.
.....
Function2 removeHeader= new Function2<Integer, Iterator<String>, Iterator<String>>(){
public Iterator<String> call(Integer ind, Iterator<String> iterator) throws Exception {
System.out.println("ind="+ind);
if((ind==0) && iterator.hasNext()){
iterator.next();
iterator.next();
return iterator;
}else
return iterator;
}
};
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile(path) //file:///app/home/emm/zipfiles/myzips/
.javaRDD()
.mapPartitionsWithIndex(removeHeader,false)
.map(new Function<String, mySchema>()
{
........
Java.util.NoSuchElementException
at java.util.LinkedList.removeFirst(LinkedList.java:268)
at java.util.LinkedList.remove(LinkedList.java:683)
at org.apache.spark.sql.execution.BufferedRowIterator.next(BufferedRowIterator.java:49)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:374)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.next(WholeStageCodegenExec.scala:368)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2480)
at com.comcast.emm.vodip.SparkSessionFiles.SparkSessionFiles$1.call(SparkSessionFiles.java:2476)
You could do something like that :
JavaRDD<mySchema> peopleRDD = spark.read()
.textFile("file:///app/home/emm/zipfiles/myzips/")
.javaRDD()
.mapPartitionsWithIndex((index, iter) -> {
if (index == 0 && iter.hasNext()) {
iter.next();
if (iter.hasNext()) {
iter.next();
}
}
return iter;
}, true);
...
In Scala, it the syntax is simpler. For example :
rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(2) else iter }
EDIT :
I modified the code to avoid the Exception.
This code will only delete the first 2 lines of the RDD, not of every files.
If you want to remove the first 2 lines of every file, I suggest you do a RDD for each file, apply the .mapPartitionWithIndex(...) for each RDD, then do a union of each RDD.
I want to create an API which looks like this
public Dataset<Row> getDataFromKafka(SparkContext sc, String topic, StructType schema);
here
topic - is Kafka topic name from which the data is going to be consumed.
schema - is schema information for Dataset
so my function contains following code :
JavaStreamingContext jsc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(2000L));
JavaPairInputDStream<String, String> directStream = KafkaUtils.createDirectStream(
jsc, String.class, String.class,
StringDecoder.class, StringDecoder.class,
kafkaConsumerConfig(), topics
);
Dataset<Row> dataSet = sqlContext.createDataFrame(javaSparkContext.emptyRDD(), schema);
DataSetHolder holder = new DataSetHolder(dataSet);
LongAccumulator stopStreaming = sc.longAccumulator("stop");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
//get type of message from value
Row row = null;
if (END == msg) {
stopStreaming.add(1);
row = null;
} else {
row = new GenericRow(/*row data created from values*/);
}
return row;
}).filter(row -> row != null).rdd();
holder.union(sqlContext.createDataFrame(rows, schema));
holder.get().count();
});
jsc.start();
//stop stream if stopStreaming value is greater than 0 its spawned as new thread.
return holder.get();
Here DatasetHolder is a wrapper class around Dataset to combine the result of all the rdds.
class DataSetHolder {
private Dataset<Row> df = null;
public DataSetHolder(Dataset<Row> df) {
this.df = df;
}
public void union(Dataset<Row> frame) {
this.df = df.union(frame);
}
public Dataset<Row> get() {
return df;
}
}
This doesn't looks good at all but I had to do it. I am wondering what is the good way to do it. Or is there any provision for this by Spark?
Update
So after consuming all the data from stream i.e. from kafka topic, we create a dataframe out of it so that the data analyst can register it as a temp table and can fire any query to get the meaningful result.