split values from kafka consumer record using grok logstash - logstash

I need to split the values using grok to send to Elasticsearch using logstash
ConsumerRecord(topic='testtopic', partition=0, offset=31748, timestamp=1675771249797, timestamp_type=0, key=None, value=b'{"Device-SID":"b3b2b4b-ccfe-2345-b325-cdvg567nbhgye","Device-SN":"Serial Number","File-Class":"arcus_data_file","File-Name":"test.230207.050006.8660.pdf","File-Size":"1186","File-URL":"https://something.com","File-UUID":"8cb14264-cvgfrt-werty-8599-2345671","Product-Class":"arcus","Product-ID":"tj56uubxxxx","Upload-Date":"2023-02-07T12:00:28Z"}', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=451, serialized_header_size=-1)
expected:
topic
partition
offset
timestamp
Device-SID
Device-SN
File-Class
File-Name
File-Size
File-URL

Related

Stream writes having multiple identical keys to delta lake

I am writing streams to delta lake through spark structured streaming. Each streaming batch contains key - value (also contains timestamp as one column). delta lake doesn't support of update with multiple same keys at source( steaming batch) So I want to update delta lake with only record with latest timestamp. How can I do this ?
This is code snippet I am trying:
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
println(s"Executing batch $batchId ...")
microBatchOutputDF.show()
deltaTable.as("t")
.merge(
microBatchOutputDF.as("s"),
"s.key = t.key")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
}
Thanks in advance.
You can eliminate records having older timestamp from your "microBatchOutputDF" dataframe & keep only record with latest timestamp for given key.
You can use spark's 'reduceByKey' operation & implement custom reduce function as below.
def getLatestEvents(input: DataFrame) : RDD[Row] = {
input.rdd.map(x => (x.getAs[String]("key"), x)).reduceByKey(reduceFun).map(_._2) }
def reduceFun(x: Row, y: Row) : Row = {
if (x.getAs[Timestamp]("timestamp").getTime > y.getAs[Timestamp]("timestamp").getTime) x else y }
Assumed key is of type string & timestamp of type timestamp. And call "getLatestEvents" for your streaming batch 'microBatchOutputDF'. It ignores older timestamp events & keeps only latest one.
val latestRecordsDF = spark.createDataFrame(getLatestEvents(microBatchOutputDF), <schema of DF>)
Then call deltalake merge operation on top of 'latestRecordsDF'
In streaming for a microbatch, you might got more than one records for a given key. In order to update it with target table, you have to figure out the latest record for the key in the microbatch. In your case you can use max of timestamp column and the value column to find the latest record and use that one for merge operation.
You can refer this link for more details on finding the latest record for the given key.

Stream analytics join query not working - return 0 rows

I have two input files on my stream analytics. one is csv and another one is json. I just making join query on stream analytics but its not working.
This is my query
SELECT
i1.Serial_No,i2.Customer_Id
FROM input1 i1
JOIN input2 i2
ON
i1.Serial_No = i2.Serial_No
Sample data
Json :
{
"Serial_No":"12345",
"Device_type":"Owned"
}
CSV :
"Serial_No,Customer_Id"
12345,12345
Please any one help me in this
Using join query on stream analytics comma delimiter is not working on CSV. But tab delimiter is working.

Kafka mapping message (Using Java Spark)

I have kafka topic that contains JSON, example:
{"jsonCode":"1234", "jsonData":{.....}}
{"jsonCode":"1234", "jsonData":{.....}}
{"jsonCode":"1235", "jsonData":{.....}}
{"jsonCode":"1235", "jsonData":{.....}}
{"jsonCode":"1236", "jsonData":{.....}}
My question is if I can to create the following hash map during the read from topic:
["1234", [list of jsonCode 1234 jsons]
["1235", [list of jsonCode 1235 jsons]
["1236", [list of jsonCode 1236 jsons]
Its possible? How can I do this mapping?
I want to read from Kafka using SparkStreamming, to get all unread messages on topic and create the hash map
Thanks.
Do you have any consumer configuration setup in your code. Consumer config typically requires key and value pairs.
Check while reading from topic, you able to read values in terms of key value pairs. Typically your consumer should be like this:
final Consumer<yourKey,yourValue> consumer; //consumer with consumer config
final ConsumerRecords<String, String> consumerRecords = consumer.poll(pollvals);
consumerRecords.forEach(record -> {
System.out.printf("[Consumer Record:(key - %s,value- %s,partition- %d, offset %d)]\n", record.key(),
record.value(), record.partition(), record.offset());
//parse your json from either key or from value
String value=null;
.....
value = jsonparser(record.value()); // lets parse from value.
...

Spark Streaming Desinging Questiion

I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1
Note: Each info file will have around of 1000 of these records. And our system is continuously generating these info files. Through, spark streaming i wanted to do mapping of line numbers and info files and wanted to get aggregate result.
Can we give input to spark cluster these kind of files? I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file and "DA" corresponds the ( line number, count).
As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks?
Or can we achieve this in Spark itself? What should be the right approach?
What i wanted to achieve?
I wanted to get line level information. Means, to get line (As a key) and info files (as values)
Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90, ..., ... ,)
Do let me know if my explanation is not clear or if i am missing something.
Thanks & Regards,
Vinti
You could do something like this. Having your DStream stream:
// this gives you DA & FP lines, with the line number as the key
val validLines = stream.map(_.split(":")).
filter(line => Seq("DA", "FP").contains(line._1)).
map(_._2.split(","))
map(line => (line._1, line._2))
// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)
def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
// add the new values
val newVals = runnigValues match {
case Some(list) => list :: newValues
case _ => newValues
}
Some(newVals)
}
This should accumulate for each key a sequence with the values associated, storing it in state

logstash maintaining the order in which data is read

I have a single logstash instance and reading data from multiple files. I want to maintain the order in which data is updated in Elasticsearch as the _id field is the key.
So if there are two records in the input file with same key, it has to be updated in order.
How do enforce ordering from a source in logstash?
Input file 1:
Key = A1 , Data = abc , time=5:51 PM
Key = B1 , Data = efg , time=5:52 PM
Key = C1 , Data = hij , time=5:53 PM
Input file 2:
Key = A1 , Data = klm, time=5:50 PM
This will be read by two threads in logstash.
If there are two filter threads which formats the data.
Output goes to elastics search with _id:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{Key}"
}
}
How to ensure that Key=A1 has Data=abc not "klm".
If the data that needs to be processed in order is read from different files there's no way of doing this since Logstash doesn't maintain an ordered queue of events. If you have more than one filter worker (i.e. start Logstash with -w/--filterworkers greater than one) there's no order guarantee even if you read from a single file.
You'll have to write something yourself to get the ordering right. Perhaps Apache Samza could be of use.

Resources