How to use NiFi ExtractGrok properly

How to use NiFi ExtractGrok properly - logstash

I'm using: NiFi v1.8.0 and Logstash v7.1.1
I'm tasked to move all our Logstash configurations over to NiFi. I am trying to understand how the NiFi ExtractGrok works, but I can't find any examples. How is this intended to be used? And how can you set a NiFi attribute with this grok processor? And when I mean examples, I mean actual examples that show you a before and after so people can understand whats going on. I've read the NiFi ExtractGrok documentation, but its very limited and seems to assume you understand how it works.
This is the only example I've been able to find: How to fetch multiline with ExtractGrok processor in ApacheNifi?

According to what you are saying, the processor you need, is rather ConvertRecord than ExtractGrok. ExtractGrok will only extract certain fields into FlowFile attributes or content.
If you want to format your log files into a workable format(like JSON, if you want to send those files to ElasticSearch), then you would use GrokReader as Record Reader and Record Writer as JsonRecordSetWriter.
Then, you would configure your Schema Text (or use a Schema Registry) in both RecordReader and RecordWriter to be your schema, and set Grok Expression to be your grok expression in your GrokReader.
For example:
my log messages log like this:
2019-12-09 07:59:59,136 this is the first log message
2019-12-09 09:59:59,136 this is the first log message with a stack trace: org.springframework.boot.actuate.jdbc.DataSourceHealthIndicator - DataSource health check failed
org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:81)......
So, my grok would be:
%{TIMESTAMP_ISO8601:timestamp}\s+%{GREEDYDATA:log_message}
and my schema would be:
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "timestamp",
"type": "string"
},
{
"name": "log_message",
"type": "string"
},
{
"name": "stackTrace",
"type": "string"
}
]
}
Note the stackTrace field I've added to the schema. The GrokReader automatically maps stack traces into their own field. So you have to add stackTrace field if you want to map it too. Then, you can put it into the log_message field if you want, using Jolt.
The output of this ConvertRecord would be:
[ {
"timestamp" : "2019-12-09 07:59:59,136",
"log_message" : "this is the first log message",
"stackTrace" : null
}, {
"timestamp" : "2019-12-09 09:59:59,136",
"log_message" : "this is the first log message with a stack trace: org.springframework.boot.actuate.jdbc.DataSourceHealthIndicator - DataSource health check failed",
"stackTrace" : "org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.)\nat org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:81)......"
} ]

Related

Convert to JSON and map to new JSON object in Alteryx

I am using Alteryx to take an Excel file and convert to JSON. The JSON output I'm getting looks different to what I was expecting and also the object starts with "JSON": which I don't want to happen and I would also like to know how/which components I would use to map fields to specific JSON fields instead of key value pairs if I need to later in the flow.
I have attached my sample workflow and excel which are:
Excel screenshot
Alteryx test flow
JSON output I am seeing:
[
{
"JSON": "{\"email\":\"test123#test.com\",\"startdate\":\"2020-12-01\",\"isEnabled\":\"0\",\"status\":\"active\"}"
},
{
"JSON": "{\"email\":\"myemail#emails.com\",\"startdate\":\"2020-12-02\",\"isEnabled\":\"1\",\"status\":\"active\"}"
}
]
What I expected:
[{
"email": "test123#test.com",
"startdate": "2020-12-01",
"isEnabled": "0",
"status": "active"
},
{
"email": "myemail#emails.com",
"startdate": "2020-12-02",
"isEnabled": "1",
"status": "active"
}
]
Also, what component would I use if I wanted to map the structure above to another JSON structure similar this one:
[{
"name":"MyName",
"accounType":"array",
"contactDetails":{
"email":"test123#test.com",
"startDate":"2020-12-01"
}
}
} ]
Thanks

In the workflow that you have built, you are essentially creating the JSON twice. The JSON Build creates the JSON structure, so if you then want to output it, select your file to output and then change the dropdown to csv with delimiter \0 and no headers.
However, try putting an output straight after your Excel file and output to JSON, the Output Tool will build the JSON for you.
In answer to your second question, build the JSON for Contact Details first as a field (remember to rename JSON to contactDetails). Then build from there with one of the above options.

Logstash: Parse Azure Event Hub Logs

Sample message from Azure Event Hubs logstash plugin:
https://pastebin.com/b8WnQHug
I would like to have output:
{
"operationName": "Microsoft.ContainerService/managedClusters/diagnosticLogs/Read",
"category": "kube-apiserver",
"ccpNamespace": "5d764286d7481f0001d4b054",
"resourceId": "/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/MY-RG/PROVIDERS/MICROSOFT.CONTAINERSERVICE/MANAGEDCLUSTERS/MY-AKS",
"properties": {
"log": "First line from record\n Second line from another record\n Third line from another record \n etc from another recors",
"stream": "stderr",
"pod": "kube-apiserver-8b5b9cd44-khjfk",
"containerID": "4c2ddb8ba9639ae9c88f728d850d550473eb36f4eb3e1d99c3f052b87cff9357"
},
"time": "2019-10-16T13:44:16.0000000Z",
"Cloud": "Public",
"Environment": "prod"
}
Main fields:
time ( as timestamp )
pod ( name of pod field)
stream ( event type field )
log ( worst part, log field should be concatenate from other message.records[] with same time and containerID fields )
Elasticsearch has experimental Azure module, here is source code/filter for logstash:
https://github.com/elastic/logstash/blob/master/x-pack/modules/azure/configuration/logstash/azure.conf.erb
I dont need such a complexity.
I guess I need:
split filter for new fields
date filter for message.records[].timestamp
"something" to find all message.records with same message.records[].time and message.records[].properties.containerID fields and concatenate message.records[].properties.log field
Can anyone help?
Thanks
EDIT: It think I will have to consider also this:
https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html
, altough there will be probably in 90% all multiline logs in single event, there may be a chance that it will be splitted into multiple events.
Another problem is that aggregate does not work at scale ( azure event hub plugin can ), so aggregate will be bottleneck..

MessageConversionException when trying to serialize a List of Avro SpecificRecords and split them with #Splitter

I have a Spring Cloud Stream (Elmhurst Release) Splitter with a RabbitMQ binder. I'm trying to update it so that it will use Avro schema for the payloads. I'm getting a conversion exception which makes it look like the Avro converter isn't getting invoked, and the JSON Converter picks up the message and trips on it.
Caused by: org.springframework.messaging.converter.MessageConversionException: Could not write JSON: Not a map: {"type":"record","name":"SkinnyMessage","namespace":"com.example.avro","doc":"Light message for passing references to Message objects","fields":[{"name":"id","type":"string"},{"name":"guid","type":"string"}]} (through reference chain: com.example.avro.SkinnyMessage["schema"]->org.apache.avro.Schema$RecordSchema["valueType"]); nested exception is com.fasterxml.jackson.databind.JsonMappingException: Not a map: {"type":"record","name":"SkinnyMessage","namespace":"com.example.avro","doc":"Light message for passing references to Message objects","fields":[{"name":"id","type":"string"},{"name":"guid","type":"string"}]} (through reference chain: com.example.avro.SkinnyMessage["schema"]->org.apache.avro.Schema$RecordSchema["valueType"])
I've confirmed that I can create objects and serialize them to-disk using the generated Avro class (maven-avro plugin), so that part seems right. I've converted the project to Java from Groovy and still get the same error, so I think that's ruled out, too.
Here's the Avro Schema:
{
"namespace": "com.example.avro",
"type": "record",
"name": "SkinnyMessage",
"doc": "Light message for passing references to Message objects",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "guid",
"type": "string"
}
]
}
And the relevant part of the class:
#EnableBinding(Processor.class)
#EnableSchemaRegistryClient
class PagingQueryProcessorApplication {
#Timed(value = 'paging.query')
#Splitter(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
List<SkinnyMessage> queryExecutor(def trigger){
log.debug 'Building query'
def query = queryConfiguration.buildQuery().toUriString()
log.info "Executing query: ${query}"
def response = service.getRecordings(query)
log.info "Returning response collection: ${response.body.content.size()}"
// We build a slim notification on each of the query responses
def skinnyMessages = response.body.content.collect{
new SkinnyMessage(it.getLink('self').getHref(), it.content.guid)
}
skinnyMessages
}
...
}
Edit: When I step through with a debugger, I can see that the AvroSchemaRegistryClientMessageConverter fails the canConvertTo(payload, headers) call because the mimeType in the headers is application/json and not application/*+avro, so it continues trying the rest of the converter chain.

If I build a message and set an avro content-type header on it, that seems to work, but that seems like a hack.
List<Message<SkinnyMessage>> skinnyMessages = response.body.content.collect{
MessageBuilder.withPayload(
new SkinnyMessage(it.getLink('self').getHref(), it.content.recordingGuid))
.setHeader('contentType', 'application/*+avro')
.build()
}
Creates messages which look right in RabbitMQ UI:
contentType: application/vnd.skinnymessage.v1+avro
correlationId: f8be74d6-f780-efcc-295d-338a8b7f2ea0
content_type: application/octet-stream
Payload
96 bytes
Encoding: string
thttps://example.com/message/2597061H9a688e40-3e30-4b17-80e9-cf4f897e8a91
If I understand the docs correctly though, this should happen transparently from the setting in the application.yml: (as in Schema Registry Samples):
spring:
cloud:
stream:
bindings:
output:
contentType: application/*+avro

Configuring what is written to Kafka Topic when using Kafka Connect Cassandra Source

I am doing a spike in which we want to publish data as it is written in a Cassandra table to a Kafka Topic. We are looking at using Kafka Connect and the Stream Reactor Connectors.
I am using Kafka 0.10.0.1
I am using DataMountaineer Stream Reactor 0.2.4
I placed the jar file for Stream Reactor into the Kafka libs folder and am running Kafka Connect in distributed mode
bin/connect-distributed.sh config/connect-distributed.properties
I added the Cassandra Source connector as follows:
curl -X POST -H "Content-Type: application/json" -d #config/connect-idoc-cassandra-source.json.txt localhost:8083/connectors
When I add data to the Cassandra table I see it being added to the topic using the Kafka command line consumer
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic idocs-topic --from-beginning
Here is a sample of what is being written to the Topic right now:
{
"schema": {
"type": "struct",
"fields": [{
"type": "string",
"optional": true,
"field": "idoc_id"
}, {
"type": "string",
"optional": true,
"field": "idoc_event_ts"
}, {
"type": "string",
"optional": true,
"field": "json_doc"
}],
"optional": false,
"name": "idoc.idocs_events"
},
"payload": {
"idoc_id": "dc4ab8a0-fdf8-11e6-8285-1bce55915fdd",
"idoc_event_ts": "dc4ab8a1-fdf8-11e6-8285-1bce55915fdd",
"json_doc": "{\"foo\":\"bar\"}"
}}
What I would like written to the topic is the value of the json_doc column.
Here is what I have in my config for the Cassandra source
{
"name": "cassandra-idocs",
"config": {
"tasks.max": "1",
"connector.class": "com.datamountaineer.streamreactor.connect.cassandra.source.CassandraSourceConnector",
"connect.cassandra.key.space": "idoc",
"connect.cassandra.source.kcql": "INSERT INTO idocs-topic SELECT json_doc FROM idocs_events PK idoc_event_ts",
"connect.cassandra.import.mode": "incremental",
"connect.cassandra.contact.points": "localhost",
"connect.cassandra.port": 9042,
"connect.cassandra.import.poll.interval": 10000
}}
How do I change the way Kafka Connect Cassandra Source is configured so that only the value of json_doc is written to the Topic so it would look something like this:
{"foo":"bar"}
The Kassandra Connect Query Language seemed to be the way to go but it isn't limiting what is written to the column specified in the KCQL.
UPDATE
Saw this answer on StackOverflow and changed the converters in the connect-distributed.properties file from JsonConverter to StringConverter.
The result is this is now written to the Topic:
Struct{idoc_id=74597cf0-fdf7-11e6-8285-1bce55915fdd,idoc_event_ts=74597cf1-fdf7-11e6-8285-1bce55915fdd,json_doc={"foo":"bar"}}
UPDATE 2
Changed the converters in the connect-distributed.properties file back to JsonConverter. Then also disabled the schemas.
key.converter.schemas.enable=false
value.converter.schemas.enable=false
The result is this is now written to the Topic:
{
"idoc_id": "dc4ab8a0-fdf8-11e6-8285-1bce55915fdd",
"idoc_event_ts": "dc4ab8a1-fdf8-11e6-8285-1bce55915fdd",
"json_doc": "{\"foo\":\"bar\"}"
}
Note
Using code from snapshot release and changing the KCQL to
INSERT INTO idocs-topic
SELECT json_doc, idoc_event_ts
FROM idocs_events
IGNORE idoc_event_ts
PK idoc_event_ts
Yields this result on the Topic
{"json_doc": "{\"foo\":\"bar\"}"}
Thanks

Turns out what I was attempting to do was not possible in the Cassandra Source in DataMountaineer Stream Reactor 0.2.4. However, the snapshot release (of what I assume will become release 0.2.5) will support this.
Here is how it will work:
1) Set the converters in the connect-distributed.properties file to StringConverter.
2) Set the KCQL in the JSON configuration for the Cassandra Source connector to
INSERT INTO idocs-topic
SELECT json_doc, idoc_event_ts
FROM idocs_events
IGNORE idoc_event_ts
PK idoc_event_ts
WITHUNWRAP
This will result in the value of the json_doc column being published to the Kafka Topic without any schema information or the column name itself.
So if the column json_doc contained the value {"foo":"bar"} then this is what would appear on the Topic:
{"foo":"bar"}
Here is some background information on how the KCQL works in the snapshot release.
The SELECT will now retrieve only the columns in that table that are specified in the KCQL. Originally it was always retrieving all of the columns. It is important to note that the PK column must be part of the SELECT statement when using the incremental import mode. If the value of the PK column is not something that should be included in the message published to the Kafka Topic then add it to the IGNORE statement (as in the example above).
The WITHUNWRAP is the new feature to KCQL that will tell the Cassandra Source connector to create a SourceRecord using the String Schema type (instead of Struct). In this mode only the values of the columns that are in the SELECT statement will be stored as the value of the SourceRecord. If there is more than one column in the SELECT statement after applying the IGNORE statement then the values are appended to together and separated by a comma.

logstash filter for Java logs

I am trying to write a logstash filter for my Java logs so that I can insert them into my database cleanly.
Below is an example of my log format:
FINE 2016-01-28 22:20:42.614+0000 net.myorg.crypto.CryptoFactory:getInstance:73:v181328
AppName : MyApp AssocAppName:
Host : localhost 127.000.000.001 AssocHost:
Thread : http-bio-8080-exec-5[23]
SequenceId: -1
Logger : net.myorg.crypto.CryptoFactory
Message : ENTRY
---
FINE 2016-01-28 22:20:42.628+0000 net.myorg.crypto.CryptoFactory:getInstance:75:v181328
AppName : MyApp AssocAppName:
Host : localhost 127.000.000.001 AssocHost:
Thread : http-bio-8080-exec-5[23]
SequenceId: -1
Logger : net.myorg.crypto.CryptoFactory
Message : RETURN
---
My logstash-forwarder is pretty simple. It just includes all logs in the directory (they all have the same format as above)
"files": [
{
"paths": [ "/opt/logs/*.log" ],
"fields": { "type": "javaLogs" }
}
]
The trouble I'm having is on the logstash side. How can I write a filter in logstash to match this log format?
Using something like this, gets me close:
filter {
if [type] == "javaLogs" {
multiline {
pattern => "^%{TIMESTAMP_ISO8601}"
negate => true
what => "previous"
}
}
}
But I want to break each line in the log down to its own mapping in logstash. For example, creating fields like AppName, AssocHost, Host, Thread, etc.
I think the answer is using grok.

Joining them with multiline (the codec or filter, depending on your needs) is a great first step.
Unfortunately, your pattern says "If the log entry doesn't start with a timestamp, join it with the previous eentry".
Note that none of your log entries start with a timestamp.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use NiFi ExtractGrok properly - logstash

Related

Convert to JSON and map to new JSON object in Alteryx

Logstash: Parse Azure Event Hub Logs

MessageConversionException when trying to serialize a List of Avro SpecificRecords and split them with #Splitter

Configuring what is written to Kafka Topic when using Kafka Connect Cassandra Source

logstash filter for Java logs

Categories

Resources