Logstash: replace field values matching pattern - linux

I'm using Elasticsearch + Logstash + kibana for windows eventlog analysis. And i get the following log:
{
"_index": "logstash-2015.04.16",
"_type": "logs",
"_id": "Ov498b0cTqK8W4_IPzZKbg",
"_score": null,
"_source": {
"EventTime": "2015-04-16 14:12:45",
"EventType": "AUDIT_FAILURE",
"EventID": "4656",
"Message": "A handle to an object was requested.\r\n\r\nSubject:\r\n\tSecurity ID:\t\tS-1-5-21-2832557239-2908104349-351431359-3166\r\n\tAccount Name:\t\ts.tekotin\r\n\tAccount Domain:\t\tIAS\r\n\tLogon ID:\t\t0x88991C8\r\n\r\nObject:\r\n\tObject Server:\t\tSecurity\r\n\tObject Type:\t\tFile\r\n\tObject Name:\t\tC:\\Folders\\Общая (HotSMS)\\Test_folder\\3\r\n\tHandle ID:\t\t0x0\r\n\tResource Attributes:\t-\r\n\r\nProcess Information:\r\n\tProcess ID:\t\t0x4\r\n\tProcess Name:\t\t\r\n\r\nAccess Request Information:\r\n\tTransaction ID:\t\t{00000000-0000-0000-0000-000000000000}\r\n\tAccesses:\t\tReadData (or ListDirectory)\r\n\t\t\t\tReadAttributes\r\n\t\t\t\t\r\n\tAccess Reasons:\t\tReadData (or ListDirectory):\tDenied by\tD:(D;OICI;CCDCLCSWRPWPLOCRSDRC;;;S-1-5-21-2832557239-2908104349-351431359-3166)\r\n\t\t\t\tReadAttributes:\tGranted by ACE on parent folder\tD:(A;OICI;0x1200a9;;;S-1-5-21-2832557239-2908104349-351431359-3166)\r\n\t\t\t\t\r\n\tAccess Mask:\t\t0x81\r\n\tPrivileges Used for Access Check:\t-\r\n\tRestricted SID Count:\t0",
"ObjectServer": "Security",
"ObjectName": "C:\\Folders\\Общая (HotSMS)\\Test_folder\\3",
"HandleId": "0x0",
"PrivilegeList": "-",
"RestrictedSidCount": "0",
"ResourceAttributes": "-",
"#timestamp": "2015-04-16T11:12:45.802Z"
},
"sort": [
1429182765802,
1429182765802
]
}
I get many log messages with different EventID, and when I recieve a log entry with EventID 4656 - i want to replace the value "4656" with a string "Access Failure". Is there a chance to do so?

You can do it when you are loading with logstash -- just do something like this:
filter {
if [EventID] == "4656" {
mutate {
replace => [ "EventID", "Access Failure" ]
}
}
}

If you have a lot of values, look at translate{}:
translate {
dictionary => [
"4656", "Access Failure",
"1234", "Another Value"
]
field => "EventID"
destination => "EventName"
}
I don't think translate{} will let you replace the original field. You could remove it, though, in favor of the new field.

use replace filter:
Replace a field with a new value. The new value can include %{foo} strings to help you build a new value from other parts of the event.
Example:
filter {
if [source] == "your code like 4656" {
mutate {
replace => { "message" => "%{source_host}: My new message" }
}
}
}

Related

Searching after indexing in ElasticSearch

I want to index 1 billion records. each record has 2 attributes (attribute1 and attribute2).
each record that has same value in attribute1 must be merge. for example, I have two record
attribute1 attribute2
1 4
1 6
my elastic document must be
{
"attribute1": "1"
"attribute2": "4,6"
}
due to huge amount of data, I must to read a bulk (about 1000 records) and merge them based on the above rule (in memory) and then search them in ElasticSearch and merge them with search result and then index/reindex them.
In summary I have to Search and Index per bulk respectively.
I implemented this rule but in some cases Elastic does not return all results and some documents have been indexed duplicately.
after each Index I Refresh ElasticSearch so that it be ready for next search. but in some case it doesn’t work.
my index setting is followed as:
{
"test_index": {
"settings": {
"index": {
"refresh_interval": "-1",
"translog": {
"flush_threshold_size": "1g"
},
"max_result_window": "1000000",
"creation_date": "1464577964635",
"store": {
"throttle": {
"type": "merge"
}
}
},
"number_of_replicas": "0",
"uuid": "TZOse2tLRqGk-vHRMGc2GQ",
"version": {
"created": "2030199"
},
"warmer": {
"enabled": "false"
},
"indices": {
"memory": {
"index_buffer_size": "40%"
}
},
"number_of_shards": "5",
"merge": {
"policy": {
"max_merge_size": "2g"
}
}
}
}
how can I resolve this problem?
Is there any other setting to handle this situation?
In your bulk commands, you need to use the index operation for the first occurence and then update with a script to update your attribute2 property:
{ "index" : { "_index" : "test_index", "_type" : "test_type", "_id" : "1" } }
{ "attribute1" : "1", "attribute2": [4] }
{ "update" : { "_index" : "test_index", "_type" : "test_type", "_id" : "1" } }
{ "script" : { "inline": "ctx._source.attribute2 += attr2", "params" : {"attr2" : 6}}}
After the first index operation your document will look like
{
"attribute1": "1"
"attribute2": [4]
}
After the second update operation, your document will look like
{
"attribute1": "1"
"attribute2": [4, 6]
}
Note that it is also possible to only use update operations with doc_as_upsert and script.

"stop" filter behaving differently in Elasticsearch when using "_all"

I'm trying to implement a match search in Elasticsearch, and I noticed that the behavior is different depending if I use _all or if a enter a specific string value as the field name of my query.
To give some context, I've created an index with the following settings:
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"kstem",
"word_delimiter"
]
}
}
}
}
}
If I create a document like:
{
"name": "Hello.World"
}
And I execute a search using _all like:
curl -d '{"query": { "match" : { "_all" : "hello" } }}' http://localhost:9200/myindex/mytype/_search
It will correctly match the document (since I'm using the stop filter to split the words at the dot), but if I execute this query instead:
curl -d '{"query": { "match" : { "name" : "hello" } }}' http://localhost:9200/myindex/mytype/_search
Nothing is being returned instead. How is this possible?
Issue a GET for /myindex/mytype/_mapping and see if your index is configured the way you think it is. Meaning, see if that "name" field is not_analyzed, for example.
Even more, run the following query to see how name field is actually indexed:
{
"query": {
"match": {
"name": "hello"
}
},
"fielddata_fields": ["name"]
}
You should see something like this in the result:
"fields": {
"name": [
"hello",
"world"
]
}
If you don't, then you know something's wrong with your mapping for the name field.

Logstash: how to add file name as a field?

I'm using Logstash + Elasticsearch + Kibana to have an overview of my Tomcat log files.
For each log entry I need to know the name of the file from which it came. I'd like to add it as a field. Is there a way to do it?
I've googled a little and I've only found this SO question, but the answer is no longer up-to-date.
So far the only solution I see is to specify separate configuration for each possible file name with different "add_field" like so:
input {
file {
type => "catalinalog"
path => [ "/path/to/my/files/catalina**" ]
add_field => { "server" => "prod1" }
}
}
But then I need to reconfigure logstash each time there is a new possible file name.
Any better ideas?
Hi I added a grok filter to do just this. I only wanted to have the filename not the path, but you can change this to your needs.
filter {
grok {
match => ["path","%{GREEDYDATA}/%{GREEDYDATA:filename}\.log"]
}
}
In case you would like to combine the message and file name in one event:
filter {
grok {
match => {
message => "ERROR (?<function>[\S]*)"
}
}
grok {
match => {
path => "%{GREEDYDATA}/%{GREEDYDATA:filename}\.log"
}
}}
The result in ElasticSearch (focus on 'filename' and 'function' fields):
"_index": "logstash-2016.08.03",
"_type": "logs",
"_id": "AVZRyEI49-A6kyBCq6Yt",
"_score": 1,
"_source": {
"message": "27/07/16 12:16:18,321 ERROR blaaaaaaaaa.internal.com",
"#version": "1",
"#timestamp": "2016-08-03T19:01:33.083Z",
"path": "/home/admin/mylog.log",
"host": "my-virtual-machine",
"function": "blaaaaaaaaa.internal.com",
"filename": "mylog"
}

Elasticsearch term filter on inner object field not matching

I have just organized my document structure to have a more OO design (e.g. moved top level properties like venueId and venueName into a venue object with id and name fields).
However I can now not get a simple term filter working for fields on the child venue inner object.
Here is my mapping:
{
"deal": {
"properties": {
"textId": {"type":"string","name":"textId","index":"no"},
"displayId": {"type":"string","name":"displayId","index":"no"},
"active": {"name":"active","type":"boolean","index":"not_analyzed"},
"venue": {
"type":"object",
"path":"full",
"properties": {
"textId": {"type":"string","name":"textId","index":"not_analyzed"},
"regionId": {"type":"string","name":"regionId","index":"not_analyzed"},
"displayId": {"type":"string","name":"displayId","index":"not_analyzed"},
"name": {"type":"string","name":"name"},
"address": {"type":"string","name":"address"},
"area": {
"type":"multi_field",
"fields": {
"area": {"type":"string","index":"not_analyzed"},
"area_search": {"type":"string","index":"analyzed"}}},
"location": {"type":"geo_point","lat_lon":true}}},
"tags": {
"type":"multi_field",
"fields": {
"tags":{"type":"string","index":"not_analyzed"},
"tags_search":{"type":"string","index":"analyzed"}}},
"days": {
"type":"multi_field",
"fields": {
"days":{"type":"string","index":"not_analyzed"},
"days_search":{"type":"string","index":"analyzed"}}},
"value": {"type":"string","name":"value"},
"title": {"type":"string","name":"title"},
"subtitle": {"type":"string","name":"subtitle"},
"description": {"type":"string","name":"description"},
"time": {"type":"string","name":"time"},
"link": {"type":"string","name":"link","index":"no"},
"previewImage": {"type":"string","name":"previewImage","index":"no"},
"detailImage": {"type":"string","name":"detailImage","index":"no"}}}
}
Here is an example document:
GET /production/deals/wa-au-some-venue-weekends-some-deal
{
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"_version":1,
"exists":true,
"_source" : {
"id":"921d5fe0-8867-4d5c-81b4-7c1caf11325f",
"textId":"wa-au-some-venue-weekends-some-deal",
"displayId":"some-venue-weekends-some-deal",
"active":true,
"venue":{
"id":"46a7cb64-395c-4bc4-814a-a7735591f9de",
"textId":"wa-au-some-venue",
"regionId":"wa-au",
"displayId":"some-venue",
"name":"Some Venue",
"address":"sdgfdg",
"area":"Swan Valley & Surrounds"},
"tags":["Lunch"],
"days":["Saturday","Sunday"],
"value":"$1",
"title":"Some Deal",
"subtitle":"",
"description":"",
"time":"5pm - Late"
}
}
And here is an 'explain' test on that same document:
POST /production/deals/wa-au-some-venue-weekends-some-deal/_explain
{
"query": {
"filtered": {
"filter": {
"term": {
"venue.regionId": "wa-au"
}
}
}
}
}
{
"ok":true,
"_index":"some-index-v1",
"_type":"deals",
"_id":"wa-au-some-venue-weekends-some-deal",
"matched":false,
"explanation":{
"value":0.0,
"description":"ConstantScore(cache(venue.regionId:wa-au)) doesn't match id 0"
}
}
Is there any way to get more useful debugging info?
Is there something wrong with the explain result description? Simply saying "doesn't match id 0" does not really make sense to me... the field is called 'regionId' (not 'id') and the value is definitely not 0...???
That happens because the type you submitted the mapping for is called deal, while the type you indexed the document in is called deals.
If you look at the mapping for your type deals, you'll see that was automatically generated and the field venue.regionId is analyzed, thus you most likely have two tokens in your index: wa and au. Only searching for those tokens on that type you would get back that document.
Anything else looks just great! Only a small character is wrong ;)

Query all unique values of a field with Elasticsearch

How do I search for all unique values of a given field with Elasticsearch?
I have such a kind of query like select full_name from authors, so I can display the list to the users on a form.
You could make a terms facet on your 'full_name' field. But in order to do that properly you need to make sure you're not tokenizing it while indexing, otherwise every entry in the facet will be a different term that is part of the field content. You most likely need to configure it as 'not_analyzed' in your mapping. If you are also searching on it and you still want to tokenize it you can just index it in two different ways using multi field.
You also need to take into account that depending on the number of unique terms that are part of the full_name field, this operation can be expensive and require quite some memory.
For Elasticsearch 1.0 and later, you can leverage terms aggregation to do this,
query DSL:
{
"aggs": {
"NAME": {
"terms": {
"field": "",
"size": 10
}
}
}
}
A real example:
{
"aggs": {
"full_name": {
"terms": {
"field": "authors",
"size": 0
}
}
}
}
Then you can get all unique values of authors field.
size=0 means not limit the number of terms(this requires es to be 1.1.0 or later).
Response:
{
...
"aggregations" : {
"full_name" : {
"buckets" : [
{
"key" : "Ken",
"doc_count" : 10
},
{
"key" : "Jim Gray",
"doc_count" : 10
},
]
}
}
}
see Elasticsearch terms aggregations.
Intuition:
In SQL parlance:
Select distinct full_name from authors;
is equivalent to
Select full_name from authors group by full_name;
So, we can use the grouping/aggregate syntax in ElasticSearch to find distinct entries.
Assume the following is the structure stored in elastic search :
[{
"author": "Brian Kernighan"
},
{
"author": "Charles Dickens"
}]
What did not work: Plain aggregation
{
"aggs": {
"full_name": {
"terms": {
"field": "author"
}
}
}
}
I got the following error:
{
"error": {
"root_cause": [
{
"reason": "Fielddata is disabled on text fields by default...",
"type": "illegal_argument_exception"
}
]
}
}
What worked like a charm: Appending .keyword with the field
{
"aggs": {
"full_name": {
"terms": {
"field": "author.keyword"
}
}
}
}
And the sample output could be:
{
"aggregations": {
"full_name": {
"buckets": [
{
"doc_count": 372,
"key": "Charles Dickens"
},
{
"doc_count": 283,
"key": "Brian Kernighan"
}
],
"doc_count": 1000
}
}
}
Bonus tip:
Let us assume the field in question is nested as follows:
[{
"authors": [{
"details": [{
"name": "Brian Kernighan"
}]
}]
},
{
"authors": [{
"details": [{
"name": "Charles Dickens"
}]
}]
}
]
Now the correct query becomes:
{
"aggregations": {
"full_name": {
"aggregations": {
"author_details": {
"terms": {
"field": "authors.details.name"
}
}
},
"nested": {
"path": "authors.details"
}
}
},
"size": 0
}
Working for Elasticsearch 5.2.2
curl -XGET http://localhost:9200/articles/_search?pretty -d '
{
"aggs" : {
"whatever" : {
"terms" : { "field" : "yourfield", "size":10000 }
}
},
"size" : 0
}'
The "size":10000 means get (at most) 10000 unique values. Without this, if you have more than 10 unique values, only 10 values are returned.
The "size":0 means that in result, "hits" will contain no documents. By default, 10 documents are returned, which we don't need.
Reference: bucket terms aggregation
Also note, according to this page, facets have been replaced by aggregations in Elasticsearch 1.0, which are a superset of facets.
The existing answers did not work for me in Elasticsearch 5.X, for the following reasons:
I needed to tokenize my input while indexing.
"size": 0 failed to parse because "[size] must be greater than 0."
"Fielddata is disabled on text fields by default." This means by default you cannot search on the full_name field. However, an unanalyzed keyword field can be used for aggregations.
Solution 1: use the Scroll API. It works by keeping a search context and making multiple requests, each time returning subsequent batches of results. If you are using Python, the elasticsearch module has the scan() helper function to handle scrolling for you and return all results.
Solution 2: use the Search After API. It is similar to Scroll, but provides a live cursor instead of keeping a search context. Thus it is more efficient for real-time requests.

Resources