How do I check for duplicate data on ElasticSearch?

How do I check for duplicate data on ElasticSearch? - search

When storing some documents, it should store the nonexistent and ignore the rest (should this be done at application level, maybe checking if document's id already exists, etc.?)

Here is what is stated in documentation:
Operation Type
The index operation also accepts an op_type that can be used to force a create operation, allowing for “put-if-absent” behavior. When create is used, the index operation will fail if a document by that id already exists in the index.
Here is an example of using the op_type parameter:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?op_type=create' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}'
Another option to specify create is to use the following uri:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1/_create' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}'

Related

Search for specific string using shell script

I have a set of log files and I need to find tables from a specific database being used by searching them in log files. The search keyword should start with 'database.' and should be able to grep/parse all table names with preceding database name. Can anyone advise on how this can be achieved using shell script.
Thanks

This is very easy:
grep "database" * : this shows the filename + the entire line
grep -l "database" * : this shows only the filename
grep -o "database [a-z]*" * : this shows the filename + the database name
grep -h -o "database [a-z]*" * : this shows the database name

Unable to register Patch baselineid with Patch Group by aws cli

I am registering patch baseline with Patch Group by AWS CLI and I am getting an error as mentioned in the subject.
Command used to create a Patch baseline is below:
baselineid=$(aws ssm create-patch-baseline --name "Test-NonProd-Baseline" --operating-system "WINDOWS" --tags "Key=Environment,Value=Production" --approval-rules "PatchRules=[{PatchFilterGroup={PatchFilters=[{Key=MSRC_SEVERITY,Values=[Critical,Important]},{Key=CLASSIFICATION,Values=[SecurityUpdates,Updates,ServicePacks,UpdateRollups,CriticalUpdates]}]},ApproveAfterDays=7}]" --description "Baseline containing all updates approved for production systems" --query BaselineId)
I used then above id to register patch baseline with patch group as like below
aws ssm register-patch-baseline-for-patch-group --baseline-id $baselineid --patch-group "Group A"
Unfortunately, I get the error below:
An error occurred (ValidationException) when calling the RegisterPatchBaselineForPatchGroup operation: 1 validation error detected: Value '"pb-08a507ce98777b410"' at 'baselineId' failed to satisfy constraint: Member must satisfy regular expression pattern: ^[a-zA-Z0-9_\-:/]{20,128}$.
Note: Even if I use double quotes around "$baslineid" still I do get the same error.
Variable $baselineid has a valid value, please see below:
[ec2-user#ip-172-31-40-59 ~]$ echo $baselineid
"pb-08a507ce98777b410"
Want to understand what is the issue when I'm getting a legit value and how to have it resolved.

You can perhaps use this in your second command, basically using tr utility to truncate the double qoutes:
echo $baselineid| tr -d '"'
The solution to the double quote issue has many possible fixes in this SO post:
Shell script - remove first and last quote (") from a variable
So a possible solution command could be something like this:
aws ssm register-patch-baseline-for-patch-group --baseline-id $(echo $baselineid| tr -d '"') --patch-group "Group A"

How to find the surnames of the contacts of a user in ldap?

I want to know the search base and filter(i.e. the ldap query) to get the surnames(sn) of Hugo Williams.
Thanks,
Samuel

basedn : cn=Hugo Williams,ou=users,o=mojo
scope : base
filter : (objectClass=*)
attribute to retrieve : sn
With a ldapsearch command line on a linux client it would be (don't forget to add the host/credentials if needed) :
ldapsearch -s base -b "cn=Hugo Williams,ou=users,o=mojo" "(objectClass=*)" sn

log rotation script for logstash to purge logs greater than two weeks old

I'm trying to come up with the best way to purge the logs from a logstash server that are more than two weeks old.
For those that aren't aware, Logstash stores it's logs inside of Elasticsearch. We have a really great stable ELK stack (Elasticsearch/Logstash/Kibana) where I work.
The typical way of deleting a logstash index is with a curl command like this one:
#curl --user admin -XDELETE http://localhost:9200/logstash-2015.06.06
Enter host password for user 'admin':
{"acknowledged":true}
Now what I'm looking for is a programmatic way of changing the dates in the logstash index to automatically purge any index that's greater than two weeks old.
I'm thinking of using bash to get this done.
I'd appreciate any examples of how to do this or advice you may have!
Thanks
Thanks!! But do you think you can help me get this going using auth?
This is what I tried so far:
[root#logs:~] #curator --help | grep -i auth
--http_auth TEXT Use Basic Authentication ex: user:pass
[root#logs:~] #curator delete indices --older-than 14 --time-unit days --timestring %Y.%m.%d --regex '^logstash-' --http_auth admin:secretsauce
Error: no such option: --http_auth
[root#logs:~] #curator delete indices --older-than 14 --time-unit days --timestring %Y.%m.%d --regex '^logstash-' --http_auth admin:secretsauce
Error: no such option: --http_auth
[root#logs:~] #curator delete indices --http_auth admin:secretsauce --older-than 14 --time-unit days --timestring %Y.%m.%d --regex '^logstash-'
Error: no such option: --http_auth

Use Curator. To delete indexes older than 14 days you can run this command:
curator delete indices --older-than 14 --time-unit days --timestring %Y.%m.%d --regex '^logstash-'

If curator doesn't work for you for one reason or another, here's a bash script you can run:
#!/bin/bash
: ${2?"Usage: $0 [number of days] [base url of elastic]"}
days=${1}
baseURL=${2}
curl "${baseURL}/_cat/indices?v&h=i" | grep logstash | sort --key=1 | awk -v n=${days} '{if(NR>n) print a[NR%n]; a[NR%n]=$0}' | awk -v baseURL="$baseURL" '{printf "curl -XDELETE '\''%s/%s'\''\n", baseURL, $1}' | while read x ; do eval $x ; done

The online documentation for Curator explains many of these details. The URL is handily provided at the top of the --help output:
$ curator --help
Usage: curator [OPTIONS] COMMAND [ARGS]...
Curator for Elasticsearch indices.
See http://elastic.co/guide/en/elasticsearch/client/curator/current
There's an entire sub-section on flags. In the documentation for the --http_auth flag it says:
This flag must come before any command.

A better way to do it on newer versions (7.x+) of ElasticSearch/OpenSearch is to use state management policies, you can read more about it here https://aws.amazon.com/blogs/big-data/automating-index-state-management-for-amazon-opensearch-service-successor-to-amazon-elasticsearch-service/
Here is how to use it to delete old indexes:
{
"policy": {
"description": "Removes old indexes",
"default_state": "active",
"states": [
{
"name": "active",
"transitions": [
{
"state_name": "delete",
"conditions": {
"min_index_age": "14d"
}
}
]
},
{
"name": "delete",
"actions": [
{
"delete": {}
}
],
"transitions": []
}
],
"ism_template": {
"index_patterns": [
"mylogs-*"
]
}
}
}
It will automatically apply the policy for any new mylogs-* indexes, but you'll need to apply it manually for existing ones (under "Index Management" -> "Indices").

The ElasticSearch X-Pack lets you set a policy to automatically delete indexes based on age. This is a fairly complex solution, and also paid for: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
Curator seems to be well maintained, supports recent versions of ElasticSearch and does what you want.
Alternatively, here's a BASH script. However, it won't work on BSD or Mac, because of my use of non-POSIX date -ud.
I use systemd to run this daily.
#!/usr/bin/env bash
elasticsearchURL="http://localhost:9200"
date_format="%Y.%m.%d"
today_seconds=$(date +"%s")
let seconds_per_day=24*60*60
let delay_seconds=$seconds_per_day*7
let cutoff_seconds=$today_seconds-$delay_seconds
cutoff_date=$(date -ud "#$cutoff_seconds" +"$date_format")
indices=$(curl -XGET "${elasticsearchURL}/_cat/indices" | cut -d ' ' -f 3 | grep -P "\d{4}\.\d{2}\.\d{2}")
echo "Deleting indexes created before the cutoff date $cutoff_date."
for index in $indices; do
index_date=$(echo "$index" | grep -P --only-matching "\d{4}\.\d{2}\.\d{2}")
if [[ $index_date < $cutoff_date ]]; then
echo "Deleting old index $index"
curl -XDELETE "${elasticsearchURL}/$index"
echo ""
fi
done

To do this, there is a special utility "Curator" from Elastic. It must be installed as specified in the documentation.
Then you need to write the address to ElasticSerach server in the "hosts" parameter in the configuration File. On Windows, this file should be located in the user folder, for example: c:\Users\yourUserName\.curator\curator.yml
Then you need by documentation create a file with the actions "curatorRotateLogs.yml", for example, such:
---
# Remember, leave a key empty if there is no value. None will be a string,
# not a Python "NoneType"
actions:
1:
action: delete_indices
description: >-
Delete indices older than 45 days (based on index name), for logstash-
prefixed indices. Ignore the error if the filter does not result in an
actionable list of indices (ignore_empty_list) and exit cleanly.
options:
ignore_empty_list: True
disable_action: False
filters:
- filtertype: pattern
kind: prefix
value: logstash-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 14
Then run through the scheduler: "C:\Program Files\elasticsearch-curator\curator.exe" c:\MyСoolFolder\curatorRotateLogs.yml

Problems with elasticsearch where '*' is the field

So, I should prefix this by saying that I understand * is a special character that should be escaped for elasticsearch queries. Here's the setup and trouble I'm facing. The basic problem boils down to that I'm unable to search fields containing only '*'.
curl -XPUT 'http://localhost:9200/test_index/test_item/1' -d '{
"some_text" : "*"
}'
curl -XPUT 'http://localhost:9200/test_index/test_item/2' -d '{
"some_text" : "1+*"
}'
curl -XPUT 'http://localhost:9200/test_index/test_item/3' -d '{
"some_text" : "asterisk"
}'
curl -XGET 'http://localhost:9200/test_index/_search?q=some_text:*'
Results:
"hits":{"total":2,"max_score":1.0,"hits":[
"_source":{"some_text" : "1+*"},
"_source":{"some_text" : "asterisk"}
]
curl -XGET 'http://localhost:9200/test_index/_search?q=some_text:\*'
Results:
"hits":{"total":0,"max_score":null,"hits":[]}
Using python elasticsearch:
>>>from elasticsearch import Elasticsearch
>>> es = Elasticsearch()
>>>es.search(index='test_index', doc_type='test_item', body={"query":{"match":{"some_text":"*"}}})
No hits
>>>es.search(index='test_index', doc_type='test_item', body={"query":{"match":{"some_text":"asterisk"}}})
One hit('asterisk')
>>>es.search(index='test_index', doc_type='test_item', body={"query":{"match":{"some_text":"\*"}}})
No hits
Using pyelasticsearch
>>>es.search('some_text:*', index='test_index')
2 hits, '1+*' and 'asterisk'
>>>es.search('some_text:\*', index='test_index')
No hits
How can I get the first item to show up in a search? Despite the inconsistencies between the various search methods, all of them seem to agree that I'm not allowed to get '*' back, but why? Also, escaping * seems to make the problem worse, which is kind of unusual. (I assume there is some autoescaping in the libraries perhaps, but that doesn't really explain the direct ES query).
Edit: I should mention that it is definitely indexed.
>>>es.get('test_index', 'test_item', 1)
{'_index': 'test_index', '_version': 1, '_id': '1', 'found': True, '_type': 'test_item', '_source': {'some_text': '*'}}
It may be possible that it's stored, though, which is a special thing for elasticsearch as far as I know?
Edit2:
ElasticSearch docs that talk about escaping some

Ended up solving this by changing the analyzer to a whitespace analyzer. (It was a lucene issue, not elasticsearch, which was why it was tough to find!)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I check for duplicate data on ElasticSearch? - search

When storing some documents, it should store the nonexistent and ignore the rest (should this be done at application level, maybe checking if document's id already exists, etc.?)

Related

Search for specific string using shell script

Unable to register Patch baselineid with Patch Group by aws cli

How to find the surnames of the contacts of a user in ldap?

log rotation script for logstash to purge logs greater than two weeks old

Problems with elasticsearch where '*' is the field

Categories

Resources