How to delete elastic search indices periodically? - search

I have created indices on daily basis to store the search history and i am using those indices for the suggestions in my applciation, which helps me to suggest based on history as well.
now i have to maintain only last 10 days of history. So is there any feature in Elastic search that allows me to create and delete indices periodically?

I don't know if elasticsearch has a built-in feature like that but you can achieve what you want with curator and a cron job.
An example curator command is:
Curator 3.x syntax [deprecated]:
curator --host <IP> delete indices --older-than 10 --prefix "index-prefix-" --time-unit days --timestring '%Y-%m-%d'
Curator 5.1.1 syntax:
curator_cli --host <IP> --port <PORT> delete_indices --filter_list '[{"filtertype":"age","source":"creation_date","direction":"older","unit":"days","unit_count":10},{"filtertype":"pattern","kind":"prefix","value":"index-prefix-"}]'
Run this command daily with a cron job to delete indices older than 10 days whose names start with index-prefix- and that live on the Elasticsearch instance at <IP>:<PORT>.
For more curator command-line options, see: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/singleton-cli.html
For more on cron usage, see:
http://kvz.io/blog/2007/07/29/schedule-tasks-on-linux-using-crontab/

The only thing I can think of is using data math:
https://www.elastic.co/guide/en/elasticsearch/reference/current/date-math-index-names.html
In sense you can do this:
DELETE <logs-{now%2Fd-10d}>
This does not work nice in curl though due to url encoding. You can do something like this in curl:
curl -XDELETE 'localhost:9200/<logs-%7Bnow%2Fd-10d%7D>'
Both examples remove the index that is 10 days old. It does not help you in deleting indices older than 10 days, don't think that is possible. And their is no trigger or something in elasticsearch.
So I would stick to a cron job in combination with curator, but you do have this option to go with as well.

Related

Databricks Jobs API "INVALID_PARAMETER_VALUE" when trying to get job

I'm just starting to explore the Databricks API. I've created a .netrc file as described in this doc and am able to get the API to work with this for other operations like "list clusters" and "list jobs". But when I try to query details of a particular job, it fails:
$ curl --netrc -X GET https://<my_workspace>.cloud.databricks.com/api/2.0/jobs/get/?job_id=job-395565384955064-run-12345678
{"error_code":"INVALID_PARAMETER_VALUE","message":"Job 0 does not exist."}
What am I doing wrong here?
Job ID should be a numeric identifier while you're providing the job cluster name instead. You need to use first number (395565384955064) from that name as a job ID in REST API. Also, remove / after get - it should be /api/2.0/jobs/get?job_id=<job-ID>
$ curl --netrc -X GET https://<my_workspace>.cloud.databricks.com/api/2.0/jobs/get/?job_id=job-395565384955064-run-12345678
In this link, looks like job_name had been mentioned as alphanumeric value instead of job_id. You can find job_id where you can find it .

How to see key count of matching pattern in Azure Redis Cache Console

I want to just see the total number of keys available in Azure Redis cache that matches the given pattern. I tried the following command it is showing the count after displaying all the keys (which caused server load), But I need only the count.
>SCAN 0 COUNT 10000000 MATCH "{UID}*"
Except command SCAN, the command KEYS pattern can return the same result as your current command SCAN 0 COUNT 10000000 MATCH "{UID}*".
However, for your real needs to get the number of keys matching a pattern, there is an issue add COUNT command from the Redis offical GitHub repo which had answered by the author antirez for you, as the content below.
Hi, KEYS is only intended for debugging since it is O(N) and performs a full keyspace scan. COUNT has the same problem but without the excuse of being useful for debugging... (since you can simply use redis-cli keys ... | grep ...). So feature not accepted. Thanks for your interest.
So you can not directly get the count of KEYS pattern, but there are some possible solutions for you.
Count the keys return from command KEYS pattern in your programming language for the small number of keys with a pattern, like doing redis-cli KEYS "{UID}*" | wc -l on the host server of Redis.
To use the command EVAL script numkeys key \[key ...\] arg \[arg ...\] to run a Lua script to count the keys with pattern, there are two scripts you can try.
2.1. Script 1
return #redis.call("keys", "{UID}*")
2.2. Script 2
return table.getn(redis.call('keys', ARGV[1]))
The completed command in redis-cli is EVAL "return table.getn(redis.call('keys', ARGV[1]))" 0 {UID}*

tailLines and SinceTime in logging api,both not worked simultaneously

I am using container engine, and my pods are hosted there.
I am trying to fetch logs, using log api :
http://localhost:8000/api/v1/namespaces/app-test/pods/designer-0/log?tailLines=100&sinceTime=2017-09-17T10:47:58Z
if i used both the query params separately, it works and show the proper result, but if i am using it simultaneously only the top 100 logs are returning, the sinceTime param is get ignored.
my scenario is, i need a log from a specific time, in a chunk like, 100 lines, 100 lines.. like this.
I am not sure, whether it is a bug, or it is not implemented.
I found this from the api reference manual
https://kubernetes.io/docs/api-reference/v1.6/
tailLines - If set, the number of lines from the end of the logs to
show. If not specified, logs are shown from the creation of the
container or sinceSeconds or sinceTime
So, that means if you specify tailLines, it start from the end. I dont see any option explicitly mentioned other than limitBytes. But you will have to play around with it as it does not guarantee number of lines.
tailLines=X tells the server to start that many lines from the end
sinceTime tells the server to start from the specified time
the options are mutually exclusive
Thanks All,
I have later on recognized that, it is not ignoring the sinceTime, as the TailLines intended functionality is return the lines from the last.
So, if i mentioned the sinceTime= 10 PM yesterday, it will return the records from that time..And if also tailLines, is mentioned, so it will return the recent logs from that chunk.
So, it was working as expected. I need to play with LimitBytes for getting the logs in chunk, from that time, Instead of full logs.

percolate:synced-cron shows only 2 records all the time

I am using Meteor 1.5 and package percolate:synced-cron to run a task every day once. After some days I noticed that my previous records inside "cronHistory" collection got automatically deleted (without me personally deleting the records), and shows only past 2 days history.
I am not sure what is wrong with the cronHistory collection. Any suggestions would be deeply apprciated.
I would recommend you to do a little research yourself and read the docs of packages you use. Even better, read the source to understand what kind of code you accept into your codebase. From the docs:
SyncedCron.config({
...
/*
TTL in seconds for history records in collection to expire
NOTE: Unset to remove expiry but ensure you remove the index from
mongo by hand
ALSO: SyncedCron can't use the `_ensureIndex` command to modify
the TTL index. The best way to modify the default value of
`collectionTTL` is to remove the index by hand (in the mongo shell
run `db.cronHistory.dropIndex({startedAt: 1})`) and re-run your
project. SyncedCron will recreate the index with the updated TTL.
*/
collectionTTL: 172800
});
Note the collectionTTL option is set to 2 days.

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?
Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).
The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.
I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq
Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.
Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.
Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Resources