Right way to delete and then reindex ES documents - python-3.x

I have a python3 script that attempts to reindex certain documents in an existing ElasticSearch index. I can't update the documents because I'm changing from an autogenerated id to an explicitly assigned id.
I'm currently attempting to do this by deleting existing documents using delete_by_query and then indexing once the delete is complete:
self.elasticsearch.delete_by_query(
index='%s_*' % base_index_name,
doc_type='type_a',
conflicts='proceed',
wait_for_completion=True,
refresh=True,
body={}
)
However, the index is massive, and so the delete can take several hours to finish. I'm currently getting a ReadTimeoutError, which is causing the script to crash:
WARNING:elasticsearch:Connection <Urllib3HttpConnection: X> has failed for 2 times in a row, putting on 120 second timeout.
WARNING:elasticsearch:POST X:9200/base_index_name_*/type_a/_delete_by_query?conflicts=proceed&wait_for_completion=true&refresh=true [status:N/A request:140.117s]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='X', port=9200): Read timed out. (read timeout=140)
Is my approach correct? If so, how can I make my script wait long enough for the delete_by_query to complete? There are 2 timeout parameters that can be passed to delete_by_query - search_timeout and timeout, but search_timeout defaults to no timeout (which is I think what I want), and timeout doesn't seem to do what I want. Is there some other parameter I can pass to delete_by_query to make it wait as long as it takes for the delete to finish? Or do I need to make my script wait some other way?
Or is there some better way to do this using the ElasticSearch API?

You should set wait_for_completion to False. In this case you'll get task details and will be able to track task progress using corresponding API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#docs-delete-by-query-task-api

Just to explain more in the form of codebase explained by Random for the newbee in ES/python like me:
ES = Elasticsearch(['http://localhost:9200'])
query = {'query': {'match_all': dict()}}
task_id = ES.delete_by_query(index='index_name', doc_type='sample_doc', wait_for_completion=False, body=query, ignore=[400, 404])
response_task = ES.tasks.get(task_id) # check if the task is completed
isCompleted = response_task["completed"] # if complete key is true it means task is completed
One can write custom definition to check if the task is completed in some interval using while loop.
I have used python 3.x and ElasticSearch 6.x

You can use the 'request_timeout' global param. This will reset the Connections timeout settings, as mentioned here
For example -
es.delete_by_query(index=<index_name>, body=<query>,request_timeout=300)
Or set it at connection level, for example
es = Elasticsearch(**(get_es_connection_parms()),timeout=60)

Related

Update a parameter value in Brightway

It seems to be a simple question but I have a hard time to find an answer to it. I already have a project with several parameters (project and database parameters). I would like to obtain the LCA results for several scenarios with my parameters having different values each time. I was thinking of the following simple procedure:
change the parameters' value,
update the exchanges in my project,
calculate the LCA results.
I know that the answer should be in the documentation somewhere, but I have a hard time to understand how I should apply it to my ProjectParameters, DatabaseParameters and ActivityParameters.
Thanks in advance!
EDIT: Thanks to #Nabla, I was able to come up with this:
For ProjectParameter
for pjparam in ProjectParameter.select():
if pjparam.name=='my_param_name':
break
pjparam.amount = 3
pjparam.save()
bw.parameters.recalculate()
For DatabaseParameter
for dbparam in DatabaseParameter.select():
if dbparam.name=='my_param_name':
break
dbparam.amount = 3
dbparam.save()
bw.parameters.recalculate()
For ActivityParameter
for param in ActivityParameter.select():
if param.name=='my_param_name':
break
param.amount = 3
param.save()
param.recalculate_exchanges(param.group)
You could import DatabaseParameter and ActivityParameter iterate until you find the parameter you want to change, update the value, save it and recalculate the exchanges. I think you need to do it in tiers. First you update the project parameters (if any) then the database parameters that may depend on project parameters and then the activity parameters that depend on them.
A simplified case without project parameters:
from bw2data.parameters import ActivityParameter,DatabaseParameter
# find the database parameter to be updated
for dbparam in DatabaseParameter.select():
if (dbparam.database == uncertain_db.name) and (dbparam.name=='foo'):
break
dbparam.amount = 3
dbparam.save()
#there is also this method if foruma depend on something else
#dbparam.recalculate(uncertain_db.name)
# here updating the exchanges of a particular activity (act)
for param in ActivityParameter.select():
if param.group == ":".join(act.key):
param.recalculate_exchanges(param.group)
you may want to update all the activities in the project instead of a single one like in the example. you just need to change the condition when looping through the activity parameters.

Group together celery results

TL:DR
I want to lable results in the backend.
I have a flask/celery project and I'm new to celery.
A user sends in a batch of tasks for celery to work on.
Celery saves the results to a backend SQL database (table automatically created by Celery, named celery_taskmeta).
I want to let the user see the status of his batch, and request the results from the backend.
My problem is that all the results are in one table. What are my options to lable this batch, so the user can differentiate the batches?
My ideas:
Can I add a lable to each task, e.g. "Bob's batch no. 12" and then query celery_taskmeta for that?
Can I put each batch in named backend tables, so ask Celery to save results to a table named task_12?
Trying with groups
I've tried the following code to group the results
job_group = group(api_get.delay(url) for url in urllist)
But I don't see any way to identify the group in the backend/results DB
Trying with task name
In the backend I see an empty column header 'name' so I thought I could add an arbitrary string there:
#app.task(name="an amazing vegetable")
def api_get(url: str) -> tuple:
...
But then the celery worker throws an error when I run the task:
KeyError: 'an amazing vegetable'
[2020-12-18 12:07:22,713: ERROR/MainProcess] Received unregistered task of type 'an amazing vegetable'.
Probably the simplest solution is to use Group and use the Group Result to periodically poll for group state.
A1: As for the label question - yes, you can "label" your task by using the custom state feature.
A2: you can hack around to put each batch of tasks inside backend table, but I strongly advise not to mess with it. If you really want to go this route, make a separate database for this particular use.

'Delay until' finish time of 'Queue a new build' not working in Azure Logic App

I'm triggering an Azure Logic App from an https webhook for a docker image in Azure Container Registry.
The workflow is roughly:
When a HTTP request is received
Queue a new build
Delay until
FinishTime of Queue a new build
See: Workflow image
The Delay until action doesn't work in that the queueried FinishTime is 0001-01-01T00:00:00.
It complains about the wrong format, so I manually added a Z after the FinishTime keyword.
Now the time stamp is in the right format, however, the timestamp 0001-01-01T00:00:00Z obviously doesn't make sense and subsequent steps are executed without delay.
Anything that I am missing?
edit: Queue a new build queues an Azure pipeline build. I.e. the FinishTime property comes from the pipeline.
You need to set a timestamp in future, the timestamp 0001-01-01T00:00:00Z you set to the "Delay until" action is not a future time. If you set a timestamp as 2020-04-02T07:30:00Z, the "Delay until" action will take effect.
Update:
I don't think the "Delay until" can do what you expect, but maybe you can refer to the operations below. Just add a "Condition" action to judge if the FinishTime is greater than current time.
The expression in the "Condition" is:
sub(ticks(variables('FinishTime')), ticks(utcNow()))
In a word, if the FinishTime is greater than current time --> do the "Delay until" aciton. If the FinishTime is less than current time --> do anything else which you want.(By the way you need to pay attention to the time zone of your timestamp, maybe you need to convert all of the time zone to UTC)
I've been in touch with an Azure support engineer, who has confirmed that the Delay until action should work as I intended to use it, however, that the FinishTime property will not hold a value that I can use.
In the meantime, I have found a workaround, where I'm using some logic and quite a few additional steps. Inconvenient but at least it does what I want.
Here are the most important steps that are executed after the workflow gets triggered from a webhook (docker base image update in Azure Container Registry).
Essentially, I'm initializing the following variables and queing a new build:
buildStatusCompleted: String value containing the target value completed
jarsBuildStatus: String value containing the initial value notStarted
jarsBuildResult: String value containing the default value failed
Then, I'm using an Until action to monitor when the jarsBuildStatus's value is switching to completed.
In the Until action, I'm repeating the following steps until jarsBuildStatus changes its value to buildStatusCompleted:
Delay for 15 seconds
HTTP request to Azure DevOps build, authenticating with personal access token
Parse JSON body of previous raw HTTP output for status and result keywords
Set jarsBuildStatus = status
After breaking out of the Until action (loop), the jarsBuildResult is set to the parsed result.
All these steps are part of a larger build orchestration workflow, where I'm repeating the given steps multiple times for several different Azure DevOps build pipelines.
The final action in the workflow is sending all the status, result and other relevant data as a build summary to Azure DevOps.
To me, this is only a workaround and I'll leave this question open to see if others have suggestions as well or in case the Azure support engineers can give more insight into the Delay until action.
Here's an image of the final workflow (at least, the part where I implemented the Delay until action):
edit: Turns out, I can simplify the workflow because there's a dedicated Azure DevOps action in the Logic App called Send an HTTP request to Azure DevOps, which omits the need for manual authentication (Azure support engineer pointed this out).
The workflow now looks like this:
That is, I can query the build status directly and set the jarsBuildStatus as
#{body('Send_an_HTTP_request_to_Azure_DevOps:_jar''s')['status']}
The code snippet above is automagically converted to a value for the Set variable action. Thus, no need to use an additional Parse JSON action.

Sphinx search crashes when there are no indexes

I want a sphinx searchd to start up but there are no indexes populated as yet. I have a separate cron job that pulls data from a data source and then calls the indexer to generate the indexes.
So the first time searchd starts the cron job has not yet run, hence there are no indexes. And searchd fails with errors like:
FATAL: no valid indexes to serve
Is there any way to get around this? e.g. to start earchd even when there are no indexes and if someone searched against it during that time, it just returns no docids. Later when the cron job runs, the indexes will be populated and then searched can query those indexes.
if someone searched against it during that time, it just returns no docids.
That would require an actual index to search againast.
Just create an empty index. Then when indexer runs, it recreates the index (with data this time) and notifies searchd - using --rotate switch.
Example of a way to produce a 'empty' index, as provided by #ctx: (Added Dec, 2014)
source force {
type = xmlpipe2
xmlpipe_command = cat /tmp/test.xml
}
index force {
source = force
path = /path/to/sphinx/datadir/filename
charset_type=utf-8
}
/tmp/test.xml:
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="subject"/>
</sphinx:schema>
</sphinx:docset>
indexer force and now searchd should be able to run.
Alternativly can use something like sql_query = SELECT 1,'' but that does require connection to a real database server.

node-postgres: how to prepare a statement without executing the query?

I want to create a "prepared statement" in postgres using the node-postgres module. I want to create it without binding it to parameters because the binding will take place in a loop.
In the documentation i read :
query(object config, optional function callback) : Query
If _text_ and _name_ are provided within the config, the query will result in the creation of a prepared statement.
I tried
client.query({"name":"mystatement", "text":"select id from mytable where id=$1"});
but when I try passing only the text & name keys in the config object, I get an exception :
(translated) message is binding 0 parameters but the prepared statement expects 1
Is there something I am missing ? How do you create/prepare a statement without binding it to specific value in order to avoid re-preparing the statement in every step of a loop ?
I just found an answer on this issue by the author of node-postgres.
With node-postgres the first time you issue a named query it is
parsed, bound, and executed all at once. Every subsequent query issued
on the same connection with the same name will automatically skip the
"parse" step and only rebind and execute the already planned query.
Currently node-postgres does not support a way to create a named,
prepared query and not execute the query. This feature is supported
within libpq and the client/server protocol (used by the pure
javascript bindings), but I've not directly exposed it in the API. I
thought it would add complexity to the API without any real benefit.
Since named statements are bound to the client in which they are
created, if the client is disconnected and reconnected or a different
client is returned from the client pool, the named statement will no
longer work (it requires a re-parsing).
You can use pg-prepared for that:
var prep = require('pg-prepared')
// First prepare statement without binding parameters
var item = prep('select id from mytable where id=${id}')
// Then execute the query and bind parameters in loop
for (i in [1,2,3]) {
client.query(item({id: i}), function(err, result) {...})
}
Update: Reading your question again, here's what I believe you need to do. You need to pass a "value" array as well.
Just to clarify; where you would normally "prepare" your query, just prepare the object you pass to it, without the value array. Then where you would normally "execute" your query, set the value array in the object and pass it to the query. If it's the first time, the driver will do the actual prepare for you the first time around, and simple do binding and execution for the rest of the iteration.

Resources