Group together celery results - python-3.x

TL:DR
I want to lable results in the backend.
I have a flask/celery project and I'm new to celery.
A user sends in a batch of tasks for celery to work on.
Celery saves the results to a backend SQL database (table automatically created by Celery, named celery_taskmeta).
I want to let the user see the status of his batch, and request the results from the backend.
My problem is that all the results are in one table. What are my options to lable this batch, so the user can differentiate the batches?
My ideas:
Can I add a lable to each task, e.g. "Bob's batch no. 12" and then query celery_taskmeta for that?
Can I put each batch in named backend tables, so ask Celery to save results to a table named task_12?
Trying with groups
I've tried the following code to group the results
job_group = group(api_get.delay(url) for url in urllist)
But I don't see any way to identify the group in the backend/results DB
Trying with task name
In the backend I see an empty column header 'name' so I thought I could add an arbitrary string there:
#app.task(name="an amazing vegetable")
def api_get(url: str) -> tuple:
...
But then the celery worker throws an error when I run the task:
KeyError: 'an amazing vegetable'
[2020-12-18 12:07:22,713: ERROR/MainProcess] Received unregistered task of type 'an amazing vegetable'.

Probably the simplest solution is to use Group and use the Group Result to periodically poll for group state.
A1: As for the label question - yes, you can "label" your task by using the custom state feature.
A2: you can hack around to put each batch of tasks inside backend table, but I strongly advise not to mess with it. If you really want to go this route, make a separate database for this particular use.

Related

Automating/Tracking Knex Migrations and Lucid Models

The Situation
I recently started working on a new project using nodejs. I have a background of using Python/Django and C#/.NET (not a huge fan of the latter). Node is awesome, but I must say I miss the ease of building models and automating migrations in Django. I am currently using the AdonisJS framework which leverages Knex. Knex is a powerful library, but the migrations all need to be manually built. Additionally, the AdonisJS ORM that manages the Models is independent of Knex (migration manager). You also do not define field attributes on the Models, which can have benifits for dynamically doing things in the front and back end. All things considered, there is a lot of room for human error, miscommunication and a boat load more typing required. I know the the hot thing these days is to keep it loose and fast, but for this specific project, I am looking for a bit more structure than loosely defined models.
Current State
What I have landed on is building a new Class called tableModel and a field class to define the fields within table model. I have already completed this and I am successfully writing the migration files leveraging mustache. I plan on also automatically writing the Models which I shouldn't have a problem with (fingers crossed).
The Problem
Here is where it gets a little tough and where I need help...I need to track what has been added or removed via migration so I can effectively write ups and downs as the tableModels change over time.
So let's say I add a "tableModel" which creates a migration to create table Foo with fields {id (bigint), user_id(int), name(string255)}
Later I want to add a field called description so I would simply add it to my "tableModel" and then run a build command which would build out the migration.
How do I check what has already been created though so I only do an up() for description?
Then I want to remove the name field so I mark it out in my "tableModel" and run a build migration command. How do I check what has been migrated that now needs to be added in to the down().
Edit: I would add a remove field to the up and the corresponding roll back to the down.
Bonus Round
Let's say I want to change user_id from an int to a bigint, because who makes a foreign key just an int? How do I check not just what needs to be added to the up and down, but also checks if I need to change a property on a field.
Edit: would just write the up. and a corresponding roll back to the down
The Big Question
Basically, how do I define dirty "tableModels" classes
Possible Solution?
I am thinking that maybe I should capture some type of registry or snapshot and then run the comparison when building the migrations and or models, then recapture/snapshot. If this is the route, should I store in a json file, write this to the DB itself, or is there another/better option.
If I create the tableModel instances as constants, could I actually write back to the JS file and capture the snapshot as an attribute? IF this is an option, is Node's file system the way to go and what's the best way to do this? Node keep suprising me so I wouldn't be baffled if any of these are an option.
Help!
If anyone has gone down this path before or knows of any tools I could leverage, I would greatly appreciate it and thank you in advance. Also, if I am headed in a completely wrong direction, then please let me know, I both handle and appreciate all types of feedback.
Example
Something to note, when I define the "tableModel" for a given migration or model, it is an instance of the class, I am not creating an extended class since this is not my orm.
class tableModel {
constructor(tableName, modelName = tableName, fields = []) {
this.tableName = tableName
this.modelName = modelName
this.fields = fields
}
// Bunch of other stuff
}
fooTableModel = new tableModel('fooTable', 'fooModel', fields = [
new tableField.stringField('title'),
new tableField.bigIntField('related_user_id'),
new tableField.textField('description','Testing Default',false,true)
]
)
which equates to:
tableModel {
tableName: 'fooTable',
modelName: 'fooModel',
fields:
[ stringField {
name: 'title',
type: 'string',
_unique: false,
allow_null: null,
fieldAttributes: {},
default_value: null },
bigIntField {
name: 'related_user_id',
type: 'bigInteger',
_unique: false,
allow_null: null,
fieldAttributes: {},
default_value: 0 },
textField {
name: 'description',
type: 'text',
_unique: false,
allow_null: true,
fieldAttributes: {},
default_value: 'Testing Default' } ]
You have the up and down notation mixed up. Those are for migrating the "latest" (runs the up function) and doing rollbacks (runs the down function). Up and down to not relate to dropping or adding table columns.
The migrations up is for any change, and the down is to reverse those changes. So if you wanted to drop a column from some table, you write the command in the up, then write the opposite in the down (you'd add it back in...), such that you can "rollback" and the change is effectively reversed. You have to be careful with such things though, as you can put yourself in a situation where you actually lose data.
Want to add a column? Write it in the up, and drop the column in the down.
One of the major points behind the migrations mechanism is to track the state of changes of your database, as time goes forward. So generally, if you created a table in some migration, then a day or so later you realize you need to drop/add columns, you normally don't go back and edit the existing migration, especially if the migration has already been run. You'd just write a new migration to drop/add your column.
Since you're using knex, there are a couple "knex" tables that get created. By default the one you're looking for is knex_migrations, unless someone specifically modified the settings to change the name of it. This table holds all the migrations that have run against your DB, per batch. From the CLI, assuming you have knex.js installed globally, you can run knex migrate:latest, and that will push all the migrations that exist in your directory to the target database, if they have not yet been run. It does this by way of examining that knex_migrations table. If you roll a change and don't like it, and assuming you've properly done the down function, you can invoke knex migrate:rollback to reverse the change. If there are 3 migration files that have NOT yet been run, invoking knex migrate:latest will run all 3 of those migration files under a new batch #, which is 1 higher than the most recent batch number. Conversely, if you invoke a knex migrate:rollback, it will find the highest batch number (there could be more than 1 migration in a batch...), and invoke the down function on all those files, effectively rollback those changes.
All that said, knex is a "query builder" tool. It's got a ton of helper functions to help build the sql for you. Personally, I find this to be a major distraction. Why spend hours on hours figuring out all the helper functions when I can just go crank out raw SQL and run that. Thus, that's what we've done in our system. we use knex.raw('') and write our own DDL and DML. It works great and does exactly what we need it to. We don't need to go figure out the magic of the query building.
The short answer is that knex will automatically know what has and has not been run for you (again, via that knex_migrations table it creates for you...).
Things can get weird though when it start involving git and different branches. I recommend that if you're writing migrations on some branch, and you need to go do other work, always remember to first perform a rollback of any migrations you've done in that branch BEFORE switching branches. Otherwise you will be in weird DB states that don't coincide with the application code.
I would personally just deal with updating models independently of writing migrations. For example, if I'm adding a description column to some table, then I probably want to manually update the ORM to reflect the change of the new db schema. Generally, I've found trying to use a tool that automagically does that for you (rather, if I change the orm, stuff happens to write all the underlying sql...) usually winds me up in a heap of trouble and I just spend more time trying to un-fudge stuff. But, that's just my 2 cents :)
Here is where it gets a little tough and where I need help...I need to track what has been added or removed via migration so I can effectively write ups and downs as the tableModels change over time.
You could store changes in a DB/txt file and those can act as snapshots. So when you want to rollback to a particular migration, you would find the changes (up/down) made for that mutation and adjust accordingly.
Later I want to add a field called description so I would simply add it to my "tableModel" and then run a build command which would build out the migration. How do I check what has already been created though so I only do an up() for description?
Here you either call the database itself directly and check what fields have already been created. If a field is already their and the attributes are the same, you can either ignore it or stop the transaction all together.
Bonus Round Let's say I want to change user_id from an int to a bigint, because who makes a foreign key just an int? How do I check not just what needs to be added to the up and down, but also checks if I need to change a property on a field.
Again, call the DB itself on the table in question. I know the SQL call would be:
describe [table_name];
After reading the end, I think you answered this yourself, but I think capturing these changes would work best in a NoSql database since you're using Node or PostGres with it's json field.

Right way to delete and then reindex ES documents

I have a python3 script that attempts to reindex certain documents in an existing ElasticSearch index. I can't update the documents because I'm changing from an autogenerated id to an explicitly assigned id.
I'm currently attempting to do this by deleting existing documents using delete_by_query and then indexing once the delete is complete:
self.elasticsearch.delete_by_query(
index='%s_*' % base_index_name,
doc_type='type_a',
conflicts='proceed',
wait_for_completion=True,
refresh=True,
body={}
)
However, the index is massive, and so the delete can take several hours to finish. I'm currently getting a ReadTimeoutError, which is causing the script to crash:
WARNING:elasticsearch:Connection <Urllib3HttpConnection: X> has failed for 2 times in a row, putting on 120 second timeout.
WARNING:elasticsearch:POST X:9200/base_index_name_*/type_a/_delete_by_query?conflicts=proceed&wait_for_completion=true&refresh=true [status:N/A request:140.117s]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='X', port=9200): Read timed out. (read timeout=140)
Is my approach correct? If so, how can I make my script wait long enough for the delete_by_query to complete? There are 2 timeout parameters that can be passed to delete_by_query - search_timeout and timeout, but search_timeout defaults to no timeout (which is I think what I want), and timeout doesn't seem to do what I want. Is there some other parameter I can pass to delete_by_query to make it wait as long as it takes for the delete to finish? Or do I need to make my script wait some other way?
Or is there some better way to do this using the ElasticSearch API?
You should set wait_for_completion to False. In this case you'll get task details and will be able to track task progress using corresponding API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#docs-delete-by-query-task-api
Just to explain more in the form of codebase explained by Random for the newbee in ES/python like me:
ES = Elasticsearch(['http://localhost:9200'])
query = {'query': {'match_all': dict()}}
task_id = ES.delete_by_query(index='index_name', doc_type='sample_doc', wait_for_completion=False, body=query, ignore=[400, 404])
response_task = ES.tasks.get(task_id) # check if the task is completed
isCompleted = response_task["completed"] # if complete key is true it means task is completed
One can write custom definition to check if the task is completed in some interval using while loop.
I have used python 3.x and ElasticSearch 6.x
You can use the 'request_timeout' global param. This will reset the Connections timeout settings, as mentioned here
For example -
es.delete_by_query(index=<index_name>, body=<query>,request_timeout=300)
Or set it at connection level, for example
es = Elasticsearch(**(get_es_connection_parms()),timeout=60)

Determine if a cucumber scenario has pending steps

I would like to retrieve the scenario state in the "After" scenario hook. I noticed that the .failed? method does not consider pending steps as failed steps.
So How can I determine that a scenario did not execute completely, because it failed OR because some steps were not implemented/defined.
You can use status method. The default value of status is :skipped, the failed one is :failed and the passed step is :passed. So you can write something like this:
do sth if step.status != :passed
Also, if you use !step.passed? it does the same thing because it only checks for the :passed status.
http://cukes.info/api/cucumber/ruby/yardoc/Cucumber/Ast/Scenario.html#failed%3F-instance_method
On that subject, you can also take a look at this post about demoing your feature specs to your customers: http://multifaceted.io/2013/demo-feature-tests/
LiohAu, you can use the 'status' method on a scenario itself rather than on individual steps. Try this: In hooks, add
After do |scenario|
p scenario.status
end
This will give the statuses as follows:
Any step not implemented / defined, it'll give you :undefined
Scenario fails (when all steps are defined) :failed
Scenario passes :passed
Using the same hook, it'll give you the status for scenario outline, but for each example row (since for each example row, it is an individual scenario). So if at all you want the result of an entire outline, you'll need to capture result for all example rows and compute the final result accordingly.
Hope this helps.

Implementing Multithreading in Oracle Procedures

I am working on Oracle 10gR2.
And here is my problem -
I have a procedure, lets call it *proc_parent* (inside a package) which is supposed to call another procedure, lets call it *user_creation*. I have to call *user_creation* inside a loop, which is reading some columns from a table - and these column values are passed as parameters to the *user_creation* procedure.
The code is like this:
FOR i IN (SELECT community_id,
password,
username
FROM customer
WHERE community_id IS NOT NULL
AND created_by = 'SRC_GLOB'
)
LOOP
user_creation (i.community_id,i.password,i.username);
END LOOP;
COMMIT;
user_Creation procedure is invoking a web service for some business logic, and then based on the response updates a table.
I need to find a way by which I can use multi-threading here, so that I can run multiple instances of this procedure to speed up things. I know I can use *DBMS_SCHEDULER* and probably *DBMS_ALERT* but I am not able to figure out, how to use them inside a loop.
Can someone guide me in the right direction?
Thanks,
Ankur
what you can do is submit lots of jobs in the same time. See Example 28-2 Creating a Set of Lightweight Jobs in a Single Transaction
This fills a pl/sql table with all jobs you want to submit in one tx, all at the same time. As soon as they are submitted (enabled) they will start running, as many as the system can handle, or as many as are allowed by a resource manager plan.
The overhead that the Lightweight jobs have is very ... minimal/light.
I would like to close this question. DBMS_SCHEDULER as well as DBMS_JOB (though DBMS_SCHEDULER is preferred) can be used inside the loop to submit and execute the job.
For instance, here's a sample code, using DBMS_JOB which can be invoked inside a loop:
...
FOR i IN (SELECT community_id,
password,
username
FROM customer
WHERE community_id IS NOT NULL
AND created_by = 'SRC_GLOB'
)
LOOP
DBMS_JOB.SUBMIT(JOB => jobnum,
WHAT => 'BEGIN user_creation (i.community_id,i.password,i.username); END;'
COMMIT;
END LOOP;
Using a commit after SUBMIT will kick off the job (and hence the procedure) in parallel.

Strict control over the statement_timeout variable in PostgreSQL

Does anybody know how to limit a users ability to set variables? Specifically statement_timeout?
Regardless of if I alter the user to have this variable set to a minute, or if I have it set to a minute in the postgresql.conf file, a user can always just type SET statement_timeount TO 0; to disable the timeout completely for that session.
Does anybody know a way to stop this? I know some variables can only be changed by a superuser but I cannot figure out if there is a way to force this to be one of those controlled variables. Alternatively, is there a way to revoke SET from their role?
In my application, this variable is used to limit the ability of random users (user registration is open to the public) from using up all the CPU time with (near) infinite queries. If they can disable it then it means that I must find a new methodology for limiting resources to users. If there is no method for securing this variable, is there other ways of achieving this same goal that you may suggest?
Edit 2011-03-02
The reason the database is open to the public and arbitrary SQL is allowed is because this project is for a game played directly in the database. Every player is a database user. Data is locked down behind views, rules and triggers, CREATE is revoked from public and the player role to prevent most alterations to the schema and SELECT on pg_proc is removed to secure game-sensitive function code.
This is not some mission critical system I have opened up to the world. It is a weird proof of concept that puts an abnormal amount of trust in the database in an attempt to maintain the entire CIA security triangle within it.
Thanks for your help,
Abstrct
There is no way to override this. If you allow the user to run arbitrary SQL commands, changing the statement_timeout is just the top of the iceberg anyway... If you don't trust your users, you shouldn't let them run arbitrary SQL - or accept that they can run, well, arbitrary SQL. And have some sort of external monitor that cancels the queries.
Basically you can't do this in plain postgres.
Meantime for accomplish your goal you may use some type of proxies and rewrite/forbidd some queries.
There several solutions for that, f.e.:
db-query-proxy - article how it born (in Russian).
BGBouncer + pgbouncer-rr-patch
Last contains very useful examples and it is very simple do on Python:
import re
def rewrite_query(username, query):
q1="SELECT storename, SUM\(total\) FROM sales JOIN store USING \(storeid\) GROUP BY storename ORDER BY storename"
q2="SELECT prodname, SUM\(total\) FROM sales JOIN product USING \(productid\) GROUP BY prodname ORDER BY prodname"
if re.match(q1, query):
new_query = "SELECT storename, SUM(total) FROM store_sales GROUP BY storename ORDER BY storename;"
elif re.match(q2, query):
new_query = "SELECT prodname, SUM(total) FROM product_sales GROUP BY prodname ORDER BY prodname;"
else:
new_query = query
return new_query

Resources