Pymongo - Process only newly updated documents?

Pymongo - Process only newly updated documents? - python-3.x

So I'm new to Pymongo & MongoDB, and I'm just confused as to how best to go about this problem. I have two collections:
Raw_collection
Processed_collection
Basically, I have raw documents that go into the Raw_collection, after which I process them by dropping some documents based on filters etc, and store the remaining documents into Processed_collection. Specifically, I plan to periodically update the records in Raw_collection as well.
As such, what would be the best way to process only the newly inserted documents to Raw_collection on a successive update? I looked into bulk methods but I'm not sure if that's what I want... this seems like a simple-ish problem to solve, but because of my inexperience I'm not sure what the solution would be. Any help is greatly appreciated, thanks!

So I ended up doing this through pymongo's insert_many method has:
import pandas
import pymongo
insert_raw_collection(): #call First
result = db[collection].insert_many(documents)
obj_id_list = result.inserted_ids
#[ObjectId('54f113fffba522406c9cc20e'), ObjectId('54f113fffba522406c9cc20f')]
return obj_id_list
insert_processed_collection(obj_id_list): # call Second
cursor = raw_collection_pandas_data_frame.find({"_id": {"$in": obid_list}})
for doc in cursor:
if filter(doc) == True
# do something
Basically, I return the list of inserted ObjectId's from the previous insert step and perform a filtering operation so I know which I want to keep.

Related

Is There A Way To Improve Performance Of Data Dictionary Model To Dict Appraoch?

I am currently getting a bunch of records for formsets in my Django application with the method below...
line_items = BudgetLineItem.objects.filter(budget_pk=dropdown)
line_item_listofdicts = []
for line_item in line_items:
line_item_dict = model_to_dict(line_item)
del line_item_dict['id']
del line_item_dict['budget']
del line_item_dict['archive_budget']
del line_item_dict['new_budget']
del line_item_dict['update_budget']
line_item_listofdicts.append(line_item_dict)
UpdateBudgetLineItemFormSet = inlineformset_factory(UpdateBudget,
UpdateBudgetLineItem,
form=UpdateBudgetLineItemForm,
extra=len(line_item_listofdicts),
can_delete=True,
can_order=True)
The good news is that it works and does what I want it to. However it's super slow. It takes about 13 seconds to render the data back to my app. Not optimal. I've spent the morning trying to do various prefetches and select_relateds but nothing has worked to improve the time it takes to render these fields back to the screen. The fields in question are largely DecimalFields and I've read that they can be a bit slower. I'm trying to use this data as "input" to my formsets in a CreateView. Again it works...but it's slow. Any ideas on how to make this approach more performant?
Thanks in advance for any thoughts.

Instead of retreiving the models and deleting the fields you dont need, you could just query the models specifying only the list of fields you want
using the queryset values(*args) method which you specify the fields you need as str(s)
And it will Automatically return it as a list dictionary with the specified fields
#taking your code for example, it me assume you only need the title and added_date from your model
Note just assuming your BudgetLineItem model has the fields title and added_date then you code could be like
line_items = BudgetLineItem.objects.filter(budget_pk=dropdown).values('title', 'added_on')
UpdateBudgetLineItemFormSet = inlineformset_factory(UpdateBudget,
UpdateBudgetLineItem,
form=UpdateBudgetLineItemForm,
extra=line_items.count(),
can_delete=True,
can_order=True)
From you code you are doing operations that you dont need
-Since it just the len of the items need you dont need to evaluate the query, you could just call the count method of the queryset
-If there some peice od code that still needs to use your line_item_listofdicts variable then you could just replace it with the line_items as it is a list of dictionary containing only the fields you need, instead of converting you model queryset to a list of of model instances and deleting the fields you don't need and again converting it to another list (these operations a expensive)
You could check out the document on value

django remove m2m instance when there are no more relations

In case we had the model:
class Publication(models.Model):
title = models.CharField(max_length=30)
class Article(models.Model):
publications = models.ManyToManyField(Publication)
According to: https://docs.djangoproject.com/en/4.0/topics/db/examples/many_to_many/, to create an object we must have both objects saved before we can create the relation:
p1 = Publication(title='The Python Journal')
p1.save()
a1 = Article(headline='Django lets you build web apps easily')
a1.save()
a1.publications.add(p1)
Now, if we called delete in either of those objects the object would be removed from the DB along with the relation between both objects. Up until this point I understand.
But is there any way of doing that, if an Article is removed, then, all the Publications that are not related to any Article will be deleted from the DB too? Or the only way to achieve that is to query first all the Articles and then iterate through them like:
to_delete = []
qset = a1.publications.all()
for publication in qset:
if publication.article_set.count() == 1:
to_delete(publication.id)
a1.delete()
Publications.filter(id__in=to_delete).delete()
But this has lots of problems, specially a concurrency one, since it might be that a publication gets used by another article between the call to .count() and publication.delete().
Is there any way of doing this automatically, like doing a "conditional" on_delete=models.CASCADE when creating the model or something?
Thanks!

I tried with #Ersain answer:
a1.publications.annotate(article_count=Count('article_set')).filter(article_count=1).delete()
Couldn't make it work. First of all, I couldn't find the article_set variable in the relationship.
django.core.exceptions.FieldError: Cannot resolve keyword 'article_set' into field. Choices are: article, id, title
And then, running the count filter on the QuerySet after filtering by article returned ALL the tags from the article, instead of just the ones with article_count=1. So finally this is the code that I managed to make it work with:
Publication.objects.annotate(article_count=Count('article')).filter(article_count=1).filter(article=a1).delete()
Definetly I'm not an expert, not sure if this is the best approach nor if it is really time expensive, so I'm open to suggestions. But as of now it's the only solution I found to perform this operation atomically.

You can remove the related objects using this query:
a1.publications.annotate(article_count=Count('article_set')).filter(article_count=1).delete()
annotate creates a temporary field for the queryset (alias field) which aggregates a number of related Article objects for each instance in the queryset of Publication objects, using Count function. Count is a built-in aggregation function in any SQL, which returns the number of rows from a query (a number of related instances in this case). Then, we filter out those results where article_count equals 1 and remove them.

Python get last row in MongoDB database (without looping)

Based on this post:
Get the latest record from mongodb collection
I got the following to work:
docs = dbCollectionQuotes.find({}).limit(1).sort([('$natural', pymongo.DESCENDING)])
for doc in docs:
pprint.pprint(doc)
But since we know there is only going to be one row coming back, is there any way to get that one row without looping through the cursor that is returned? I don't think we can use find_one() because of the limit and the sort.

Use next(). This works for me:
doc = dbCollectionQuotes.find().limit(1).sort([('$natural', -1)]).next()

How can I get the entire updated entry in a $afterUpdate hook in objection models?

Im using Objection.js as my ORM for a simple rainfall application. I need to be able to dynamically update and entry of one table when a lower level tables entires has been updated. To do this I need the whole entry I am updating so I can use that data to correctly update the dynamically updated entry.
Im using the $afterUpdate hook for the lower level table entry which. The issue I am having is that when I log this within the $afterUpdate hook function it only contains the properties for the parts of the entry I want to update. How can I get the entire entry? Im sure I could get the record by running an additional query to the DB but I was hoping there would be away to avoid this. Any help would be appreciated

I think, as of right now, you can only get the whole model with an extra query.
If you are doing the update with an instance query ($query) you can get the other properties from options.old.
Query:
const user = await User.query().findById(userId);
await user.$query()
.patch({ name: 'Tom Jane' })
Hook:
$afterUpdate(opt, queryContext) {
console.log(opt.old)
}
Patch
If you don't need to do this in the hook, you might want to use patch function chained with first().returning('*') to get the whole model in a single query, it's more efficient than patchAndFetchById in postgreSQL. Like stated in the documentation.
Because PostgreSQL (and some others) support returning('*') chaining, you can actually insert a row, or update / patch / delete (an) existing row(s), and receive the affected row(s) as Model instances in a single query, thus improving efficiency. See the examples for more clarity.
const jennifer = await Person
.query()
.patch({firstName: 'Jenn', lastName: 'Lawrence'})
.where('id', 1234)
.returning('*')
.first();
References:
http://vincit.github.io/objection.js/#postgresql-quot-returning-quot-tricks
https://github.com/Vincit/objection.js/issues/185
https://github.com/Vincit/objection.js/issues/695

Sequelize/Postgres - how to update each row individually on migrate?

I have lots of records in my postgres. (using sequelize to communicate)
I want to have a migrate script, but due to locking, I have to do each change as atomic as possible.
So I don't want to selectAll and then modify and then saveAll.
In mongo I have forEach cursor which allows me to update a record, save it and only then move to the next one.
Anything similar in sequelize/postgres?
Currently, I am doing that in my code - getting the IDs, then for each performing a query.
return migration.runOnAllUpdates((record)=>{
record.change = 'new value';
return record.save()
});
where runOnAllUpdates will simply give me records one by one.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pymongo - Process only newly updated documents? - python-3.x

Related

Is There A Way To Improve Performance Of Data Dictionary Model To Dict Appraoch?

django remove m2m instance when there are no more relations

Python get last row in MongoDB database (without looping)

How can I get the entire updated entry in a $afterUpdate hook in objection models?

Sequelize/Postgres - how to update each row individually on migrate?

Categories

Resources