I've written two different functions to populate the same document but different attributes. One of the function updates A1,A2,A3 attributes of document X and other function updates A4,A5 attributes of same document X given the document id. They are using MergeOrUpload operation to update the document.
I'm trying to evaluate the cost of MergeOrUpload operation. How it works at the backend and does it read the entire document and then merge the new attribute values and push back the document to the index or what? How much cost-effective it is?
It would be more efficient to include all attributes in a single call, rather than first creating the document with A1, A2 and A3, and then sending a mergeOrUpdate call to add A4, A5 and A6. In the backend, all index files are "read-only" to facilitate the distributed and concurrent nature of the service, so when you update a document, we actually need to do a copy of the original document which includes the newly updated attributes.
Your best bet to get accurate cost would be to do your own testing using your own data and usage pattern, however, its likely that you will discover creating the document with all the attributes is the most efficient approach.
Related
I have a Content Model X with 200+ entries.
One of the fields inside my Content Model X is a reference field referencing the Content Model A.
However, I realized that this Content Model A is no longer suitable for my use case so I created another Content Model B.
I wonder if there is an efficient way to update all the reference field for each entry from Content Model A to Content Model B ?
Or do I have to update the reference field manually for 200+ times ?
Thank you!
Before actually modifying 200+ content objects, another important question to be answered is that at what stage you will change your model. For instance, you cannot change the model if you have any entry (you have 200+). Therefore, it is important to follow versioning of the Content Model, for instance, v1 with reference to Model-A and v2 with reference to Model-B. Once you have finalized the structure of v2, use Content Management API to switch entries, hence change from Model-A to Model-B
(https://www.contentful.com/developers/docs/references/content-management-api/)
Of course we should try freeze the model structure, but these changes still happen and is a very common scenario. Therefore, changing manually is definitely not an option.
Also, these Content Management APIs are suggested solution from Contentful, therefore scalable solution.
I have worked on Azure Search service previously where I created an indexer directly on a SQL DB in the Azure Portal.
Now I have a use-case where I would want to ingest from multiple data sources each having different data schema. Assume these data sources to be 3 search APIs of X,Y,Z teams. All of them take search term and gives back results in their own schema. I want my Azure Search Service to be proxy for these so that I have one search API that a user can use to get results from multiple sources, ordered correctly.
How should I go about doing it? I assume that I might have to create a common schema and whenever user searches something, I would call these 3 APIs and get results, map them to a common schema and then index this data in common schema into Azure Search index. Finally, call this Azure Search API to give back the results to the caller.
I would appreciate any help! If I can get hold of a better documentation for doing this work, that will be great as well.
Your assumption is correct. You can work with 3 different indexes and fire queries against them, or you can try to combine all of them in the same index. The benefit of the second approach is a better way to implement ordering / paging as all the information will be stored in the same index.
It really depends on what you mean by ordered correctly. Should team X be able to see results from teams Y and Z? The only way you can get ranked results like this is to maintain a single index with a common schema containing data from all teams.
One potential pitfall with this approach is conflicts in the schema. For example if one team requires a field to be of a specific datatype or use a specific analyzer, while another team has different requirements. We do this in our indexes, but with some carefully selected common fields and then dedicated fields prefixed according to our own naming convention to avoid conflicts.
One thing to consider is the need to reset the index. If you need to add, change or remove fields you will have to delete the index and create it again with a new schema. If you have a common index and team X needs to add a new property, you would need to reset (delete and create) the common index which affects all teams.
So, creating separate indexes per team has its benefits. Each team can have their own schema without risk of conflicts and they can reset their index without affecting the other teams.
I have followed through Bookshelf App tutorial (in node.js) by google and instead of books catalogue I would like to model a production part catalogue.
Where a part consists of "sub"-parts and tasks.
Every "sub"-part can have again "sub"-parts and tasks (manufacturing steps).
Current implementation: At the moment I have only two kinds Parts and Tasks.
A relations between the parts is managed via a property storing the unique key (parentId) of the parent part in its child part. A bigger headache I have at the moment (for example) is a price change of a highly nested sub-part would be recursively need to update all parent parts...
Question: What would be the recommended datastore design for such an application?
It should solve or be more efficient doing:
If i change a "sub-sub-sub"-parts price this need to change the price of all parent parts according the chosen calculation methodology.
Should not be limited in depth of sub-parts (I did read limits on datastore "nested entity values" to be 20 (but probably did not understand it correctly).
Should not be limited to 1 write per second per (part and all its sub-parts) "entity group". I've read about this limit but I am not sure whether this also applies to so called Transactions (which I think you can do on entity groups).
One potential solution is avoid storing aggregate prices in Datastore entirely. Instead, the "price" on each part or task should only include the cost of that thing itself, but not the sub-parts.
Instead calculate the price on the fly when needed, adding up the entire tree of parts/sub-parts/tasks. Store this in memcache if you want to speed up calculation (but make sure to delete the memcache key when updating prices).
I am still debating which way to go and possibly store certain information in its own doc. so for example the customer can have addresses with each address would be its own doc and then in the customer doc there would be an array of ref keys stored under addresses. The benefit would be i could update these docs simply based on the key value vs having to get the customer doc first, finding the array index of the address and then either modify the whole doc or go and use subdoc to replace the content of the array with the index.
Where i am stuck is how to retrieve those referenced subdoc's. is N1QL the only way to go or does the KV API offer a way to do this short of retrieving the whole customer doc, then looping thru address array and retrieving all referenced docs that way. I know Ottoman offers something like that but i am having an issue with the latest version of SDK 2.6 and Ottoman as its not very well maintained. So hopefully someone can share some insight what and why its the best way.
If you want to rely on key/value, then you'll need to do the multiple lookup as you've described. I'm not very familiar with Ottoman: it might do this for you, but behind the scenes it will still be multiple key/value operations and/or N1QL.
With N1QL, you can perform JOINs, but again, behind the scenes it's going to eventually be pulling documents out by key/value. It just does those extra steps for you. Direct key/value is always going to be the fastest route.
If you are still in the process of deciding whether to split the data amongst multiple documents or "denormalize" the data into a single doc, one thing you should think about is how often you're going to access customer+addresses together and how often you're going to customer/access separately. If you're reading/writing customer+address often, consider putting it in one document. Otherwise, consider putting it in multiple documents.
The third option is to store it both places, or rather "cache" the address data in the customer document. This is tricky, because it could get out of sync if you're not careful. So make sure it's worth it before you go down that road.
I have a mongo database used to represent spreadsheets with three collections representing respectively cell values (row, col, value), cell formatting (row, col, object representing the format) and cell sizes (whether it's a row or column size, its index and the size).
Every document in all the collections also has a field to identify the table it refers to (containing the table's name) and I'm using upserts (mongoose's findOneAndReplace method with upsert:true) for all insertions/updates.
I was thinking of "pulling the schema inside out", by keeping a single collection representing the table and having the documents previously contained in the three collections as subdocuments inside it, as I thought it would make it more organized.
However, reading up on the subject of subdocuments, it looks like in any case two queries would be needed for every insertion/update (eg, see this question). Therefore, I was wondering if the changes I had in mind would lead to a hit on performance (I guess upserts still need to do a search and then either update or insert, so that would still be two queries behind the scenes, but there might be some optimization I'm not aware of) and in trying to simplify the schema I would not only complicate the insertion/update procedures but also get lower performances. Thanks!
Yes, there is a performance hit. MongoDB has collection-level update locks. By keeping everything in a single collection you are ultimately limiting the number of concurrent update operations your application can perform, hence leading to decreased performance. The caveat to this, is that it totally dependant on how your application is doing the writes.
On the flip side is that you could potentially save on read operations as you'd need to query a single collection rather than 3. However, scaling reads is easy compared to writes, and writes are typically the bottleneck, so its kind of hard to say if that's worth it.