My previous question: Errors saving data to Google Datastore
We're running into issues writing to Datastore. Based on the previous question, we think the issue is that we're indexing a "SeenTime" attribute with YYYY-MM-DDTHH:MM:SSZ (e.g. 2021-04-29T17:42:58Z) and this is creating a hotspot (see: https://cloud.google.com/datastore/docs/best-practices#indexes).
We need to index this because we're querying the data by date and need the time for each observation in the end application. Is there a way around this issue where we can still query by date?
This answer is a bit late but:
On your previous question, before even writing a query, it feels like the main issue is "running into issues writing" (DEADLINE_EXCEEDED/UNAVAILABLE) -> it's happening on "some saves" -- so, it's not completely clear if it's due to data hot-spotting or from "ingesting more data in shorter bursts", which causes contention (see "Designing for scale").
A single entity in Datastore mode should not be updated too rapidly. If you are using Datastore mode, design your application so that it will not need to update an entity more than once per second. If you update an entity too rapidly, then your Datastore mode writes will have higher latency, timeouts, and other types of error. This is known as contention.
You would need to add a prefix to the key to index monotonically increasing timestamps (as mentioned in the best-practices doc). Then you can test your queries using GQL interface in the console. However, since you most likely want "all events", I don't think it would be possible, and so will result in hot-spotting & read-latency.
The impression is that the latency might be unavoidable. If so, then you would need to decide if it's acceptable, depending on the frequency of your query/number-of-elements returned, along with the amount of latency (performance impact).
Consider switching to Firestore Native Mode. It has a different architecture under the hood and is the next version of Datastore. While Firestore is not perfect, it can be more forgiving about hot-spotting and contention, so it's possible that you'll have fewer issues than in Datastore.
Related
I've read through this excellent feedback on Azure Search. However, I have to be a bit more explicit in questioning one the answers to question #1 from that list...
...When you index data, it is not available for querying immediately.
...Currently there is no mechanism to control concurrent updates to the same document in an index.
Eventual consistency is fine - I perform a few updates and eventually I will see my updates on read/query.
However, no guarantee on ordering of updates is really problematic. Perhaps I'm misunderstanding Let's assume this basic scenario:
1) update index entry E.fieldX w/ foo at time 12:00:01
2) update index entry E.fieldX w/ bar at time 12:00:02
From what I gather, it's entirely possible that E.fieldX will contain "foo" after all updates have been processed?
If that is true, it seems to severely limit the applicability of this product.
Currently, Azure Search does not provide document-level optimistic concurrency, primarily because overwhelming majority of scenarios don't require it. Please vote for External Version UserVoice suggestion to help us prioritize this ask.
One way to manage data ingress concurrency today is to use Azure Search indexers. Indexers guarantee that they will process only the current version of a source document at each point of time, removing potential for races.
Ordering is unknown if you issue multiple concurrent requests, since you cannot predict in which order they'll reach the server.
If you issue indexing batches in sequence (that is, start the second batch only after you saw an ACK from the service from the first batch) you shouldn't see reordering.
Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.
I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.
Posting here as I could not find any forums for lmdb key-value store.
Is there a limit for sub-databases? What is a reasonable number of sub-databases concurrently open?
I would like to have ~200 databases which seems like a lot and clearly indicates my model is wrong.
I suppose could remodel and embed id of each db in key itself and keep one db only but then I have longer keys and also I cannot drop database if needed.
I'm interested though if LMDB uses some sort of internal prefixes for keys already.
Any suggestions how to address that problem appreciated.
Instead of calling mdb_dbi_open each time, keep your own map with database names to database handles returned from mdb_dbi_open. Reuse these handles for the lifetime of your program. This will allow you to have multiple databases within an environment and prevent the overhead with mdb_dbi_open.
If you read the documentation for mdb_env_set_maxdbs.
Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.
http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc
The best way to know is to test the function call mdb_dbi_open performance to see if it is acceptable.
It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.
Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew