Checking to see if a RethinkDB shard is in use? - node.js

I am wondering if there is a way to check to see if a RethinkDB shard is in use before performing so ReQL query on it.
I am currently calling two functions back to back, the first creating a RethinkDB table and inserting data, the second will read the data from that newly created table. This works okay if the data being inserted is minimal, but once the size of the data set being inserted increases, I start getting:
Unhandled rejection RqlRuntimeError: Cannot perform write: Primary replica for shard ["", +inf) not available
This is because the primary shard is still doing the write from the previous function. I guess I am wondering if there is some RethinkDB specific way of avoiding this or if I need to emit/listen for events or something?

You can probably use the wait command for this. From the docs:
Wait for a table or all the tables in a database to be ready. A table
may be temporarily unavailable after creation, rebalancing or
reconfiguring. The wait command blocks until the given table (or
database) is fully up to date.
http://rethinkdb.com/api/javascript/wait/

Related

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

How to achieve locking across multiple table updates in Cassandra so as to attain isolation and avoid dirty read probem

I am using Cassandra as a NoSQL DB in my solution and have a Data model wherein I have 2 tables , one is parent table and other one is child table
Here is the scenario
client A is trying to update a parent table record as well child table records
At the same time, client B also does select request (which makes a hit to both parent and child table)
client B receives latest record from Parent table but gets older record from Child table
I can use a batch log operation so that I can achieve atomicity for updating both the tables but not sure how to isolate or lock the read request from Client B so as to avoid having dirty read problem.
Have also tried evaluating light weight transactions but doesnt seem to work in this case
Just thinking if I can use some middleware application to implement locking functionality since there seems to be nothing available in Cassandra out of the box.
Please help to make me understand how to achieve read/write sync in this regard
As you mentioned - Cassandra provides only atomicity when you choose to batch. It does provide isolation though when you make a single partition batch, which is not your case unfortunately.
To respond to your question - if you really need transaction I would think about the problem and possible solutions once again. Either you should eliminate the need of locking or you should change the technology stack.

Keep denormalized table consistent (synchronized) in Cassandra DB

I am exploring Cassandra DB and I've come across this video which explains how to denormalize tables to store comments and it goes like this:
Instead of having a comments table and users_id column in that table that points to the user wrote the comment and another table which is a video table and in the comments table a column that is video_id which points to the video that has been commented.
We will have two comments table one comments_by_user and another one comment_by_video the question is how to keep these two table synchronized ?
When a user comments on a video we insert a comment in comments_by_video and comments_by_user, for a video and user respectively. However, what if the the second insert fails ?
We will have a comment by a user on a video that however cannot be found when we select all comments for that user ?
You can use the Batch Statement for this purpose. Be warned though, that batch statement is slower and puts around 30% of overhead on the regular operation on the coordinator node.
One option is to Batch Statement as mentioned in previous answer, which will have perf impact.
The other option, I could think of trying would be to change consistency level to ANY when a write fails. With consistency 'ANY' your data will be saved in the coordinator node for the configured amount of time and replicate to the node when the other responsible node comes up.
For any failed writes, handle it in your code and change the consistency to ANY for that insert and execute the insert again. Ofcourse you won't be able to read that data until any of the responsible node gets the data.
function retryWrites(...)
try{
//insert statement
}catch<writeexception>{
//set consistency to ANY
//insert failed statement
//set consistency back to wherever is required.
}
call retryWrites(statement1)
call retryWrites(statement2)

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

SQL Azure Elastic Scale - reference table

I am not sure if I understand the idea of reference tables correctly. In my mind it is a table that contains the same data in every shard. Am I wrong? I am asking because I have no idea how should I insert data to the reference table to make the data multiply in every shard. Or maybe it is impossible? Can anyone clarify this issue?
Yes, the idea of a Reference Table is that the same data is contained on every shard. If you have small numbers of shards and data changes are rare, you can open multiple connections in your application and apply the changes to multiple DBs simultaneously. Or you can construct a management script that iterates periodically across all shards to update reference data or performs a bulk-insert of a fresh image of rows.
The new feature previewing in Azure SQL Database called Elastic DB Jobs, which allows you to define a SQL script for operations that you want to take place on all shards, and then runs the script asynchronously with eventual completion guarantees. You can potentially use this to update reference tables. Details on the feature are here.

Resources