Get size of a collection in bytes - arangodb

Is there a query to get the size in bytes of a collection? I would like to know how much storage space a certain collection needs.

You may call the collection api on a single server or a cluster's coordinator like so:
<endpoint>/_db/<database>/_api/collection/<collection>/figures
In arangosh connected to again a single server or cluster's coordinator endpoint
> db._useDatabase("<database>");
> db.<collection>.figures();

Related

how does mongodb replica set work with nodejs-mongoose?

Techstack used nodejs,mongoose,mongodb
i'm working on product that handles many DBrequests. During beginning of every month the db requests are high due to high read/write requests (bulk data processing). The number of records in each collection's targeted for serving these read/write requests are quite high. Read is high but write is not that high.
So the cpu utilization on the instance in which mongodb is running reaches the dangerzone(above 90%) during these times. The only thing that gets me through these times is HOPE (yes, hoping that instance will not crash).
Rather than scaling vertically, i'm looking for solutions to scale horizontally (not a revolutionary thought). i looked at replicaset and sharding. This question is only related to replicaSet.
i went through documents and i feel like the understanding i have on replicaset is not really the way it might work.
i have configured my replicaset with below configuration. i simply want to add one more instance because as per the understanding i have right now, if i add one more instance then my database can handle more read requests by distributing the load which could minimize the cpuUtilization by atleast 30% on primaryNode. is this understanding correct or wrong? Please share your thoughts
var configuration = {
_id : "testReplicaDB",
members:[
{_id:0,host:"localhost:12017"},
{_id:1,host:"localhost:12018",arbiterOnly:true,buildIndexes:false},
{_id:2,host:"localhost:12019"}
]
}
When i broughtup the replicaset with above config and ran my nodejs-mongoose code, i ran into this issue . Resolution they are proposing is to change the above config into
var configuration = {
_id : "testReplicaDB",
members:[
{_id:0,host:"validdomain.com:12017"},
{_id:1,host:"validdomain.com:12018",arbiterOnly:true,buildIndexes:false},
{_id:2,host:"validdomain.com:12019"}
]
}
Question 1 (related to the coding written in nodejsproject with mongoose library(for handling db) which connects to the replicaSet)
const URI = mongodb://167.99.21.9:12017,167.99.21.9:12019/${DB};
i have to specify both uri's of my mongodb instances in mongoose connection URI String.
When i look at my nodejs-mongoose code that will connect to the replicaSet, i have many doubts on how it might handle the multipleNode.
How does mongoose know which ip is the primaryNode?
Lets assume 167.99.21.9:12019 is primaryNode and rs.slaveOk(false) on secondaryReplica, so secondaryNode cannot serve readRequests.
In this situation, does mongoose trigger to the first uri(167.99.21.9:12017) and this instance would redirect to the primaryNode or will the request comeback to mongoose and then mongoose will trigger another request to the 167.99.21.9:12019 ?
Question 2
This docLink mention's that data redundancy enables to handle high read requests. Lets assume, read is enabled for secondaryNode, and
Lets assume the case when mongoose triggers a request to primaryNode and primaryNode was getting bombarded at that time with read/write requests but secondaryNode is free(doing nothing) , then will mongodb automatically redirect the request to secondaryNode or will this request fail and redirect back to mongoose, so that the burden will be on mongoose to trigger another request to the next available Node?
can mongoose automatically know which Node in the replicaSet is free?
Question 3
Assuming both 167.99.21.9:12017 & 167.99.21.9:12019 instances are available for read requests with ReadPreference.SecondaryPreferred or ReadPreference.nearest, will the load get distributed when secondaryNode gets bombarded with readRequests and primaryNode is like 20% utilization? is this the case? or is my understanding wrong? Can the replicaSet act as a loadbalancer? if not, how to make it balance the load?
Question 4
var configuration = {
_id : "testReplicaDB",
members:[
{_id:0,host:"validdomain.com:12017"},
{_id:1,host:"validdomain.com:12018",arbiterOnly:true,buildIndexes:false},
{_id:2,host:"validdomain.com:12019"}
]
}
You can see the DNS name in the configuration, does this mean that when primaryNode redirects a request to secondaryNode, DNS resolution will happen and then using that IP which corresponds to secondaryNode, the request will be redirected to secondaryNode? is my understanding correct or wrong? (if my understanding is correct, this is going to fireup another set of questions)
:|
i could've missed many details while reading the docs. This is my last hope of getting answers. So please share if you know the answers to any of these.
if this is the case, then how does mongoose know which ip is the primaryReplicaset?
There is no "primary replica set", there can be however a primary in a replica set.
Each MongoDB driver queries all of the hosts specified in the connection string to discover the members of the replica set (in case one or more of the hosts is unavailable for whatever reason). When any member of the replica set responds, it does so with the full list of current members of the replica set. The driver then knows what the replica set members are, and which of them is currently primary (if any).
secondaryReplica cannot serve readRequests
This is not at all true. Any data-bearing node can fulfill read requests, IF the application provided a suitable read preference.
In this situation, does mongoose trigger to the first uri(167.99.21.9:12017) and this instance would redirect to the primaryReplicaset or will the request comeback to mongoose and then mongoose will trigger another request to the 167.99.21.9:12019 ?
mongoose does not directly talk to the database. It uses the driver (node driver for MongoDB) to do so. The driver has connections to all replica set members, and sends the requests to the appropriate node.
For example, if you specified a primary read preference, the driver would send that query to the primary if one exists. If you specified a secondary read preference, the driver would send that query to a secondary if one exists.
i'm assuming that when both 167.99.21.9:12017 & 167.99.21.9:12019 instances are available for read requests with ReadPreference.SecondaryPreferred or ReadPreference.nearest
Correct, any node can fulfill those.
the load could get distributed across
Yes and no. In general replicas may have stale data. If you require current data, you must read from the primary. If you do not require current data, you may read from secondaries.
how to make it balance the load?
You can make your application balance the load by using secondary or nearest reads, assuming it is OK for your application to receive stale data.
if mongoose triggers a request to primaryReplica and primaryReplica is bombarded with read/write requests and secondaryReplica is free(doing nothing) , then will mongodb automatically redirect the request to secondaryReplica?
No, a primary read will not be changed to a secondary read.
Especially in the scenario you are describing, the secondary is likely to be stale, thus a secondary read is likely to produce wrong results.
can mongoose automatically know which replica is free?
mongoose does not track deployment state, the driver is responsible for this. There is limited support in drivers for choosing a "less loaded" node, although this is measured based on network latency and not CPU/memory/disk load and only applies to the nearest read preference.

How Monitor - Cosmos DB (preview) Requests is calculated?

Azure provides monitor to the incoming request to the Cosmos. When I am alone working on my Cosmos DB, ran a simple select vertex statement(eg., g.V('id')). Then I monitored the incoming request, it shows around 10. But for sure I know i'm the only person accessed. I also tried traversing through the graph in a single select query the Request count is huge (around 100).
Do anybody noticed the metrics? We are assuming the request code is huge for an hour in production cause the performance slowness. Is the metric is trustworthy to believe or how to find the incoming request to the cosmos?

How do ArangoDB Graph Traversal Queries Execute in a Cluster?

In the description of SmartGraphs here it seems to imply that graph traversal queries actually follow edges from machine to machine until the query finishes executing. Is that how it actually works? For example, suppose that you have the following query that retrieves 1-hop, 2-hop, and 3-hop friends starting from the person with id 12345:
FOR p IN Person
FILTER p._key == 12345
FOR friend IN 1..3 OUTBOUND p knows
RETURN friend
Can someone please walk me through the lifetime of this query starting from the client and ending with the results on the client?
what actually happens can be a bit different compared to the schemas on our website. What we show there is kind of a "worst case" where the data can not be sharded perfectly (just to make it a bit more fun). But let's take a quick step back first to describe the different roles within an ArangoDB cluster. If you are already aware of our cluster lingo/architecture, please skip the next paragraph.
You have the coordinator which, as the name says, coordinates the query execution and is also the place where the final result set gets built up to send it back to the client. Coordinators are stateless, host a query engine and is are the place where Foxx services live. The actual data is stored on the DBservers in a stateful fashion but DBservers also have a distributed query engine which plays a vital role in all our distributed query processing. The brain of the cluster is the agency with at least three agents running the RAFT consensus protocol.
When you sharded your graph data set as a SmartGraph, then the following happens when a query is being sent to a Coordinator.
- The Coordinator knows which data needed for the query resides on which machine
and distributes the query accordingly to the respective DBservers.
- Each DBserver has its own query engine and processes the incoming query from the Coordinator locally and then sends the intermediate result back to the coordinator where the final result set gets put together. This runs in parallel.
- The Coordinator sends then result back to the client.
In case you have a perfectly shardable graph (e.g. a hierarchy with its branches being the shards //Use Case could be e.g. Bill of Materials or Network Analytics) then you can achieve the performance close to a single instance because queries can be sent to the right DBservers and no network hops are required.
If you have a much more "unstructured" graph like a social network where connections can occur among any two given vertices, sharding becomes an optimization question and, depending on the query, it is more likely that network hops between servers occur. This latter case is shown in the schemas on our website. In his case, the SmartGraph feature can minimize the network hops needed to a minimum but not completely.
Hope this helped a bit.

Unable to understand why N1QL Queries in couchbase hangs?

I have a couchbase cluster setup (couchbase version 4.1) where there are N data nodes, 1 Query Node and 1 Index Node. Data nodes have roughly 1 million key value pairs in a single bucket. This whole setup is hosted in Microsoft Azure within a virtual network. And can assure you that each node has enough resources that RAM, CPU or Disk is not an issue.
Now i can GET/SET JSON documents in my couchbase server without any issue. I am just testing, so ports are not issue as i have opened all ports between machines for now.
But when i try to run N1QL queries (from couchbase shell or using python SDK) it does not work. The query just hangs and i don't get any reply from server. On the other hand, once in a while the query just works without any issue and then after a minute it again stops working.
I have created PRIMARY index on my bucket and any other required Global Secondary Index if needed.
I also installed sample buckets provided by couchbase. Same problems exist.
Does anyone have a clue what the issue could be?
Your query hangs probably because you are straining the server too much, I don't know how many N1QL ops you are push each second, but for that type of query you will benefit the most with several tweaks, which lower cpu usage and increase efficiency.
Create a specific covering index such as:
create index inx_id_email on clients(id,email) where transaction_successful=false
use explain keyword to check if your query is using the index.
(explain SELECT id, email FROM clients where transaction_successful = false LIMIT 100 OFFSET 200)
I believe that your query/index nodes are utilized too much because you actually doing the equivalent to primary scan in relational databases.

How to handle read/write request in cassandra

I have 5 node cluster with 2 Cassandra,2 solr and 1 hadoop on EC2 with DSE4.5.
My requirement is I dont want to hard code node IP address while requesting for Reading/writing from Cluster. I have to develop web service, thru which requester can send read/write request to my cluster and web service has to determine following
1) route read request to appropriate node.
2) route write request to appropriate node.
If there is any write request then it should direct to Cassandra node on basis of keyspace and replication factor. if it is a read request then request should route to Solr node (as I done indexing on solr) and if there is any analytic query then request should route to hadoop.
And if any node goes down in that case response will not affect.
Apart from dedicated request, is there any way to request a cluster ?
by dedicated mean giving specific IP address for read and write.
Is any method or algorithm exist in DSE? or Is there any tool available in for this?
The Java driver should take care of all of that for you:
http://www.datastax.com/documentation/developer/java-driver/2.0/common/drivers/introduction/introArchOverview_c.html
For example:
Nodes discovery: the driver automatically discovers and uses all nodes of the Cassandra cluster, including newly bootstrapped ones
Configurable load balancing: the driver allows for custom routing and load balancing of queries to Cassandra nodes. Out of the box, round robin is provided with optional data-center awareness (only nodes from the local data-center are queried (and have connections maintained to)) and optional token awareness (that is, the ability to prefer a replica for the query as coordinator).
Transparent failover: if Cassandra nodes fail or become unreachable, the driver automatically and transparently tries other nodes and schedules reconnection to the dead nodes in the background.
On the Solr query side, you can use the SolrJ load balancer, but you have to hard-wire the list of nodes to be used as coordinator nodes, but SolrJ will round robin for you.

Resources