I have a couchbase cluster setup (couchbase version 4.1) where there are N data nodes, 1 Query Node and 1 Index Node. Data nodes have roughly 1 million key value pairs in a single bucket. This whole setup is hosted in Microsoft Azure within a virtual network. And can assure you that each node has enough resources that RAM, CPU or Disk is not an issue.
Now i can GET/SET JSON documents in my couchbase server without any issue. I am just testing, so ports are not issue as i have opened all ports between machines for now.
But when i try to run N1QL queries (from couchbase shell or using python SDK) it does not work. The query just hangs and i don't get any reply from server. On the other hand, once in a while the query just works without any issue and then after a minute it again stops working.
I have created PRIMARY index on my bucket and any other required Global Secondary Index if needed.
I also installed sample buckets provided by couchbase. Same problems exist.
Does anyone have a clue what the issue could be?
Your query hangs probably because you are straining the server too much, I don't know how many N1QL ops you are push each second, but for that type of query you will benefit the most with several tweaks, which lower cpu usage and increase efficiency.
Create a specific covering index such as:
create index inx_id_email on clients(id,email) where transaction_successful=false
use explain keyword to check if your query is using the index.
(explain SELECT id, email FROM clients where transaction_successful = false LIMIT 100 OFFSET 200)
I believe that your query/index nodes are utilized too much because you actually doing the equivalent to primary scan in relational databases.
Related
I am re-designing a project I built a year ago when I was just starting to learn how to code. I used MEAN stack, back then and want to convert it to a PERN stack now. My AWS knowledge has also grown a bit and I'd like to expand on these new skills.
The application receives real-time data from an api which I clean up to write to a database as well as broadcast that data to connected clients.
To better conceptualize this question I will refer to the following items:
api-m1 : this receives the incoming data and passes it to my schema I then send it to my socket-server.
socket-server: handles the WSS connection to the application's front-end clients. It also will write this data to a postgres database which it gets from Scraper and api-m1. I would like to turn this into clusters eventually as I am using nodejs and will incorporate Redis. Then I will run it behind an ALB using sticky-sessions etc.. for multiple EC2 instances.
RDS: postgres table which socket-server writes incoming scraper and api-m1 data to. RDS is used to fetch the most recent data stored along with user profile config data. NOTE: RDS main data table will have max 120-150 UID records with 6-7 columns
To help better visualize this see img below.
From a database perspective, what would be the quickest way to write my data to RDS.
Assuming we have during peak times 20-40 records/s from the api-m1 + another 20-40 records/s from the scraper? After each day I tear down the database using a lambda function and start again (as the data is only temporary and does not need to be saved for any prolonged period of time).
1.Should I INSERT each record using a SERIAL id, then from the frontend fetch the most recent rows based off of the uid?
2.a Should I UPDATE each UID so i'd have a fixed N rows of data which I just search and update? (I can see this bottlenecking with my Postgres client.
2.b Still use UPDATE but do BATCHED updates (what issues will I run into if I make multiple clusters i.e will I run into concurrency problems where table record XYZ will have an older value overwrite a more recent value because i'm using BATCH UPDATE with Node Clusters?
My concern is UPDATES are slower than INSERTS and I don't want to make it as fast as possible. This section of the application isn't CPU heavy, and the rt-data isn't that intensive.
To make my comments an answer:
You don't seem to need SQL semantics for anything here, so I'd just toss RDS and use e.g. Redis (or DynamoDB, I guess) for that data store.
I am trying to write some synchronization code for a java app that runs on each of the cassandra servers in our cluster (so each server has 1 cassandra instance + our app). For this I wanted to make a method that will return the 'local' cassandra node, using the java driver.
Every process creates a cqlSession using the local address as contactPoint. The driver will figure out the rest of the cluster from that. But my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Is there a way in the Java driver to determine which of the x nodes the process its running on?
I tried this code:
public static Node getLocalNode(CqlSession cqlSession) {
Metadata metadata = cqlSession.getMetadata();
Map<UUID, Node> allNodes = metadata.getNodes();
Row row = cqlSession.execute("SELECT host_id FROM system.local").one();
UUID localUUID = row.getUuid("host_id");
Node localNode = null;
for (Node node : allNodes.values()) {
if (node.getHostId().equals(localUUID)) {
localNode = node;
break;
}
}
return localNode;
}
But it seems to return random nodes - which makes sense if it just sends the query to one of the nodes in the cluster. I was hoping to find a way without providing hardcoded configuration to determine what node the app is running on.
my assumption was that the local address would be its 'primary' node, at least for requesting things from the system.local table. This seems not so, when trying to run the code.
Correct. When running a query where token range ownership cannot be determined, a coordinator is "selected." There is a random component to that selection. But it does take things like network distance and resource utilization into account.
I'm going to advise reading the driver documentation on Load Balancing. This does a great job of explaining how the load balancing policies work with the newer drivers (>= 4.10).
In that doc you will find that query routing plans:
are different for each query, in order to balance the load across the cluster;
only contain nodes that are known to be able to process queries, i.e. neither ignored nor down;
favor local nodes over remote ones.
As far as being able to tell which apps are connected to which nodes, try using the execution information returned by the result set. You should be able to get the coordinator's endpoint and hostId that way.
ResultSet rs = session.execute("select host_id from system.local");
Row row = rs.one();
System.out.println(row.getUuid("host_id"));
System.out.println();
System.out.println(rs.getExecutionInfo().getCoordinator());
Output:
9788de64-08ee-4ab6-86a6-fdf387a9e4a2
Node(endPoint=/127.0.0.1:9042, hostId=9788de64-08ee-4ab6-86a6-fdf387a9e4a2, hashCode=2625653a)
You are correct. The Java driver connects to random nodes by design.
The Cassandra drivers (including the Java driver) are configured with a load-balancing policy (LBP) which determine which nodes the driver contacts and in which order when it runs a query against the cluster.
In your case, you didn't configure a load-balancing policy so it defaults to the DefaultLoadBalancingPolicy. The default policy calculates a query plan (list of nodes to contact) for every single query so each plan is different across queries.
The default policy gets a list of available nodes (down or unresponsive nodes are not included in the query plan) that will "prioritise" query replicas (replicas which own the data) in the local DC over non-replicas meaning replicas will be contacted as coordinators before other nodes. If there are 2 or more replicas available, they are ordered based on "healthiest" first. Also, the list in the query plan are shuffled around for randomness so the driver avoids contacting the same node(s) all the time.
Hopefully this clarifies why your app doesn't always hit the "local" node. For more details on how it works, see Load balancing with the Java driver.
I gather from your post that you want to circumvent the built-in load-balancing behaviour of the driver. It seems like you have a very edge case that I haven't come across and I'm not sure what outcome you're after. If you tell us what problem you are trying to solve, we might be able to provide a better answer. Cheers!
In the description of SmartGraphs here it seems to imply that graph traversal queries actually follow edges from machine to machine until the query finishes executing. Is that how it actually works? For example, suppose that you have the following query that retrieves 1-hop, 2-hop, and 3-hop friends starting from the person with id 12345:
FOR p IN Person
FILTER p._key == 12345
FOR friend IN 1..3 OUTBOUND p knows
RETURN friend
Can someone please walk me through the lifetime of this query starting from the client and ending with the results on the client?
what actually happens can be a bit different compared to the schemas on our website. What we show there is kind of a "worst case" where the data can not be sharded perfectly (just to make it a bit more fun). But let's take a quick step back first to describe the different roles within an ArangoDB cluster. If you are already aware of our cluster lingo/architecture, please skip the next paragraph.
You have the coordinator which, as the name says, coordinates the query execution and is also the place where the final result set gets built up to send it back to the client. Coordinators are stateless, host a query engine and is are the place where Foxx services live. The actual data is stored on the DBservers in a stateful fashion but DBservers also have a distributed query engine which plays a vital role in all our distributed query processing. The brain of the cluster is the agency with at least three agents running the RAFT consensus protocol.
When you sharded your graph data set as a SmartGraph, then the following happens when a query is being sent to a Coordinator.
- The Coordinator knows which data needed for the query resides on which machine
and distributes the query accordingly to the respective DBservers.
- Each DBserver has its own query engine and processes the incoming query from the Coordinator locally and then sends the intermediate result back to the coordinator where the final result set gets put together. This runs in parallel.
- The Coordinator sends then result back to the client.
In case you have a perfectly shardable graph (e.g. a hierarchy with its branches being the shards //Use Case could be e.g. Bill of Materials or Network Analytics) then you can achieve the performance close to a single instance because queries can be sent to the right DBservers and no network hops are required.
If you have a much more "unstructured" graph like a social network where connections can occur among any two given vertices, sharding becomes an optimization question and, depending on the query, it is more likely that network hops between servers occur. This latter case is shown in the schemas on our website. In his case, the SmartGraph feature can minimize the network hops needed to a minimum but not completely.
Hope this helped a bit.
I have neo4j instance run on 10$/mo digitalocean vps:
1GB / 1 CPU,
30GB SSD DISK,
2TB TRANSFER
I'm using awesome node library seraph. But while I tried to save a bit of data (about 10 properties per node) - (around 100 root nodes + 3-5 related nodes for each) after about half of it I start to get econnreset, missing ids and stuff. I guess it's because of wrong configuration/not enough resources.
So how to check what is wrong? What kind of logs to read?
I actually found this issue, and added this to configuration:
neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.nodestore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
and there are less errors, but still there is a lot of stuff that is lost due to econnreset.
What should I change? Thank you very much!
Use Neo4j 2.2.2
Configure in neo4j.properties dbms.pagecache.memory=250M
Configure in neo4j-wrapper.conf the wrapper.java.maxmemory=512
I'm not sure if seraph uses the REST API to create data or Cyphers remote endpoint, the latter would be preferred.
Please provide your code !!
I presume you can do your whole operation with 500 cypher statements each parametrized with your 10 properties passed to neo as map parameter that you can assign directly.
I am having NodeJS service which talks to couchbase cluster to fetch the data. The couchbase cluster has 4 nodes(running on ip1, ip2, ip3, ip4) and service also is running on same 4 servers. On all the NodeJS services my connection string looks like this:
couchbase://ip1,ip2,ip3,ip4
but whenever I try to fetch some document from bucket X, console shows node on ip4 is doing that operation. No matter which NodeJS application is making request the same ip4 is serving all the request.
I want each NodeJS server to use their couchbase node so that RAM and CPU consumption on all the servers are equal so I changed the order of IPs in connection string but every time request is being served by same ip4.
I created another bucket and put my data in it and try to fetch it but again it went to same ip4. Can someone explain why is this happening and can it cause high load on one of the node?
What do you mean by "I want each NodeJS server to use their couchbase node"?
In Couchbase, part of the active dataset is on each node in the cluster. The sharding is automatic. When you have a cluster, the 1024 active vBuckets (shards) for each Bucket are spread out across all the nodes of the cluster. So with your 4 nodes, there will be 256 vBuckets on each node. Given the consistent hashing algorithm used by the Couchbase SDK, it will be able to tell from the key which vBucket that object goes into and combined with the cluster map it got from the cluster, know which node that vBucket lives in the cluster. So an app will be getting data from each of the nodes in the cluster if you have it configured correctly as the data is evenly spread out.
On the files system there will be as part of the Couchbase install a CLI tool call vbuckettool that takes an objectID and clustermap as arguments. All it does is the consistent hashing algorithm + the clustermap. So you can actually predict where an object will go even if it does not exist yet.
On a different note, best practice in production is to not run your application on the same nodes as Couchbase. It really is supposed to be separate to get the most out of its shared nothing architecture among other reasons.