I am a newbie and trying to get going on ArangoDB.I want to run a batch of AQL queries which would be interdependent on each other. I want to do the same things we do in PL-SQL. I tried clubbing two or more queries in one post/get request through FOXX but didn't work. Can someone suggest me a better way to do this? or a tutorial for this?
It all depends what is the client accessing the database.
E.g. we are using Java and the java driver to access ArangoDB. Then either transaction call or AQL query with subsequent AQL queries can be made.
The question is, if the AQL queries are interdependent on each other, why whould you run them in one request? How would you get the results of each one?
Take a look at Gremlin language (it is a Graph language), you would find that it uses WebSockets and result of one query is returned in a binary way through WS... Thus batching such queries wouldn't have any sense. (just a note, ArangoDB also has a provider for the Gremlin API).
I expect, if you are accessing ArangoDB through HTTP. And now you are trying to save http requests. If that is the case I would recommend writing your own API layer, which would expose interface, where you would be able to batch the requests. However the API layer would make 2 calls to Arango (e.g. in parallel), getting the results and somehow merging them to the final output.
Related
I'm talking to Cosmos DB via the (SQL) REST API, so existing questions that refer to various SDKs are of limited use.
When I run a simple query on a partitioned container, like
select value count(1) from foo
I run into a HTTP 400 error:
The provided cross partition query can not be directly served by the gateway. This is a first chance (internal) exception that all newer clients will know how to handle gracefully. This exception is traced, but unless you see it bubble up as an exception (which only
happens on older SDK clients), then you can safely ignore this message.
How can I get rid of this error? Is it a matter of running separate queries by partition key? If so, would I have to keep track of what the existing key values are?
I have a Spark dataframe that I need to send as body of HTTP POST request. The storage system is Apache Solr. We are creating Spark dataframe by reading Solr collection. I can use Jackson library to create JSON and send it over HTTP POST. Also, dataframe may have millions of records so preferred way is to send them in batches over HTTP.
Below are the two approaches I can think.
We can use foreach/foreachPartition operations of Spark dataframe and call HTTP POST which means that HTTP call will happen within each executor (If I am not wrong). Is this approach right? Also, it means if I have 3 executors then there will be 3 HTTP calls that we can make in parallel. Right? But opening and closing HTTP connection so many times, will it not cause issue?
After getting the Spark dataframe, we can save it in some other SOLR collection (using Spark) and then data from that collection will be read to get the data in batches using SOLR API (using rows, start parameters), create JSON out of it and send it over HTTP request.
I would like to know which one of the above two approaches is preferred?
After getting the Spark dataframe, we can save it in some other SOLR
collection (using Spark) and then data from that collection will be
read to get the data in batches using SOLR API (using rows, start
parameters), create JSON out of it and send it over HTTP request.
out of your 2 approaches 2nd approach is best since you have paging feature in solrj
1) save your dataframe as solr documents with indexes
2) use solrj is api which will interact with your solr collections and will return solr documents based on your criteria.
3) you can convert them in to json using any parser and present in uis or user queries.
Infact this is not new approach, people who are using hbase with solr will do in the same way (since querying from hbase is really slow compared to querying from solr collections), where each hbase table is solr collection and can be queried via solrj and present to dashborads like angular js.
more illustrative diagram like below..
I am creating a REST API in NodeJS that connects to MongoDB does a MapReduce and store the results on a different collection.
The code is pretty simple. It takes a User ID, gets all other users who are related to this user somehow using some algorithm, and then for each one, calculate a likeness percentage. Assuming there are 50k users in the test database, this MapReduce takes around 200-800ms. And that is ideal for me. If this were to get famous and have hundreds of concurrent requests like this, I'm pretty sure that will not be the case any more. I understand that MongoDB might need to be sharded as needed.
The other scenario is to just do a normal find(), loop over the cursor and do the same logic. It takes the same amount of time as MapReduce mind you. However, I just thought about this to try and put the heavy lifting of the calculations on the client side (NodeJS) and not on the server side like MapReduce. Does this idea even have merit? I thought that this way, I can scale APIs horizontally behind a load balancer or something.
It would be better to keep heavy lifting off of the server which processes each request and put it onto the database.
If you have 1000 requests and 200 of them require you to perform the calculation, 800 requests can be processed as normal by the server, so long as mongo does the the calculation with mapReduce or aggregation.
If you instead run the calculations manually on your node server, all requests will be affected by the server having to do the heavy lifting.
Mongo is also quite efficient at aggregation for sure and mapReduce also I would imagine.
I recently moved a ton of logic from my server onto mongoDB where I could and it made a world of difference.
I went through this article and the following rose a question:
QUEUED INPUTS If you’re receiving a high amount of concurrent data,
your database can become a bottleneck. As depicted above, Node.js can
easily handle the concurrent connections themselves. But because
database access is a blocking operation (in this case), we run into
trouble.
Isn't Db access an asynchronous operation in Nodejs? E.g. I usually perform all possible data transformations using MongoDb aggregation to minimize impact on NodeJs. Or I get things wrong?
that is why callbacks came into picture. that is the actual use of callbacks sine we don't know how much time db will take to process the aggregation. Db access is asynchronous just because of callbacks .
We are using DynamoDB with node.js and Express to create REST APIs. We have started to go with Dynamo on the backend, for simplicity of operations.
We have started to use the DynamoDB Document SDK from AWS Labs to simplify usage, and make it easy to work with JSON documents. To instantiate a client to use, we need to do the following:
AWS = require('aws-sdk');
Doc = require("dynamodb-doc");
var Dynamodb = new AWS.DynamoDB();
var DocClient = new Doc.DynamoDB(Dynamodb);
My question is, where do those last two steps need to take place, in order to ensure data integrity? I’m concerned about an object that is waiting for something happen in Dynamo, being taken over by another process, and getting the data swapped, resulting in incorrect data being sent back to a client, or incorrect data being written to the database.
We have three parts to our REST API. We have the main server.js file, that starts express and the HTTP server, and assigns resources to it, sets up logging, etc. We do the first two steps of creating the connection to Dynamo, creating the AWS and Doc requires, at that point. Those vars are global in the app. We then, depending on the route being followed through the API, call a controller that parses up the input from the rest call. It then calls a model file, that does the interacting with Dynamo, and provides the response back to the controller, which formats the return package along with any errors, and sends it to the client. The model is simply a group of methods that essentially cover the same area of the app. We would have a user model, for instance, that covers things like login and account creation in an app.
I have done the last two steps above for creating the dynamo object in two places. One, I have simply placed them in one spot, at the top of each model file. I do not reinstantiate them in the methods below, I simply use them. I have also instantiated them within the methods, when we are preparing to the make the call to Dynamo, making them entirely local to the method, and pass them to a secondary function if needed. This second method has always struck me as the safest way to do it. However, under load testing, I have run into situations where we seem to have overwhelmed the outgoing network connections, and I start getting errors telling me that the DynamoDB end point is unavailable in the region I’m running in. I believe this is from the additional calls required to make the connections.
So, the question is, is creating those objects local to the model file, safe, or do they need to be created locally in the method that uses them? Any thoughts would be much appreciated.
You should be safe creating just one instance of those clients and sharing them in your code, but that isn't related to your underlying concern.
Concurrent access to various records in DynamoDB is still something you have to deal with. It is possible to have different requests attempt writes to the object at the same time. This is possible if you have concurrent requests on a single server, but is especially true when you have multiple servers.
Writes to DynamoDB are atomic only at the individual item. This means if your logic requires multiple updates to separate items potentially in separate tables there is no way to guarantee all or none of those changes are made. It is possible only some of them could be made.
DynamoDB natively supports conditional writes so it is possible to ensure specific conditions are met, such as specific attributes still have certain values, otherwise the write will fail.
With respect to making too many requests to DynamoDB... unless you are overwhelming your machine there shouldn't be any way to overwhelm the DynamoDB API. If you are performing more read/writes that you have provisioned you will receive errors indicating provisioned throughput has been exceeded, but the API itself is still functioning as intended under these conditions.