Pagination in Gremlin - pagination

I'm trying to do pagination in Gremlin. I've followed the solution provided on gremlin recipes. So right now I am using the range() step to truncate results. This works well because when users asks for x results id only queries for x of them and doesn't perform full search which is significantly faster.
However in Gremlin docs it states that:
A Traversal’s result are never ordered unless explicitly by means of order()-step. Thus, never rely on the iteration order between TinkerPop3 releases and even within a release (as traversal optimizations may alter the flow).
And order is really important for pagination (we don't want user to have same results on different pages).
Adding order() would really slow down querying, since it will have to query all vertices that apply to the search and then truncate it with range().
Any ideas how this can be solved to have consistency between queries and still have smaller query time for not-full search?

The documentation is correct, but could use some additional context. Without order() you aren't guaranteed a specific order, unless the underlying graph database guarantees that order. TinkerPop should preserve that guarantee if it exists. So, ultimately you need to consider what the underlying graph will do.
As you are using JanusGraph, I'm pretty sure that it does not guarantee order of result iteration but there are caveats depending on the type of indexing that you do. You can read more about that here but may want to ask specific questions on the JanusGraph user list.

Related

MongoDB most efficient Query Strategy

I state that I have already tried to look in the Mongo documentation, but I have not found what I am looking for. I've also read similar questions, but they always talk about very simple queries. I'm working with the Node's Mongo native driver. This is a scalability problem, so the collections I am talking about can have millions of records or some dozen.
Basically I have a query and I need to validate all results (which have a complex structure). Two possible solutions come to mind:
I create a query as specific as possible and try to validate the result directly on the server
I use the cursor to go through the documents one by one from the client (this would also allow me to stop if I am looking for only one result)
Here is the question: what is the most efficient way, in terms of latency, overall time, bandwidth use and computational weight server/client? There is probably no single answer, in fact I'd like to understand the pros and cons of the different approaches (and whichever approach you recommend). I know the solution should be determined on a case-by-case basis, however I am trying to figure out what could best cover most of the cases.
Also, to be more specific:
A) Being a complex query (several nested objects with ranges of values ​​and lists of values ​​allowed), performing the validation from the server would certainly save bandwidth, but is it always possible? And in terms of computation could it be more efficient to do it on the client?
B) I don't understand the cursor behavior: is it a continuously open stream until it is closed by server/client? In addition, does the next() result already take up resources on the server/client or does it happen to the call?
If anyone knows, I'd also like to know how Mongoose solved these "problems", for example in the case of custom validators.

"ignoreDocumentNotFound", "readCompleteInput" query options in explain

When I use "explain" on an insert query I get two query options that seems not to be documented:
ignoreDocumentNotFound
readCompleteInput
What are these options for and what they do?
Nice to see you like our db._explain() Facility ;-)
To answer your question one has to know that explain re-uses a backend functionality that is also used for different purposes:
distribute AQL queries in ArangoDB clusters
analyse what the optimizer did with queries in Unittests
The later will explain queries, and check whether certain assumptions over the query plan are still valid.
The ignoreDocumentNotFound and readCompleteInput flags are exactly intended for that purpose, so the unittests can revalidate whether certain assumptions for the query are still true.
Since they don't contain additional value for the end user, they're not documented. One could argue whether explain should hide them to avoid irretations

Using Cassandra to store immutable data?

We're investigating options to store and read a lot of immutable data (events) and I'd like some feedback on whether Cassandra would be a good fit.
Requirements:
We need to store about 10 events per seconds (but the rate will increase). Each event is small, about 1 Kb.
A really important requirement is that we need to be able to replay all events in order. For us it would be fine to read all data in insertion order (like a table scan) so an explicit sort might not be necessary.
Querying the data in any other way is not a prime concern and since Cassandra is a schema db I don't suppose it's possible when the events come in many different forms? Would Cassandra be a good fit for this? If so is there something one should be aware of?
I've had the exact same requirements for a "project" (rather a tool) a year ago, and I used Cassandra and I didn't regret. In general it fits very well. You can fit quite a lot of data in a Cassandra cluster and the performance is impressive (although you might need tweaking) and the natural ordering is a nice thing to have.
Rather than expressing the benefits of using it, I'll rather concentrate on possible pitfalls you might not consider before starting.
You have to think about your schema. The data is naturally ordered within one row by the clustering key, in your case it will be the timestamp. However, you cannot order data between different rows. They might be ordered after the query, but it is not guaranteed in any way so don't think about it. There was some kind of way to write a query before 2.1 I believe (using order by and disabling paging and allowing filtering) but that introduced bad performance and I don't think it is even possible now. So you should order data between rows on your querying side.
This might be an issue if you have multiple variable types (such as temperature and pressure) that have to be replayed at the same time, and you put them in different rows. You have to get those rows with different variable types, then do your resorting on the querying side. Another way to do it is to put all variable types in one row, but than filtering for only a subset is an issue to solve.
Rowlength is limited to 2 billion elements, and although that seems a lot, it really is not unreachable with time series data. Especially because you don't want to get near those two billions, keep it lower in hundreds of millions maximum. If you put some parameter on which you will split the rows (some increasing index or rounding by day/month/year) you will have to implement that in your query logic as well.
Experiment with your queries first on a dummy example. You cannot arbitrarily use <, > or = in queries. There are specific rules in SQL with filtering, or using the WHERE clause..
All in all these things might seem important, but they are really not too much of a hassle when you get to know Cassandra a bit. I'm underlining them just to give you a heads up. If something is not logical at first just fall back to understanding why it is like that and the whole theory about data distribution and the ring topology.
Don't expect too much from the collections within the columns, their length is limited to ~65000 elements.
Don't fall into the misconception that batched statements are faster (this one is a classic :) )
Based on the requirements you expressed, Cassandra could be a good fit as it's a write-optimized data store. Timeseries are quite a common pattern and you can define a clustering order, for example, on the timestamp of the events in order to retrieve all the events in time order. I've found this article on Datastax Academy very useful when wanted to learn about time series.
Variable data structure it's not a problem: you can store the data in a BLOB, then parse it internally from your application (i.e. store it as JSON and read it in your model), or you could even store the data in a map, although collections in Cassandra have some caveats that it's good to be aware of. Here you can find docs about collections in Cassandra 2.0/2.1.
Cassandra is quite different from a SQL database, and although CQL has some similarities there are fundamental differences in usage patterns. It's very important to know how Cassandra works and how to model your data in order to pursue efficiency - a great article from Datastax explains the basics of data modelling.
In a nutshell: Cassandra may be a good fit for you, but before using it take some time to understand its internals as it could be a bad beast if you use it poorly.

Cocuhdb a fit for a query like this?

I have been looking for an escape from GAE as the datastore does not support a lot of the things I want to do with it.
So I have looked at CouchDB (among others) and I really like the REST interface and the hosting option I found at Cloudant.
But for all my googling and reading any docs I could find, I still am not sure if it is a good fit.
So I come here in the hope that someone might have more insight.
I write web apps and a lot of the projects I want to do will involve a query that looks like this:
Find all entries that are within a user-input-lat/long bounding box and where start-time is less than user-input-time-1 and end-time is greater than user-input-time-2 and has all tags in user-input-list-of-tags.
Thats not even pseudocode, but I hope it makes sense anyway.
I am not just looking for a "You cannot do that in CouchDB". Some kind of explanation and perhaps something like "If you can live without the tags then you can do this:"
I would like to use the Cloudant service so GeoCouch is apparently out of the question, but they do something that should work like lucene, but does that mean the queries are slow?
As you can tell, I am a bit confused here, so just do your best to straighten me out and I'll be greatfull :)
Not mentioning the tags (which in itself is already a problem), what you describe is a multi-dimensional query : you have several "coordinates" (lat, long, start-time, end-time) and provide a range for each of these coordinates.
On its own, CouchDB cannot perform multi-dimensional queries at all — you only get single-dimension queries across one coordinate.
Tags certainly are possible, but it depends on whether you need documents that have at least one tag in the list, or documents that have all tags in the list. The first case is easy (run one query per tag using the bulk API), the second might require excessive amounts of memory (if a document has N tags, it needs to emit 2N-1 tag-sets in order to match all possible tag combinations involving it, so you should place an upper bound on either the number of tags in a document, or the number of tags in a query).
Lucene does allow multi-dimensional and keyword-based queries, though I cannot vouch for their performance.

Transform MongoDB Data on Find

Is it possible to transform the returned data from a Find query in MongoDB?
As an example, I have a first and last field to store a user's first and last name. In certain queries, I wish to return the first name and last initial only (e.g. 'Joe Smith' returned as 'Joe S'). In MySQL a SUBSTRING() function could be used on the field in the SELECT statement.
Are there data transformations or string functions in Mongo like there are in SQL? If so can you please provide an example of usage. If not, is there a proposed method of transforming the data aside from looping through the returned object?
It is possible to do just about anything server-side with mongodb. The reason you will usually hear "no" is you sacrifice too much speed for it to make sense under ordinary circumstances. One of the main forces behind PyMongo, Mike Dirolf with 10gen, has a good blog post on using server-side javascript with pymongo here: http://dirolf.com/2010/04/05/stored-javascript-in-mongodb-and-pymongo.html. His example is for storing a javascript function to return the sum of two fields. But you could easily modify to return the first letter of your user name field. The gist would be something like:
db.system_js.first_letter = "function (x) { return x.charAt(0); }"
Understand first, though, that mongodb is made to be really good at retrieving your data, not really good at processing it. The recommendation (see for example 50 tips and tricks for mongodb developers from Kristina Chodorow by Oreilly) is to do what Andrew tersely alluded to doing above: make a first letter column and return that instead. Any processing can be more efficiently done in the application.
But if you feel that even querying for the fullname before returning fullname[0] from your 'view' is too much of a security risk, you don't need to do everything the fastest possible way. I'd avoided map-reduce in mongodb for awhile because of all the public concerns about speed. Then I ran my first map reduce and twiddled my thumbs for .1 seconds as it processed 80,000 10k documents. I realize in the scheme of things, that's tiny. But it illustrates that just because it's bad for a massive website to take a performance hit on some server side processing, doesn't mean it would matter to you. In my case, I imagine it would take me slightly longer to migrate to Hadoop than to just eat that .1 seconds every now and then. Good luck with your site
The question you should ask yourself is why you need that data. If you need it for display purposes, do that in your view code. If you need it for query purposes, then do as Andrew suggested, and store it as an extra field on the object. Mongo doesn't provide server-side transformations (usually, and where it does, you usually don't want to use them); the answer is usually to not treat your data as you would in a relational DB, but to use the more flexible nature of the data store to pre-bake your data into the formats that you're going to be using.
If you can provide more information on how this data should be used, then we might be able to answer a little more usefully.

Resources