Fastest Search and Insertion - search

I ve got a million objects . Which is the fastest way to lookup a particular object with name as key also the fastest way to perfrom insertion ? would hashing be sufficient?

Probably a hash table, assuming you don't need anything other than key based access. Make sure that the hashing of the key is good enough (as to minimise collisions) and the table is large enough (for the same reason).

It will depend on how often your need to do a lookup and how often you need to insert elements.
If you often have to insert elements then a linked list would perform better.
If you often have to search for elements, an hash table is more efficient. Perhaps, you can have both - your main data as a linked list, and an hash table which will serve as an index to the list.

You can also use a binary search tree. BST has the advantage of fast search and fast insertion too. Use the key to route your way in the tree and build the tree node to have the value.
Use BST in favor of hash tables if you are not sure about the balance of the operation (ie: looking up a key and value pairs, insertion, etc) and if you (based on your analysis) know that keys may collide frequently in the hash table (which will cause bad performance for the hash table).

Several Structures exist here that you can use. Each has it's advantage and disadvantage.
A HashTable will have a great lookup time and insertion time, provided you have a table that minimizes collision. If not, then lookup/insertion can lead to a lot more time.
A Binary Search Tree has ln(n) insertion and lookup, provided that it's balanced. Sometimes the balancing can cause the insertion to take a bit longer then ln(n), depending on the BST you go with.

Can go with B+ tree, it guarantees lesser search complexity ( since you reach leaf nodes fast, height = log n to base k, k = degree of nodes). The databases have similar requirement and they use B+ trees to maintain and retrieve data.

Related

Are dummy partition keys always bad?

I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether. By dummy, I mean a column whose only purpose is to contain the same value for all rows, thereby putting all data on 1 node and giving the lowest possible cardinality. For example:
dummy | id | name
-------------------------
0 | 01 | 'Oliver'
0 | 02 | 'James'
0 | 03 | 'Nicholls'
The two main points in regards to why you should avoid dummy partition keys are:
1) You end up with data "hot-spots". There is a lot of data stored on 1 node so there's more traffic around that node and you have poor distribution around the cluster.
2) Partition space is finite. If you put all data on one partition, it will eventually be incapable of storing any more data.
I can understand these points and I agree that you definitely want to avoid those situations, so I put this idea out of my mind and tried to think of a good partition key for my table. The table in question stores sites and there are two common ways that table gets queried in our system. Either a single site is requested or all sites are requested.
This puts me in a bit of an awkward situation, because the table is either queried on nothing or the site ID, and making a unique field the partition key would give me very high cardinality and high latency on queries that request all sites.
So I decided that I'd just choose an arbitrary field that would give relatively low cardinality, even though it doesn't reflect how the data will actually be queried, just because it's better than having a cardinality that is either excessively high or excessively low. This approach also has problems though.
I could partition my data on column x, but we have numerous clients, all of whom use our system differently, so x for 1 client could give the results I'm after, but could give awful results for another.
At this point I'm running out of options. I need a field in my table that will be consistent for all clients, however this field doesn't exist, so I'm now considering having a new field that will contain a random number from 1-3 and then partitioning on that field, which is essentially just a dummy field. The only difference is that I want to randomise the values a little bit as to avoid hot-spots and unbounded row growth.
I know this is a data-modelling question and it varies from system to system, and of course there are going to be situations where you have to choose the lesser of two evils (there is no perfect solution), but what I'm really focussed on with this question is:
Are dummy partition keys something that should outright never be a consideration in Cassandra, or are there situations in which they're seen as acceptable? If you think the former, then how would you approach this situation?
I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether.
I'm going to go out on a limb and guess that your search has yielded my article We Shall Have Order!, where I made my position on the use of "dummy" partition keys quite clear. Bearing that in mind, I'll try to provide some alternate solutions.
I see two potential problems to solve here. The first:
I need a field in my table that will be consistent for all clients, however this field doesn't exist
Typically this is solved by duplicating your data into another query table. That's the best way to serve multiple, varying query patterns. If you have one client (service?) that needs to query that table by site id, then you could have that table duplicated into a table called sites_by_id.
CREATE TABLE sites_by_id (
id BIGINT,
name TEXT,
PRIMARY KEY (id));
The other problem is this query pattern:
all sites are requested
Another common Cassandra anti-pattern is that of unbound SELECTs (SELECT query without a WHERE clause). I am sure you understand why these are bad, as they require all nodes/partitions to be read for completion (which is probably why you are looking into a "dummy" key). But as the table supporting these types of queries increases in size, they will only get slower and slower over time...regardless of whether you execute an unbound SELECT or use a "dummy" key.
The solution here is to perform a re-examination of your data model, and business requirements. Perhaps your data can be split up into sites by region or country? Maybe your client really only needs the sites that have been updated for this year? Obtaining some more details on the client's query requirements may help you find a good partitioning key for them to use. Otherwise, if they really do need all of them all of the time, then doanduyhai's suggestion of using Spark will better fit your use case.
or all sites are requested
So basically you have a full table scan scenario. Isn't Apache Spark over Cassandra a better fit for this use-case ? I suspect it's an analytics use-case, isn't it ?
As far as I understand, you want to access a single site by its id, in which case lookup by partition key is ideal. The other use-case which requires to fetch all the sites is best suited with Spark

Using intensive update in Map type column in Cassandra is anti-pattern?

Friends,
I am modeling a table in Cassandra which contains a Map column. So this Map should contains dynamic values and will be update so much for that row (I will update by a Primary Key)
Is it an anti-patterns, which other options should I consider ?
What you're trying to do is possibly what I described here.
First big limitations that comes into my mind are the one given by the specification:
64KB is the max size of an item in a collection
65536 is the max number of queryable elements inside a collection
More there are the problems described in other post
you can not retrieve part of a collection: even if internally each entry of a map is stored as a column you can only retrieve the whole collection (this can lead to very slow performances)
you have to choose whether creating an index on keys or on values, both simultaneously are not supported.
Since maps are typed you can't put mixed values inside: you have to represent everything as a string or bytes and then transform your data client side
I personally consider this approach as an anti pattern for all these reasons -- this approach provide a schema less solution but reduce performances and introduce lots of limitations like the one secondary indexes and typing.
HTH, Carlo

Readable row key cassandra

We are investigating migrating a system from a RDBMS to Cassandra and are having trouble finding a way to convert auto-increment column to Cassandra. We actually have no need for this to be sequential at all, it can even contain characters, but it must be short (ideally under 8 chars) and globally unique. Ideal value would look something like
AB123456
First part of the question is should we be generating this key in application code or in Cassandra?
Second part:
If Cassandra, how?
If Application code, is it an acceptable pattern to generate a candidate code then attempt an insert, if collision occurs then regenerate key candidate and retry?
The common way to do this in Cassandra is to use a uuid (or timeuuid if the IDs should be time ordered). But these must be long to get uniqueness - they are 16 bytes long. (uuids are unique because the probability of a collision is so low; timeuuids are guaranteed unique since they contain information about the generating host and include time.)
If you need a shorter key, you can't reliably find collisions by checking before inserting. There will always be race conditions where this will fail without external coordination. Coming in Cassandra 2.0 is compare-and-set which will let you do this, but at a performance cost.
If you use a random 8 character string, containing only numbers and letters, there are 36^8 possible keys, with collisions becoming very likely after about sqrt(36^8) ~ 1 million operations. You can improve this by using any character, so there are 256^8 keys, with collisions likely after about sqrt(256^8) ~ 4 billion operations. This is probably too low though, so it would be better to use longer IDs.

Which search algorithm to prefer?

Binary search algorithm has a big O value of O(log n) and a sequential search has a big O value of O(n). But we need sorting algorithm before a binary search and best big O value for a sorting algotithm is O(n.log n). So, effectively, the big O value for binary search is O(n.log n), which is greater than that of the sequential search. So, which one is preferred as searching algo?
In practice it depends on how often you search. If you have to search millions of times, you want binary search, even if you have to pay the upfront cost of sorting. It depends on your use case. With binary search, you also ensure that your inserts keep the list sorted, so they become slower as well.
If you need to do a lot of inserts, and very few searches, sequential search might be faster.
Keep in mind that a lot of this won't even be noticeable until you are working with a lot of data.
Sequential search is practically rarely used in optimised applications. Because it is usually much better to find an appropriate data structure then using the one that provides a frequently used search in O(n).
For example, red-black tree is a special kind of balanced binary tree which provides insert/delete/search all in O(log n). So it is fast to create, fill it in and search.

Performance of Long IDs

I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:
"000ab56cb24aef9b817ac98d55695c6a"
Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).
For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?
The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.
It also affects storage, as longer keys take more space, space takes time.
However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.
However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).
So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.
Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.
If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.
From the CouchDB: The definitive guide book:
I need to draw a picture of this at
some point, but the reason is if you
think of the idealized btree, when you
use UUID’s you might be hitting any
number of root nodes in that tree, so
with the append only nature you have
to write each of those nodes and
everything above it in the tree. but
if you use monotonically increasing
id’s then you’re invalidating the same
path down the right hand side of the
tree thus minimizing the number of
nodes that need to be rewritten. would
be just the same for monotonically
decreasing as well. and it should
technically work if you’re updates can
be guaranteed to hit one or two nodes
in the inside of the tree, though
that’s much harder to prove.
So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.

Resources