How slow is a call to a local database? [closed] - node.js

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In general, say you have a (<16mb) table in a database running on the same machine as your server. If you need to do lots of querying into this table (%100 reads), is it better to:
Get the entire table, and do all the searching/querying/ in the server code.
Make lots of queries into the local database.
If the database is local, can I take advantage of the dbms's highly-efficient internal data structures for querying, or is the delay such that it's faster to map the tables returned by the database into my own data structures.
Thanks.

This is going to depend heavily on what kind of searches you're doing.
If your data is all ID lookups, it's probably faster to have it in RAM.
If your data is all full scans (no indexes), it's probably faster to have it in RAM.
If your data uses indexes, it's probably faster to have it in the DB.
Of course, much of the appeal of a database is indexes and a common query interface, so you have to weigh how valuable those are versus raw speed.
There's no way to really answer this without knowing exactly the nature of the data and queries to be done on it. Over-the-wire time has its cost, as does BSON <-> native marshalling, but indexed searches can be O(log n) as opposed to a dumb O(n) (or worse) search over a simple in-memory data structure.
Have you tried benchmarking?

Related

I have more than 3k rows of data to retrieve from cassandra using an api. I have indexing on it but its causing issue of connection reset [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 14 days ago.
Improve this question
I have more than 3k rows of data to retrieve from cassandra using an api. I have indexing on it but then also its causing issue of connection reset.
Should I look for any other data base to do so?
Is it possible to have a work around in cassandra?
Will providing limit or filter on between dates in query will help?
(so there will be restriction on api, is it standard practice)
So there's a lot missing here that is needed to help diagnose what is going on. Specifically, it'd be great to see the underlying table definition and the actual CQL query that the API is trying to run.
Without that, I can say that to me, it sounds like the API is trying to aggregate the 3000 rows from multiple partitions with a specific date range in the cluster (and is probably using the ALLOW FILTERING directive to accomplish this). Most multi-partition queries will time-out, just because of all the extra network time being introduced while polling each node in the cluster.
As with all queries in Cassandra, a table needs to be built to support a specific query. If it's not, this is generally what happens.
Will providing limit or filter on between dates in query will help?
Yes, breaking this query up into smaller pieces will help. If you can look at the underlying table definition, that might give you a clue as to the right way to properly query the table. But in this case, making 10 queries for 300 rows probably has a higher chance for success than 1 query for 3000 rows.

ArangoDB, what's the better way to peform queries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What's better to retrieve complex data from ArangoDB: A big query with all collection joins and graph traversal or multiple queries for each piece of data?
I think it depends on several aspects, e.g. the operation(s) you want to perform, scenario in which the querie(s) should be executed or if you favor performance over maintainability.
AQL provides the ability to write a single non-trivial query which might span through entire dataset and perform complex operation(s). Dissolving a big query into multiple smaller ones might improve maintainability and code readability, but on the other hand separate queries for each piece of data might have negative performance impact in the form of network latency associated with each request. One should also consider if the scenario allows to work with partial results returned from database while the other batch of queries is being processed.

How can search results be cached? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I implement caching mechanism of search results as on stackoverflow?
How does elastic search and lucene deal with caching?
As of now, you can cache in two different ways within Elasticsearch
Filter cache - Here if you can offload as many constraints which don't take part in scoring of results, you can have segment level caches for that particular filter alone. This along with warmer API provides some decent amount in memory based caching for the filters applied alone
Shard request cache * - You can cache the results ( Other than hits) on query level. This is pretty new feature and should provide a good amount of caching. But still _source needs to be still taken the shards.
Within Elasticsearch you can exploit these features to attain a good amount of caching.
Also, you can explore other caching option external to Elasticsearch to memcache or other in memory caches.
previously called shared query cache
Warmers
Warmers have been removed. There have been significant improvements to the index that make warmers not necessary anymore.
in ElasticSearch 5.4+

measuring precision and recall [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
We are building a text search solution and want a way to measure precision and recall of the system every time we add new document types. From reading some of the posts here it sounds like a machine learning based solution is the way to go. Can a expert comment on this? We will then look to add machine learning folks to our team.
The only way to get the F1-score require knowledge about the correct class, rank of all samples obtains by evaluation querys, and you also need thoses evaluation querys.
Any machine learning will need a large quantity of manual work to provided thoses samples and/or querys. So large that it wont save you any time.
Another bad aspect of this evaluation is through to learning-related intrinsic errors. It will go with the growing size of the index of the search engine and the number of examples required. You never get a good evaluation.
Forget machine-learning for the evaluation of search engine.
Build by hand your tests querys and sample, by the time it will become big and reliable.
If you really want machine-learning in your system, you should look at query pre-processing. Getting some meta-information about the query by another way (you say SVN, why not?) is generaly a good for performance and while it did'nt change the result, you can use the same sample for an end-to-end evaluation.
That what I have done few years ago, but with naive baye classifier on natural langage analysis.

NoSQL - what's the best option for IP tracking? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm trying to implement a system using node.js in which a number of sites would contain js loaded from a common host, and trigger an action when some user visits n+ sites.
I suppose a nosql solution storing a mapping of ip address => array of sites visited would be preferable to a RDBM both in terms of performance and simplicity. The actions I need are "add to array if not there already" and getting the length of the array. Also, I wouldn't like it all to sit in memory all the time, since the db might get large some day.
What's a system that fits these requirements the best? MongoDB seems like a nice option given $addToSet exists, but maybe there's something better in terms of RAM usage?
When I hear about working with lists or sets, the first choice is Redis

Resources