performance of parameterized queries for different db's - parameterized

A lot of people know that it is important to use parameterized queries to prevent sql injection attacks.
Parameterized queries are also much faster in sqlite and oracle when doing online transaction processing because the query optimizer doesn't have to reparse every parameterized sql statement before executing. I've seen sqlite becoming 3 times faster when you use parameterized queries, oracle can become 10 times faster when you use parameterized queries in some extreme cases with a lot of concurrency.
How about other db's like mysql, ms sql, db2 and postgresql?
Is there an equal difference in performance between parameterized queries and literal queries?

With respect to MySQL, MySQLPerformanceBlog reported some benchmarks of queries per second with non-prepared statements, prepared statements, and query cached statements. Their conclusion is that prepared statements is actually 14.5% faster than non-prepared statements on MySQL. Follow the link for details.
Of course the ratio varies based on the query.
Some people suppose that there's some overhead because you're making an extra round-trip from the client to the RDBMS -- one to prepare the query, the second to pass parameters and execute the query.
But the reality is that these are false assumptions made without actually measuring. I've never heard of prepared statements being slower in any brand of database.

I've nearly always seen an increase in speed - but only the first time generally. After the plans are loaded and cached I would have surmised that the various db engines will behave the same for either type.

Related

Are Objection.js WHERE IN queries slowing my Node.js application down?

Ok, so this is basically a yes/no answer. I have a node application running on Heroku with a postgres plan. I use Objection.js as ORM. I'm facing 300+ms response times on most of the endpoints, and I have a theory about why this is. I would like to have this confirmed.
Most of my api endpoints do around 5-10 eager loads. Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
But that made me thinking: Heroku Postgres doesn't run on the same heroku dyno as the node application I assume, so that means there is latency for every query. Could it be that all these latencies add up, causing a total of 300ms delay?
Summarizing: would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Speculating why this kind of slowdowns happen is mostly useless (I'll speculate a bit in the end anyways ;) ). First thing to do in this kind of situation would be to measure where that 300ms are used.
From database you should be able to see query times to spot if there are any slow queries causing the problems.
Also knex outputs some information about performance to console when you run it with DEBUG=knex:* environment variable set.
Nowadays also node has builtin profiling support which you can enable by setting --inspect flag when starting node. Then you will be able to connect your node process with chrome dev tools and see where node is using its time. From that profile you will be able to see for example if database query result parsing is dominating execution time.
The best way to figure out the slowness is to isolate the slow part of the app and inspect that carefully or even post that example to stackoverflow that other people can tell you why it might be slow. General explanation like this doesn't give much tool for other to help to resolve the real issue.
Objection.js handles eager loading by doing additional WHERE IN queries and not by doing one big query with a lot of JOINS. The reason for this is that it was easier to build that way and that it shouldn't harm performance too much.
With objection you can select which eager algorithm you like to use. In most of the cases (when there are one-to-many or many-to-many relations) making multiple queries is actually more performant compared to using join, because with join amount of data explodes and transfer time + result parsing on node side will take too much time.
would it be speedier to build your own queries with Knex instead of generating them through Objection.js, in case you have a separately hosted database?
Usually no.
Speculation part:
You mentioned that Most of my api endpoints do around 5-10 eager loads.. Most of the times when I have encountered this kind of slowness in queries the reason has been that app is querying too big chunks of data from database. When query returns for example tens of thousands rows, it will be some megabytes of JSON data. Only parsing that amount of data from database to JavaScript objects takes hundreds of milliseconds. If your queries are causing also high CPU load during that 300ms then this might be your problem.
Slow queries. Somtimes database doesn't have indexes set correctly so queries just will have to scan all the tables linearly to get the results. Checking slow queries from DB logs will help to find these. Also if getting response is taking long, but CPU load is low on node process, this might be the case.
Hereby I can confirm that every query takes approx. 30ms, regardless the complexity of the query. Objection.js indeed does around 10 separate queries because of the eager loading, explaining the cumulative 300ms.
Just FYI; I'm going down this road now ⬇
I've started to delve into writing my own more advanced SQL queries. It seems like you can do pretty advanced stuff, achieving similar results to the eager loading of Objection.js
select
"product".*,
json_agg(distinct brand) as brand,
case when count(shop) = 0 then '[]' else json_agg(distinct shop) end as shops,
case when count(category) = 0 then '[]' else json_agg(distinct category) end as categories,
case when count(barcode) = 0 then '[]' else json_agg(distinct barcode.code) end as barcodes
from "product"
inner join "brand" on "product"."brand_id" = "brand"."id"
left join "product_shop" on "product"."id" = "product_shop"."product_id"
left join "shop" on "product_shop"."shop_code" = "shop"."code"
left join "product_category" on "product"."id" = "product_category"."product_id"
left join "category" on "product_category"."category_id" = "category"."id"
left join "barcode" on "product"."id" = "barcode"."product_id"
group by "product"."id"
This takes 19ms on 1000 products, but usually the limit is 25 products, so very performant.
As mentioned by other people, objection uses multiple queries instead of joins by default to perform eager loading. It's a safer default than completely join based loading which can become really slow in some cases. You can read more about the default eager algorithm here.
You can choose to use the join based algoritm simply by calling the joinEager instead method of eager. joinEager executes one single query.
Objection also used to have a (pretty stupid) default of 1 parallel queries per operation, which meant that all queries inside an eager call were executed sequentially. Now that default has been removed and one should get a better performance even in cases like yours.
Your trick of using json_agg is pretty clever and actually avoids the slowness problems I referred to that can arise in some cases when using joinEager. However, that cannot be easily used for nested loading or with other database engines.

Nodejs and Sqlite. Perform long queries

I have to perform 2 queries: query A is long (20 seconds) and query B is fast (1 second).
I want to guarantee that query B is performed fast also if query A is running.
How can i achive this behaviour?
It may not be easy to do because of how SQLite does locking.
From official Appropriate Uses For SQLite documentation:
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
[...]
SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
It may not be the best way to use SQLite, as the SQLite documentation states, when you have so many data that a single query takes so long.
There is no easy solution to fix that, other than using a real RDBMS like PostgreSQL.
And since you didn't include those queries that take so long it's also impossible to tell you anything more than that. Of course maybe your queries could be optimized but we don't know that.

When to use Materialized Views?

I'm learning Cassandra now and I understand I should make a table for each query. I'm not sure when I should make separate tables or materialized views. For example, I have the following queries for users and posts:
users_by_id
users_by_email
users_by_session_key
posts_by_id
posts_by_category
posts_by_user
Should I always use materialized views?
It seems to me that if you want to keep the Posts or Users consistent across queries, then I have to use materialized views. However materialized views I read have a read before write latency.
On the other hand, if I use different tables, am I supposed to make 3 Inserts every time a new post is created? I noticed that I get the error batch with conditions cannot span multiple tables, which means I have to insert it one at a time into each separate table, which can cause consistency problems if one of the queries fails. (A batch statement, would fail all 3 if one of them failed).
So, since it makes sense to have consistency, then it seems to me that I will always want to use materialized views, and have to take the read before write penalty.
I guess my other question is when would it ever be okay for data to be inconsistent?
So hoping someone can provide more clarity for me for how to handle multiple queries in cassandra on a 'theoretical model` like Users or Posts. Should I be using materialized views? If I use 3 different tables for each model, how do I keep them consistent? Just hope that all 3 inserts don't fail? Doesn't seem right.
Read my deep dive blog post for all the trade-offs when using materialized views. Once you understand the trade-offs, choose wisely: http://www.doanduyhai.com/blog/?p=1930
No, you shouldn't always use materialized views. The perfect solution is a interface for your database. In this application, you handle all your different tables. But there's are also some use case for the materialized views: If you haven't the time for this application but you need this feature, use materialized views. You have a performance trade off but in this scenario, the time is more important. If you also need real updates instead of upserts on all tables: use materialized views.
Batch is useful for buffering or putting data-sets with the same partition key together. For example: You have a high data troughput application. Between your heartbeats or between execution another query with QUORUM, you got 10 other events with the same partition key. But you won't execute them because you're waiting for a successful response. If a success comes back, you execute a batch query. But please keep in mind: Use only a batch for the same partition keys.
Generally, remember one important thing: Cassandra has an eventually consistency model. That means: If you use qourum, you will have consistency but not every time. If your application needs a full consistency, not only eventually use another solution. E.g. SQL with sharding. Cassandra is optimized for writes and you will only get happy when you're using the cassandra features.
Some performance tips:
If you need a better consistency: Use QUORUM, never use ALL. And, generally, write you queries standalone. Sometimes batch is useful. Don't execute queries with ALLOW FILTERING. Don't use token ranges or IN operator on partition keys :)

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Neo4j Optimization Questions for Server Plug-in Queries

I'm trying to optimize a fuzzy search query. It's fairly large, as it searches most properties in the database for a single word. I have some questions about some things I've been doing to improve the search speed.
Test Info: I added about 10,000 nodes and I'm searching on about 40 properties. My query times are about 3-30 seconds depending on the criteria.
MATCH (n) WHERE
(n:Type__Exercise and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' )) or
(n:Type__Fault and ( n.description =~ '(?i).*criteria.*' or n.name =~ '(?i).*criteria.*' ))
with n LIMIT 100
return count(n)
This is basically my query, but with a lot more OR clauses. I also use parameters when sending the query to the execution engine. I realize it's very expensive to use the regular expressions on every single property. I'm hoping I can get good enough performance without doing exact matches up to a certain amount of data (This application will only have 1-10 users querying at a time). This is a possible interim effort we're investigating until the new label indexes support full text queries.
First of all, how do I tell if my query was cached? I make a call to my server plug-in via the curl command and the times I'm seeing are almost identical each time I pass the same criteria (The time is for the entire curl command to finish). I'm using a single instance of the execution engine that was created by using the GraphDatabaseService that is passed in to the plug-in via a #Source parameter. How much of an improvement should I see if a query is cached?
Is there a query size where Neo4j doesn't bother caching the query?
How effective is the LIMIT clause at speeding up queries? I added one, but didn't see a great performance boost (for queries that do have results). Does the execution engine stop once it finds enough nodes?
My queries are ready-only, do I still have to wrap my calls with a transaction?
I could split up my query so I only search one property at a time or say 4 properties at a time. Then I could run the whole set of queries via the execution engine. It seems like this would be better for caching, but is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
Is there a way to use parameters when using PROFILE in the Neo4j console? I've been trying to use this to see how many db hits I'm getting on my queries.
How effective is the Neo4j browser for comparing times it takes to execute a query?
Does caching happen here?
If I want to warm up Neo4j data for queries - can I run the exact queries I'm expecting? Does the query need to return data, or will a count type query warm the cache? As an alternative, should I just iterate over all the nodes? I'd rather just pull in the nodes that are likely to be searched vs all of them.
I think for the time being you'd be better served using the fulltext-legacy indexing facilities, I recently wrote a blog post about it: http://jexp.de/blog/2014/03/full-text-indexing-fts-in-neo4j-2-0/
If you don't want to do that:
I would probably also rewrite your query to turn it around:
MATCH (n)
WHERE
(n:Type__Exercise OR n:Type__Fault) AND
(n.description =~ '(?i).*criteria.*' OR n.name =~ '(?i).*criteria.*' )
You can probably also benefit a bit more by having a secondary "search" field that is just the concatenation of your description and name fields. You probably also want to improve your regexp like adding a word boundary \b left and right.
Regarding your questions:
First of all, how do I tell if my query was cached?
Your query will be cached if you use parameters (for the regexps) there is a configurable query-caches size (defaulting to 100 queries)
Is there a query size where Neo4j doesn't bother caching the query?
Neo4j currently caches all queries that come in regardless of size
My queries are ready-only, do I still have to wrap my calls with a transaction?
Cypher will create its own transaction. In general read transactions are mandatory. For cypher you need outer transactions if you want multiple queries to participate in the same tx-scope.
is there an added cost to running multiple small queries rather than one large one? What if I kicked off 10 threads? Would there be enough of a performance increase to make this worth while?
It depends smaller queries are executed more quickly (if they touch less of the total dataset) but you have to combine their results in the client.
If they touch the same nodes you do double work.
For bigger queries you have to watch out when you span up cross products or exponential path explosions.
Regarding running smaller queries with many threads
Good question, it should be faster there are currently some bottlenecks that we're about to remove. Just try it out.
Is there a way to use parameters when using PROFILE in the Neo4j console?
You can use the shell variables for that, with export name=value and list them with env
e.g.
export name=Lisa
profile match (n:User {name:{name}}) return n;
How effective is the Neo4j browser for comparing times it takes to execute a query?
The browser measures the complete roundtrip with potentially more data loading, so it's timing is not very accurate.
Warmup
The exact queries would make sense
You don't have to return data, it is enough to return count(*) but you should access the properties you want to access to make sure they are loaded.

Resources