Multithreaded DBMSs? - multithreading

I was wondering what DBMSs actually use multithreading in their query plans/executions?

Oracle supports this, as does SQL Server and DB2. I do not believe that MySQL or PostgeSQL support parallel queries though.

I believe most databases that support table partitioning will support querying each partition at the same time if the need arises rather than just pruning unneeded partitions. Oracle can do this. Teradata definitely does this.

MySQL only uses one thread per query (in the standard engines); this includes if the tables are partitioned.

The Multi-threading is used in dB # many areas, for example at the Query Evaluation.
*) The Parallel Query execution is done with the help of Multi-threading for Optimizing
the performance of the Query evaluation.
*) Parallelize the dB backup like creating a separate backup thread for each available tape drive will accomplish the dB server Back-up. (eg) Oracle Uses it.
*) Using the Table Reorganization - When the time goes the dB becomes bulky and the DBA will reorganize the tables in the intention of improve the performance of the dB.
---- In oracle the POSIX and C++ is used to achieve the multi-threading.----

Related

Peformance difference between YSQL vs YCQL

Based on the document "Benchmarking Distributed SQL Databases" What sees that throughput is higher in YCQL when compared with YSQL.
If we are using the same table structure and tool to insert would be the same and I am not using any SQL like features then why does YCQL perform better when compared with YSQL?
This could be because of a few differences between YCQL and YSQL. Note that while these differences not fundamental to the architecture, they manifest because YSQL started with the PostgreSQL code for the upper half of the DB. Many of these are being enhanced.
One hop optimization YCQL is shard-aware and knows how the underlying DB (called DocDB) shards and distributes data across nodes. This means it can “hop” directly to the node that contains the data when using PREPARE-BIND statements. YSQL today cannot do this since this requires a JDBC level protocol change, this work is being done in the jdbc-yugabytedb project.
Threads instead of processes YCQL uses threads to handle incoming client queries/statements, while the YSQL (and PostgreSQL code) uses processes. These processes are heavier weight, and this could affect throughput in certain scenarios (and connection scalability in certain others as well). This is another enhancement that is planned.
Upsert vs insert In YCQL, each insert is treated as an upsert (update or insert, without having to check the existing value) by default and needs special syntax to perform pure inserts. In YSQL, each insert needs to read the data before performing the insert since if the key already exists, it is treated as a failure.
More work gone into YCQL performance Currently (end of 2019) the focus has been only on correctness + functionality for YSQL, while YCQL performance has been worked on quite a bit. Note that while the work on performance has just started, it is possible to improve the performance relatively quickly because of the underlying architecture.

How to Create slowness in Cassandra?

I want to create slowness in Cassandra to test my application. Is there any specific ways to induce slowness in Cassandra. In RDBMS we use locking, to wait for other operation until the lock is released. As Cassandra doesn't have locking, is there any other way to create deadlock, slowness etc.
You could use cassandra-stress tool
You could check out our project here simulacron. https://github.com/datastax/simulacron
This is a C*/DSE simulator, that was written specifically to test things like race conditions, and error conditions. You would have to prime all your relevant queries ahead of time, but it would allow you introduce a wait time, or errors to your responses. You can also simulate a large cluster on your local machine.
There is also a similar tool called scassandra, which does much of the same thing.
http://www.scassandra.org/
There are many ways to do it, i'll list two:
Create UDF with sleep/wait function within, if your version of Cassandra supports it.
Link to the docs:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDF.html
Create large table (the larger it be, slower it will run), and run:
select some_column from table where other_column = 'something' allow filtering;
where other_column is not a partition key of the table. It will result in full table scan, and since Cassandra isn't built for it, it will take some time (also I/O and CPU).
Maybe easier to just limit the network on the nodes. Depending on the OS ure using there are different options.

Apache Calcite Data Federation Usecase

Just want to check if the Apache Calcite can be used for the use case "Data Federation"(query with multiple databases).
The idea is I have a master query (5 tables) that has tables from one database (say Hive) and 3 tables from another database (say MySQL).
Can I execute master query on multiple database from one JDBC Client interface ?
If this is possible; where the query execution (particularly inter database join) happens?
Also, can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
I read from Calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?
I will try to answer. you can as well send questions to the mailing list. dev#calcite.apache.org you are more likely get answer there.
Can I execute master query on multiple database from one JDBC Client interface ? If this is possible; where the query execution (particularly inter database join) happens?
yes, you can. the Inter database join happens in your memory where calcite runs.
Can I get a physical plan from Calcite where I can execute explicitly in another execution engine?
yes, you can. a lot of calcite consumers are doing this way. but you will have to wrap around the calcite rule system, I mean excute
I read from calcite documentation that it can push down Join and GroupBy but I could not understand it? Can anyone help me understand this?
these are the SQL optimisations that the engine does. imagine a groupBy which could have happened on a tiny table but actually specified after joining with a huge table.

Cassandra bulk insert operation, internally

I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.

performance of parameterized queries for different db's

A lot of people know that it is important to use parameterized queries to prevent sql injection attacks.
Parameterized queries are also much faster in sqlite and oracle when doing online transaction processing because the query optimizer doesn't have to reparse every parameterized sql statement before executing. I've seen sqlite becoming 3 times faster when you use parameterized queries, oracle can become 10 times faster when you use parameterized queries in some extreme cases with a lot of concurrency.
How about other db's like mysql, ms sql, db2 and postgresql?
Is there an equal difference in performance between parameterized queries and literal queries?
With respect to MySQL, MySQLPerformanceBlog reported some benchmarks of queries per second with non-prepared statements, prepared statements, and query cached statements. Their conclusion is that prepared statements is actually 14.5% faster than non-prepared statements on MySQL. Follow the link for details.
Of course the ratio varies based on the query.
Some people suppose that there's some overhead because you're making an extra round-trip from the client to the RDBMS -- one to prepare the query, the second to pass parameters and execute the query.
But the reality is that these are false assumptions made without actually measuring. I've never heard of prepared statements being slower in any brand of database.
I've nearly always seen an increase in speed - but only the first time generally. After the plans are loaded and cached I would have surmised that the various db engines will behave the same for either type.

Resources