There are 1,00,000 Update Statement available in a SQL table EexecuteQueue
Below is the Step I am planning to do.
Identify the Logical Processor of the Database server.
The queries available in the EexecuteQueue table will be split in to number of (logical processor-2) and execute in different thread.
My assumption is Instead of executing 1,00,000 update statement sequentially, threads will execute 25,000 update statements in parallel (If we have 4 Threads).
My Question
Is my assumption correct?
Is it good to user Threads in CLRSQL?
Thanks in advance.
My assumption is Instead of executing 1,00,000 update statement sequentially, threads will execute 25,000 update statements in parallel (If we have 4 Threads). Is my assumption correct?
Yes, but is completely irrelevant. Doing 25k operations on 4 threads by no means implies is going to be faster than doing 100k operations on a single thread. Such an assumption is, at best, naive. You need to identify your bottlenecks and address them accordingly, depending on your findings. Read How to analyse SQL Server performance.
Is it good to user Threads in CLRSQL?
No.
To speed up batch updates, use set based operations. Reduce number of round trips. Batch commit.
Related
I understand that Scylla allows batch statements like these.
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
These statements have performance implications as it ensures atomicity. However, I simply have many insert statements which I want to perform from my node client in a single IO. Atomicity among these inserts is not needed. Any idea how I can do that? Can't find anything.
Batching multiple inserts in Cassandra world usually is an antipattern (except when they go into one partition, see the docs). When you're sending inserts into multiple partitions in one batch, the coordinator node will need to take care for taking data from this batch and sending them to nodes that are owning the data. And this puts an additional load onto the coordinating node that first needs to backup the content of the batch just not to lose it if it crashes in the middle of execution, and then need to execute all operations, and wait for results of execution before sending it back to caller (see this diagram to understand how so-called logged batch works).
When you don't need atomicity, then the best performance would be by sending multiple parallel inserts, and waiting for their execution - it will be faster, it will put less load onto nodes, and driver can use token-aware load balancing policy, so requests will be sent to nodes that own data (if you're using prepared statements). In node.js you can achieve this by using Concurrent Execution API - there are several variants of its usage, so it's better to look into the documentation to select what is best for your use case.
In the doc we can find a query hint named USE_ADDITIONAL_PARALLELISM here: https://cloud.google.com/spanner/docs/query-syntax#statement-hints
However the documentation is very short for it.
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
In what scenario would we use it?
What is its impact on the infrastructure?
How does it scale with number of nodes?
Does it need a query that picks data from different splits, or does it work on a single split?
Any meaningful information about it is welcome.
PS: I was originally introduced to the hint in this thread
A Spanner query may be executed on multiple remote servers.
Source: An illustration of the life of a query from the Cloud Spanner "Query execution plans" documentation
The root node coordinates the query execution.
If the execution plan expects rows on multiple splits to satisfy the query predicate(s), multiple subplans are executed on the respective remote servers.
Due to the distributed nature of Spanner these subplans can sometimes be executed in parallel; for example, the right subplan execution is not dependent on the left subplan results.
If the USE_ADDITIONAL_PARALLELISM query hint is provided, the root node may choose to increase the number of parallel remote executions, if the execution plan includes multiple subplans.
To answer the original questions:
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
This hint does not change how a query is executed, it only make it possible for subplans of that execution to be initiated with increased parallelism.
In what scenario would we use it?
Especially in cases when a full table scale is required, this may lead to faster, in wall-time, query completion, but the trade offs concerning resource allocation, and the affects on other parallel operations, should also be considered.
What is its impact on the infrastructure?
If an increased number of remote executions are run in parallel, the average CPU for the instance may increase.
How does it scale with number of nodes?
An increased number of nodes provides additional capacity for parallel operations.
Does it need a query that picks data from different splits, or does it work on a single split?
Benefits will likely be significantly higher for queries which require data that resides on multiple splits.
A Cloud Spanner query may have multiple levels of distribution. The USE_ADDITIONAL_PARALLELISM query hint will cause a node executing a query to try and prefetch the results of subqueries further up in the distribution queue. This can be useful in scenarios such as queries doing full table scans or doing full table scans with aggregations like COUNT(), MAX , MIN etc. where identical subqueries can be distributed to many splits and where the individual subqueries to the splits return relatively little data (such as aggregation state). However, if the individual subqueries return significant data then using this hint can cause memory usage on the consuming node to go up significantly due to prefetching.
Is there any way to execute VoltDB stored procedures at regular interval or schedule store procedure to run at a specific time?
I am exploring VotlDB to shift out product from RDBMS to VotlDB. Out produce written in java.
Most of the query can be migrated into the VoltDB stored procedures. But In our product, we have cron job in oracle which executes at regular interval. Now I do not find such features in VoltDB.
I know VoltDB stored procedures can be called from the application at regular interval but our product deploys in an Active-Active mode, in that case, all application will call store procedure at regular interval and that is not a good solution or otherwise, we have to develop some mechanism to run procedure from one instance only.
so It would be good if I get cron job feature from VoltDB.
I work at VoltDB. There isn't currently a feature like this in VoltDB, for example like DBMS_JOB in Oracle.
You could certainly use a cron job on one of the servers in your cluster, or on some other server within your network that could invoke sqlcmd to run a script or echo individual SQL statements or execute procedure commands through sqlcmd to the database. Making cron jobs highly available is a general problem. You might find these other discussions helpful:
How to convert Linux cron jobs to "the Amazon way"?
https://www.reddit.com/r/linuxadmin/comments/3j3bz4/run_cronjob_only_on_one_node_in_cluster/
You could also look into something like rcron.
One thing to be careful of when converting from an RDBMS to VoltDB is that VoltDB is optimized for processing many small transactions in parallel across many partitions. While the architecture of serialized execution per partition excels for many operational and streaming workloads, it is not designed to perform bulk operations on many rows at a time, especially transactions that need to perform writes on many rows that may be in different partitions within one transaction.
If you have a periodic job that does something like "process all the new rows that meet some criteria" you may find this transaction is slow and every time it runs it could delay other parts of the workload, especially if many rows have accumulated. It would be more the "VoltDB Way" to replace a simple INSERT statement that you may be using to ingest data (to be processed later by a scheduled job) with a procedure that inserts and immediately processes the row of data. You might even need a procedure that checks for other records and processes small sets of rows as a group, for example stitching together segments of data that go together but may have arrived out of order. By operating on fewer records at a time within one partition at a time, this type of procedure would be more scalable and would keep the data closer to your desired finished state in real time, rather than always having some data waiting to be processed.
I have to execute an operation that launches a lot of Map/Reduce (~400) but every Map/Reduce is on a different collection, so it can't be any concurrent write.
To improve the performance of this operation I paralyzed it by creating a thread on application side (I use the Java driver) for each Map/Reduce (note that I don't use sharding mode).
But when I compared the results I ended up with some worst results that with a sequential execution (mono-thread).
To be more precise : 341 sec for sequential execution, 904 for a distributed one.
So instead of getting better execution time, it's three time longer.
Someone knows why mongoDB don't like parallelization of Map/Reduce processes ?
I found an article about it (link), but now that mongoDB use the V8 engine I thought that should be ok.
First, do Map/Reduces at different databases, there was lock for per database(now at version 2.6).
Second, need more RAM and more faster disk IO, there may be the bottleneck.
Here is an example about how to use multi-core. http://edgystuff.tumblr.com/post/54709368492/how-to-speed-up-mongodb-map-reduce-by-20x
"The issue is that there is too much lock contention between the threads. MR is not very altruistic when locking (it yields every 1000 reads), and since MR jobs does a lot of writing too, threads end up waiting on each other. Since MongoDB has individual locks per database"
I have a process(c++ code) that reads and writes from database (Oracle).
But it takes long time for process to finish.
I was thinking of creating partitions in the tables that this process queries.
And then making the process multi-threaded so that each thread(one for each partition) can read/write the data in parallel.
I will be creating a DB connection per thread.
Will write slow it down?
Will this work?
Is there any other way of improving performance (all queries are tuned and optimized already)?
Thanks,
Nikhil
If the current bottleneck is writing the data to the database then creating more threads to write more data may or may not help, depending on how the data is partitioned, and whether or not the writes can occur concurrently, or whether they interfere with each other (either at the database lock level, or at the database disk IO level).
Creating more threads will instead allow the application to process more data, and queue it up for writing to the database, assuming that there is sufficient hardware concurrency (e.g. on a multicore machine) to handle the additional threads.
Partitioning may improve the database performance, as may changing the indexes on the relevant tables. If you can put separate partitions on separate physical disks then that can improve IO when only one partition needs to be accessed by a given SQL statement.
Dropping indexes that aren't needed, changing the order of index columns to match the queries, and even changing the index type can also improve performance.
As with everything: profile it before and after every proposed change.