I have a process(c++ code) that reads and writes from database (Oracle).
But it takes long time for process to finish.
I was thinking of creating partitions in the tables that this process queries.
And then making the process multi-threaded so that each thread(one for each partition) can read/write the data in parallel.
I will be creating a DB connection per thread.
Will write slow it down?
Will this work?
Is there any other way of improving performance (all queries are tuned and optimized already)?
Thanks,
Nikhil
If the current bottleneck is writing the data to the database then creating more threads to write more data may or may not help, depending on how the data is partitioned, and whether or not the writes can occur concurrently, or whether they interfere with each other (either at the database lock level, or at the database disk IO level).
Creating more threads will instead allow the application to process more data, and queue it up for writing to the database, assuming that there is sufficient hardware concurrency (e.g. on a multicore machine) to handle the additional threads.
Partitioning may improve the database performance, as may changing the indexes on the relevant tables. If you can put separate partitions on separate physical disks then that can improve IO when only one partition needs to be accessed by a given SQL statement.
Dropping indexes that aren't needed, changing the order of index columns to match the queries, and even changing the index type can also improve performance.
As with everything: profile it before and after every proposed change.
Related
I have a backend in nestjs using typeorm and postgres. This backend saves and reads data frequently from the database. In this database we are dealing with row counts of 10k + at times that needs to get updated and saved or created.
In this particular case where I need some brain juice I have a table (lets call it table a)
the backend fetches data from table a every few seconds
the content in table A needs to get updated frequently (properties and values overwritten). I am doing this updating task from a several application backend solely for this use-case.
Example case
Table A holds 100K records
update-service splits these 100K records into chunks of 5 and parallell updates 25K records each. While doing so, the main application that retrieves data from the backend slows down.
What is the best way to have performant read and write in parallel? I am assuming the slow down comes from locks (main backend retrieves data while update service tries to update) but I am not sure as I have not that much experience working with databases.
Don't assume, assert.
While you experiencing bad performance, check how the operating system's resources are doing; in this case, mostly CPU and disk. If one of them is maxed out, you know what is going on, and you either have to reduce the degree of parallelism or make the system stronger.
It is also interesting to look at wait events in PostgreSQL:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event;
That will show I/O related events if you are running out of disk bandwidth, but it will also show database-internal contention that you can potentially hit with very high degrees of parallelism.
We are exploring using Cassandra as a way to store time series type data, so this may be somewhat of a noob question. One of the use cases is to read data from a Kafka stream, look for matches, and incrementing a counter (e.g. 5 customers have clicked through link alpha on page beta, increment (beta, alpha) by 5). However, we expect a very wide degree of parallelism to keep up with the load, so there may be more than one consumer reading from Kafka at the same time.
My question is: How would Cassandra resolve multiple simultaneous writes to a given counter from multiple sources?
It's my understanding that multiple writes to the counter with different timestamps will be added to the counter in the timestamp order received. However, if there were to be a simultaneous write with exact same timestamp, would the LWW model of Cassandra throw out one of those counter increments?
If we were to have a large cluster (100+ nodes), ALL or QUORUM writes may not be sufficient performant to keep up with the messasge traffic. Writes with THREE would seem to be likely to result in a situation where process #1 writes to nodes A, B, and C, but process #2 might write to X, Y, and Z. Would LWT work here, or do they not play well with counter activity?
I would try out a proof of concept and benchmark it, it will most likely work just fine. Counters are not super performant in Cassandra though, especially if there will be a lot of contention.
Counters are not like the normal writes with a simple LWW, it uses paxos with some pessimistic locking and specialized caches. The partition lock contention will slow it down soome, and paxos is an expensive multiple network hop process with reads before writes.
Use quorum, don't try to do something funky with CL's with counters, especially before benchmarking to know if you need it. 100 node cluster should be able to handle a lot as long as your not trying to update all the same partitions constantly.
Query 1: Event data from device is stored in Cassandra table. Obviously this is time series data. If we need to store how older dated events (if cached in device due to some issue) at current time, are we going to get performance issue? If yes, what is the solution to avoid that?
Query 2: Is it good practice to write the event into Cassandra table as soon as the event comes in? Or shall we queue it for sometime to write multiple events in one go if that improves Cassandra write performance significantly?
Q1: this all depends on the table design. Usually this shouldn't be an issue, but this may depend on your access patterns & compaction strategy. If you have table structure, please share it.
Q2: Individual writes shouldn't be a problem, but it really depends on your requirements for throughput. If you'll write several data points that belong to the same partition key you potentially may use unlogged batches, and in this case Cassandra will perform only one write for several inserts that are in this batch. Please read this document.
I have a NodeJS application that needs to stream data from an RDS Postgres, perform some relatively expensive CPU operations on the data, and insert it into another database. The CPU intensive portion I've offloaded into an AWS Lambda, such that the Node application will get a batch of rows and immediately pass them to the Lambda for processing. The bottleneck appears to be the speed in which the data can be received from Postgres.
In order to utilize multiple connections to the DB, I have an algorithm which is effectively leapfrogging on sorted IDs, so that many concurrent connections can be maintained. Ex: 1 connection fetches ids 1-100, second one fetches ids 101-200, etc, and then when the first returns maybe it fetches ids 1001-1100. Is this relatively standard practice? Is there a faster method for pulling the data out for processing?
So long as I am below the database's max_connections, would it be arguably beneficial to add more, possibly as additional concurrent applications streaming data out of it? Both the application and the RDS are currently in the VPC, and the CPU utilization on the RDS gets to about 30%, with memory at 60%.
It would likely be MUCH faster to dump your Postgres database into a CSV file or export it directly to flat files, dump the flat files into S3 after splitting them up, then have workers process each batch of files on their own.
Streaming data out of Postgres (particularly if you're doing it for millions of items) will take a LOT of IO and a very long time.
I did a test, seems there is no obvious performance improvement
to insert the same number of rows, 2 concurrent threads almost cost the same time with 1 single thread.
Is there any way to improve load data infile performance? Is multi-thread a wrong approach?
Multithreading improves performance if it can increase the parallel use of computing resources. If the file and DB are on the same harddisk, you are probably out of luck. If you load data, compute something heavy and then write to the DB, you might be able to use CPU and HD in parallel. For a test, create one thread to read the input file and a second thread to write to the DB (using fake data for the DB). If that is faster than reading a bit and writing a bit, then multithreading can improve performance. I mention that, because it is far from clear what your test actually did.
If you are confident about the consistency of the data you are bulk loading, you might want to use the following:
SET FOREIGN_KEY_CHECKS=0;
-- do the bulk load
SET FOREIGN_KEY_CHECKS=1;
This will temporarily disable foreign key checks, making the data insertion way faster.
Regarding multithreading, try moving to the lowest isolation level for both your threads:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
-- bulk load
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; --default setting
Change the last line to whatever is your normal isolation level. Alternatively you could use a temporary variable to store the previous level:
SET #tx_isolation_orig = ##tx_isolation;
SET ##tx_isolation = read-uncommited;
-- bulk load
SET ##tx_isolation = #tx_isolation_orig;
Further infos:
InnoDB foreign key constraints
Isolation levels