I have a backend in nestjs using typeorm and postgres. This backend saves and reads data frequently from the database. In this database we are dealing with row counts of 10k + at times that needs to get updated and saved or created.
In this particular case where I need some brain juice I have a table (lets call it table a)
the backend fetches data from table a every few seconds
the content in table A needs to get updated frequently (properties and values overwritten). I am doing this updating task from a several application backend solely for this use-case.
Example case
Table A holds 100K records
update-service splits these 100K records into chunks of 5 and parallell updates 25K records each. While doing so, the main application that retrieves data from the backend slows down.
What is the best way to have performant read and write in parallel? I am assuming the slow down comes from locks (main backend retrieves data while update service tries to update) but I am not sure as I have not that much experience working with databases.
Don't assume, assert.
While you experiencing bad performance, check how the operating system's resources are doing; in this case, mostly CPU and disk. If one of them is maxed out, you know what is going on, and you either have to reduce the degree of parallelism or make the system stronger.
It is also interesting to look at wait events in PostgreSQL:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event;
That will show I/O related events if you are running out of disk bandwidth, but it will also show database-internal contention that you can potentially hit with very high degrees of parallelism.
Related
We are running an elastic pool in Azure running multiple databases, when running 1 of our larger imports this seems to take longer than we are used to. During these imports we ran at 6 cores as a test. All databases are allowed to use all cores.
On our local enviroment, it inserts about 100k records per second, however, the same dataset on Azure does about 1k per second (our vm) to 4k per second (dev laptop).
During this insert, the database only uses 14% log IO, 5% CPU and 0% DataIO.
When setting up a new database using DTU model in P2 we are noticing the same experience. So we are not even hitting the limits of the database
The table contains about 36 columns which are all required.
We have tried this using BulkInsert in the following way using different batchsizes
BulkConfig b = new BulkConfig();
b.BatchSize = 100000;
await dbcontext.BulkInsertAsync(entities, b);
As well as using standard EntityFramework addranges using smaller batches. We even went as far as using the manually written SqlBulkCopy methods, however all with no dice.
Now the question is mainly, is this a software issue? Are we running into issues in our AzureDB? Do we need to change the way we do Bulk imports?
Edit:
Attempted to run the import using the TempDB Setting in BulkInsert, however this also does not increase performance. LogIO is still at 14%.
Iterate through the dataset on the application layer, invoking a
stored procedure for each row that will perform an INSERT/UPDATE
action based on the existence of a record with a certain key. If the
number of records to upsert is limited, this strategy may work well;
otherwise, roundtrips and log writes will have a major influence on
speed.
To minimise roundtrips and log writes and increase throughput, use
bulk insert approaches like the SqlBulkCopy class in ADO.NET to
upload the full dataset to Azure SQL Database and then execute all
the INSERT/UPDATE (or MERGE) operations in a single batch. Overall
execution times may be reduced from hours to minutes/seconds using
this method.
Here, is a discussion related to same scenario: Optimize Azure SQL Database Bulk Upsert scenarios - link.
I'm using Cassandra Java driver with a fetch size set to 1k. I need to query all records in a table and perform some time consuming action for a every row.
What will happen if I'll keep the ResultSet open (not fully iterated) for a one day?
What I don't care about:
consistency. If some new record will be written in the meantime, I'm ok to fetch it. However, I'm fine if I won't get it
fault tolerance. If during that process some node will fail, I'm fine if the query will fail too. However, I would like to detect that from the client perspective.
What I care about:
Cassandra resource utilization - I don't want to cause cluster outage due to some blocked resources
lateness - I don't want to block (or slow down much) cluster for other consumers of that table
I would like to get all records which existed when I started the query (assuming no deletions). However, they don't have to be up to date
The paging state is the information about the last read data (literally serialized partition key, clustering, and remaining). When sent to coordinator it will look for everything greater than that. So there are no resources in the server spent for this and no performance impact vs a normal read.
Cassandra does not have any features to allow isolation even within a single query. If data has changed from when the first query was made and the second, you will get the up to date information.
I have a NodeJS application that needs to stream data from an RDS Postgres, perform some relatively expensive CPU operations on the data, and insert it into another database. The CPU intensive portion I've offloaded into an AWS Lambda, such that the Node application will get a batch of rows and immediately pass them to the Lambda for processing. The bottleneck appears to be the speed in which the data can be received from Postgres.
In order to utilize multiple connections to the DB, I have an algorithm which is effectively leapfrogging on sorted IDs, so that many concurrent connections can be maintained. Ex: 1 connection fetches ids 1-100, second one fetches ids 101-200, etc, and then when the first returns maybe it fetches ids 1001-1100. Is this relatively standard practice? Is there a faster method for pulling the data out for processing?
So long as I am below the database's max_connections, would it be arguably beneficial to add more, possibly as additional concurrent applications streaming data out of it? Both the application and the RDS are currently in the VPC, and the CPU utilization on the RDS gets to about 30%, with memory at 60%.
It would likely be MUCH faster to dump your Postgres database into a CSV file or export it directly to flat files, dump the flat files into S3 after splitting them up, then have workers process each batch of files on their own.
Streaming data out of Postgres (particularly if you're doing it for millions of items) will take a LOT of IO and a very long time.
It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.
Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew
I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.