I am trying to do some big bulk inserts to Postgres via node-postgres
When the bindings array exceeds 65536 values then passes to postgres the rest of values and when the query it runs I take the error
[error: bind message supplies 4 parameters, but prepared statement "" requires 65540]
Any thoughts?
Thanks in advance.
Prepared Statements within node-postgres are not suitable for bulk inserts, because they do not support multi-query statements. And you shouldn't stretch the array of variables across all inserts at the same time, this won't scale well, it has its own limits, like the one you hit there.
Instead, you should use multi-value inserts, in the format of:
INSERT INTO table(col1, col2, col3) VALUES
(val-1-1, val-1-2, val-1-3),
(val-2-1, val-2-2, val-2-3),
...etc
split your bulk inserts in queries like this, with up to 1,000 - 10,000 records, depending on the size of each record, and execute as a simple query.
See also Performance Boost article, to understand INSERT scalability better.
Related
Google's Spanner supports SQL "bulk" Inserts e.g. from doco
INSERT INTO Singers (SingerId, FirstName, LastName)
VALUES(1, 'Marc', 'Richards'),
(2, 'Catalina', 'Smith'),
(3, 'Alice', 'Trentor');
However I cannot find any support for this in the Go Client. The Go Client "Statement" type supports single-row inserts and I have used the BatchUpdate() function to execute a batch of single-row inserts, but I cannot find any support for bulk-inserts.
Does the Spanner Client support bulk inserts ?
Yes, there are a number of ways that you can do that:
The one mentioned by yourself: Use the BatchUpdate method to execute a collection of individual INSERT statements. An example can be found here.
You can execute an INSERT statement that inserts multiple rows by calling the Update method with a SQL string that inserts multiple rows. An example can be found here.
The most efficient way to insert a bulk of rows is to use mutations instead of DML. Use the Apply method to insert a collection of (insert) mutations. An example can be found here
Right now the way I am doing my workflow is like this:
get a list of rows from a postgres database (let's say 10.000)
for each row I need to call an API endpoint and get a value, so 10.000 values returned from API
for each row that I have a value returned I need to update a field in the database. 10.000 rows updated
Right now I am doing a update after each API fetch but as you can imagine this isn't the most optimized way.
What other option do I have?
Probably bottleneck in that code is fetching the data from API. This trick only allows to send many small queries to DB faster without having to wait roundtrip time between each update.
To do multiple updates in single query you could use common table expressions and pack multiple small queries to single CTE query:
https://runkit.com/embed/uyx5f6vumxfy
knex
.with('firstUpdate', knex.raw('?', [knex('table').update({ colName: 'foo' }).where('id', 1)]))
.with('secondUpdate', knex.raw('?', [knex('table').update({ colName: 'bar' }).where('id', 2)]))
.select(1)
knex.raw trick there is a workaround, since .with(string, function) implementation has a bug.
I have a table with two binary columns used to store strings that are 64 bytes long maximum and two integer columns. This table has 10+ million rows and uses 2GB of memory out 7GB available memory, so there is plenty of available memory left. I also configured MemSQL based on http://docs.memsql.com/latest/setup/best_practices/.
For simple select SQL where binary columns are compared to certain values, MemSQL is about 3 times faster than SQL Server, so we could rule out issues such as configuration or hardware with MemSQL.
For complex SQLs that use
substring operations in the Select clause and
substring and length operations in the where clause
MemSQL is about 10 times slower than SQL Server. The measured performance of these SQLs on MemSQL were taken after the first few runs to make sure that the SQL compilation durations were not included. It looks like MemSQL’s performance issue has to do with how it handles binary columns and substring and string length operations.
Has anyone seen similar performance issues with MemSQL? If so, what were the column types and SQL operations?
Has anyone seen similar performance issues with MemSQL for substring and length operations on varchar columns?
Thanks.
Michael,
My recommendation: go back to rowstore, but with VARBINARY instead of BINARY, consider putting indexes on the columns or creating persisted columns, and try rewriting your predicate with like.
If you paste an example query, I can help you transform it.
The relevant docs are here
dev.mysql.com/doc/refman/5.0/en/pattern-matching.html
docs.memsql.com/4.0/concepts/computed_columns
Good luck.
It's hard to make general answers to perf questions, but in your case I would try a MemSQL columnstore table as opposed to an in-memory rowstore table. Since you are doing full scans anyway, you'll get the benefit of having the column data stacked up right next to each other.
http://docs.memsql.com/4.0/concepts/columnar/
I am bypassing the ORM and using the Model.query function to query and return a "large" result set from PostgreSQL. The query returns around 2 million rows. When running the query directly from postgres it returns in around 20s. The query fails silently when executed from sails. Is there a limit on the number of rows that can be returned?
Is there a limit on the number of rows that can be returned?
No there is no limit.
The query fails silently when executed from sails
What does "fails silently" mean? How do you know that it's failed? It may still be processing; or the adapter might have a connection timeout that you're breaching.
Two million rows serialized out of the database, translated to JSON, and piped down to the client is much different than just running SQL directly on the database. It could take 20x longer, depending on your system resource situation. I strongly recommend using sails.js's paging features to pull the data out in chunks. Pulling two million rows in one operation from a web server doesn't make a lot of sense.
I have the following table (using CQL3):
create table test (
shard text,
tuuid timeuuid,
some_data text,
status text,
primary key (shard, tuuid, some_data, status)
);
I would like to get rows ordered by tuuid. But this is only possible when I restrict shard - I get this is due to performance.
I have shard purely for sharding, and I can potentially restrict its range of values to some small range [0-16) say. Then, I could run a query like this:
select * from test where shard in (0,...,15) order by tuuid limit L;
I may have millions of rows in the table, so I would like to understand the performance characteristics of such a order by query. It would seem like the performance could be pretty bad in general, BUT with a limit clause of some reasonable number (order of 10K), this may not be so bad - i.e. a 16 way merge but with a fairly low limit.
Any tips, advice or pointers into the code on where to look would be appreciated.
Your data is sorted according to your column key. So the performance issue in your merge in your query above does not happen due to the WHERE clause but because of your LIMIT clause, afaik.
Your columns are inserted IN ORDER according to tuuid so there is no performance issue there.
If you are fetching too many rows at once, I recommended creating a test_meta table where you store the latest timeuuid every X-inserts, to get an upper bound on the rows your query will fetch. Then, you can change your query to:
select * from test where shard in (0,...,15) and tuuid > x and tuuid < y;
In short: make use of your column keys and get rid of the limit. Alternatively, in Cassandra 2.0, there will be pagination which will help here, too.
Another issue I stumbled over, you say that
I may have millions of rows in the table
But according to your data model, you will have exactly shard number of rows. This is your row key and - together with the partitioner - will determine the distribution/sharding of your data.
hope that helps!
UPDATE
From my personal experience, cassandra performances quite well during heavy reads as well as writes. If the result sets became too large, I rather experienced memory issues on the receiving/client side rather then timeouts on the server side. Still, to prevent either, I recommend having a look a the upcoming (2.0) pagination feature.
In the meanwhile:
Try to investigate using the trace functionality in 1.2.
If you are mostly reading the "latest" data, try adding a reversed type.
For general optimizations like caches etc, first, read how cassandra handles reads on a node and then, see this tuning guide.