psycopg2: command is too large when storing large values in CockroachDB

psycopg2: command is too large when storing large values in CockroachDB - psycopg2

I'm looking to store a ~0.5G value into a single field, but psycopg2 is not cooperating:
crdb_cursor.execute(sql.SQL("UPSERT INTO my_db.my_table (field1, field2) VALUES (%s, %s)"), ['static_key', 'VERY LARGE STRING'])
psycopg2.InternalError: command is too large: 347201019 bytes (max: 67108864)
I've already set SET CLUSTER SETTING sql.conn.max_read_buffer_message_size='1 GiB';
Is there any (better) way to store this large a string into CockroachDB?
Clients will be requesting this entire string at a time, and no intra-string search or match operations will be performed.
I understand that there will be performance implications to storing large singular fields in a SQL database.

It seems at the moment that psycopg2 isn't capable of handling strings that large, and neither is CockroachDB. CockroachDB recommends keeping values around 1MB and with default configuration the limit is somewhere between 1MB and 20MB.
For storing a string that is several hundred Megabytes, I would suggest some kind of object store and then store a reference to the object in the database. Here is and example of a blob store built on top of CockroachDB that may give you some ideas.

Related

How does Cassandra store variable data types like text

assumption is, Cassandra will store fixed length data in column family. like a column family: id(bigint), age(int), description(text), picture(blob). Now description and picture have no limit. How does it store that? Does Cassandra externalize through an ID -> location way?
For example, looks like, in relational databases, a pointer is used to point to the actual location of large texts. See how it is done
Also, looks like, in mysql, it is recommended to use char instead of varchar for better performance. I guess simply because, there is no need for an "id lookup". See: mysql char vs varchar
enter code here
`

Cassandra stores individual cells (column values) in its on-disk files ("sstables") as a 32-bit length followed by the data bytes. So string values do not need to have a fixed size, nor are stored as pointers to other locations - the complete string appears as-is inside the data file.
The 32-bit length limit means that each "text" or "blob" value is limited to 2GB in length, but in practice, you shouldn't use anything even close to that - with Cassandra documentation suggesting you shouldn't use more than 1MB. There are several problems with having very large values:
Because values are not stored as pointers to some other storage, but rather stored inline in the sttable files, these large strings get copied around every time sstable files get rewritten, namely during compaction. It would be more efficient to keep the huge string on disk in a separate files and just copy around pointers to it - but Cassandra doesn't do this.
The Cassandra query language (CQL) does not have any mechanism for store or retrieving a partial cell. So if you have a 2GB string, you have to retrieve it entirely - there is no way to "page" through it, nor a way to write it incrementally.
In Scylla, large cells will result in large latency spikes because Scylla will handle the very large cell atomically and not context-switch to do other work. In Cassandra this problem will be less pronounced but will still likely cause problems (the thread stuck on the large cell will monopolize the CPU until preempted by the operating system).

How can I add a data of size more than 2 mb as a single entry in cosmos db?

As there is a size limit for cosmsos db for single entry of data, how can I add a data of size more than 2 mb as a single entry?

The 2MB limit is a hard-limit, not expandable. You'll need to work out a different model for your storage. Also, depending on how your data is encoded, it's likely that the actual limit will be under 2MB (since data is often expanded when encoded).
If you have content within an array (the typical reason why a document would grow so large), consider refactoring this part of your data model (perhaps store references to other documents, within the array, vs the subdocuments themselves). Also, with arrays, you have to deal with an "unbounded growth" situation: even with documents under 2MB, if the array can keep growing, then eventually you'll run into a size limit issue.

Hive/Impala performance with string partition key vs Integer partition key

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?

Well, it makes a difference if you look up the official Impala documentation.
Instead of elaborating, I will paste the section from the doc, as I think it states it quite well:
"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."
Reference: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html

No, there is no such recommendation. Consider this:
The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.
Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.
There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.
Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.
1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

Does CouchDB distinguish between floats/ints/strings?

I need to optimize disk usage and amount of data transferred during replication with my CouchDB instance. Does storing numerical data as int/floats instead of as string make a difference to file storage and or during http requests? I've read that JSON treats everything as strings, but newer JSON specs make use of different datatypes (float/int/boolean). What about for PouchDB?

CouchDB stores JSON data in native JSON types, so ints and floats are actual number types when serialised to disk. But I doubt you save much disk space over when that wouldn’t be the case. The replication protocol uses JSON and the internal encoding has no effect on this.

PouchDB in WebSQL and Sqlite store your document as string (I don't know what IndexedDb).
So to optimize disk usage, just keep less data. :)

sqlite database design with millions of 'url' strings - slow bulk import from csv

I'm trying to create an sqlite database by importing a csv file with urls. The file has about 6 million strings. Here are the commands I've used
create table urltable (url text primary key);
.import csvfile urldatabase
After about 3 million urls the speed slows down a lot and my hard disk keeps spinning continuously. I've tried splitting the csv file into 1/4th chunks but I run into the same problem.
I read similar posts on stackoverflow and tried using BEGIN...COMMIT blocks and PRAGMA synchronous=OFF but none of them helped. The only way I was able to create the database was by removing the primary key constraint from url. But then, when I run a select command to find a particular url, it takes 2-3 seconds which won't work for my application.
With the primary key set on url, the select is instantaneous. Please advise me what am I doing wrong.
[Edit]
Summary of suggestions that helped :
Reduce the number of transactions
Increase page size & cache size
Add the index later
Remove redundancy from url
Still, with a primary index, the database size is more than double the original csv file that I was trying to import. Any way to reduce that?

Increase your cache size to something large enough to contain all of the data in memory. The default values for page size and cache size are relatively small and if this is a desktop application then you can easily increase the cache size many times.
PRAGMA page_size = 4096;
PRAGMA cache_size = 72500;
Will give you a cache size of just under 300mb. Remember page size must be set before the database is created. The default page size is 1024 and default cache size is 2000.
Alternatively (or almost equivalently really) you can create the database entirely in an in-memory database and then use the backup API to move it to an on-disk database.

A PRIMARY KEY or UNIQUE constraint will automatically generate an index. An index will dramatically speed up SELECTs, at the expense of slowing down INSERTs.
Try importing your data into a non-indexed table, and then explicitly CREATE UNIQUE INDEX _index_name ON urltable(url). It may be faster to build the index all at once than one row at a time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string