Is there any alternative to 32K string limits? - string

I want to store WKT that can be quite large but I'm running into the 32K limit while storing them in object values.
create table A (id integer, wkt object);

So there is a way to store longer strings in objects:
CREATE TABLE IF NOT EXISTS A (
"id" INTEGER,
"wkt" OBJECT (IGNORED)
)
By using ignored the entire object is not indexed, which also prohibits it from being used in other SQL parts properly (they will always do a full table scan).
However, subscripts work just fine.
For other readers: WKT can also be stored as geo_shape type as well, or used with match directly.

Related

Storing arrays in Cassandra

I have lots of fast incoming data that is organised thusly;
Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
The data arrays themselves are not necessarily written in order.
The length of the arrays may vary.
The data is either read as an entire array at a time so makes sense to store the entire thing together.
The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.
For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.
In Cassandra/Scylla it looks like I have the options of either:
Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.
What other options do I have?
Thanks
Option 1, seems to be a good match:
I assume each logical object have an unique id (or better uuid)
In such a case, you can create something like
CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.
This allows
fast retrieve of the entire "array", even a big one, using paging
fast retrieve of an index in an array

Azure Cosmos Db as key value store indexing mode

What indexing mode / policy should I use when using cosmos db as a simple key/value store?
From https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy :
None: Indexing is disabled on the container. This is commonly used when a container is used as a pure key-value store without the need for secondary indexes.
Is this because the property used as partition key is indexed even when indexMode is set to “none”? I would expect to need to turn indexing on but specify just the partition key’s path as the only included path.
If it matters, I’m planning to use the SQL API.
EDIT:
here's the information I was missing to understand this:
The item must have an id property, otherwise cosmos db will assign one. https://learn.microsoft.com/en-us/azure/cosmos-db/account-databases-containers-items#properties-of-an-item
Since I'm using Azure Data Factory to load the items, I can tell ADF to duplicate the column that has the value I want to use as my id into a new column called id: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview#add-additional-columns-during-copy
I need to use ReadItemAsync, or better yet, ReadItemStreamAsync since it doesn't deserialize the response, to get the item without using a query.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.container.readitemasync?view=azure-dotnet
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.container.readitemstreamasync?view=azure-dotnet
When you set indexingMode to "none", the only way to efficiently retrieve a document is by id (e.g. ReadDocumentAsync() or read_item()). This is akin to a key/value store, since you wouldn't be performing queries against other properties; you'd be specifically looking up a document by some known id, and returning the entire document. Cost-wise, this would be ~1RU for a 1K document, just like point-reads with an indexed collection.
You could still run queries, but without indexes, you'll see unusually-high RU cost.
You would still specify the partition key's value with your point-reads, as you'd normally do.

psycopg2: command is too large when storing large values in CockroachDB

I'm looking to store a ~0.5G value into a single field, but psycopg2 is not cooperating:
crdb_cursor.execute(sql.SQL("UPSERT INTO my_db.my_table (field1, field2) VALUES (%s, %s)"), ['static_key', 'VERY LARGE STRING'])
psycopg2.InternalError: command is too large: 347201019 bytes (max: 67108864)
I've already set SET CLUSTER SETTING sql.conn.max_read_buffer_message_size='1 GiB';
Is there any (better) way to store this large a string into CockroachDB?
Clients will be requesting this entire string at a time, and no intra-string search or match operations will be performed.
I understand that there will be performance implications to storing large singular fields in a SQL database.
It seems at the moment that psycopg2 isn't capable of handling strings that large, and neither is CockroachDB. CockroachDB recommends keeping values around 1MB and with default configuration the limit is somewhere between 1MB and 20MB.
For storing a string that is several hundred Megabytes, I would suggest some kind of object store and then store a reference to the object in the database. Here is and example of a blob store built on top of CockroachDB that may give you some ideas.

Hive/Impala performance with string partition key vs Integer partition key

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?
Well, it makes a difference if you look up the official Impala documentation.
Instead of elaborating, I will paste the section from the doc, as I think it states it quite well:
"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."
Reference: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html
No, there is no such recommendation. Consider this:
The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.
Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.
There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.
Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.
1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

When to use Blobs in a Cassandra (and CQL) table and what are blobs exactly?

I was trying to understand better the design decision choice when making table entries in cassandra and when the blob type is a good choice.
I realized I didn't really know when to choose a blob as a data type because I was not sure what a blob really was (or what the acronym stood for). Thus I decided to read the following documentation for the data type blob:
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/blob_r.html
Blob
Cassandra 1.2.3 still supports blobs as string constants for input (to allow smoother transition to blob constant). Blobs as strings are
now deprecated and will not be supported in the near future. If you
were using strings as blobs, update your client code to switch to blob
constants. A blob constant is an hexadecimal number defined by
0xX+ where hex is an hexadecimal character, such as
[0-9a-fA-F]. For example, 0xcafe.
Blob conversion functions
A number of functions convert the native types into binary data (blob). For every <native-type> nonblob type supported by CQL3, the
typeAsBlob function takes a argument of type type and returns it as a
blob. Conversely, the blobAsType function takes a 64-bit blob argument
and converts it to a bigint value. For example, bigintAsBlob(3) is
0x0000000000000003 and blobAsBigint(0x0000000000000003) is 3.
What I got out of it is that its just a long hexadecimal/binary. However, I don't really appreciate when I would use it as a column type for a potential table and how its better or worse than other type. Also, going through some of its properties might be a good way to figure out what situations blobs are good for.
Blobs (Binary Large OBjectS) are the solution for when your data doesn't fit into the standard types provided by C*. For example, say you wanted to make a forum where users were allowed to upload files of any type. To store these in C* you would use a Blob column (or possibly several blob columns since you don't want individual cells to become to large).
Another example might be a table where users are allowed to have a current photo, this photo could be added as a blob and be stored along with the rest of the user information.
Accoring to 3.x document, blob type is suitable for storing a small image or short string.
In my case I used it to store a hashed value, as hash function returns binary and the best option is to store as binary from the view of table data size.
(Converting to string and store as string(text) could be also ok, if size not considered.)
Results below shows my test in local machine (insert 1 million records) and the sizes are 52,626,907(binary) and 72,879,839(base64-converted data as string).
unit: byte.
CREATE TABLE IF NOT EXISTS testks.bin_data (
bin_data blob,
PRIMARY KEY(bin_data)
);
CREATE TABLE IF NOT EXISTS testks.base64_data (
base64_data text,
PRIMARY KEY(base64_data)
);
cqlsh> select * from testks.base64_data limit 10;
base64_data
------------------------------
W0umEPMzL5O81v+tTZZPKZEWpkI=
bGUzPm4zRvcqK1ogwTvPNPNImvk=
Nsr0GKx6LjXaiZSwATU38Ffo7fA=
A6lBV69DbFz/UFWbxolb+dlLcLc=
R919DvcyqBUup+NrpRyRvzJD+/E=
63LEynDKE5RoEDd1M0VAnPPUtIg=
FPkOW9+iPytFfhjdvoqAzbBfcXo=
uHvtEpVIkKivS130djPO2f34WSM=
fzEVf6a5zk/2UEIU8r8bZDHDuEg=
fiV4iKgjuIjcAUmwGmNiy9Y8xzA=
(10 rows)
cqlsh> select * from testks.bin_data limit 10;
bin_data
--------------------------------------------
0xb2af015062e9aba22be2ab0719ddd235a5c0710f
0xb1348fa7353e44a49a3363822457403761a02ba8
0x4b3ecfe764cbb0ba1e86965576d584e6e616b03e
0x4825ef7efb86bbfd8318fa0b0ac80eaa2ece9ced
0x37bdad7db721d040dcc0b399f6f81c7fd2b5cea6
0x3de4ca634e3a053a1b0ede56641396141a75c965
0x596ec12d9d9afeb5b1b0bb42e42ad01b84302811
0xbf51709a8d1a449e1eea09ef8a45bdd2f732e8ec
0x67dcb3b6e58d8a13fcdc6cf0b5c1e7f71b416df6
0x7e6537033037cc5c028bc7c03781882504bdbd65

Resources