I need to optimize disk usage and amount of data transferred during replication with my CouchDB instance. Does storing numerical data as int/floats instead of as string make a difference to file storage and or during http requests? I've read that JSON treats everything as strings, but newer JSON specs make use of different datatypes (float/int/boolean). What about for PouchDB?
CouchDB stores JSON data in native JSON types, so ints and floats are actual number types when serialised to disk. But I doubt you save much disk space over when that wouldn’t be the case. The replication protocol uses JSON and the internal encoding has no effect on this.
PouchDB in WebSQL and Sqlite store your document as string (I don't know what IndexedDb).
So to optimize disk usage, just keep less data. :)
Related
we have a map of custom object key to custom value Object(complex Object). We set the in-memory-format as OBJECT. But IMap.get is taking more time to get the value when the retrieved object size is big. We cannot afford latency here and this is required for further processing. IMap.get is called in jvm where cluster is started. Do we have a way to get the objects quickly irrespective of its size?
This is partly the price you pay for in-memory-format==OBJECT
To confirm, try in-memory-format==BINARY and compare the difference.
Store and retrieve are slower with OBJECT, some queries will be faster. If you run enough of those queries the penalty is justified.
If you do get(X) and the value is stored deserialized (OBJECT), the following sequence occurs
1 - the object it serialized from object to byte[]
2 - the byte array is sent to the caller, possibly across the network
3 - the object is deserialized by the caller, byte[] to object.
If you change to store serialized (BINARY), step 1 isn't need.
If the caller is the same process, step 2 isn't needed.
If you can, it's worth upgrading (latest is 5.1.3) as there are some newer options that may perform better. See this blog post explaining.
You also don't necessarily have to return the entire object to the caller. A read-only EntryProcessor can extract part of the data you need to return across the network. A smaller network packet will help, but if the cost is in the serialization then the difference may not be remarkable.
If you're retrieving a non-local map entry (either because you're using client-server deployment model, or an embedded deployment with multiple nodes so that some retrievals are remote), then a retrieval is going to require moving data across the network. There is no way to move data across the network that isn't affected by object size; so the solution is to find a way to make the objects more compact.
You don't mention what serialization method you're using, but the default Java serialization is horribly inefficient ... any other option would be an improvement. If your code is all Java, IdentifiedDataSerializable is the most performant. See the following blog for some numbers:
https://hazelcast.com/blog/comparing-serialization-options/
Also, if your data is stored in BINARY format, then it's stored in serialized form (whatever serialization option you've chosen), so at retrieval time the data is ready to be put on the wire. By storing in OBJECT form, you'll have to perform the serialization at retrieval time. This will make your GET operation slower. The trade-off is that if you're doing server-side compute (using the distributed executor service, EntryProcessors, or Jet pipelines), the server-side compute is faster if the data is in OBJECT format because it doesn't have to deserialize the data to access the data fields. So if you aren't using those server-side compute capabilities, you're better off with BINARY storage format.
Finally, if your objects are large, do you really need to be retrieving the entire object? Using the SQL API, you can do a SELECT of just certain fields in the object, rather than retrieving the entire object. (You can also do this with Projections and the older Predicate API but the SQL method is the preferred way to do this). If the client code doesn't need the entire object, selecting certain fields can save network bandwidth on the object transfer.
I'm looking to store a ~0.5G value into a single field, but psycopg2 is not cooperating:
crdb_cursor.execute(sql.SQL("UPSERT INTO my_db.my_table (field1, field2) VALUES (%s, %s)"), ['static_key', 'VERY LARGE STRING'])
psycopg2.InternalError: command is too large: 347201019 bytes (max: 67108864)
I've already set SET CLUSTER SETTING sql.conn.max_read_buffer_message_size='1 GiB';
Is there any (better) way to store this large a string into CockroachDB?
Clients will be requesting this entire string at a time, and no intra-string search or match operations will be performed.
I understand that there will be performance implications to storing large singular fields in a SQL database.
It seems at the moment that psycopg2 isn't capable of handling strings that large, and neither is CockroachDB. CockroachDB recommends keeping values around 1MB and with default configuration the limit is somewhere between 1MB and 20MB.
For storing a string that is several hundred Megabytes, I would suggest some kind of object store and then store a reference to the object in the database. Here is and example of a blob store built on top of CockroachDB that may give you some ideas.
I have a task to create a metadata table for my timeseries cassandra db. This metadata table would like to store over 500 pdf files. Each pdf file comprises of 5-10 MB data.
I have thought of storing them as Blobs. Is cassandra able to do that?
Cassandra isn't a perfect for such blobs and at least datastax recommends to keep them smaller than 1MB for best performance.
But - just try for your self and do some testing. Problems arise when partitions become larger and there are updates in them so the coordinator has much work to do in joining them.
A simple way to go is, store your blob separate as uuid key-value pair in its own table and only store the uuid with your data. When the blob is updated - insert a new one with a new uuid and update your records. With this trick you never have different (and maybe large) versions of your blob and will not suffer that much from performance. I think I read that Walmart did this successfully with images that were partly about 10MB as well as smaller ones.
Just try it out - if you have Cassandra already.
If not you might have a look at Ceph or something similar - but that needs it's own deployment.
You can serialize the file and store them as blob. The cost is deserialization when reading the file back. There are many efficient serialization/deserialization libraries that do this efficiently. Another way is to do what #jasim waheed suggested. However, that will result in network io. So you can decide where you want to pay the cost.
I am curious whether VoltDB compresses the data on disk/at rest.
If it does, what is the algorithm used and are there options for 3rd party compression methods (e.g. a loss permitted proprietary video stream compression algorithm)?
VoltDB uses Snappy compression when writing Snapshots to disk. Snappy is an algorithm optimized for speed, but it still has pretty good compression. There aren't any options for configuring or customizing a different compression method.
Data stored in VoltDB (e.g. when you insert records) is stored 100% in RAM and is not compressed. There is a sizing worksheet built in to the web interface that can help estimate the RAM required based on the specific datatypes of the tables, and whatever indexes you may define.
One of the datatypes that is supported is VARBINARY which stores byte arrays, i.e. any binary data. You could store pre-compressed data in VARBINARY columns, or use a third-party java compression library within stored procedures to compress and decompress inputs. There is a maximum size limit of 1MB per column, and 2MB per record, however a procedure could store larger sized binary data by splitting it across multiple records. There is a maximum size of 50MB for the inputs to or the results from a stored procedure. You could potentially store and retrieve larger sized binary data using multiple transactions.
I saw you posted the same question in our forum, if you'd like to discuss more back and forth, that is the best place. We'd also like to talk to you about your solution, so if you like I can contact you at the email address from your VoltDB Forum account.
I understand that CouchDB views are pre-computed, now I'm wondering what the storage cost is per view. How can this be estimated? Is it the raw JSON size of the emitted data?
To be more specific, it's BigCouch (Cloudant).
I can't give you a rule for estimation, but you have to consider several factors here
CouchDB uses append-only storage, so your database (and view) files will grow also if you update data. To free unused space again, compaction is needed.
The data vs on-disk sizes can be extracted using the _info endpoint of a design-document
CouchDB uses a B-tree data structure for indexing, so a view requires the space of serialized JSON + some overhead for the tree
Since version 1.2 CouchDB by default compresses the database and view files with the snappy algorithm
If you are interested in the internals, there are discussions here, here, here and here.