Efficient way to store a JSON string in a Cassandra column? - cassandra

Cassandra newbie question. I'm collecting some data from a social networking site using REST calls. So I end up with the data coming back in JSON format.
The JSON is only one of the columns in my table. I'm trying to figure out what the "best practice" is for storing the JSON string.
First I thought of using the map type, but the JSON contains a mix of strings, numerical types, etc. It doesn't seem like I can declare wildcard types for the map key/value. The JSON string can be quite large, probably over 10KB in size. I could potentially store it as a string, but it seems like that would be inefficient. I would assume this is a common task, so I'm sure there are some general guidelines for how to do this.
I know Cassandra has native support for JSON, but from what I understand, that's mostly used when the entire JSON map matches 1-1 with the database schema. That's not the case for me. The schema has a bunch of columns and the JSON string is just a sort of "payload". Is it better to store the JSON string as a blob or as text? BTW, the Cassandra version is 2.1.5.
Any hints appreciated. Thanks in advance.

In the Cassandra Storage engine there's really not a big difference between a blob and a text, since Cassandra stores text as blobs essentially. And yes the "native" JSON support you speak of is only for when your data model matches your JSON model, and it's only in Cassandra 2.2+.
I would store it as a text type, and you shouldn't have to implement anything to compress your JSON data when sending the data (or handle uncompressing). Since Cassandra's Binary Protocol supports doing transport compression. Also make sure your table is storing the data compressed with the same compression algorithm (I suggest using LZ4 since it's the fastest algo implmeneted) to save on doing compression for each read request. Thus if you configure storing the data compressed and use transport compression, you don't even have to implement either yourself.
You didn't say which Client Driver you're using, but here's the documentation on how to setup Transport Compression for Datastax Java Client Driver.

It depends on how to want to query your JSON. There are 3 possible strategies:
Store as a string
Store as a compressed blob
Store as a blob
Option 1 has the advantage of being human readable when you query your data on command line with cqlsh or if you want to debug data directly live. The drawback is the size of this JSON column (10k)
Option 2 has the advantage to keep the JSON payload small because text elements have a pretty decent compression ration. Drawbacks are: a. you need to take care of compression/decompression client side and b. it's not human readable directly
Option 3 has drawbacks of option 1 (size) and 2 (not human readable)

Related

Spotify gives corrupted data in HTTP Toolkit

I'm trying to watch spotify packages on the emulator, but the data sent and received are corrupted. How can I solve this problem?
İmages:
I tried reading the data many times but it always looks like this. I want to see the data properly in JSON form.
This data is not corrupted, it's just not in the format you want.
You can see the format by looking at the content-type header, which says application/protobuf.
This is not JSON data. Instead it's Protobuf, which is a general-purpose serialization format, similar to JSON or XML, but designed to be faster to process and smaller to transfer, in part by being sent as raw binary data, instead of readable strings.
To deserialize this 100% correctly, you will need the Protobuf schema for the API you're talking to (a .proto file). In many cases, unless this is your own or a publicly documented API, that's not going to be available.
You can still try to decode the data into raw data types though, although that might not allow you to decode all information immediately. There's more info on that here: raw decoder for protobufs format. Decoding data like this works best using the protoc command-line tool, but you may also be able to decode this data using https://protobuf-decoder.netlify.app/. Note that that takes hex data, not a raw string like you have here, so you'll have to pick the 'Hex' option for the body in HTTP Toolkit to copy the hex codes over instead.
I can only extract data from here, how can I use them with python requests? I want to convert them to dictionary data type. Or can we solve this using https://github.com/spotify/proto-requests? I'm trying to write a program on Spotify.

is redisJSON better than plain redis when keeping data for boardgame session data?

Just loaded up a redis server for my backend with ioredis.
I'm learning that if i want to store data in json spec, i gotta use the redisJSON module instead. Since hashes are only string typed and they are flat. However, if im only storing one object per user instance, containing less than 10 fields that are typed string/num or array.. is it better to just use without redisJSON? On one hand, redisJSON can let me query an object on one query. On the other, i can just store multiple datatypes and query between those sets/hash with a consistent naming convention.
Does anyone know whats the better usage or pitfalls with either approach?
the backend serves a websocket for a multiplayer boardgame.
The answer is it depends and it requires several trade-offs to be made for each project
Performance: RedisJSON uses a tree structure for storing all elements in a document.
Comparing to a string: the advantage is that updating sub-elements of a document will be faster than manipulating a string containing a serialised JSON object. But retrieving (reassembling) and writing the entire document will be more expensive compared to Strings. Read more here.
Comparing to Hash: when manipulating a flat document (1 level deep), RedisJSON and HSET performance are comparable.
Maintainability: using several native data types in Redis to represent your object can be really performing, but the code will be more complex to maintain. There can be additional migration/refactoring work when the structure of the document is altered.
Querying: RediSearch has support for indexing and querying the content RedisJSON documents. This is, of course, if your use case requires secondary indexing and querying documents other than with their key. You can still build your own secondary indexing with Redis data structures, but this is also a trade-off in maintainability
disclaimer: I work at Redis, creator and maintainer of RediSearch and RedisJSON

Best approach in converting and storing json objects into Apache Cassandra

I have an API service where user push arbitrary json objects, these json objects can be nested and others just normal. I am facing some challenges on how to effectively convert the incoming json objects into something much more suitable for storage on cassandra. Advice on how to handle is highly appreciated.
Thanks.
As suggested by #omnibear you should take a look at the linked answer. Basically what you need to figure out before deciding on a solution is the answer to the question: "how do you process each JSON after storing it?". Some possible scenarios:
if you process it as it is, then you can store it as a blob
if you have situations where you need to modify a predefined subset of the attributes of the JSON, then you might want to store those as columns and the rest of the JSON as a blob

redis performance, store json object as a string

I need to save a User model, something like:
{ "nickname": "alan",
"email": ...,
"password":...,
...} // and a couple of other fields
Today, I use a Set: users
In this Set, I have a member like user:alan
In this member I have the hash above
This is working fine but I was just wondering if instead of the above approach that could make sense to use the following one:
Still use users Set (to easily get the users (members) list)
In this set only use a key / value storage like:
key: alan
value : the stringify version of the above user hash
Retrieving a record would then be easier (I will then have to Parse it with JSON).
I'm very new to redis and I am not sure what could be the best. What do you think ?
You can use Redis hashes data structure to store your JSON object fields and values. For example your "users" set can still be used as a list which stores all users and your individual JSON object can be stored into hash like this:
db.hmset("user:id", JSON.stringify(jsonObj));
Now you can get by key all users or only specific one (from which you get/set only specified fields/values). Also these two questions are probably related to your scenario.
EDIT: (sorry I didn't realize that we talked about this earlier)
Retrieving a record would then be easier (I will then have to Parse it with JSON).
This is true, but with hash data structure you can get/set only the field/value which you need to work with. Retrieving entire JSON object can result in decrease of performance (depends on how often you do it) if you only want to change part of the object (other thing is that you will need to stringify/parse the object everytime).
One additional merit for JSON over hashes is maintaining type. 123.3 becomes the string "123.3" and depending on library Null/None can accidentally be casted to "null".
Both are a bit tedious as that will require writing a transformer for extracting the strings and converting them back to their expected types.
For space/memory consumption considerations, I've started leaning towards storing just the values as a JSON list ["my_type_version", 123.5, null , ... ] so I didn't have overhead of N * ( sum(len(concat(JSON key names))) which in my case was +60% of Redis's used memory footprint.
bear in mind: Hashes cannot store nested objects, JSON can do it.
Truthfully, either way works fine. The way you store it is a design decision you will need to make. It depends on how you want to retrieve the user information, etc.
In terms of performance, storing the JSON encoded version of the user object will use less memory and take less time for storage/retrieval. That is, JSON parsing is probably faster than retrieving each field from Redis. And, even if not, it is probably more memory efficient. The difference in performance is probably minimal anyway.

Arrays in Azure Table Storage Entities (Non-Byte)

I'm looking to store arrays in Azure Table entities. At present, the only type of array supported natively is byte-array, limited to 64k length. The size is enough, but I'd like to store arrays of longs, doubles and timestamps in an entity.
I can obviously cast multiple bytes to the requested type myself, but I was wondering if there's any best-practice to achieve that.
To clarify, these are fixed length arrays (e.g. 1000 cells) associated with a single key.
I have written a Azure table storage client, called Lucifure Stash, which supports arrays, enums, large data, serialization, public and private properties and fields and more.
You can get it at https://github.com/hocho/LucifureStash
I've been trying to think of a nice way to do this other than the method you've already mentioned, and I'm at a loss. The simplest solution I can come up with is to take the array, binary serialize it and store in a binary array property.
Other options I've come up with but dismissed:
If storing it natively is important, you could keep this information in another child table (I know Azure Tables don't technically have relationships, but that doesn't mean you can't represent this type of thing). The downside of this being that it will be considerably slower than your original.
Take the array, XML serialize it and store it in a string property. This would mean that you could see the contents of your array when using 3rd party data explorer tools and you could run (inefficient) queries that look for an exact match on the contents of the array.
Use Lokad Cloud fat entities to store your data. This essentially takes you're whole object, binary serializes it and splits the results into 64kb blocks across the properties of the table entity. This does solve problems like the one you're experiencing, but you will only be able to access your data using tools that support this framework.
If you have just a key-value collection to store, then you can also check out Azure BLOBs. They can rather efficiently store arrays of up to 25M time-value points per single blob (with a random access within the dataset).
If you choose to store your object in blob storage and need more than one "key" to get it, you can just create an azure table or two or n where you store the key you want to look up and the reference to the exact blob item.
Why don't you store the values as csv strings?
You could serialize your array as a JSON string using the .NET JavaScript serializer:
http://msdn.microsoft.com/en-us/library/system.web.script.serialization.javascriptserializer.aspx
This class has a "MaxJsonLength" property you could use to ensure your arrays didn't exceed 64K when you were serializing them. And you can use the same class to deserialize your stored objects.

Resources