Cassandra 3.0 updated SSTable format - cassandra

According to this issue, Cassandra's storage format was updated in 3.0.
If previously I could use cassandra-cli to see how the SSTable is built, to get something like this:
[default#test] list phonelists;
-------------------
RowKey: scott
=> (column=, value=, timestamp=1374684062860000)
=> (column=phonenumbers:bill, value='555-7382', timestamp=1374684062860000)
=> (column=phonenumbers:jane, value='555-8743', timestamp=1374684062860000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=1374684062860000)
-------------------
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=phonenumbers:doug, value='555-1579', timestamp=1374683971220000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=137468397122
What would the internal formal look like in the latest version of Cassandra? Could you provide an example?
What utility can I use to see the internal representation of the table in Cassandra in a way listed above, but with a new SSTable format?
All that I have found on the internet is that the partition header how stores column names, row stores clustering values and that there are no duplicated values.
How can I look into it?

Prior to 3.0 sstable2json was a useful utility for getting an understanding of how data is organized in SSTables. This feature is not currently present in cassandra 3.0, but there will be an alternative eventually. Until then myself and Chris Lohfink have developed an alternative to sstable2json (sstable-tools) for Cassandra 3.0 which you can use to understand how data is organized. There is some talk about bringing this into cassandra proper in CASSANDRA-7464.
A key differentiator between the storage format between older verisons of Cassandra and Cassandra 3.0 is that an SSTable was previously a representation of partitions and their cells (identified by their clustering and column name) whereas with Cassandra 3.0 an SSTable now represents partitions and their rows.
You can read about these changes in more detail by visiting this blog post by the primary developer of these changes who does a great job explaining it in detail.
The largest benefit you will see is that in the general case your data size will shrink (in some cases by a large factor), as a lot of the overhead introduced by CQL has been eliminated by some key enhancements.
Here's an example showing the difference between C* 2 and 3.
Schema:
create keyspace demo with replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
use demo;
create table phonelists (user text, person text, phonenumbers text, primary key (user, person));
insert into phonelists (user, person, phonenumbers) values ('scott', 'bill', '555-7382');
insert into phonelists (user, person, phonenumbers) values ('scott', 'jane', '555-8743');
insert into phonelists (user, person, phonenumbers) values ('scott', 'patricia', '555-4326');
insert into phonelists (user, person, phonenumbers) values ('john', 'doug', '555-1579');
insert into phonelists (user, person, phonenumbers) values ('john', 'patricia', '555-4326');
sstable2json C* 2.2 output:
[
{"key": "scott",
"cells": [["bill:","",1451767903101827],
["bill:phonenumbers","555-7382",1451767903101827],
["jane:","",1451767911293116],
["jane:phonenumbers","555-8743",1451767911293116],
["patricia:","",1451767920541450],
["patricia:phonenumbers","555-4326",1451767920541450]]},
{"key": "john",
"cells": [["doug:","",1451767936220932],
["doug:phonenumbers","555-1579",1451767936220932],
["patricia:","",1451767945748889],
["patricia:phonenumbers","555-4326",1451767945748889]]}
]
sstable-tools toJson C* 3.0 output:
[
{
"partition" : {
"key" : [ "scott" ]
},
"rows" : [
{
"type" : "row",
"clustering" : [ "bill" ],
"liveness_info" : { "tstamp" : 1451768259775428 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-7382" }
]
},
{
"type" : "row",
"clustering" : [ "jane" ],
"liveness_info" : { "tstamp" : 1451768259793653 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-8743" }
]
},
{
"type" : "row",
"clustering" : [ "patricia" ],
"liveness_info" : { "tstamp" : 1451768259796202 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-4326" }
]
}
]
},
{
"partition" : {
"key" : [ "john" ]
},
"rows" : [
{
"type" : "row",
"clustering" : [ "doug" ],
"liveness_info" : { "tstamp" : 1451768259798802 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-1579" }
]
},
{
"type" : "row",
"clustering" : [ "patricia" ],
"liveness_info" : { "tstamp" : 1451768259908016 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-4326" }
]
}
]
}
]
While the output is larger (that is more of a consequence of the tool). The key differences you can see are:
Data is now a collection of Partitions and their Rows (which include cells) instead of a collection of Partitions and their Cells.
Timestamps are now at the row level (liveness_info) instead of at the cell level. If some row cells differentiate in their timestamps, the new storage engine does delta encoding to save space and associated the difference at the cell level. This also includes TTLs. As you can imagine this saves a lot of space if you have a lot of non-key columns as the timestamp does not need to be repeated.
The clustering information (in this case we are clustered on 'person') is now present at the Row level instead of cell level, which saves a bunch of overhead as the clustering column values don't have to be at the cell level.
I should note that in this particular example data case the benefits of the new storage engine aren't completely realized since there is only 1 non-clustering column.
There are a number of other improvements not shown here (like the ability to store row-level range tombstones).

Related

PySpark: Formatting JSON before input to DataFrame

I want to create a Spark dataframe which contains a list of labelled tweets from a number of separate JSON files. I've tried simply using spark.read.json(files, multiLine=True) but I end up with a _corrupted_record in some files, there's something Spark doesn't seem to like about the format (JSON is valid, I've checked).
The following is a representation of the format of each JSON object per file that I'm dealing with:
{"annotator": {
"eventsAnnotated" : [ {...} ],
"id" : "0939"
},
"events": [
{"eventid": "039393",
"tweets": [
{
"postID" : "111",
"timestamp" : "01/01/01",
"categories" : [ "Category" ],
"indicatorTerms" : [ ],
"priority" : "Low",
"text" : "text"
},
...]
However, I'm only interested in the tweets section of the JSON and can disregard eventid, or anything included in annotator:
"tweets": [
{
"postID" : "111",
"timestamp" : "01/01/01",
"categories" : [ "Category" ],
"indicatorTerms" : [ ],
"priority" : "Low",
"text" : "text"
},
...]
I'd like that to end up in a Spark dataframe in which postID, timestamp, categories, indicatorTerms, priority, and text are my columns and each row corresponds to one of these JSON entries.
I guess what I'm asking is how can I read these files into some sort of temporary structure where I can stream, line-by-line, each tweet and then transform that into a Spark dataframe? I've seen some posts about RDDs but only managed to confuse myself, I am pretty new to Spark as a whole.

Alias value in avsc does not display value on par with avro file

I have updated avsc file to rename column like,
"fields" : [ {
"name" : "department_id",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "office_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "department_name" ],
"columnName" : "department_name"
}
However in may avro file columns are like department_id : 10, department_name : "maths"
Now when i query like below,
select office_name from t
it always returns null values. Will it not return value from department_name in avro. Is there a way to have multiple names for column in avsc
From cloudera community, "we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark."
Schema with aliases,
val schema = new Schema.Parser().parse(new File("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/user.avsc"))
schema: org.apache.avro.Schema = {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string","aliases":["customer_name"],"columnName":"customer_name"},{"name":"favorite_color","type":["string","null"],"aliases":["color"],"columnName":"color"}]}
Spark striping the aliases,
val usersDF = spark.read.format("avro").option("avroSchema",schema.toString).load("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/users.avro")
usersDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string]
I guess you can go with spark builtin features to rename a column, but if you find any other workaround let me know as well.

How are counters in Cassandra stored on the disk?

I am unable to understand how Cassandra counters are stored on the disk.
Create test table
create table testcounter (
id text,
count counter,
PRIMARY KEY(id))
WITH compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
Add data
update testcounter set count = count + 10 where id = 'testrow';
Check sstable
nodetool flush test testcounter
sstabledump /usr/local/var/lib/cassandra/data/test/testcounter-87d6ae20908e11e9a5779f988085883a/mc-1-big-Data.db
Response from sstabledump
[
{
"partition" : {
"key" : [ "testrow" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 63,
"cells" : [
{ "name" : "count", "value" : 422215477737628, "tstamp" : "2019-06-16T23:30:34.423470Z" }
]
}
]
}
Update existing data
update testcounter set count = count + 10 where id = 'testrow';
update testcounter set count = count + 10 where id = 'testrow';
Flush
nodetool flush test testcounter
At this point, there are two sets of db files.
ls /usr/local/var/lib/cassandra/data/test/testcounter-87d6ae20908e11e9a5779f988085883a/
backups mc-1-big-Digest.crc32 mc-1-big-Statistics.db mc-2-big-CompressionInfo.db mc-2-big-Filter.db mc-2-big-Summary.db
mc-1-big-CompressionInfo.db mc-1-big-Filter.db mc-1-big-Summary.db mc-2-big-Data.db mc-2-big-Index.db mc-2-big-TOC.txt
mc-1-big-Data.db mc-1-big-Index.db mc-1-big-TOC.txt mc-2-big-Digest.crc32 mc-2-big-Statistics.db
sstabledump for mc-1
[
{
"partition" : {
"key" : [ "testrow" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 63,
"cells" : [
{ "name" : "count", "value" : 422215477737628, "tstamp" : "2019-06-16T23:30:34.423470Z" }
]
}
]
}
sstabledump for mc-2
[
{
"partition" : {
"key" : [ "testrow" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 65,
"cells" : [
{ "name" : "count", "value" : 422215477737628, "tstamp" : "2019-06-16T23:34:37.245893Z" }
]
}
]
}
It looks like there are no tombstones and even the counter values are not stored. What is happening behind-the-scenes?
After 2.1 its actually a read before write then stores essentially a packed tuple which isnt very obvious or easy to deserialize. Might be worth opening a jira to have sstabledump deserialize the context and make it more readable.
For more details see: https://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters

cassandra-cli 'list' in cassandra 3.0

I want to view the "rowkey" with its stored data in cassandra 3.0. I know, the depreciated cassandra-cli had the 'list'-command. However, in cassandra 3.0, I cannot find the replacement for the 'list'-command. Anyone knows the new cli-command for 'list'?
You can use sstabledump utility as #chris-lohfink suggested. How to use it? Create keyspace, table in it populate some data:
cqlsh> CREATE KEYSPACE IF NOT EXISTS minetest WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh> CREATE TABLE object_coordinates (
... object_id int PRIMARY KEY,
... coordinate text
... );
cqlsh> use minetest;
cqlsh:minetest> insert into object_coordinates (object_id, coordinate) values (564682,'59.8505,34.0035');
cqlsh:minetest> insert into object_coordinates (object_id, coordinate) values (1235,'61.7814,40.3316');
cqlsh:minetest> select object_id, coordinate, writetime(coordinate) from object_coordinates;
object_id | coordinate | writetime(coordinate)
-----------+-----------------+-----------------------
1235 | 61.7814,40.3316 | 1480436931275615
564682 | 59.8505,34.0035 | 1480436927707627
(2 rows)
object_id is a primary (partition key) key, coordinate is clustering one.
Flush changes to disk:
# nodetool flush
Find sstable on disk and analyze it:
# cd /var/lib/cassandra/data/minetest/object_coordinates-e19d4c40b65011e68563f1a7ec2d3d77
# ls
backups mc-1-big-CompressionInfo.db mc-1-big-Data.db mc-1-big-Digest.crc32 mc-1-big-Filter.db mc-1-big-Index.db mc-1-big-Statistics.db mc-1-big-Summary.db mc-1-big-TOC.txt
# sstabledump mc-1-big-Data.db
[
{
"partition" : {
"key" : [ "1235" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 18,
"liveness_info" : { "tstamp" : "2016-11-29T16:28:51.275615Z" },
"cells" : [
{ "name" : "coordinate", "value" : "61.7814,40.3316" }
]
}
]
},
{
"partition" : {
"key" : [ "564682" ],
"position" : 43
},
"rows" : [
{
"type" : "row",
"position" : 61,
"liveness_info" : { "tstamp" : "2016-11-29T16:28:47.707627Z" },
"cells" : [
{ "name" : "coordinate", "value" : "59.8505,34.0035" }
]
}
]
}
]
Or with -d flag:
# sstabledump mc-1-big-Data.db -d
[1235]#0 Row[info=[ts=1480436931275615] ]: | [coordinate=61.7814,40.3316 ts=1480436931275615]
[564682]#43 Row[info=[ts=1480436927707627] ]: | [coordinate=59.8505,34.0035 ts=1480436927707627
Output says that 1235 and 564682 and saves coordinates in those partitions.
Link to doc http://www.datastax.com/dev/blog/debugging-sstables-in-3-0-with-sstabledump
PS. sstabledump is provided by cassandra-tools package in ubuntu.

Using near with elemMatch in Mongoose

I am searching within a collection of Stores. Stores have an embedded collection of outlets with locations. My goal is to return the set of stores that have outlets near a geolocation, and also only return those Outlets within that location.
I can successfully restrict the query to only return Stores have an Outlet at a particular location using 'near'
Store
.where('isActive').equals(true)
.where('outlets.location')
.near({ center: [153.027117, -27.468515], maxDistance: 1000 / 6378137, spherical: true })
.where('outlets.isActive').equals(true)
.where('products.productType').equals('53433f1f3e02e39addde1954')
.where('products.isActive').equals(true)
.select('name outlets')
.select({'products': {$elemMatch: {'isActive': true, 'productType': '53433f1f3e02e39addde1954'}}})
.select('name outlets')
.execQ()
.then(function (results) {
console.log(results);
})
.fail(function (err) {
console.log(err);
})
.done();
The problem I have is that the store document returns all the outlets, not just the outlet that matched the geolocation. I've tried using elemMatch within a select like I did with the products;
.select({'outlets': {$elemMatch: {'location': {near:{ center: [153.027117, -27.468515], maxDistance: 10000 / 6378137, spherical: true }}}}})
However it returns an empty array. Can use use the near operator in an elemMatch clause? Am I doing it incorrectly? Is there an more efficient/fast/better way to achieve the goal?
I see what you are trying to do here but there seems to be a few flaws in this sort of design. Though not exactly your document structure I see you are trying to do something like this:
{
"_id" : ObjectId("5344badd519563414f23fdf8"),
"store" : "Mine",
"outlets" : [
{
"name" : "somewhere",
"loc" : {
"type" : "Point",
"coordinates" : [
150.975131,
-33.8440366
]
}
},
{
"name" : "else",
"loc" : {
"type" : "Point",
"coordinates" : [
151.3651524,
-33.8389783
]
}
}
]
}
{
"_id" : ObjectId("5344be6f519563414f23fdf9"),
"store" : "Another",
"outlets" : [
{
"name" : "else",
"loc" : {
"type" : "Point",
"coordinates" : [
151.3651524,
-33.8389783
]
}
},
{
"name" : "somewhere",
"loc" : {
"type" : "Point",
"coordinates" : [
150.975131,
-33.8440366
]
}
}
]
}
So basically you appear to be attempting to nest the outlet locations within an array in a top level document.
What I am referring to a flaw here is that by design, any type of "near" based query is going to return more than 1 result. That does seem logical when you look at the purpose. You can of course modify this to restrict the results by "maxDistance" but generally it will be more than 1.
So the only way is to .limit() the results returned by the cursor to a single "nearest" response. Also note that with some operations those results are not necessarily "sorted" with the "nearest response first.
Now as these results are actually contained within an array of the document, remember that .find() itself does not actually "filter" the results of an array, so of course the document will contain all of the array contents.
If you tried to "project" with a positional $ operator, then the problem falls back to the original point because there is no singular actual match, so it is not possible to return an "index" value for the matching element. If you in fact did try this, you would always get the default index value of 0, so just returning the first element.
If then you thought you could run off to aggregate and and try to actually "de-normalize" the array entries, you would be out of luck because due to the need to use the index at the first stage of any aggregation pipeline statement.
So the summary of this is that embedded entries like this are not suited to this design where you need to do geo-spatial matching on those store locations. The locations are better off in a separate collection:
{
"_id" : ObjectId("5344bec7519563414f23fdfa"),
"store": "Mine"
"name" : "else",
"loc" : {
"type" : "Point",
"coordinates" : [
151.3651524,
-33.8389783
]
}
}
{
"_id" : ObjectId("5344bed5519563414f23fdfb"),
"store": "Mine"
"name" : "somewhere",
"loc" : {
"type" : "Point",
"coordinates" : [
150.975131,
-33.8440366
]
}
}
So that would allow you to "limit" the result to the "nearest" match by setting the limit to 1. You can also include any necessary values such as the "store" to be used in your filtering. If you need to you can include other information aside from what you need to filter or otherwise just put the ObjectId values within the array of the original object, or possibly even duplicate for both collections.
But since the very nature of these queries is intended to not only return 1 match, then there is no way you are going to get this to work on embedded documents. So your solution will require some changes in your overall schema.

Resources