How to efficiently store this document structure in Cassandra? - cassandra

I want to migrate this complex document structure to cassandra:
foo = {
1: {
:some => :data,
},
2: {
:some => :data
},
...
99 :{
:some => :data
}
'seen' => {1 => 1347682901, 2 => 1347682801}
}
The problem:
It has to be retrievable (readble) as one row/record in ~<5 milliseconds.
So far, I am serializing the data but that is not the optimum as I'm always in need to update the whole thing.
Another thing is, that I would like to use cassandras ttl feature for the values in the 'seen' hash.
Any ideas on how the sub-structures (1..n) could work in cassandra, as they are totally dynamic but should be readable all with one query?

Create a columnFamily. And store as following
rowKey = foo
columnName Value
-----------------------------------
1 {:some => :data,..}
2 {:some => :data,..}
...
...
99 {:some => :data,..}
seen {1 => 1347682901, 2 => 1347682801}
1,2,... "seen" are all dynamic.
If you are worried about updating just one of these columns. It is same as how you insert a new column in a columnfamily. See here Cassandra update column
$column_family->insert('foo', array('42' => '{:some => :newdata,..}'));
I haven't had to use TTL yet. But it's as simple as it is. See pretty easy way to achieve this here Expiring Columns in Cassandra 0.7+
Update
Q1. Just for my understanding: Do you suggest creating 99 columns? Or is it possible to keep that dynamic?
Column family, unlike RDBMS, has flexible structure. You can have unlimited numbers of columns for a row key, dynamically created. For example:
myCcolumnFamily{
"rowKey1": {
"attr1": "some_values",
"attr2": "other_value",
"seen" : 823648223
},
"rowKey2": {
"attr1": "some_values",
"attr3": "other_value1",
"attr5": "other_value2",
"attr7": "other_value3",
"attr9": "other_value4",
"seen" : 823648223
},
"rowKey3": {
"name" : "naishe",
"log" : "s3://bucket42.aws.com/naishe/logs",
"status" : "UNKNOWN",
"place" : "Varanasi"
}
}
This is an old article, worth reading: WTF is a SuperColumn? Here is a typical quote that will answer your query (emphasis mine):
One thing I want to point out is that there’s no schema enforced at this [ColumnFamily] level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.
. . . .
Q2. And you suggest serializing the sub-structure?
It's up to you. If you do not want to serialize, you probably should use SuperColumn. My rule of thumb is this. If the value in a column represents a unit whose parts cannot be accessed independently, use Column. (that means serialize value). If column is having fragmented subparts that possibly will require accessing directly use SuperColumn.

Related

Creating Test data for ArangoDB

Hi I would like to insert random test data into an edge collection called Transaction with the fields _id, Amount and TransferType with random data. I have written the following code below, but it is showing a syntax error.
FOR i IN 1..30000
INSERT {
_id: CONCAT('Transaction/', i),
Amount:RAND(),
Time:Rand(DATE_TIMESTAMP),
i > 1000 || u.Type_of_Transfer == "NEFT" ? u.Type_of_Transfer == "IMPS"
} INTO Transaction OPTIONS { ignoreErrors: true }
Your code has multiple issues:
When you are creating a new document you can either not specify the _key key and Arango will create one for you, or you specify one as a string to be used. _id as a key will be ignored.
RAND() produces a random number between 0 and 1, so it needs to be multiplied in order to make it into the range you want you might need to round it, if you need integer values.
DATE_TIMESTAMP is a function and you have given it as a parameter to the RAND() function which needs no parameter. But because it generates a numerical timestamp (milliseconds since 1970-01-01 00:00 UTC), actually it's not needed. The only thing you need is the random number generation shifted to a range that makes sense (ie: not in the 1970s)
The i > 1000 ... line is something I could only guess what it wanted to be. Here the key for the JSON object is missing. You are referencing a u variable that is not defined anywhere. I see the first two parts of a ternary operator expression (cond ? true_value : false_value) but the : is missing. My best guess is that you wanted to create a Type_of_transfer key with value of "NEFT" when i>1000 and "IMPS" when i<=1000
So, I rewrote your AQL and tested it
FOR i IN 1..30000
INSERT {
_key: TO_STRING(i),
Amount: RAND()*1000,
Time: ROUND(RAND()*100000000+1603031645000),
Type_of_Transfer: i > 1000 ? "NEFT" : "IMPS"
} INTO Transaction OPTIONS { ignoreErrors: true }

Atomic way of inserting a row if not exist in bigtable

We would like to insert a row if not exist in bigtable. Our idea is to use CheckAndMutateRow api with a onNoMatch insert. We are using the nodejs sdk, the idea would be to do the following (it seems to works, but we don't no about the atomicity of the operation)
const row = table.row('phone#4c410523#20190501');
const filter = [];
const config = {
onNoMatch: [
{
method: 'insert',
data: {
stats_summary: {
os_name: 'android',
timestamp,
},
},
},
],
};
await row.filter(filter, config);
CheckAndMutateRow is atomic. Based on the API definition:
Mutations are applied atomically and in order, meaning that earlier mutations can be masked / negated by later ones. Cells already present in the row are left unchanged unless explicitly changed by a mutation.
After committing the accumulated mutations, resets the local mutations.
MutateRow does an upsert. So if you give it a rowkey, column name and timestamp it will create a new cell if it doesn't exist, or overwrite it otherwise. You can achieve this behavior with a "simple write".
Conditional writes are use e.g. if you want to check the value before you overwrite it. Let's say you want to set column A to X only if column B is Y or overwrite column A only if column A's current value is Z.

Is it possible to do a Lookup use Kiba

Is it possible to do a "Lookup" with Kiba.
Since it's quite a normal process in a etl.
Could you show a demo if yes, thanks.
Yes, a lookup can be done with Kiba!
For a tutorial, see this live coding session I recorded, I create a lookup transform to lookup extra fields using a given fields by tapping in the MovieDB database.
Leveraging this example, you could for instance implement a simple ActiveRecord lookup using a block transform:
# assuming you have a 'country_iso_2' field in the row above
transform do |row|
country = Country.where(iso_2: row['country_iso_2']).first
row['country_name'] = country.try(:name) || 'Unknown'
row
end
or you could extract a more reusable class transform that you would call like this:
transform ActiveRecordLookup, model: Country,
lookup_on: 'country_iso_2',
fetch_fields: { 'name' => 'country_name' }
transform DefaultValue, 'name' => 'Unknown'
Obviously, if you have the need for large volumes, you will have to implement some improvements (e.g. caching, bulk reading).
Hope this helps!

cassandra data model for web logging

Been playing around with Cassandra and I am trying to evaluate what would be the best data model for storing things like views or hits for unique page id's? Would it best to have a single column family per pageid, or 1 Super-column (logs) with columns pageid? Each page has a unique id, then would like to store date and some other metrics on the view.
I am just not sure which solution handles better scalability, lots of column family OR 1 giant super-column?
page-92838 { date:sept 2, browser:IE }
page-22939 { date:sept 2, browser:IE5 }
OR
logs {
page-92838 {
date:sept 2,
browser:IE
}
page-22939 {
date:sept 2,
browser:IE5
}
}
And secondly, how to handle lots of different date: entries for page-92838?
You don't need a column-family per pageid.
One solution is to have a row for each page, keyed on the pageid.
You could then have a column for each page-view or hit, keyed and sorted on time-UUID (assuming having the views in time-sorted order would be useful) or other unique, always-increasing counter. Note that all Cassandra columns are time-stamped anyway, so you would have a precise timestamp 'for free' regardless of what other time- or date- stamps you use. Using a precise time-UUID as the key also solves the problem of storing many hits on the same date.
The value of each column could then be a textual value or JSON document containing any other metadata you want to store (such as browser).
page-12345 -> {timeuuid1:metadata1}{timeuuid2:metadata2}{timeuuid3:metadata3}...
page-12346 -> ...
With cassandra, it is best to start with what queries you need to do, and model your schema to support those queries.
Assuming you want to query hits on a page, and hits by browser, you can have a counter column for each page like,
stats { #cf
page-id { #key
hits : # counter column for hits
browser-ie : #counts of views with ie
browser-firefox : ....
}
}
If you need to do time based queries, look at how twitters rainbird denormalizes as it writes to cassandra.

SubSonic 3 / ActiveRecord - Easy way to compare two records?

With SubSonic 3 / ActiveRecord, is there an easy way to compare two records without having to compare each column by column. For example, I'd like a function that does something like this (without having to write a custom comparer for each table in my database):
public partial class MyTable
{
public IList<SubSonic.Schema.IColumn> Compare(MyTable m)
{
IList<SubSonic.Schema.IColumn> columnsThatDontMatch = new...;
if (this.Field1 != m.Field1)
{
columnsThatDontMatch.add(Field1_Column);
}
if (this.Field2 != m.Field2)
{
columnsThatDontMatch.add(Field2_Column);
}
...
return columnsThatDontMatch;
}
}
In the end, what I really need is a function that tests for equality between two rows, excluding the primary key columns. The pseudo-code above is a more general form of this. I believe that once I get the columns that don't match, I'll be able to check if any of the columns are primary key fields.
I've looked through Columns property without finding anything that I can use. Ideally, the solution would be something I can toss in the t4 file and generate for all my tables in the database.
The best way, if using SQL Server as your backend as this can be auto populated, is to create a derived column that has a definition that uses CHECKSUM to hash the values of "selected" columns to form a uniqueness outside of the primary key.
EDIT: if you are not using SQL Server then this hashing will need to be done in code as you save, edit the row.

Resources