Atomic way of inserting a row if not exist in bigtable - node.js

We would like to insert a row if not exist in bigtable. Our idea is to use CheckAndMutateRow api with a onNoMatch insert. We are using the nodejs sdk, the idea would be to do the following (it seems to works, but we don't no about the atomicity of the operation)
const row = table.row('phone#4c410523#20190501');
const filter = [];
const config = {
onNoMatch: [
{
method: 'insert',
data: {
stats_summary: {
os_name: 'android',
timestamp,
},
},
},
],
};
await row.filter(filter, config);

CheckAndMutateRow is atomic. Based on the API definition:
Mutations are applied atomically and in order, meaning that earlier mutations can be masked / negated by later ones. Cells already present in the row are left unchanged unless explicitly changed by a mutation.
After committing the accumulated mutations, resets the local mutations.

MutateRow does an upsert. So if you give it a rowkey, column name and timestamp it will create a new cell if it doesn't exist, or overwrite it otherwise. You can achieve this behavior with a "simple write".
Conditional writes are use e.g. if you want to check the value before you overwrite it. Let's say you want to set column A to X only if column B is Y or overwrite column A only if column A's current value is Z.

Related

bigQuery: PartialFailureError on table insert

I'm trying to insert data row to bigQuery table as follows:
await bigqueryClient
.dataset(DATASET_ID)
.table(TABLE_ID)
.insert(row);
But I get a PartialFailureError when deploying the cloud function.
The table schem has a name (string) and campaigns (record/repeated) fields which I created manually from the console.
hotel_name STRING NULLABLE
campaigns RECORD REPEATED
campaign_id STRING NULLABLE
platform_id NUMERIC NULLABLE
platform_name STRING NULLABLE
reporting_id STRING NULLABLE
And the data I'm inserting is an object like this:
const row = {
hotel_name: hotel_name,//string
campaigns: {
id: item.id,//string
platform_id: item.platform_id,//int
platform_name: item.platform_name,//string
reporting_id: item.reporting_id,//string
},
};
The errors logged don't give much clue about the issue.
These errors suck. The actual info about what went wrong can be found in the errors property on the PartialFailureError. In https://www.atdatabases.org we reformat the error to make this easier using: https://github.com/ForbesLindesay/atdatabases/blob/0e1a033264aac33deaff2ab753796049e623ab86/packages/bigquery/src/implementation/BigQueryDriver.ts#L211
According to my test it seems that there are 2 errors here. First is that you have campaign_id in schema while id in JSON.
2nd thing is related with format of REPEATED mode data in JSON. The documentation mentions following:
. Notice that the addresses column contains an array of values (indicated by [ ]). The multiple addresses in the array are the repeated data. The multiple fields within each address are the nested data.
It's not so straight in mentioned document (probably can be found somewhere else) however when you use REPEATED mode you should use brackets [].
I tested it shortly on my side and it seems that it should work like this:
const row = {
hotel_name: hotel_name,//string
campaigns: [ {
campaign_id: item.id,//string
platform_id: item.platform_id,//int
platform_name: item.platform_name,//string
reporting_id: item.reporting_id,//string
}, ]
};

Specifying that a column should just be inserted not upserted

I have some data i want to insert via insertGraph
ala ModelName
.query(trx)
.insertGraph(data)
problem is I have a guard with allowInsert that specifies which columns should be populated. I have a column holding a foreign key to another table. I don't want this column to be populated. I keep getting trying to upsert an unallowed relation. I'm at a loss on how to specify that foreignId shouldn't be populated.
My code looks like this with the allowInsert guard
ModelName
.query(trx)
.allowInsert([subrelation2.[columnToPopulate1, columnToPopulate2]])
.insertGraph(data)
P.s. I've tried specifying foreignId in the allowInsert condition to no avail. Specifying relation2.* allows the insertion. But I want to retain the sanity checks
Seems like you are specifying columns of the relation2 model, instead of subrelations. Is columnToPopulate1 a subrelation of the Model of relation2? By the name looks like a column of the model, which is wrong.
I think that you want to use a relations to insert two columns in the 'relation2' model. Something like:
let data = {
modelNameColumn: 'value',
relation2: {
columnToPopulate1: 'value',
columnToPopulate2: 'value',
}
}
await ModelName
.query(trx)
.allowInsert('[relation2]')
.insertGraph(data)
In allowInsert method you can only specify which relations are allowed to be inserted, but cant define which columns.
In case you want to remove the posibility of a column to be updated, you can use a beforeUpdate() trigger:
class Model2 extends Model {
async $beforeUpdate(opt, queryContext) {
await super.$beforeUpdate(opt, queryContext);
if (this.columnName) throw new Error('columnName shouldnt be updated')
}
}

Paginating a mongoose mapReduce, for a ranking algorithm

I'm using a MongoDB mapReduce to code a ranking feed algorithm, it almost works but the latest thing to implement is the pagination. The map reduce supports the results limitation but how could I implement the offset (skipping) based e.g. on the latest viewed _id of the results, knowing that I'm using mongoose?
This is the procedure I wrote:
o = {};
o.map = function() {
//log10(likes+comments) / elapsed hours from the post creation
emit(Math.log(this.likes + this.comments + 1) / Math.LN10 / Math.abs((now - this.createdAt) / 6e7 + 1), this);
};
o.reduce = function(key, values) {
//sort the values, when they have the same score
values.sort(function(a, b) {
a.createdAt - b.createdAt;
});
//serialize the values, because mongoose does not support multiple returned values
return JSON.stringify(values);
};
o.scope = {now: new Date()};
o.limit = 15;
Posts.mapReduce(o, function(err, results) {
if (err) return console.log(err);
console.log(results);
});
Also, if the mapReduce it's not the way to go, do you suggest other on how to implement something like this?
What you need is a page delimiter which is not the id of the latest viewed as you say, but your sorting property. In this case, it seems to be the formula Math.log(this.likes + this.comments + 1) / Math.LN10 / Math.abs((now - this.createdAt) / 6e7 + 1).
So, in your mapReduce query needs to hold a where value of that formula above. Or specifically, 'formula >= . And also it needs to hold the value of createdAt at the last page, since you don't sort by that. (Assuming createdAt is unique). So yourqueryof mapReduce would saywhere: theFormulaExpression, createdAt: { $lt: lastCreatedAt }`
If you do allow multiple identical createdAt values, you have to play a little outside of the database itself.
So you just search by formula.
Ideally, that gives you one element with exactly that value, and the next ones sorted after that. So in reply to the module caller, remove this first element off the array (and make sure you actually ask for more results then you need because of this).
Now, since you allow for multiple similar values, you need another identifying prop, say, object id or created_at. Your consumer (caller of this module) will have to provide both (last value of the score, createdAt of the last object). Say you have a page split exactly in the middle - one or more objects is on the previous page, another set on the next
. You'd have to not simply remove the top value (because that same score is already served on the previous page), but possibly several of them from the top.
Then it goes really crazy, because potentially your whole page was already served - compare the _ids, look for the first one after the one your module caller has provided you with. Or look into the data and determine how many matching values like that are there, try to get at least as many more values from mapReduce then you have on your actual page size.
Aside from that, I would do this with aggregation instead, it should be much more preformant.

Dynamodb querying for list count

I have a dynamodb table which has following columns,
id,name,events, deadline
events is a list which contain number of events.
I want to scan/query for all the rows with following items as the result,
id, name, number of events.
I tried following way but didn't receive any value for number of events. Can someone show me where am I wrong.
var params = {
TableName: 'table_name',
ExpressionAttributeNames: {"#name": "name",
"#even": "events.length"
},
ProjectionExpression: 'id, #name, #even'
}
You cannot achieve what you want in this way. The entries in "ExpressionAttributeNames" are not evaluated as expressions.
The definition of "#even": "events.length" in "ExpressionAttributeNames" does not evaluate the expression event.length and assign it to the variable "#even". Instead it specifies "#even" as referring to a column named "events.length" or a table where "events" is an object that has a "length" attribute. Since your table has neither, you get nothing back.
From the DynamoDB documentation:
In an expression, a dot (".") is interpreted as a separator character in a document path. However, DynamoDB also allows you to use a dot character as part of an attribute name.
To achieve what you want, you will have to return the "events" column and calculate the length outside of the query, or define a new "eventsLength" column and populate and maintain that value yourself if you are concerned about returning "events" in each query.

How to efficiently store this document structure in Cassandra?

I want to migrate this complex document structure to cassandra:
foo = {
1: {
:some => :data,
},
2: {
:some => :data
},
...
99 :{
:some => :data
}
'seen' => {1 => 1347682901, 2 => 1347682801}
}
The problem:
It has to be retrievable (readble) as one row/record in ~<5 milliseconds.
So far, I am serializing the data but that is not the optimum as I'm always in need to update the whole thing.
Another thing is, that I would like to use cassandras ttl feature for the values in the 'seen' hash.
Any ideas on how the sub-structures (1..n) could work in cassandra, as they are totally dynamic but should be readable all with one query?
Create a columnFamily. And store as following
rowKey = foo
columnName Value
-----------------------------------
1 {:some => :data,..}
2 {:some => :data,..}
...
...
99 {:some => :data,..}
seen {1 => 1347682901, 2 => 1347682801}
1,2,... "seen" are all dynamic.
If you are worried about updating just one of these columns. It is same as how you insert a new column in a columnfamily. See here Cassandra update column
$column_family->insert('foo', array('42' => '{:some => :newdata,..}'));
I haven't had to use TTL yet. But it's as simple as it is. See pretty easy way to achieve this here Expiring Columns in Cassandra 0.7+
Update
Q1. Just for my understanding: Do you suggest creating 99 columns? Or is it possible to keep that dynamic?
Column family, unlike RDBMS, has flexible structure. You can have unlimited numbers of columns for a row key, dynamically created. For example:
myCcolumnFamily{
"rowKey1": {
"attr1": "some_values",
"attr2": "other_value",
"seen" : 823648223
},
"rowKey2": {
"attr1": "some_values",
"attr3": "other_value1",
"attr5": "other_value2",
"attr7": "other_value3",
"attr9": "other_value4",
"seen" : 823648223
},
"rowKey3": {
"name" : "naishe",
"log" : "s3://bucket42.aws.com/naishe/logs",
"status" : "UNKNOWN",
"place" : "Varanasi"
}
}
This is an old article, worth reading: WTF is a SuperColumn? Here is a typical quote that will answer your query (emphasis mine):
One thing I want to point out is that there’s no schema enforced at this [ColumnFamily] level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.
. . . .
Q2. And you suggest serializing the sub-structure?
It's up to you. If you do not want to serialize, you probably should use SuperColumn. My rule of thumb is this. If the value in a column represents a unit whose parts cannot be accessed independently, use Column. (that means serialize value). If column is having fragmented subparts that possibly will require accessing directly use SuperColumn.

Resources