I am very new to AWS and I have been reading the dynamoDb SDK documentation and the properties that you can specify when creating a Table are far more than the properties that you pass when creating a table using AWS CDK.
SDK example:
var AWS = require("aws-sdk");
AWS.config.update({
region: "us-west-2",
endpoint: "http://localhost:8000"
});
var dynamodb = new AWS.DynamoDB();
var params = {
TableName : "Movies",
KeySchema: [
{ AttributeName: "year", KeyType: "HASH"}, //Partition key
{ AttributeName: "title", KeyType: "RANGE" } //Sort key
],
AttributeDefinitions: [
{ AttributeName: "year", AttributeType: "N" },
{ AttributeName: "title", AttributeType: "S" }
],
ProvisionedThroughput: {
ReadCapacityUnits: 10,
WriteCapacityUnits: 10
}
};
dynamodb.createTable(params, function(err, data) {
if (err) {
console.error("Unable to create table. Error JSON:", JSON.stringify(err, null, 2));
} else {
console.log("Created table. Table description JSON:", JSON.stringify(data, null, 2));
}
});
CDK example:
import * as dynamodb from '#aws-cdk/aws-dynamodb';
const table = new dynamodb.Table(this, 'Hits', {
partitionKey: { name: 'path', type: dynamodb.AttributeType.STRING }
});
here are all the props you can set which are more high level table related settings:
https://docs.aws.amazon.com/cdk/api/latest/docs/#aws-cdk_aws-dynamodb.Table.html
so for example if I want to set the provision throughput in CDK how do I do it? or set AttributeDefinitions, or indexes?
Do I wait unit table creation is done and then modify the table properties via the SDK UpdateTable call?
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB.html#updateTable-property
Billing Mode
DynamoDB supports two billing modes:
PROVISIONED - the default mode where the table and global secondary
indexes have configured read and write capacity.
PAY_PER_REQUEST - on-demand pricing and scaling. You only pay for what
you use and there is no read and write capacity for the table or its
global secondary indexes.
see the Billing Mode attribute:
cdk docs
Dynamodb is pretty much entirely implemented in CDK, but some of the properties are a bit more difficult to find if you aren't very familiar with.
Billing Mode is the property for Provisioned or on demand read/write capacity. It is a constant, so it would be used something like
billingMode: aws_dynamodb.BillingMode.PAY_PER_REQUEST
With CDK you often have to dive a little bit into the documentation to find what you want, but for the mainstream services - Lambda, S3, Dynamo - these are fully implemented in CDK.
And in any case, for other services that may not be, you can use any of the functions that start with Cfn as these are escape hatches that allow you to basically implement direct cloud formation template jsons from cdk
I am playing around with DynamoDb. I am not sure what is the purpose of StreamSpecification and why we should or shouldn't use it? I have read the documentation Aws - StreamSpecification but it does not explain much as what it does.
MovieTable:
Type: AWS::DynamoDB::Table
Properties:
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: "Name"
AttributeType: "S"
- AttributeName: "Genre"
AttributeType: "S"
- AttributeName: "Rating"
AttributeType: "N"
- AttributeName: "DateReleased"
AttributeType: "S"
KeySchema:
- AttributeName: "Name"
KeyType: "HASH"
- AttributeName: "Genre"
KeyType: "RANGE"
- AttributeName: "Rating"
KeyType: "RANGE"
- AttributeName: "DateReleased"
KeyType: "RANGE"
TimeToLiveSpecification:
AttributeName: ExpireAfter
Enabled: false
SSESpecification:
SSEEnabled: true
The StreamSpecification allows you to enable the optional DynamoDB Streams support for this table. DynamoDB Streams allow you to read all the changes to a table as a stream - which you can use for various reasons such as replicating the same changes to another table, checking for suspicious activity, and so on. You can read an introduction to the DynamoDB Streams feature here.
If you don't want to enable a stream on your table (and since you didn't know what this was, you probably don't :-)), you can just ignore StreamSpecification.
I've been trying to find an explanation for this situation but I didn't find any.
I have two DynamoDb tables, both with two key indexes, one is a HASH key and the other is a RANGE key.
In the table where both keys are strings, I can query the database with just the HASH key like this (using the node sdk):
const params = {
TableName: process.env.DYNAMODB_TABLE,
Key: { id: sessionId },
};
const { Item } = await dynamoDb.get(params);
However, the same operation on the other table throws the mentioned error about The number of conditions on the keys is invalid
Here are the two table schemas:
This table definiton allows me to use the mentioned query.
SessionsDynamoDbTable:
Type: 'AWS::DynamoDB::Table'
DeletionPolicy: Retain
Properties:
AttributeDefinitions:
-
AttributeName: userId
AttributeType: S
-
AttributeName: id
AttributeType: S
-
AttributeName: startDate
AttributeType: S
KeySchema:
-
AttributeName: userId
KeyType: HASH
-
AttributeName: id
KeyType: RANGE
LocalSecondaryIndexes:
- IndexName: byDate
KeySchema:
- AttributeName: userId
KeyType: HASH
- AttributeName: startDate
KeyType: RANGE
Projection:
NonKeyAttributes:
- endDate
- name
ProjectionType: INCLUDE
BillingMode: PAY_PER_REQUEST
TableName: ${self:provider.environment.DYNAMODB_TABLE}
This does not allow me to make a query like the one mentioned
SessionsTable:
Type: 'AWS::DynamoDB::Table'
TimeToLiveDescription:
AttributeName: expiresAt
Enabled: true
Properties:
AttributeDefinitions:
-
AttributeName: id
AttributeType: S
-
AttributeName: expiresAt
AttributeType: N
KeySchema:
-
AttributeName: id
KeyType: HASH
-
AttributeName: expiresAt
KeyType: RANGE
BillingMode: PAY_PER_REQUEST
TableName: ${self:provider.environment.DYNAMODB_TABLE}
I'm including the entire table definition because I don't know if secondary indexes can have an impact or not on this problem.
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value. Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.more
get(params, callback) ⇒ AWS.Request
Returns a set of attributes for the item with the given primary key by delegating to AWS.DynamoDB.getItem().
In SessionsTable id is HASH key and in SessionsDynamoDbTable id in RANGE key.for SessionsDynamoDbTable you should provide HASH Key in addition to RANGE
key.
In my serverless.yml file, I have specified a DynamoDB resource, something to this effect (see below). I'd like to know two things:
Why is it that I'm not seeing these tables get created when they don't exist, forcing me to manually enter AWS console and do so myself?
In my source code (nodejs), i'm not sure how I'd reference a table specified in the yml file instead of hardcoding it.
The two questions above roll up into a singular problem, which is that I'd like to be able to specify the tables in the yml and then when doing a "deploy", have a different table set created per environment.
i.e. If I deploy to "--stage Prod", then table would be "MyTable_Prod". If I deploy to "--stage Dev", then table would be "MyTable_Dev", etc...
Figuring this out would go a long way to making deployments much smoother :).
The serverless.yml section of interest is as follows:
resources:
Resources:
DynamoDbTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: MyHappyFunTable
AttributeDefinitions:
- AttributeName: id
AttributeType: S
KeySchema:
- AttributeName: id
KeyType: HASH
ProvisionedThroughput:
ReadCapacityUnits: 5
WriteCapacityUnits: 5
DynamoDBIamPolicy:
Type: AWS::IAM::Policy
DependsOn: DynamoDbTable
Properties:
PolicyName: lambda-dynamodb
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- dynamodb:Query
- dynamodb:Scan
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:UpdateItem
- dynamodb:DeleteItem
Resource: "arn:aws:dynamodb:${opt:region, self:provider.region}:*:table/${self:provider.environment.DYNAMODB_TABLE}"
Roles:
- Ref: IamRoleLambdaExecution
A sample of my horrid 'hardcoded' table names is as follows:
dbParms = {
TableName : "MyTable_Dev",
FilterExpression: "#tid = :tid and #owner = :owner",
ProjectionExpression: "#id, #name",
ExpressionAttributeNames: {
"#tid" : "tenantid",
"#id" : "id",
"#name" : "name",
"#owner" : "owner"
},
ExpressionAttributeValues: {
":tid": tenantId,
":owner": owner
}
};
Note the "MyTable_Dev" ... ideally i'd like that to be something like "MyTable_"
+ {$opt.stage} ... or something to that effect.
In my source code (nodejs), i'm not sure how I'd reference a table specified in the yml file instead of hardcoding it.
I would put your stage in an environment variable that your Lambda function has access to.
In your serverless.yml,
provider:
...
environment:
STAGE: {$opt:stage}
Then, in your code you can access it through process.env.STAGE.
const tableName = 'MyTable_' + process.env.STAGE
According to this issue, Cassandra's storage format was updated in 3.0.
If previously I could use cassandra-cli to see how the SSTable is built, to get something like this:
[default#test] list phonelists;
-------------------
RowKey: scott
=> (column=, value=, timestamp=1374684062860000)
=> (column=phonenumbers:bill, value='555-7382', timestamp=1374684062860000)
=> (column=phonenumbers:jane, value='555-8743', timestamp=1374684062860000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=1374684062860000)
-------------------
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=phonenumbers:doug, value='555-1579', timestamp=1374683971220000)
=> (column=phonenumbers:patricia, value='555-4326', timestamp=137468397122
What would the internal formal look like in the latest version of Cassandra? Could you provide an example?
What utility can I use to see the internal representation of the table in Cassandra in a way listed above, but with a new SSTable format?
All that I have found on the internet is that the partition header how stores column names, row stores clustering values and that there are no duplicated values.
How can I look into it?
Prior to 3.0 sstable2json was a useful utility for getting an understanding of how data is organized in SSTables. This feature is not currently present in cassandra 3.0, but there will be an alternative eventually. Until then myself and Chris Lohfink have developed an alternative to sstable2json (sstable-tools) for Cassandra 3.0 which you can use to understand how data is organized. There is some talk about bringing this into cassandra proper in CASSANDRA-7464.
A key differentiator between the storage format between older verisons of Cassandra and Cassandra 3.0 is that an SSTable was previously a representation of partitions and their cells (identified by their clustering and column name) whereas with Cassandra 3.0 an SSTable now represents partitions and their rows.
You can read about these changes in more detail by visiting this blog post by the primary developer of these changes who does a great job explaining it in detail.
The largest benefit you will see is that in the general case your data size will shrink (in some cases by a large factor), as a lot of the overhead introduced by CQL has been eliminated by some key enhancements.
Here's an example showing the difference between C* 2 and 3.
Schema:
create keyspace demo with replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
use demo;
create table phonelists (user text, person text, phonenumbers text, primary key (user, person));
insert into phonelists (user, person, phonenumbers) values ('scott', 'bill', '555-7382');
insert into phonelists (user, person, phonenumbers) values ('scott', 'jane', '555-8743');
insert into phonelists (user, person, phonenumbers) values ('scott', 'patricia', '555-4326');
insert into phonelists (user, person, phonenumbers) values ('john', 'doug', '555-1579');
insert into phonelists (user, person, phonenumbers) values ('john', 'patricia', '555-4326');
sstable2json C* 2.2 output:
[
{"key": "scott",
"cells": [["bill:","",1451767903101827],
["bill:phonenumbers","555-7382",1451767903101827],
["jane:","",1451767911293116],
["jane:phonenumbers","555-8743",1451767911293116],
["patricia:","",1451767920541450],
["patricia:phonenumbers","555-4326",1451767920541450]]},
{"key": "john",
"cells": [["doug:","",1451767936220932],
["doug:phonenumbers","555-1579",1451767936220932],
["patricia:","",1451767945748889],
["patricia:phonenumbers","555-4326",1451767945748889]]}
]
sstable-tools toJson C* 3.0 output:
[
{
"partition" : {
"key" : [ "scott" ]
},
"rows" : [
{
"type" : "row",
"clustering" : [ "bill" ],
"liveness_info" : { "tstamp" : 1451768259775428 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-7382" }
]
},
{
"type" : "row",
"clustering" : [ "jane" ],
"liveness_info" : { "tstamp" : 1451768259793653 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-8743" }
]
},
{
"type" : "row",
"clustering" : [ "patricia" ],
"liveness_info" : { "tstamp" : 1451768259796202 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-4326" }
]
}
]
},
{
"partition" : {
"key" : [ "john" ]
},
"rows" : [
{
"type" : "row",
"clustering" : [ "doug" ],
"liveness_info" : { "tstamp" : 1451768259798802 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-1579" }
]
},
{
"type" : "row",
"clustering" : [ "patricia" ],
"liveness_info" : { "tstamp" : 1451768259908016 },
"cells" : [
{ "name" : "phonenumbers", "value" : "555-4326" }
]
}
]
}
]
While the output is larger (that is more of a consequence of the tool). The key differences you can see are:
Data is now a collection of Partitions and their Rows (which include cells) instead of a collection of Partitions and their Cells.
Timestamps are now at the row level (liveness_info) instead of at the cell level. If some row cells differentiate in their timestamps, the new storage engine does delta encoding to save space and associated the difference at the cell level. This also includes TTLs. As you can imagine this saves a lot of space if you have a lot of non-key columns as the timestamp does not need to be repeated.
The clustering information (in this case we are clustered on 'person') is now present at the Row level instead of cell level, which saves a bunch of overhead as the clustering column values don't have to be at the cell level.
I should note that in this particular example data case the benefits of the new storage engine aren't completely realized since there is only 1 non-clustering column.
There are a number of other improvements not shown here (like the ability to store row-level range tombstones).