I have enabled Kinesis DataStream in DynamoDB and have configured a Delivery Stream to store the stream as audit logs into an s3 bucket.
I then query the s3 bucket from Amazon Athena.
Everything seems to be working, but the userIdentity property is always empty (null) which seems pointless to me to have an audit if I cannot capture who did the transaction. Is this property only populated when a record is deleted from DynamoDB and TTL is enabled?
Questions:
How do I capture the user id / name of the user responsible for adding, updating, or deleting a record via the application or directly via DynamoDB in AWS console?
(Less important question) How do I format the stream before it hits the s3 bucket so I can include the record id being updated?
Also please note that I have a lambda function that I use from the Delivery Stream that simply adds new line to each stream as a delimeter. If I wanted to do more processing/formatting to the stream, should I be executing this lambda when the stream hits the DeliveryStream? Or should I be executing this as a trigger in the DynamoDB table itself before it hits the DeliveryStream?
DynamoDB does not include the user details in the Data Stream. This needs to be implemented by the application, then you can get the values from the newImage if provided by the stream.
Related
I am having a Node + Express application running on EC2 server and trying to add a new search feature to it. I am thinking about using Lambda function and ElasticSearch. When the client fires a request to update a table in dynamodb, Lambda function will react to this event and update the elastcsearch index.
I know lambda runs serverless whereas my original application runs within a server. Can anybody give me some hints about how to do it or let me know if it's even possible?
The link between a DynamoDB update and a Lambda is "DynamoDB Streams".
The documentation says, in part,
Amazon DynamoDB is integrated with AWS Lambda so that you can create
triggers—pieces of code that automatically respond to events in
DynamoDB Streams. With triggers, you can build applications that react
to data modifications in DynamoDB tables.
If you enable DynamoDB Streams on a table, you can associate the
stream Amazon Resource Name (ARN) with an AWS Lambda function that you
write. Immediately after an item in the table is modified, a new
record appears in the table's stream. AWS Lambda polls the stream and
invokes your Lambda function synchronously when it detects new stream
records.
Let's suppose I have a database on DynamoDB, and I am currently using streams and lambda functions to send that data to Elasticsearch.
Here's the thing, supposing the data is saved successfully on DynamoDB, is there a way for me to be 100% sure that the data has been saved on Elasticsearch as well?
Considering I have a function to save that data on DDB is there a way for me communicate with the lambda function triggered by DDB before returning a status code answer, so I can receive confirmation before returning?
I want to do that in order to return ok both from my function and the lambda function at the same time.
This doesn't look like the correct approach for this problem. We generally use DynamoDB Streams + Lambda for operations that are async in nature and when we don't have to communicate the status of this Lambda execution to the client.
So I suggest the following two approaches that are the closest to what you are trying to achieve -
Make the operation completely synchronous. i.e., do the DynamoDB insert and ElasticSearch insert in the same call (without any Ddb Stream and Lambda triggers). This will ensure that you return the correct status of both writes to the client. Also, in case the ES insert fails, you have an option to revert the Ddb write and then return the complete status as failed.
The first approach obviously adds to the Latency of the function. So you can continue with your original approach, but let the client know about it. It will work as follows -
Client calls your API.
API inserts record into Ddb and returns to the client.
The client receives the status and displays a message to the user that their request is being processed.
The client then starts polling for the status of the ES insert via another API.
Meanwhile, the Ddb stream triggers the ES insert Lambda fn and completes the ES write.
The poller on the client comes to know about the successful insert into ES and displays a final success message to the user.
I am using an AWS Lambda function to stream the query data directly to S3 using nodejs pipeline API, the mysql dependency to create a read stream and the "s3-streams" dependency to create a write s3 stream.
And it is working fine for small queries, but when I'm querying more than 100k rows its showing me that there are too many connections
Error: ER_CON_COUNT_ERROR: Too many connections
pseudo-code:
await pipeline(connection.query.stream,transformData,writeStreamToS3)
I want simply to disable lambda retries when it's launched by a kinesis trigger. If the lambda fails or exit, I don't want it to retry.
From AWS Lambda Retry Behavior - AWS Lambda:
Poll-based (or pull model) event sources that are stream-based: These consist of Kinesis Data Streams or DynamoDB. When a Lambda function invocation fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days.
The exception is treated as blocking, and AWS Lambda will not read any new records from the shard until the failed batch of records either expires or is processed successfully. This ensures that AWS Lambda processes the stream events in order.
There does not appear to be any configuration options to change this behaviour.
How about handling your error properly so that the invocation will still succeed and Lambda will not retry it anymore?
In NodeJS, it would be something like this...
export const handler = (event, context) => {
return doWhateverAsync()
.then(() => someSuccessfulValue)
.catch((err) => {
// Log the error at least.
console.log(error)
// But still return something so Lambda won't retry.
return someSuccessfulValue
})
}
If you are using a Lambda Event Source Mapping to trigger your Lambda with a batch of records from kinesis stream shard then you can configure the maximum number of retries that will be made by the event source mapping.
another option is to configure the maximum age of the record which is sent to the function.
Retry attempts – The maximum number of times that Lambda retries when the function returns an error. This doesn't apply to service errors or throttles where the batch didn't reach the function.
Maximum age of record – The maximum age of a record that Lambda sends to your function.
A good practice is to configure failure destination. this is usually an SQS queue or SNS topic. details of the batch that caused the invocation to fail are stored here.
https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html#services-kinesis-errors for more info.
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...
s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
I want to read this dataset, but only the a subset of date that are not yet in glacier, eg:
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
Unfortunately, I have the exception
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but it is ugly like hell and it should not be necessary.
Is there any tip to read available data in the datastore even with old data in glacier?
Error you are getting not related to Apache spark , you are getting exception because of Glacier service in short S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read.
Apache Spark cannot handle directly glacier storage TABLE/PARTITION mapped to an S3 .
java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service:
Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request
ID: C444D508B6042138)
When S3 moves any objects from S3 storage classes
STANDARD,
STANDARD_IA,
REDUCED_REDUNDANCY
to GLACIER storage class, you have object S3 has stored in Glacier which is not visible
to you and S3 will bill only Glacier storage rates.
It is still an S3 object, but has the GLACIER storage class.
When you need to access one of these objects, you initiate a restore,
which temporary copy into S3 .
Move data into S3 bucket read into Apache Spark will resolve your issue.
https://aws.amazon.com/s3/storage-classes/
Note : Apache Spark , AWS athena etc cannot read object directly from glacier if you try will get 403 error.
If you archive objects using the Glacier storage option, you must
inspect the storage class of an object before you attempt to retrieve
it. The customary GET request will work as expected if the object is
stored in S3 Standard or Reduced Redundancy (RRS) storage. It will
fail (with a 403 error) if the object is archived in Glacier. In this
case, you must use the RESTORE operation (described below) to make
your data available in S3.
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
403 error is due to the fact you can not read object that is archieve in Glacier, source
Reading Files from Glacier
If you want to read files from Glacier, you need to restore them to s3 before using them in Apache Spark, a copy will be available on s3 for the time mentioned during restore command, for details see here, you can use S3 console, cli or any language to do that too
Discarding some Glacier files that you do not want to restore
Let's say you do not want to restore all the files from Glacier and discard them during processing, from Spark 2.1.1, 2.2.0 you can ignore those files (with IO/Runtime Exception), by setting spark.sql.files.ignoreCorruptFiles to true source
If you define your table through Hive, and use the Hive metastore catalog to query it, it won't try to go onto the non selected partitions.
Take a look at the spark.sql.hive.metastorePartitionPruning setting
try this setting:
ss.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
or
add the spark-defaults.conf config:
spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER
The S3 connectors from Amazon (s3://) and the ASF (s3a://) don't work with Glacier. Certainly nobody tests s3a against glacier. and if there were problems, you'd be left to fix them yourself. Just copy the data into s3 or onto local HDFS and then work with it there