Streaming query results directly to S3 - node.js

I am using an AWS Lambda function to stream the query data directly to S3 using nodejs pipeline API, the mysql dependency to create a read stream and the "s3-streams" dependency to create a write s3 stream.
And it is working fine for small queries, but when I'm querying more than 100k rows its showing me that there are too many connections
Error: ER_CON_COUNT_ERROR: Too many connections
pseudo-code:
await pipeline(connection.query.stream,transformData,writeStreamToS3)

Related

Why is the userIdentity property always empty in AWS’ Kinesis DataStream?

I have enabled Kinesis DataStream in DynamoDB and have configured a Delivery Stream to store the stream as audit logs into an s3 bucket.
I then query the s3 bucket from Amazon Athena.
Everything seems to be working, but the userIdentity property is always empty (null) which seems pointless to me to have an audit if I cannot capture who did the transaction. Is this property only populated when a record is deleted from DynamoDB and TTL is enabled?
Questions:
How do I capture the user id / name of the user responsible for adding, updating, or deleting a record via the application or directly via DynamoDB in AWS console?
(Less important question) How do I format the stream before it hits the s3 bucket so I can include the record id being updated?
Also please note that I have a lambda function that I use from the Delivery Stream that simply adds new line to each stream as a delimeter. If I wanted to do more processing/formatting to the stream, should I be executing this lambda when the stream hits the DeliveryStream? Or should I be executing this as a trigger in the DynamoDB table itself before it hits the DeliveryStream?
DynamoDB does not include the user details in the Data Stream. This needs to be implemented by the application, then you can get the values from the newImage if provided by the stream.

how to build search functionality with ElasticSearch and lambda function into your existing project

I am having a Node + Express application running on EC2 server and trying to add a new search feature to it. I am thinking about using Lambda function and ElasticSearch. When the client fires a request to update a table in dynamodb, Lambda function will react to this event and update the elastcsearch index.
I know lambda runs serverless whereas my original application runs within a server. Can anybody give me some hints about how to do it or let me know if it's even possible?
The link between a DynamoDB update and a Lambda is "DynamoDB Streams".
The documentation says, in part,
Amazon DynamoDB is integrated with AWS Lambda so that you can create
triggers—pieces of code that automatically respond to events in
DynamoDB Streams. With triggers, you can build applications that react
to data modifications in DynamoDB tables.
If you enable DynamoDB Streams on a table, you can associate the
stream Amazon Resource Name (ARN) with an AWS Lambda function that you
write. Immediately after an item in the table is modified, a new
record appears in the table's stream. AWS Lambda polls the stream and
invokes your Lambda function synchronously when it detects new stream
records.

Troubleshoot DynamoDB to Elastic Search

Let's suppose I have a database on DynamoDB, and I am currently using streams and lambda functions to send that data to Elasticsearch.
Here's the thing, supposing the data is saved successfully on DynamoDB, is there a way for me to be 100% sure that the data has been saved on Elasticsearch as well?
Considering I have a function to save that data on DDB is there a way for me communicate with the lambda function triggered by DDB before returning a status code answer, so I can receive confirmation before returning?
I want to do that in order to return ok both from my function and the lambda function at the same time.
This doesn't look like the correct approach for this problem. We generally use DynamoDB Streams + Lambda for operations that are async in nature and when we don't have to communicate the status of this Lambda execution to the client.
So I suggest the following two approaches that are the closest to what you are trying to achieve -
Make the operation completely synchronous. i.e., do the DynamoDB insert and ElasticSearch insert in the same call (without any Ddb Stream and Lambda triggers). This will ensure that you return the correct status of both writes to the client. Also, in case the ES insert fails, you have an option to revert the Ddb write and then return the complete status as failed.
The first approach obviously adds to the Latency of the function. So you can continue with your original approach, but let the client know about it. It will work as follows -
Client calls your API.
API inserts record into Ddb and returns to the client.
The client receives the status and displays a message to the user that their request is being processed.
The client then starts polling for the status of the ES insert via another API.
Meanwhile, the Ddb stream triggers the ES insert Lambda fn and completes the ES write.
The poller on the client comes to know about the successful insert into ES and displays a final success message to the user.

How to know status in AWS Lambda code that Redshift table loaded successfully

I have requirement to load data from one Redshift cluster to another, each in different region.
For same first creating table in target schema and then on source running UNLOAD command, which will put files in an S3 bucket.
To load data from S3 to target Redshift schema, I will use python code in AWS Lambda. The AWS Lambda function will trigger on the S3 path where all files will come when the unload command is executed.
After ingestion I need to run a transformation query that will put data in fact and dimension tables. How can I know the status for target table that a particular table loaded successfully?
The Lambda python code will triggered for each file, and will copy data to the target table. If I know the status I will issue the transformation query.
Is there any feature in AWS Lambda that will help me to know the status of the COPY command?

spark read partitioned data in S3 partly in glacier

I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...
s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier]
I want to read this dataset, but only the a subset of date that are not yet in glacier, eg:
val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))
Unfortunately, I have the exception
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but it is ugly like hell and it should not be necessary.
Is there any tip to read available data in the datastore even with old data in glacier?
Error you are getting not related to Apache spark , you are getting exception because of Glacier service in short S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read.
Apache Spark cannot handle directly glacier storage TABLE/PARTITION mapped to an S3 .
java.io.IOException:
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service:
Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request
ID: C444D508B6042138)
When S3 moves any objects from S3 storage classes
STANDARD,
STANDARD_IA,
REDUCED_REDUNDANCY
to GLACIER storage class, you have object S3 has stored in Glacier which is not visible
to you and S3 will bill only Glacier storage rates.
It is still an S3 object, but has the GLACIER storage class.
When you need to access one of these objects, you initiate a restore,
which temporary copy into S3 .
Move data into S3 bucket read into Apache Spark will resolve your issue.
https://aws.amazon.com/s3/storage-classes/
Note : Apache Spark , AWS athena etc cannot read object directly from glacier if you try will get 403 error.
If you archive objects using the Glacier storage option, you must
inspect the storage class of an object before you attempt to retrieve
it. The customary GET request will work as expected if the object is
stored in S3 Standard or Reduced Redundancy (RRS) storage. It will
fail (with a 403 error) if the object is archived in Glacier. In this
case, you must use the RESTORE operation (described below) to make
your data available in S3.
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/
403 error is due to the fact you can not read object that is archieve in Glacier, source
Reading Files from Glacier
If you want to read files from Glacier, you need to restore them to s3 before using them in Apache Spark, a copy will be available on s3 for the time mentioned during restore command, for details see here, you can use S3 console, cli or any language to do that too
Discarding some Glacier files that you do not want to restore
Let's say you do not want to restore all the files from Glacier and discard them during processing, from Spark 2.1.1, 2.2.0 you can ignore those files (with IO/Runtime Exception), by setting spark.sql.files.ignoreCorruptFiles to true source
If you define your table through Hive, and use the Hive metastore catalog to query it, it won't try to go onto the non selected partitions.
Take a look at the spark.sql.hive.metastorePartitionPruning setting
try this setting:
ss.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
or
add the spark-defaults.conf config:
spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER
The S3 connectors from Amazon (s3://) and the ASF (s3a://) don't work with Glacier. Certainly nobody tests s3a against glacier. and if there were problems, you'd be left to fix them yourself. Just copy the data into s3 or onto local HDFS and then work with it there

Resources