I'm developing a AWS Lambda in python which will trigger by API Gateway and lambda will connect my snowflake. I'll process few CSV files via API Gateway to get some data from snowflake. Currently I'm using Python connector to connect Snowflake.
My issue is, if my csv has 100 records so it process the records recursively and it connects snowflake from lambda every time to process each record and its impacting on the performance.
Is there any method or mechanism that lambda can create a session for certain period of time and process all records in single connection.
As far as I know, connect() will automatically create a session that will last for a period of time. Once connected, you can use the cursor to execute multiple commands without needing to call connect() every time. Docs here. But I'm guessing you know this, and what you want is a single command instead of having to call multiple INSERT.
This is also possible, using a STAGE and COPY INTO command instead of INSERT. You can find an example from Snowflake documentation for bulk loading from AWS S3 here.
Related
I am re-designing a project I built a year ago when I was just starting to learn how to code. I used MEAN stack, back then and want to convert it to a PERN stack now. My AWS knowledge has also grown a bit and I'd like to expand on these new skills.
The application receives real-time data from an api which I clean up to write to a database as well as broadcast that data to connected clients.
To better conceptualize this question I will refer to the following items:
api-m1 : this receives the incoming data and passes it to my schema I then send it to my socket-server.
socket-server: handles the WSS connection to the application's front-end clients. It also will write this data to a postgres database which it gets from Scraper and api-m1. I would like to turn this into clusters eventually as I am using nodejs and will incorporate Redis. Then I will run it behind an ALB using sticky-sessions etc.. for multiple EC2 instances.
RDS: postgres table which socket-server writes incoming scraper and api-m1 data to. RDS is used to fetch the most recent data stored along with user profile config data. NOTE: RDS main data table will have max 120-150 UID records with 6-7 columns
To help better visualize this see img below.
From a database perspective, what would be the quickest way to write my data to RDS.
Assuming we have during peak times 20-40 records/s from the api-m1 + another 20-40 records/s from the scraper? After each day I tear down the database using a lambda function and start again (as the data is only temporary and does not need to be saved for any prolonged period of time).
1.Should I INSERT each record using a SERIAL id, then from the frontend fetch the most recent rows based off of the uid?
2.a Should I UPDATE each UID so i'd have a fixed N rows of data which I just search and update? (I can see this bottlenecking with my Postgres client.
2.b Still use UPDATE but do BATCHED updates (what issues will I run into if I make multiple clusters i.e will I run into concurrency problems where table record XYZ will have an older value overwrite a more recent value because i'm using BATCH UPDATE with Node Clusters?
My concern is UPDATES are slower than INSERTS and I don't want to make it as fast as possible. This section of the application isn't CPU heavy, and the rt-data isn't that intensive.
To make my comments an answer:
You don't seem to need SQL semantics for anything here, so I'd just toss RDS and use e.g. Redis (or DynamoDB, I guess) for that data store.
I am using an eventarc trigger to send messages to my cloud run instances. However, the issue is that I am unable to set an ACK deadline since there is no way to set any attributes. I have also tried to use just purely Pubsub however, I am unable to get the endpoint URL since it is only retrievable after it is created. I want to avoid having to split this into two different modules since it adds additional complexity, but I am unsure what else to do at this point.
Any help would be greatly appreciated!
Eventarc is backed on PubSub, and the max timeout of PubSub is 10 minutes. Therefore, even if the timeout was customizable on Eventarc, it wouldn't match your requirements.
You have several solutions
You can invoke a Cloud Run/Functions to create a Cloud Task, and the task call your Cloud Run (async) to perform the process in Background
You can invoke a Cloud Workflow that your Cloud Run (async) to perform the process in Background
You can invoke a Cloud Run/Functions that invoke the brand new Cloud Run job (with your processing code) to perform the process in Background
We are using microservicse approach in our backend
We have a nodejs service which provide a REST endpoint that grap some data from mongodb and apply some business logic to it.
We would need to add a schedule job every 15 min to sync the mongodb data with some 3rd party data source.
The question here is - dose adding to this microservicse a schedule job that would do that, consider anti pattern?
I was thinking from the other point of having a service that just do the sync job will create some over engineering for simple thing, another repo, build cycle deployment etc hardware, complicated maintenance etc
Would love to hear more thoughts around it
You can use an AWS CloudWatch event rule to schedule CloudWatch to generate an event every 15 minutes. Make a Lambda function a target of the CloudWatch event so it executes every 15 minutes to sync your data. Be aware of the VPC/NAT issues if calling your 3rd party resources from Lambda if they are external to your VPC/account.
Ideally, if it is like an ETL job, you can offload it to a Lambda function (if you are using AWS) or a serverless function to do the same.
Also, look into MongoDB Stitch that can do something similar.
I'm working on a NodeJS project and using pretty common AWS setup it seems. My ApiGateway receives call, triggers lambda A, then this lambda A triggers other lambdas, say B or C depending on params passed from ApiGateway.
Lambda A needs to access MongoDB and to avoid hassle with running MongoDB myself I decided to use mLab. ATM Lambda A is accessing MongoDB using NodeJS driver.
Now, not to start connection with every Lambda A execution I use connection pool, again, inside of Lambda A code, outside of handler I keep connection pool that allows me to reuse connections when Lambda A is invoked multiple times.
This seems to work fine.
However, I'm not sure how to deal with connections when Lambda A is invoking Lambda B and Lambda B needs to access mLab's MongoDB database.
Is it possible to pass connection pool somehow or Lambda B would have to keep its own connection pool?
I was thinking of using mLab's Data API that exposes most of the operations of MongoDB driver and so I could use HTTP calls e.g. GET and POST to run commands against database. It seems similar to RESTHeart it seems.
I'm leaning towards option 2 but on mLab's Data API it clearly states to avoid using REST api unless cannot connect using MongoDB driver directly:
The first method—the one we strongly recommend whenever possible for
added performance and functionality—is to connect using one of the
available MongoDB drivers. You do not need to use our API if you use
the driver. The second method, documented in this article, is to
connect via mLab’s RESTful Data API. Use this method only if you
cannot connect using a MongoDB driver.
Given all this how would it be best to approach it? 1 or 2 or is there any other option I should consider?
Unfortunately you won't be able to 'share' a mongo connection across lambdas because ultimately there's a 'physical' socket to the connection which is specific to that instance.
I think both of your solutions are good depending on usage.
If you tend to have steady average concurrency on both lambda A and B across an hour period (which is a bit of a rule of thumb as to how long AWS keeps a lambda instance alive), then having them both own their own static connections is a good solution. This is because the chances are that a request will reach an already started and connected lambda. I would also guess that node drivers for 'vanilla' mongo are more mature than those for the RESTFul Data API.
However if you get spikey or uneven load, then you might use the RESTFul Data API. This is because you'll be centralising the responsibility for managing the number of open connections to your instances to a single point, which under these conditions means you're less likely to be opening unneeded connections, or using all of your current capacity and having to wait for a new connection to be established.
Ultimately it's a game of probabilistic load balancing- either you 'pool' all your connections in a central place (the Data API) and become less affected by the usage of a single function at the expense of greater latency on individual operations, or you pool at a function level but are more exposed to cold-starts opening connections under uneven concurrency.
We would like to design/develop a DataConnector service using SparkStreaming.
The data connector service will allow the user to specify a source and destination database, and
the connection parameters for the same. Therefore, dynamic data connection requests need to be handled by the system. We would like to have one streaming job that can handle all connection requests made to the Data Connector service.
We propose to handle this in the following manner:
All connection requests made to the DataConnector service, will be updated in our "meta-store" database.
We will have a custom receiver which fetches all active connection requests from the meta-store and use the DStream results (database names, connection strings, etc) of the receiver, to fetch data from multiple databases and process the same.
Any problems with this approach?