Is it possible to build a data pipeline in AWS to transfer data between two different RDS MySQL instances? The transfer would be taking place once per day (although not necessarily at the same time every day).
I am interested in copying full datatables from one instance to another, but the documentation for the data pipeline service doesn't seem to consider this use case.
Thanks in advance.
If one is a copy of the other, you can use Data Migration Services (a different Amazon service).
If you choose "ongoing replication" then the service will update your target database throughout the day with changes from the source database.
I suspect if you start making changes to the target database that make it different to the source database then you will have problems.
Related
In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?
This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.
I have a webserver hosted on cloud run that loads a tensorflow model from cloud file store on start. To know which model to load, it looks up the latest reference in a psql db.
Occasionally a retrain script runs using google cloud functions. This stores a new model in cloud file store and a new reference in the psql db.
Currently, in order to use this new model I would need to redeploy the cloud run instance so it grabs the new model on start. How can I automate using the newest model instead? Of course something elegant, robust, and scalable is ideal, but if something hacky/clunky but functional is much easier that would be preferred. This is a throw-away prototype but it needs to be available and usable.
I have considered a few options but I'm not sure how possible either of them are:
Create some sort of postgres trigger/notification that the cloud run server listens to. Guess this would require another thread. This ups complexity and I'm unsure how multiple threads works with Cloud Run.
Similar, but use a http pub/sub. Make an endpoint on the server to re-lookup and get the latest model. Publish on retrainer finish.
could deploy a new instance and remove the old one after the retrainer runs. Simple in some regards, but seems riskier and it might be hard to accomplish programmatically.
Your current pattern should implement cache management (because you cache a model). How can you invalidate the cache?
Restart the instance? Cloud Run doesn't allow you to control the instances. The easiest way is to redeploy a new revision to force the current instance to stop and new ones to start.
Setting a TTL? It's an option: load a model for XX hours, and then reload it from the source. Problem: you could have glitches (instances with new models and instances with the old one, up to the cache TTL expires for all the instances)
Offering cache invalidation mechanism? As said before, it's hard because Cloud Run doesn't allow you to communicate with all the instances directly. So, push mechanism is very hard and tricky to implement (not impossible, but I don't recommend you to waste time with that). Pull mechanism is an option: check a "latest updated date" somewhere (a record in Firestore, a file in Cloud Storage, an entry in CLoud SQL,...) and compare it with your model updated date. If similar, great. If not, reload the latest model
You have several solutions, all depend on your wish.
But you have another solution, my preference. In fact, every time that you have a new model, recreate a new container with the new model already loaded in it (with Cloud Build) and deploy that new container on Cloud Run.
That solution solves your cache management issue, and you will have a better cold start latency for all your new instances. (In addition of easier roll back, A/B testing or canary release capability, version management and control, portability, local/other env testing,...)
We have two nodes with couchDB installed. One of the nodes have data on it, we want to copy the data from that instance to another instance of couch db. We want to avoid replicator due to volume of the data.
We tried copying data from %couchdb%/data/shards and %couchdb%/data/.shards to corresponding locations of target node as per one of the suggestions from CouchDB backups and cloning the database
but not able to see the Data in the server Fauxton UI. Can someone suggest what is missing?
Couchtransform lets you convert or just clone data from one db to another, its multi threaded and won't need to deal with massive files.
I have a web application (using MongoDB database, AngularJS on front-end and NodeJS on back-end) that deployed on 2 places. First is on static ip so that it can access from anywhere and second is on one local machine so that user can use it when the internet connection is not available. So on both places, data can be inserted by user. My requirement is to sync the both databases, when internet connection is available on local machine i.e. from local system database to remote system database and vice-versa without loosing any data on both places.
One way I am thinking about is provide the sync button in the application and sync the databases using insert/update query. I am not sure is there any better and automated way to do this task so that the databases sync automatically like data copied in replica set.
Please provide the best solution to do this task. Thanks in advance.
I have a mongodb running on windows azure linux vm,
The database is located on the system drive and i wish to move it to another hard drive since there is not enough space there.
I found out this post :
Changing MongoDB data store directory
These seems to be a fine solution suggested there, yet there is another person who mentioned something about copying the files,
My database is live and getting data all the time, how can i make this proccess with lossing the least data possible ?
Thanks,
First, if this is a production system you really need to be running this as a replica set. Running production databases on singleton mongodb instances is not a best practice. I would consider 2 full members plus 1 arbiter the minimum production set up.
If you want to go the replica set route, you can first convert this instance to a replica set:
http://docs.mongodb.org/manual/tutorial/convert-standalone-to-replica-set/
this should have minimal down time.
Then add 2 new instances with the correct storage set up. After they sync you will have a full 3 member set. You can then fail over to one of the new instances. Remove this bad instance from the replica set. Finally I'd add an arbiter instance to get you back up to 3 members of the replica set while keeping costs down.
If on the other hand you do not want to run as a replica set, I'd shutdown mongod on this instance, copy the files over to the new directory structure on another appropriate volume, change the config to point to it (either changing dbpath or using a symlink) and then startup again. Downtime will be largely a factor of the size of your existing database, so the sooner you do this the better.
However - I will stress this again - if you are looking for little to no down time with mongoDB you need to use a replica set.