I am inserting huge amount of data from ec2 to rds postgres.
ec2 reads data from S3, and format the data then inserts to rds.
Using pyhton3.8, flask and flask_sqlalchemy
ec2 is based on Sydney, rds is based on west2.
Each insert is like taking 30 secs, that may take over 1~2 days to complete all inserting.
When I try in my local to local postgres, it's done in 5mins.
Anyway I can improve the performance? Like increasing ec2 instance's size?
I googled and found put ec2 and rds into same region may increase performance, but need more opinions from you guys
I was reading about an article Inserting a billion rows in SQLite under a minute may help you.
Personally i did not use EC2 but if you can change your database configuration that article still can help you. It based on optimizing the database configuration for inserting too much data.
Related
I'm having some issues when trying to export binlog information and mysql dump with --master-data=1 from my Aurora MySQL instance. The error I'm receiving is
"mysqldump: Couldn't execute 'FLUSH TABLES WITH READ LOCK': Access denied for user 'user'#'%' (using password: YES) (1045)"
After some digging I found out that one way to do it is to create a read replica from the master, stop replication then perform the dump.
Sadly this does not work as I expected. In all AWS guides I've found they say to create a read replica from the "Actions" button, but I have no such option, doesn't even appear in the dropdown.
One option does appear, "Add a reader", which I did and after connecting to it, it seems like it's not a replica but more like a master with read only permissions, even if in the AWS console the "replica latency" column for that instance has a value attached to it.
It's a replica but it's not really a replica?
My main question here is how could I perform a dump of an Aurora MySQL in order to start replication on another instance?
I tried following most of the guides that are available from aws regarding mysql replication as well as lots of other stackoverflow questions.
There is an unfortunate overloading of "replica" here.
read replica = some non-Aurora MySQL server (RDS or on-premises) that you replicate to.
Aurora replica = the 2nd, 3rd, and so on DB instance in an Aurora cluster. Read-only. Gets notified of data changes by an entirely different mechanism than binlog replication. Thus this term is phased out in favor of "reader instance", but you will still see it lurking in documentation that compares and contrasts different ways to do replication.
Aurora read replica = a read-only Aurora cluster that gets data replicated into it via binlog from some other source.
If you select an RDS MySQL instance in the console, that has an option to create a read replica, because that's the only kind of replication it can do. Aurora MySQL clusters only have "add reader"* because that is the most common / fastest / most efficient way for Aurora. The instructions here cover all the different permutations:
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Replication.MySQL.html
That page recommends using a snapshot as the initial "dump" from the Aurora cluster.
There is also an option "Create cross-Region read replica" for Aurora clusters. But for that capability, it's preferable to do "Add AWS Region" instead - that uses an Aurora-specific mechanism (Aurora global database) to do low-latency replication between AWS Regions.
Let's say I need to transfer data between two S3 buckets in a manner of ETL and perform an easy transformation on the data during the transportation process (taking only part of the columns and filtering by ID).
The data is parquet files and its size change between 1GB to 100GB.
What should be more efficient in terms of speed and cost - using an Apache Spark Glue job, or Spark on the Hadoop cluster with X machines?
The answer to this is basically the same for any serverless (Glue)/non-serverless (EMR) service equivalents.
The first should be faster to set up, but will be less configurable and probably more expensive. The second will give you more options for optimization (performance and cost) but you should not forget to include the cost of managing the service yourself. You can use AWS pricing calculator if you need some price estimate upfront.
I would definitely start with Glue and move to something more complicated if problems arise. Also, don't forget that there is serverless EMR now also available.
I read this question when determining if it was worthwhile to switch from AWS Glue to AWS EMR.
With configurable EC2 SPOT instances on EMR we drastically reduced a previous Glue job that read 1GB-4TB of csv uncompressed csv data. We were able to use spots instances to leverage much larger and faster Graviton processor EC2s that could load more data into RAM reducing spills to disk. Another benefit was that got rid of the dynamic frames which is very beneficial when you do not know a schema, but was overhead that we did not need. In addition the spot instances which are larger than what is provided by AWS Glue reduced our time to run but not too much. More importantly we cut our costs by 40-75%, yes that is even with the EC2 + EBS + EMR overhead cost per EC2 instance. We went from $25-250 dollars a day on Glue to $2-$60 on EMR. Costs monthly for this process was $1600 in AWS Glue and now is <$500. We run EMR as job_flow_run and TERMINATE when idle so that it essentially acts like Glue serverless.
We did not go with EMR Serverless because there was no spot instances which was probably the biggest benefit.
The only problem is that we did not switch earlier. We are now moving all AWS Glue jobs to AWS EMR.
I'm trying to use DynamoDB locally and am using the container https://hub.docker.com/r/amazon/dynamodb-local combined with AWS Workbench (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/workbench.html).
I have successfully created a table on my local container but now I wish to delete it. I'm assuming AWS Workbench has this functionality(?) and I'm just being dumb... can anybody point me at the button I need to press?
Many thanks.
In case anybody else is looking, at time of writing aws workbench does not support the functionality to delete a table. Got my answer straight from the DynamoDb team.
Came across this question while trying to update data from NoSql Workbenck into my local DDB table.
My issue was now knowing how to re-commit/update the data after my first commit to my local docker ddb server as I was getting this error
Error
Could not commit to DynamoDB as table creation failed: ResourceInUseException: Cannot create preexisting table
What worked for me was to:
stop my local instance (ctrl + c)
restart my local ddb server (docker run -p 8000:8000 amazon/dynamodb-local)
and commiting my changes to my local DDB again from NoSql Workbench
Just in case anyone else is trying to solve the same problem and if you haven't tried this yet.
You now can use PartiQL with NoSQL Workbench to query, insert, update, and delete table data in Amazon DynamoDB
Posted On: Dec 21, 2020
However, you cannot still delete the table from dynamodb.
I have a Aurora table that has 500 millions of records .
I need to perform Big data analysis like finding diff between two tables .
Till now i have been doing this using HIVE on files system ,but now we have inserted all files rows into Aurora DB .
But still monthly i need to do the same thing finding diff.
So to this what colud be the best option ?
Exporting Aurora data back to S3 as files and then running HIVE query on that(how much time it might take to export all Aurora rows into S3)?
Can i run HIVE query on Aurora table ?(I guess hive on Aurora does not support)
Running spark SQL on Aurora (how will be the performance ) ?
Or is there any better way to this .
In my opinion Aurora MySQL isn't good option to perform big data analysis. It results from the limitation of MySQL InnoDB and also from additional restrictions on Aurora in relation to MySQL InnoDB. For instance you don't find there such features as data compression or columnar format.
When it comes to Aurora, you can use for instance Aurora Parallel Query, but it doesn't support partitioned tables.
https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/
Other option is to connect directly to Aurora by using AWS Glue and perform the analysis, but in this case you can have problem with the database performance. It can be a bottleneck.
https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html
I suggest to import/export the data to s3 by using LOAD DATA FROM S3 / SELECT INTO OUTFILE S3 to S3 and then perform the analysis by using Glue or EMR. You should also consider to use Redshift instead of Aurora.
We have an aurora database(aws) that we use for production. We would like to have a clone database that will be updated on a daily basis and will be used for qa(one way sync from the production to qa db). What is the best way to do it?
Thanks
There's an open source Python Library that can do this for you, or you could take a look at that approach and do the same:
https://github.com/blacklocus/aurora-echo
You can launch following script daily:
Convert production automatic snapshot to manual: aws rds copy-db-cluster-snapshot
Now you can share your manual snapshot with test account: aws rds modify-db-snapshot-attributes --attribute-name restore --values-to-add dev-account-id
Restore your snapshot to cluster with aws rds restore-db-cluster-from-snapshot
Add instance
Rename db cluster to (it is about 10 seconds)
Rename db cluster to (it is about 10 seconds)
If new cluster works, you can delete cluster with instances.