Cannot exporting data from an Aurora MySQL DB instance for replication - amazon-rds

I'm having some issues when trying to export binlog information and mysql dump with --master-data=1 from my Aurora MySQL instance. The error I'm receiving is
"mysqldump: Couldn't execute 'FLUSH TABLES WITH READ LOCK': Access denied for user 'user'#'%' (using password: YES) (1045)"
After some digging I found out that one way to do it is to create a read replica from the master, stop replication then perform the dump.
Sadly this does not work as I expected. In all AWS guides I've found they say to create a read replica from the "Actions" button, but I have no such option, doesn't even appear in the dropdown.
One option does appear, "Add a reader", which I did and after connecting to it, it seems like it's not a replica but more like a master with read only permissions, even if in the AWS console the "replica latency" column for that instance has a value attached to it.
It's a replica but it's not really a replica?
My main question here is how could I perform a dump of an Aurora MySQL in order to start replication on another instance?
I tried following most of the guides that are available from aws regarding mysql replication as well as lots of other stackoverflow questions.

There is an unfortunate overloading of "replica" here.
read replica = some non-Aurora MySQL server (RDS or on-premises) that you replicate to.
Aurora replica = the 2nd, 3rd, and so on DB instance in an Aurora cluster. Read-only. Gets notified of data changes by an entirely different mechanism than binlog replication. Thus this term is phased out in favor of "reader instance", but you will still see it lurking in documentation that compares and contrasts different ways to do replication.
Aurora read replica = a read-only Aurora cluster that gets data replicated into it via binlog from some other source.
If you select an RDS MySQL instance in the console, that has an option to create a read replica, because that's the only kind of replication it can do. Aurora MySQL clusters only have "add reader"* because that is the most common / fastest / most efficient way for Aurora. The instructions here cover all the different permutations:
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Replication.MySQL.html
That page recommends using a snapshot as the initial "dump" from the Aurora cluster.
There is also an option "Create cross-Region read replica" for Aurora clusters. But for that capability, it's preferable to do "Add AWS Region" instead - that uses an Aurora-specific mechanism (Aurora global database) to do low-latency replication between AWS Regions.

Related

Does Scylla DB have a similar migration support to GKE as K8ssandra's Zero Downtime Migration feature?

We are trying to migrate our ScyllaDB cluster deployed on GCE machines to the GKE cluster in Google Cloud, we came across one approach of Cassandra migration and want to implement the same here in ScyllaDB migration. Below is the link for the same, can you please suggest if this is possible in Scylla ?
or if Scylla hasn't introduced such a migration technique with the Scylla K8S operator ?
https://k8ssandra.io/blog/tutorials/cassandra-database-migration-to-kubernetes-zero-downtime/
Adding a new "destination" DC to your existing cluster "source" DC, is a very common technic to migrate to a new DC.
Add the new "destination" DC
Change replication factor settings accordingly
nodetool rebuild --> stream data from the "source" DC to the "destination" DC
nodetool repair the new DC.
Update your application clients to connect to the new DC once it's ready to serve (all data streamed + repaired)
Decommission the "old" (source) DC
For the gory details see here:
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/add-dc-to-existing-dc.html
https://docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/decommissioning-data-center.html
If you prefer to go the full scan route. CQL reads on the source and CQL writes on the destination, with some ability for data manipulation and save points to resume from, than the Scylla Spark Migrator is a good option.
https://github.com/scylladb/scylla-code-samples/tree/master/spark-scylla-migrator-demo
You can also use the Scylla Spark migrator to migrate parquet files
https://www.scylladb.com/2020/06/10/migrate-parquet-files-with-the-scylla-migrator/
Remember not to migrate Materialized views (MV), you can always re-create them post migration again from the base tables.
We use an Apache Spark-based Migrator: https://github.com/scylladb/scylla-migrator
Here's the blog we wrote on how to do this back in 2019: https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/
Though in this case, you aren't moving from Cassandra to ScyllaDB; just moving from one ScyllaDB instance to another. If this makes sense to you, it should be straight forward. If you have questions, feel free to join our Slack community to get more interactive assistance:
http://slack.scylladb.com/

How to check from/to which Cassandra node the client read/write data?

I'm working with Cassandra 3.x and Phantom driver (scala), and modifying my Cassandra deployment from a simple, three nodes cluster to a multi datacenter Cassandra deployment that consists of two datacenters:
Transactional - the "main" datacenter, to which all reads/writes occur (except for reads/writes done by some analytics job).
Analytics - a datacenter used for analytics purposes only. The analytics job should operate (i.e. read/write to) on this datacenter.
I configured the client on the analytics job to read/write to the analytics data-center, and all other services to read/write from the transactional data-center.
How can I check the client actually behaves as expected - and reads/writes the data to the correct data-center?
The driver has an option allowing you to turn on tracking. That should allow you to see which nodes are involved with each query.
There's a short description of how to do this on the driver documentation page: https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/logging/
The query logger reference API has a lot more detail on the available methods, and can even show the values of bind vars, if needed.

Janusgraph components that scale

If I understand right multiple gremlin servers don't communicate with each other. The scale is in the cassandra/ES only.
If that is true how many vertexes can each gremlin server support?
When the graph is updated by one gremlin server when will the other gremlin servers see that change?
Thanks!
The number of vertices supported is 500 trillion (2^59)
The storage backend is the sole source of state between multiple Gremlin servers. The number of vertices will not be increased by adding additional Gremlin servers.
The limitations on the number of vertices is outlined in the Technical Limitations Page in the JanusGraph Manual.
When one Gremlin Server sees changes made by another is determined by the storage backend choice, but it's still tricky to answer
As far as when the other Gremlin servers will see changes, that is a bit tricky to answer. If you are using a consistent data backend, the answer will generally be as soon as Gremlin finishes its transaction.
But Cassandra is a different beast.
Using an eventually consistent storage backend
Cassandra is what's known as an eventually-consistent database. This means that it trades transactional consistency for availability and partition tolerance; even if you started to lose nodes in the cluster, it will continue to function and serve requests.
The downside to this is that mutations in Cassandra do not instantly become available to consumers; you can even have the case where a client writes a change to Cassandra and that very same client doesn't see the change if they immediately try to read that data.
Chapter 31 in the JanusGraph Manual covers dealing with an eventually consistent storage backend like Cassandra.
Realistically, the amount of time between a mutation and all clients being able to see the mutation in Cassandra depends entirely upon the data load, the nature of the write, and the read/write consistency levels that JanusGraph is configured to read and write to Cassandra with.

Big data analysis on Amazon Aurora RDS

I have a Aurora table that has 500 millions of records .
I need to perform Big data analysis like finding diff between two tables .
Till now i have been doing this using HIVE on files system ,but now we have inserted all files rows into Aurora DB .
But still monthly i need to do the same thing finding diff.
So to this what colud be the best option ?
Exporting Aurora data back to S3 as files and then running HIVE query on that(how much time it might take to export all Aurora rows into S3)?
Can i run HIVE query on Aurora table ?(I guess hive on Aurora does not support)
Running spark SQL on Aurora (how will be the performance ) ?
Or is there any better way to this .
In my opinion Aurora MySQL isn't good option to perform big data analysis. It results from the limitation of MySQL InnoDB and also from additional restrictions on Aurora in relation to MySQL InnoDB. For instance you don't find there such features as data compression or columnar format.
When it comes to Aurora, you can use for instance Aurora Parallel Query, but it doesn't support partitioned tables.
https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/
Other option is to connect directly to Aurora by using AWS Glue and perform the analysis, but in this case you can have problem with the database performance. It can be a bottleneck.
https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html
I suggest to import/export the data to s3 by using LOAD DATA FROM S3 / SELECT INTO OUTFILE S3 to S3 and then perform the analysis by using Glue or EMR. You should also consider to use Redshift instead of Aurora.

How to run Couchbase on multiple server or multiple AWS instances?

I am trying to evaluate couchbase`s performance on multiple nodes. I have a Client that generates data for me based on some schema(for 1 node currently, local). But I want to know how I can horizontally scale Couchbase and how it works. Like If I have multiple machines or AWS instances or Windows Azure how can I configure Couchbase to shard the data and than I can evaluate its performance for multiple nodes. Any suggestions and details as to how I can do this?
I am not (yet) familiar with Azure but you can find a very good white paper about Couchbase on AWS:
Running Couchbase on AWS
Let's talk about the cluster itself, you just need to
install Couchbase on multiple nodes
create a "cluster" on one of then
then you simply have to add other nodes to the cluster and rebalance.
I have created an Ansible script that use exactly the steps to create a cluster from command line, see
Create a Couchbase cluster with Ansible
Once you have done that your application will leverage all the nodes automatically, and you can add/remove nodes as you need.
Finally if you want to learn more about Couchbase architecture, how sharding, failover, data consistency, indexing work, I am inviting your to look at this white paper:
Couchbase Server: An Architectural Overview

Resources