Cassandra data clone to another cassandra database(different servers)

Cassandra data clone to another cassandra database(different servers) - cassandra

My question is above mentioned, have a cassandra database and wanted to use another server with this data. How can i move this all keyspace's data?
I have snapshots but i dont know can i open it to another server.
Thanks for your helps

Unfortunately, you have limited options to move data across clouds primarily COPY command or sstableloader (https://docs.datastax.com/en/cassandra/2.1/cassandra/migrating.html) or if you plan to maintain a like-to-like setup (same number of nodes) across clouds then simply copying snapshots under data would work.
If you are moving to IBM Softlayer, you maybe able to use software-defined storage solutions that get deployed on bare metal and provide features like thin clones that will allow you to create clones of cassandra clusters in matter of minutes and provide incredible space savings. This is rather useful for creating clones for dev/test purposes. Checkout Robin Systems, you may find them interesting.

The cleanest way to migrate your data from one cluster to another is using the sstableloader tool. This will allow you to stream the contents of your sstables from a local directory to a remote cluster. In this case the new cluster can also be configured differently and you also don't have to worry about assigned tokens.

Related

Point-In-time-recovery in Greenplum Database

We have recently setup greenplum. Now the major concern is to setup strategy for PITR. Postgres provides PITR capability but I am a little confused as how will it work in greenplum as each segment will have it's own log directory and config file

We recently introduced the concept of named restore point to serve as a building block for PITR for greenplum. In order to use this you will need to use the catalog function gp_create_restore_point() which internally creates a cluster wide consistency point across all the segments. This function returns all the restore point location () for each segment and the master. Using these recovery points you will be able to configure the recover.conf in your PITR cluster
To demonstrate how Greenplum named restore points work, a new test
directory src/test/gpdb_pitr has been added. The test showcases WAL
Archiving in conjunction with the named restore points to do
Point-In-Time Recovery.
In case you are most interested in the details, please refer to the following two commits that discusses this functionality in detail https://github.com/greenplum-db/gpdb/commit/47896cc89b4935199aa7d97043f2b7572a71042b
https://github.com/greenplum-db/gpdb/commit/40e0fd9ce6c7da3921f0b12e55118320204f0f6d

How to perform backup and restore of Janusgraph database which is backed by Apache Cassandra?

I'm having trouble in figuring out on how to take the backup of Janusgraph database which is backed by persistent storage Apache Cassandra.
I'm looking for correct methodology on how to perform backup and restore tasks. I'm very new to this concept and have no idea on how to do this. It will be highly appreciated if someone explain the correct approach or point me to rightful documentation to safely execute the tasks.
Thanks a lot for your time.

Cassandra can be backed up a few ways. One way is called a "snapshot". You can issue this via "nodetool snapshot" command. What cassandra will do is to create a "snapshots" sub-directory, if it doesn't already exist, under each table that's being "backed up" (each table has its own directory where it stores its data) and then it will create the specific snapshot directory for this particular occurrence of the snapshot (either you can name the directory with the "nodetool snapshot" parameter or let it default). Cassandra will then create soft links to all of the sstables that exist for that particular table - looping through each table, keyspace or database - depending on your "nodetool snapshot" parameters. It's very fast as creating soft links takes almost 0 time. You will have to perform this command on each node in the cassandra cluster to back up all of the data. Each node's data will be backed up to the local host. I know DSE, and possibly Apache, are adding functionality to back up to object storage as well (I don't know if this is an OpsCenter-only capability or if it can be done via the snapshot command as well). You will have to watch the space consumption on this as there are no processes to clean these up.
Like many database systems, you can also purchase/use 3rd party software to perform backups (e.g. Cohesity (formally Talena), Rubrik, etc.). We use one such product in our environments and it works well (graphical interface, easy-to-use point-in-time recoveryt, etc.). They also offer easy-to-use "refresh" capabilities (e.g. refresh your PT environment from, say, production backups).
Those are probably the two best options.
Good luck.

How to mount a file and access it from application in a container kubernetes

I am looking for a best solution for a problem where lets say an application has to access a csv file (say employee.csv) and does some operations such as getEmployee or updateEmployee etc.
Which Volume is best suitable for this and why?
Please note that employee.csv will have some pre-loaded data already.
Also to be precise we are using azure-cli for handling kubernetes.
Please Help!!

My first question would be: is your application meant to be scalable (i.e. have multiple instances running at the same time)? If that is the case, then you should choose a volume that can be written by multiple instances at the same time (ReadWriteMany, https://kubernetes.io/docs/concepts/storage/persistent-volumes/). As you are using Azure, the AzureFile volume could fit your case. However, I am concerned that there could be a conflict with multiple writers (and some data may be lost). My advice would be to better use a Database System so you avoid this kind of situations.
If you only want to have one writer, then you could use pretty much any of them. However, if you use local volumes you could have problems when a pod get rescheduled on another host (it would not be able to retrieve the data). Given the requirements that you have (a simple csv file), the reason I would give you for using one PersistentVolume provider instead of another would be the less painful to setup. In this sense, just like before, if you are using Azure you could simply use an AzureFile volume type, as it should be more straightforward to configure in that cloud: https://learn.microsoft.com/en-us/azure/aks/azure-files

Keeping Multiple Servers in a Cluster In-Sync?

I'm currently managing a cluster of PHP-FPM servers, all of which tend to get out of sync with each other. The application that I'm using on top of the app servers (Magento) allows for admins to modify various files on the system, but now that the site is in a clustered set up modifying a file only modifies it on a single instance (on one of the app servers) of the various machines in the cluster.
Is there an open-source application for Linux that may allow me to keep all of these servers in sync? I have no problem with creating a small VM instance that can listen for changes from machines to sync. In theory, the perfect application would have small clients that run on each machine to be synced, which would talk to the master server which would then decide how/what to sync from each machine.
I have already examined the possibilities of running a centralized file server, but unfortunately my app servers are spread out between EC2 and physical machines, which makes this unfeasible. As there are multiple app servers (some of which are dynamically created depending on the load of the site), simply setting up a rsync cron job is not efficient as the cron job would have to be modified on each machine to send files to every other machine in the cluster, and that would just be a whole bunch of unnecessary data transfers/ssh connections.

I'm dealing with setting up a similar solution. I'm half way there. I would recommend you use lsyncd, which basically monitors the disk for changes and then immediately (or whatever interval you want) automatically syncs files to a list of servers using rsync.
The only issue I'm having is keeping the server lists up to date, since I can spin up additional servers at any time, I would need to have each machine in the cluster notified whenever a machine is added or removed from the cluster.
I think lsyncd is a great solution that you should look into. The issue I'm having may turn out to be a problem for you as well, and that remains to be solved.

Instead of keeping tens or hundreds of servers cross-synchronized it would be much more efficient, reliable, and most of all simple maintaining just one "admin node" and replicating changes from that to all your "worker nodes".
For instance at our company we use a Development server -> Staging server -> Live backends workflow where all the changes are transferred across servers using a custom php+rsync front end. That allows the developers to push updates to a Staging server in the live environment, test out changes, and roll them to Live backends incrementally.
A similar approach could very well work in your case as well. Obviously it's not a plug-and-play solution, but I see it as the easiest way to go - both in terms of maintainability and scalability.

How to implement Shared Storage for Concurrent File Access between 2 nodes (Linux)

I need to design a Clustered application which runs separate instances on 2 nodes. These nodes are both Linux VM's running on VMware. Both application instances need to access a database & a set of files.
My intention is that a shared storage disk (external to both nodes) should contain the database & files. The applications would co-ordinate (via RPC-like mechanism) to determine which instance is master & which one is slave. The master would have write-access to the shared storage disk & the slave will have read-only access.
I'm having problems determining the file system for the shared storage device, since it would need to support concurrent access across 2 nodes. Going for a proprietary clustered file system (like GFS) is not a viable alternative owing to costs. Is there any way this can be accomplished in Linux (EXT3) via other means?
Desired behavior is as follows:
Instance A writes to file foo on shared disk
Instance B can read whatever A wrote into file foo immediately.
I also tried using SCSI PGR3 but it did not work.

Q: Are both VM's co-located on the same physical host?
If so, why not use VMWare shared folders?
Otherwise, if both are co-located on the same LAN, what about good old NFS?

try using heartbeat+pacemaker, it has couple of inbuilt options to monitor cluster. Should have something to look for data too

you might look at an active/passive setup with "drbd+(heartbeat|pacemaker)" ..
drbd gives you a distributed block device over 2 nodes, where you can deploy an ext3-fs ontop ..
heartbeat|pacemaker gives you a solution to handle which node is active and passive and some monitoring/repair functions ..
if you need read access on the "passive" node too, try to configure a NAS too on the nodes, where the passive node may mount it e.g. nfs|cifs ..
handling a database like pgsq|mysql on a network attached storage might not work ..

Are you going to be writing the applications from scratch? If so, you can consider using Zookeeper for the coordination between master and slave. This will put the coordination logic purely into the application code.

GPFS Is inherently a clustered filesystem.
You setup your servers to see the same lun(s), build the GPFS filesystem on the lun(s) and mount the GPFS filesystem on the machines.
If you are familiar with NFS, it looks like NFS, but it's GPFS, A Clustered filesystem by nature.
And if one of your GPFS servers goes down, if you defined your environment correctly, no one is the wiser and things continue to run.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string