How to delete one of several git-annex replicas? - git-annex

Say I have several (normal) git-annex replicas. Now, for some reason I want to give one of the machines or hard drives away, so I want to reduce the number of replicas by one, i.e. delete one replica.
The first thing I can do is to ensure by git annex copy . -t other that all the content is present in at least one other replica. Then I can do git annex drop . followed by a git annex sync to remove all the content of the replica I want to delete.
But, what do I have to do to tell the other replicas that this one is gone? Should I just remove the git remote? Or do I have to invoke a special git annex command?

You need to tell one of your other repositories, that this repository is dead. Git annex sync will propagate this information to all other repositories, so all repositories will eventually now that its data is no longer accessible.
After marking it as dead git annex info should no longer list it and sync it.
For some more information, see here:
https://git-annex.branchable.com/tips/what_to_do_when_you_lose_a_repository/
https://git-annex.branchable.com/git-annex-dead/

Related

Surprising behavior with replicated, deleted CouchDb documents

We have two CouchDb servers, let's call them A and B. There's one-way replication from A to B and documents are only created, modified, or deleted on A - basically you can think of B as just a backup. There was a document on A that was deleted. When I tried to retrieve the revision prior to deletion from A I got {"error":"not_found","reason":"missing"} but that DB hasn't been compacted (at least as I understand it compaction only happens if you start it manually and that wasn't done). However, while B knew the document had been deleted the old revision was available on B.
My understanding is that if we haven't manually run compaction the old revision should always be available on A. Furthermore, when B replicates if there were multiple revisions since the last replication it'll pull metadata for the old revisions but might not pull the documents. Thus, in this setup, the set of revisions available on B should always be a proper subset of those available on A. So how could B have a revision that A does not?
We're on CouchDb 2.3.0.

How to rollback using Datastax driver 3.0

I have 2 repositories and I want a piece of information to be added in both of them. If the 2nd one fails then I want to roll back the operation in the first repository. Is there a way to do operations atomically in Cassandra in code, not cql?
I suppose I could try to delete the entry in the first repository if the operation in 2nd repository fails but I'll prefer a way in which Cassandra does it so that I don't have to maintain the logic in my application

A quick guide on Salt-based install of Spark cluster

I tried asking this on the official Salt user forum, but for some reason I did not get any assistance there. I am hoping I might get help here.
I am a new user of Salt. I am still evaluating the framework as a candidate for our SCM tool (as opposed to Ansible).
I went through the tutorial, and I am able to successfully manage master-minion/s relationship as covered in the first half of the tutorial.
Tutorial now forks into many different, intricate areas.
What I need is relatively straight forward, so I am hoping that perhaps someone can guide me here how to accomplish it.
I am looking to install Spark and HDFS on 20 RHEL 7 machines (lets say in ranges 168.192.10.0-20, 0 is a name node).
I see:
https://github.com/saltstack-formulas/hadoop-formula
and I found third-party Spark formula:
https://github.com/beauzeaux/spark-formula
Could someone be kind enough to suggest a set of instructions on how to go about this install in a most straightforward way?
Disclaimer: This answer describes only the rough process of what you need to do. I've distilled it from the respective documentation chapters and added the sources for reference. I'm assuming that you are familiar with the basic workings of Salt (states and pillars and whatnot) and also with Hadoop (I'm not).
1. Configure GitFS
The typical way to install Salt formulas is using GitFS. See the respective chapter from the Salt manual for an in-depth documentation.
This needs to be done on your Salt master node.
Enable GitFS in the master configuration file (typically /etc/salt/master, or a separate file in /etc/salt/master.d):
fileserver_backend:
- git
Add the two Salt formulas that you need as remotes (same file). This is also covered in the documentation:
gitfs_remotes:
- https://github.com/saltstack-formulas/hadoop-formula.git
- https://github.com/beauzeaux/spark-formula
(optional): Note the following warning from the Formula documentation:
We strongly recommend forking a formula repository into your own GitHub account to avoid unexpected changes to your infrastructure.
Many Salt Formulas are highly active repositories so pull new changes with care. Plus any additions you make to your fork can be easily sent back upstream with a quick pull request!
Fork the formulas into your own Git repository (using GitHub or otherwise) and use your private Git URL as remote in order to prevent unexpected changes to your configuration.
Restart Salt master.
2. Install Hadoop
This is documented in-depth in the Formulas README file. From a cursory reading, the formula can set up both Hadoop masters and slaves; the role is determined using a Salt grain.
Configure the Hadoop role in the file /etc/salt/grains. This needs to be done on each Salt minion node (use hadoop_master and hadoop_slave appropriately):
roles:
- hadoop_master
Configure the Salt mine on your Salt minion (typically /etc/salt/minion or a separate file in /etc/salt/minion.d):
mine_functions:
network.interfaces: []
network.ip_addrs: []
grains.items: []
Have a look at additional configuration grains and set them as you see fit.
Add the required pillar data for configuring your Hadoop set up. We're back on the Salt master node for this (for this, I'm assuming you are familiar with states and pillars; see the manual or this walkthrough otherwise). Have a look at the example pillar for possible configuration options.
Use the hadoop and hadoop.hdfs states in your top.sls:
'your-hadoop-hostname*':
- hadoop
- hadoop.hdfs
3. Install Spark
According to the Formula's README, there's nothing to configure via grains or pillars, so all that's left is to use the spark state in your top.sls:
'your-hadoop-hostname*':
- hadoop
- hadoop.hdfs
- spark
4. Fire!
Apply all states:
salt 'your-hadoop-hostname*' state.highstate

Cassandra - Delete Old Versions of Tables and Backup Database

Looking in my keyspace directory I see several versions of most of my tables. I am assuming this is because I dropped them at some point and recreated them as I was refining the schema.
table1-b3441432142142sdf02328914104803190
table1-ba234143018dssd810412asdfsf2498041
These created tables names are very cumbersome to work with. Try changing to one of the directories without copy pasting the directory name from the terminal window... Painful. So easy to mistype something.
That side note aside, how do I tell which directory is the most current version of the table? Can I automatically delete the old versions? I am not clear if these are considered snapshots or not since each directory also can contain snapshots. I read in another post you can stop autosnapshot, but I'm not sure I want that. I'd rather just automatically delete any tables not being currently used (i.e.: that are not the latest version).
I stumbled across this trying to do a backup. I realized I am forced go to every table directory and copy out the snapshot files (there are like 50 directories..not including all the old table versions) which seems like a terrible design (maybe I'm missing something??).
I assumed I could do a snapshot of the whole keyspace and get one file back or at least output all the files to a single directory that represents the snapshot of the entire keyspace. At the very least it would be nice knowing what the current versions are so I can grab the correct files and offload them to storage somewhere.
DataStax Enterprise has a backup feature but it only supports AWS and I am using Azure.
So to clarify:
How do I automatically delete old table versions and know which is
the current version?
How can I backup the most recent versions of the tables and output the files to a single directory that I can offload somewhere? I only have two nodes, so simply relying on the repair is not a good option for me if a node goes down.
You can see the active version of a table by looking in the system keyspace and checking the cf_id field. For example, to see the version for a table in the 'test' keyspace with table name 'temp', you could do this:
cqlsh> SELECT cf_id FROM system.schema_columnfamilies WHERE keyspace_name='test' AND columnfamily_name='temp' allow filtering;
cf_id
--------------------------------------
d8ea9830-20e9-11e5-afc0-c381f961c62a
As far as I know, it is safe to delete (rm -r) outdated table version directories that are no longer active. I imagine they don't delete them automatically so that you can recover the data if you dropped them by mistake. I don't know of a way to have them removed automatically even if auto snapshot is disabled.
I don't think there is a command to write all the snapshot files to a single directory. According to the documentation on snapshot, "After the snapshot is complete, you can move the backup files to another location if needed, or you can leave them in place." So it's left up to the application developer how they want to handle archiving the snapshot files.

Best practices for cleaning up Cassandra incremental backup folders

We have incremental backup on our Cassandra cluster. The "backups" folders under the data folders now contain a lot of data and some of them have millions of files.
According to the documentation: "DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created."
It's not clear to me what the best way is to clear out these files. Can they all just be deleted when a snapshot is created, or should we delete files that are older than a certain period?
My thought was, just to be on the safe side, to run a regular script to delete files more than 30 days old:
find [Cassandra data root]/*/*/backups -type f -mtime +30 -delete
Am I being too careful? We're not concerned about having a long backup history.
Thanks.
You are probably being too careful, though that's not always a bad thing, but there are a number of considerations. A good pattern is to have multiple snapshots (for example weekly snapshots going back to some period) and all backups during that time period so you can restore to known states. For example, if for whatever reason your most recent snapshot doesn't work for whatever reason, if you still have your previous snapshot + all sstables since then, you can use that.
You can delete all created backups after your snapshot as the act of doing the snapshot flushes and hard links all sstables to a snapshots directory. Just make sure your snapshots are actually happening and completing (it's a pretty solid process since it hard links) before getting rid of old snapshots & deleting backups.
You should also make sure to test your restore process as that'll give you a good idea of what you will need. You should be able to restore from your last snapshot + the sstables backed up since that time. Would be a good idea to fire up a new cluster and try restoring data from your snapshots + backups, or maybe try out this process in place in a test environment.
I like to point to this article: 'Cassandra and Backups' as a good run down of backing up and restoring cassandra.

Resources