Clickhouse shared dictionaries - columnstore

Is there any way to "share" or "replicate" a dictionary among multiple machine in the same shared and/or cluster using clickhouse.
Currently I have ~10 files for external dictionaries that clickhouse loads up (and a few csvs where the data is loaded from). All the dictionaries are quite small and critical for a lot of queries, so I'd like to find a way to distribute them instead of having to maintain up to date copies on every cluster.
Is there any way to do this ?

you can try use configuration managament tools like a puppet, ansible for rollup your dictionaries over multiple nodes together clickhouse configs
or just use rsync or NFS share and run service clickhouse reload as neccessary

Related

How to perform backup and restore of Janusgraph database which is backed by Apache Cassandra?

I'm having trouble in figuring out on how to take the backup of Janusgraph database which is backed by persistent storage Apache Cassandra.
I'm looking for correct methodology on how to perform backup and restore tasks. I'm very new to this concept and have no idea on how to do this. It will be highly appreciated if someone explain the correct approach or point me to rightful documentation to safely execute the tasks.
Thanks a lot for your time.
Cassandra can be backed up a few ways. One way is called a "snapshot". You can issue this via "nodetool snapshot" command. What cassandra will do is to create a "snapshots" sub-directory, if it doesn't already exist, under each table that's being "backed up" (each table has its own directory where it stores its data) and then it will create the specific snapshot directory for this particular occurrence of the snapshot (either you can name the directory with the "nodetool snapshot" parameter or let it default). Cassandra will then create soft links to all of the sstables that exist for that particular table - looping through each table, keyspace or database - depending on your "nodetool snapshot" parameters. It's very fast as creating soft links takes almost 0 time. You will have to perform this command on each node in the cassandra cluster to back up all of the data. Each node's data will be backed up to the local host. I know DSE, and possibly Apache, are adding functionality to back up to object storage as well (I don't know if this is an OpsCenter-only capability or if it can be done via the snapshot command as well). You will have to watch the space consumption on this as there are no processes to clean these up.
Like many database systems, you can also purchase/use 3rd party software to perform backups (e.g. Cohesity (formally Talena), Rubrik, etc.). We use one such product in our environments and it works well (graphical interface, easy-to-use point-in-time recoveryt, etc.). They also offer easy-to-use "refresh" capabilities (e.g. refresh your PT environment from, say, production backups).
Those are probably the two best options.
Good luck.

Cloning Couch DB data from one server to another through file systems (without replicator)

We have two nodes with couchDB installed. One of the nodes have data on it, we want to copy the data from that instance to another instance of couch db. We want to avoid replicator due to volume of the data.
We tried copying data from %couchdb%/data/shards and %couchdb%/data/.shards to corresponding locations of target node as per one of the suggestions from CouchDB backups and cloning the database
but not able to see the Data in the server Fauxton UI. Can someone suggest what is missing?
Couchtransform lets you convert or just clone data from one db to another, its multi threaded and won't need to deal with massive files.

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?
I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.
Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

Cassandra data clone to another cassandra database(different servers)

My question is above mentioned, have a cassandra database and wanted to use another server with this data. How can i move this all keyspace's data?
I have snapshots but i dont know can i open it to another server.
Thanks for your helps
Unfortunately, you have limited options to move data across clouds primarily COPY command or sstableloader (https://docs.datastax.com/en/cassandra/2.1/cassandra/migrating.html) or if you plan to maintain a like-to-like setup (same number of nodes) across clouds then simply copying snapshots under data would work.
If you are moving to IBM Softlayer, you maybe able to use software-defined storage solutions that get deployed on bare metal and provide features like thin clones that will allow you to create clones of cassandra clusters in matter of minutes and provide incredible space savings. This is rather useful for creating clones for dev/test purposes. Checkout Robin Systems, you may find them interesting.
The cleanest way to migrate your data from one cluster to another is using the sstableloader tool. This will allow you to stream the contents of your sstables from a local directory to a remote cluster. In this case the new cluster can also be configured differently and you also don't have to worry about assigned tokens.

Processing speed over mounted path

I have two scenarios.
Scenario 1: Machine A contains 1000 documents as folders. This folder of machine A is mounted in machine B. I process documents within these folders in machine B and store the output result in mounted path in machine B.
Scenario 2: The documents in machine A is directly copied into machine B and processed
Scenario 2 is much faster than Scenario 1. I could guess its because there is no data transfer happening over the network between 2 machines. Is there a way I can use mounting and still achieve better performance?
Did you try enabling a cache? - for NFS: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/fscachenfs.html - CIFS should have caching enabled by default (unless you disabled it)
The other option would be to use something like Windows’ offline files, which copies files and folders between client and server in the background, so you don’t need to deal with it. The only thing I’ve found for linux is OFS.
But the performance depends on the size of the files and if you read them randomly or sequentially. For instance when I am encoding videos, I access the file right away via the network from my NFS, because it takes as much time as it would take to read and write the file. This way no additional time is “wasted” on the encoding, as the application can encode the stream which is coming from the network.
So for large files you might want to change the algorithms to a sequential read, on the other hand small files which are copied within seconds, could be also synced between server and client using rsync, bittorrent sync, dropbox or one of the other hundreds of tools. And this is actually quite commonly done.

Resources