How can I add extra disk space to my elasticsearch nodes - linux

The set up is 3 nodes, two warm nodes with 5TBs of storage and a hot node with 2TB. I want to add 2TBs of storage to each of the two nodes.
Each node is run as a docker image on a Linux server which will be shutdown while adding the disks. I do not know how to make elasticsearch utilize the extra space after adding the disks.
No docker-compose files are used.
The elastic image is started without specifying volumes, but only specifies the elasticsearch yaml file. The elasticsearch file does not mention anything about the path properties.

You can use multiple data path, editing the yaml configuration file:
path:
data:
- /mnt/disk_1
- /mnt/disk_2
- /mnt/disk_3
In recent ElasticSearch versions, this option is deprecated.
See official documentation to migrate to an alternative configuration.

Related

Kibana 6.5 pointing to multiple elasticsesrch nodes

I was wondering if kibana 6.5.0 supports the option to pointing to multiple elasticsearch nodes.
I have 5 elasticsearch nodes in a cluster setup and i want point a single kibana instance to those nodes (i do not want to use querying node or similar).
I tried using the tag in yaml file "elasticsearch.host" but only supports 1 elasticsearch URL.
Also i tried with the tag "elasticsearch.url" and "elasticsearch.urls" as specified in a specifyc section in elastic.io but it does not work... Basically it crashes.
Any idea if with this specyfic version of Kibana i can point to multiple cluster nodes? If so any example how you would use the tag?
Thank you.
It is not possible, this version does not support multiple hosts.
This feature was implemented in version 6.6.
To be able to point kibana to multiple hosts you will need to upgrade your stack.

How to forward logs to s3 from yarn container?

I am setting up Spark on Hadoop Yarn cluster in AWS EC2 machines.
This cluster will be ephemeral (For few hours within a day) and hence i want to forward the container logs generated to s3.
I have seen Amazon EMR supporting this feature by forwarding logs to s3 every 5 minutes
Is there any built in configuration inside hadoop/spark that i can leverage ..?
Any other solution to solve this issue will also be helpfull.
Sounds like you're looking for YARN log aggregation.
Haven't tried changing it myself, but you can configure yarn.nodemanager.remote-app-log-dir to point to S3 filesystem, assuming you've setup your core-site.xml accordingly
yarn.log-aggregation.retain-seconds +
yarn.log-aggregation.retain-check-interval-seconds will determine how often the YARN containers will ship out their logs
The alternate solution would be to build your own AMI that has Fluentd or Filebeat pointing at the local YARN log directories, then setup those log forwarders to write to a remote location. For example, Elasticsearch (or one of the AWS log solutions) would be a better choice than just S3

Cassandra Snapshot running on kubernetes

I'm using Kubernetes (via minikube) to deploy my Lagom services and my Cassandra DB.
After a lot of work, I succeed to deploy my service and my DB on Kubernetes.
Now, I'm about to manage my data and I need to generate a backup for each day.
Is there any solution to generate and restore a snapshot (Backup) for Cassandra running on Kubernetes:
cassandra statefulset image:
gcr.io/google-samples/cassandra:v12
Cassandra node:
svc/cassandra ClusterIP 10.97.86.33 <none> 9042/TCP 1d
Any help? please.
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupRestore.html
That link contains all the information you need. Basically you use nodetool snapshot command to create hard links of your SSTables. Then it's up to you to decide what to do with the snapshots.
I would define a new disk in the statefulset and mount it to a folder, e.g. /var/backup/cassandra. The backup disk is a network storage. Then I would create a simple script that:
Run 'nodetool snapshot'
Get the snapshot id from the output of the command.
Copy all files in the snapshot folder to /var/backup/cassandra
Delete snapshot folder
Now all I have to do is make sure I store the backups on my network drive somewhere else for long term.
Disclaimer. I haven't actually done this so there might be a step missing but this would be the first thing I would try based on the Datastax documentation.

Cassandra 2+ HPC Deployment

I am trying to deploy Cassandra on a Linux Based HPC cluster and I need some guidelines if possible. Specifically, what is the difference between running Cassandra locally and in cluster.
When managing locally (in which case it runs smoothly) we duplicate the original files for every node inside our Cassandra directory and we apply the appropriate changes for IP address, rcp, JMX etc... however, when managing a network which files do we need to install in each node. The whole package with all the files or just some of the required ones
like, bin/cassandra.in.sh, conf/cassandra.yaml, bin/cassandra.
I am a little bit confused on what to store in each node separately so to start working on the cluster.
You need to install Cassandra on each node (VM), i.e. the whole package and then update config files as neccessary. As described here to configure cluster in a single data center you need:
Install Cassandra on each node
Configure cluster name
Configure seeds
Configure snitch, if needed

Datastax Cluster Storage Amazon Ec2 - Production

I have a Datastax Enterprise cluster in production with the following configuration:
3 Hadoop Nodes
2 Cassandra Nodes
2 Solr Nodes
There are few tables in Cassandra with few million lines.
Every night I process few million registers using PIG.
All the search done in our website uses SOLR.
Basically we are 100% based on DSE.
This structure is based on Amazon Ec2, and all the instances are:
M3.Xlarge
80 GB SSD
15 GB RAM
13 (4 core x 3.25 unit)
I want to add an extra 1TB Hard Disk for each node and use it in the cluster.
How can I do that? Which config files do I need to change when I attach a new hard disk?
After attaching the new hard drive storage to the EC2 instance, edit the cassandra.yaml file and add the new storage location to the data_file_directories configuration option. (Cassandra supports multiple entries for data storage, and will spread the data out.)
The config file will depend on your installation method, either /etc/dse/cassandra/cassandra.yaml or {install_location}/resources/cassandra/conf/cassandra.yaml.
After making the config file change, DSE will need to be restarted on each node (could do a rolling restart).
Reference: https://stackoverflow.com/a/23121664/9965

Resources