apache zookeeper ask server RAM - resources

Is it possible to use zookeeper to monitor the server resources or to retrieve a number representing the total RAM available?

Apache Zookeeper is not a monitoring tool, it's a distributed manager for configuration sharing, naming and synchronization.
Here is a good description of what Apache Zookeeper is and how it can be used - https://stackoverflow.com/a/8864303/2453586
You can make a workaround and write a script that will check the amount of available RAM every N minutes/seconds (via cron or like a deamonized process) on each local machine and write this value to Zookeeper znode, like zookeeper_host:2181/ram/host1/available. But it's not a good idea.
It's more likely to use a specific tools like Zabbix, Nagios or Gangila for memory monitoring purposes.

You are probably looking for something like jmx.

Related

planning for graphite components for big cassandra cluster monitoring

I am planning to setup a 80 nodes cassandra cluster (current version 2.1 but will upgrade to 3 in future).
I have gone though http://graphite.readthedocs.io/en/latest/tools.html which has list of tools that graphite supports.
I want to decide which tools to choose as listener and storage so that it could scale.
As a listener should i use the default carbon or should i choose graphite-ng ?
However as storage component, i am confused that whether default whisper is enough? Or should I look at ohter option (like Influxdata,cynite or some rdms db (postgres/mysql))?
As gui component i have finalized to use grafana for better visulization.
I think datadog + grafana will work fine but datadog is not opensource.So Please suggest an opensource scalable up to 100 cassandra nodes alternative.
I have 35 Cassandra nodes (different clusters) monitored without any problems with graphite + carbon + whisper + grafana. But i have to tell that re-configuring collection and aggregations windows with whisper is a pain.
There's many alternatives today for this job, you can use influxdb (+ telegraf) stack for example.
Also with datadog you don't need grafana, they're also a visualizing platform. I've worked with it some time ago, but they have some misleading names for some metrics in their plugin, and some metrics were just missing. As a pros for this platform, it's really easy to install and use.
We have a cassandra cluster of 36 nodes in production right now (we had 51 but migrated the instance type since then so we need less C* servers now), monitored using a single graphite server. We are also saving data for 30 days but in a 60s resolution. We excluded the internode metrics (e.g. open connections from a to b) because of the scaling of the metric count, but keep all other. This totals to ~510k metrics, each whisper file being ~500kb in size => ~250GB. iostat tells me, that we have write peaks to ~70k writes/s. This all is done on a single AWS i3.2xlarge instance which include 1.9TB nvme instance storage and 61GB of RAM. To fully utilize the power of the this disk type we increased the number of carbon caches. The cpu usage is very low (<20%) and so is the iowait (<1%).
I guess we could get away with a less beefy machine, but this gives us a lot of headroom for growing the cluster and we are constantly adding new servers. For the monitoring: Be prepared that AWS will terminate these machines more often than others, so backup and restore are more likely a regular operation.
I hope this little insight helped you.

suggest Free tools to monitor cassandra cluster performance

I want to monitor cassandra cluster on CentOS machine. Suggest me Free tools to monitor performance in terms of discs, RAM, nodetool commands and other parameters.
thanks
Have a look at Prometheus.io and the various exporters.
Here's a good blog on it: http://www.robustperception.io/monitoring-cassandra-with-prometheus/ .
In terms of the nodetool commands etc.. you would probably be worth automating things like this via tools like Ansible and/or Rundeck.
You can also have a look at axonops.com .
There are many host side monitoring tools,
I personally use Nmon for a my host side monitoring of a cassandra cluster.
See also: Installing NMon
You can easily create graphs out of the nmon results using NMONVisualizer

How can I retrieve workers information for a running application in SPARK?

I want to get information about the workers that are being used by an application in Spark cluster. I need to get its IP address, CPU cores, memory available etc.
Is there any API in spark for this purpose?
Above image shows the same info on Spark UI but I am not able to figure out the way to get it by JAVA code.
It is specific to JAVA.
I want all worker nodes information.
Thanks.
There are multiple ways to do this:
Parse the output log messages and see what workers are started on each machine in your cluster. You can get the names/IPs of all the hosts, when tasks are started and where, how much memory each worker gets, etc. If you want to see the exact HW configuration, you will then need to log in to the worker nodes or use different tools.
The same information as in the web frontend is contained in the eventLogs of the spark applications (this is actually where the data you see comes from). I prefer to use the eventLog as it is very easy to parse in python rather than the log messages.
If you want to have real-time monitoring of the cluster you can use either ganglia (gives nice graphical displays of CPU/memory/network/disks) or use colmux that gives you the same data but in a text format. I personally prefer colmux (easier to set up, you get immediate stats, etc).

Linux HA vs Apache Hadoop

I'm using Cloudera (Apache Hadoop), so I have a pretty good idea about it.
However, I just found out about Linux HA project and
I cannot find out what is the difference between Linux HA and Apache Hadoop.
When should we use Apache Hadoop and when should we use Linux HA?
Thank you!
Linux HA is a software based High-availability cluster services which are used to improve the ability of many kinds of services. That means - This Linux HA is used to keep desired services up and running with no downtime. This uses the concept of heartbeat to identify the service state in the cluster. For example if you have a web server running on hostA, it is replicated to run on hostB also. Whenever the hostA is down, hostB starts and serves requests. i.e there is no downtime provided by the server.
Whereas, Apache Hadoop is a Framework that solves the problem of storing large amount of data and processing it.

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?
Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH
Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.
same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Resources