Ceph OSD always 'down' in Ubuntu 14.04.1

Ceph OSD always 'down' in Ubuntu 14.04.1 - ubuntu-14.04

I am trying to install and deploy a ceph cluster. As I don't have enough physical servers, I create 4 VMs on my OpenStack using official Ubuntu 14.04 image. I want to deploy a cluster with 1 monitor node and 3 OSD node with ceph version 0.80.7-0ubuntu0.14.04.1. I followed the steps from manual deployment document, and successfully installed the monitor node. However, after the installation of OSD node, it seems that the OSD daemons are running but not correctly report to the monitor node. The osd tree always shows down when I request command ceph --cluster cephcluster1 osd tree.
Following are the commands and corresponding results that may be related to my problem.
root#monitor:/home/ubuntu# ceph --cluster cephcluster1 osd tree
# id weight type name up/down reweight
-1 3 root default
-2 1 host osd1
0 1 osd.0 down 1
-3 1 host osd2
1 1 osd.1 down 1
-4 1 host osd3
2 1 osd.2 down 1
root#monitor:/home/ubuntu# ceph --cluster cephcluster1 -s
cluster fd78cbf8-8c64-4b12-9cfa-0e75bc6c8d98
health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 3/3 in osds are down
monmap e1: 1 mons at {monitor=172.26.111.4:6789/0}, election epoch 1, quorum 0 monitor
osdmap e21: 3 osds: 0 up, 3 in
pgmap v22: 192 pgs, 3 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
192 creating
The configuration file /etc/ceph/cephcluster1.conf on all nodes:
[global]
fsid = fd78cbf8-8c64-4b12-9cfa-0e75bc6c8d98
mon initial members = monitor
mon host = 172.26.111.4
public network = 10.5.0.0/16
cluster network = 172.26.111.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
filestore xattr use omap = true
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1
[osd]
osd journal size = 1024
[osd.0]
osd host = osd1
[osd.1]
osd host = osd2
[osd.2]
osd host = osd3
Logs when I start one of the osd daemons through start ceph-osd cluster=cephcluster1 id=x where x is the OSD ID:
/var/log/ceph/cephcluster1-osd.0.log on the OSD node #1:
2015-02-11 09:59:56.626899 7f5409d74800 0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-osd, pid 11230
2015-02-11 09:59:56.646218 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is supported and appears to work
2015-02-11 09:59:56.646372 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-11 09:59:56.658227 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-02-11 09:59:56.679515 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) limited size xattrs
2015-02-11 09:59:56.699721 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-02-11 09:59:56.700107 7f5409d74800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-02-11 09:59:56.700454 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 20: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.704025 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 20: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.704884 7f5409d74800 1 journal close /var/lib/ceph/osd/cephcluster1-0/journal
2015-02-11 09:59:56.725281 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is supported and appears to work
2015-02-11 09:59:56.725397 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-11 09:59:56.736445 7f5409d74800 0 genericfilestorebackend(/var/lib/ceph/osd/cephcluster1-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-02-11 09:59:56.756912 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) limited size xattrs
2015-02-11 09:59:56.776471 7f5409d74800 0 filestore(/var/lib/ceph/osd/cephcluster1-0) mount: WRITEAHEAD journal mode explicitly enabled in conf
2015-02-11 09:59:56.776748 7f5409d74800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-02-11 09:59:56.776848 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.777069 7f5409d74800 1 journal _open /var/lib/ceph/osd/cephcluster1-0/journal fd 21: 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-02-11 09:59:56.783019 7f5409d74800 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-02-11 09:59:56.783584 7f5409d74800 0 osd.0 11 crush map has features 1107558400, adjusting msgr requires for clients
2015-02-11 09:59:56.783645 7f5409d74800 0 osd.0 11 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
2015-02-11 09:59:56.783687 7f5409d74800 0 osd.0 11 crush map has features 1107558400, adjusting msgr requires for osds
2015-02-11 09:59:56.783750 7f5409d74800 0 osd.0 11 load_pgs
2015-02-11 09:59:56.783831 7f5409d74800 0 osd.0 11 load_pgs opened 0 pgs
2015-02-11 09:59:56.792167 7f53f9b57700 0 osd.0 11 ignoring osdmap until we have initialized
2015-02-11 09:59:56.792334 7f53f9b57700 0 osd.0 11 ignoring osdmap until we have initialized
2015-02-11 09:59:56.792838 7f5409d74800 0 osd.0 11 done with init, starting boot process
/var/log/ceph/ceph-mon.monitor.log on the monitor node:
2015-02-11 09:59:56.593494 7f24cc41d700 0 mon.monitor#0(leader) e1 handle_command mon_command({"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 0, "weight": 0.05} v 0) v1
2015-02-11 09:59:56.593955 7f24cc41d700 0 mon.monitor#0(leader).osd e21 create-or-move crush item name 'osd.0' initial_weight 0.05 at location {host=osd1,root=default}
Any suggestion is appreciate. Many thanks!

Related

How can I set Slurm Partition QoS?

I created partition QOS to my Slurm partition but it isn't worked. How can I solve this problem. If anyone knows, please let me know. The following steps are my operation.
CreateQoS
$sacctmgr show qos format="Name,MaxWall,MaxTRESPerUser%30,MaxJob,MaxSubmit,Priority,Preempt"
Name MaxWall MaxTRESPU MaxJobs MaxSubmit Priority Preempt
---------- ----------- ------------------------------ ------- --------- ---------- ----------
normal 0
batchdisa+ 0 0 10
2.Attach QOS to partition
$scontrol show partition
PartitionName=sample01
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=batchdisable
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=computenode0[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
3.Run Jobs
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
67109044 sample01 testjob test R 1:42 1 computenode01
67109045 sample01 testjob test R 1:39 1 computenode02

I was able to solve the problem by adding the following setting to slrum.conf.
AccountingStorageEnforce=associations

How to monitor the throughput of Heron Cluster

I needed to get the throughput of Heron Cluster for some reasons, but there is no metric in the Heron UI. So do you have any ideas about how to monitor the throughput of Heron Cluster? Thanks.
The result of running heron-explorer as follows:
yitian#heron01:~$ heron-explorer metrics aurora/yitian/devel SentenceWordCountTopology
[2018-08-03 21:02:09 +0000] [INFO]: Using tracker URL: http://127.0.0.1:8888
'spout' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count ack-count fail-count
------------------- ----------------- ---------------------- -------------------- ------------ ----------- ------------
container_3_spout_6 2053 0.253257 146 1.13288e+07 1.13278e+07 0
container_4_spout_7 2091 0.150625 137.5 1.1624e+07 1.16228e+07 231
'count' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count execute-count ack-count fail-count
-------------------- ----------------- ---------------------- -------------------- ------------ --------------- ----------- ------------
container_6_count_12 2092 0.184742 155.167 0 4.6026e+07 4.6026e+07 0
container_5_count_9 2091 0.387867 146 0 4.60069e+07 4.60069e+07 0
container_6_count_11 2092 0.184488 157.833 0 4.58158e+07 4.58158e+07 0
container_4_count_8 2091 0.443688 129.833 0 4.58722e+07 4.58722e+07 0
container_5_count_10 2091 0.382577 118.5 0 4.60091e+07 4.60091e+07 0
'split' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count execute-count ack-count fail-count
------------------- ----------------- ---------------------- -------------------- ------------ --------------- ----------- ------------
container_1_split_2 2091 0.143034 75.3333 4.59453e+07 4.59453e+06 4.59453e+06 0
container_3_split_5 2042 1.12248 79.1667 4.64862e+07 4.64862e+06 4.64862e+06 0
container_2_split_3 2150 0.139837 83.6667 4.59443e+07 4.59443e+06 4.59443e+06 0
container_1_split_1 2091 0.145702 104.167 4.59454e+07 4.59454e+06 4.59454e+06 0
container_2_split_4 2150 0.138453 106.333 4.59443e+07 4.59443e+06 4.59443e+06 0
[2018-08-03 21:02:09 +0000] [INFO]: Elapsed time: 0.031s.

You can use the execute-count of you sink component to measure the output of your topology. If each of your components have a 1:1 input:output ratio then this will be your throughput.
However, if you are windowing tuples into batches or splitting tuples (like separating sentences into individual words) then things get a little more complicated. You can get the input into your topology by looking at the emit-count of your spout components. You could then use this in comparison to you bolt execute-counts to create your own throughput metric.
An easy way to get programmatic access to these metrics is via the Heron Tracker REST API. You can use your chosen language's HTTP library (like Requests for Python) to query the last 3 hours of data for a running topology. If you require more than 3 hours of data (the maximum stored by the topology TMaster) you will need to use one of the other metrics sinks to send metrics to an external database. Heron currently provides sinks for saving to local files, Graphite or Prometheus. InfluxDB support is in the works.

What does LOAD in nodetool status measure?

I am observing higher load on a Cassandra node (compared to other nodes in the ring) and I am looking for help interpreting this data. I have anonymized my IPs but the snipped below shows a comparison of "good" node 199 (load 14G) and "bad" node 159(load 25G):
nodetool status|grep -E '199|159'
UN XXXXX.159 25.2 GB 256 ? ffda4798-tokentoken XXXXXX
UN XXXXX.199 13.37 GB 256 ? c3a49dca-tokentoken YYYY
Note load is almost 2x on .159. Yet neither memory nor disk usage explain/support this:
.199 (low load box) data -- memory at about 34%, disk 50-60G:
top|grep apache_cassan
28950 root 20 0 24.353g 0.010t 1.440g S 225.3 34.2 25826:35 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.448g S 212.4 34.2 25826:41 apache_cassandr
28950 root 20 0 24.357g 0.010t 1.452g S 219.7 34.3 25826:48 apache_cassandr
28950 root 20 0 24.357g 0.011t 1.460g S 250.5 34.3 25826:55 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 559G 47G 513G 9% /cassandra/data_dir_a
/dev/sdf1 559G 63G 497G 12% /cassandra/data_dir_b
/dev/sdg1 559G 54G 506G 10% /cassandra/data_dir_c
/dev/sdh1 559G 57G 503G 11% /cassandra/data_dir_d
.159 (high load box) data -- memory at about 28%, disk 20-40G:
top|grep apache_cassan
25354 root 20 0 36.297g 0.017t 8.608g S 414.7 27.8 170:42.81 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.608g S 272.2 27.8 170:51.00 apache_cassandr
25354 root 20 0 36.302g 0.017t 8.612g S 129.7 27.8 170:54.90 apache_cassandr
25354 root 20 0 36.354g 0.017t 8.625g S 94.1 27.8 170:57.73 apache_cassandr
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 838G 17G 822G 2% /cassandra/data_dir_a
/dev/sdf1 838G 11G 828G 2% /cassandra/data_dir_b
/dev/sdg1 838G 35G 804G 5% /cassandra/data_dir_c
/dev/sdh1 838G 26G 813G 4% /cassandra/data_dir_d
TL;DR version -- what does nodetool status 'load' column actually measure/report

The nodetool status command provides the following information:
Status - U (up) or D (down)
Indicates whether the node is functioning or not.
Load - updates every 90 seconds
The amount of file system data under the cassandra data directory after excluding all content in the snapshots subdirectories. Because all SSTable data files are included, any data that is not cleaned up, such as TTL-expired cell or tombstoned data) is counted.
For more information go to nodetool status output description

Spark master node ip not updated in Zookeeper store after failover

We were working on spark HA setup in multi node cluster. What we have seen that after stand by spark machine becomes "Alive" the zookeeper data store for master is not updated , it still has the old machine ip address , we thought of building an application which will use this info and return back with current master in typical HA setup , but because of this bug in Zookeeper , we are not able to build that application. like zookeeper has the data :-
[zk: localhost:2181(CONNECTED) 2] get /sparkha/master_status
10.49.1.112
cZxid = 0x10000007f
ctime = Wed Sep 14 23:52:52 PDT 2016
mZxid = 0x10000007f
mtime = Wed Sep 14 23:52:52 PDT 2016
pZxid = 0x10000007f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
after failover the ip should be changed to new machine ip. we wanted to know is it a bug or we are missing something ?
We know about spark-damon.sh status but we were more focusing on ZK store.
Thanks

Sphinx Search Gives Zero Result

Hi just installed Sphinx on my CentOS VPS.. But for some reason whenever I search, it gives me no result.. I'm using ssh for searching.. Here is the command
search --index sphinx_index_cc_post -a Introducing The New Solar Train Tunnel
This is the output of command
Sphinx 2.0.5-release (r3308)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/etc/sphinx.conf'...
index 'sphinx_index_cc_post': query 'Introducing The New Solar Train Tunnel ': returned 0 matches of 0 total in 0.000 sec
words:
1. 'introducing': 0 documents, 0 hits
2. 'the': 0 documents, 0 hits
3. 'new': 0 documents, 0 hits
4. 'solar': 0 documents, 0 hits
5. 'train': 0 documents, 0 hits
6. 'tunnel': 0 documents, 0 hits
This is my index in config file
source sphinx_index_cc_post
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = password
sql_db = database
sql_port = 3306
sql_query_range = SELECT MIN(postid),MAX(postid) FROM cc_post
sql_range_step = 1000
sql_query = SELECT postedby, category, totalvotes, trendvalue, featured, isactive, postingdate \
FROM cc_post \
WHERE postid BETWEEN $start AND $end
}
index sphinx_index_cc_post
{
source = sphinx_index_cc_post
path = /usr/local/sphinx/data/sphinx_index_cc_post
charset_type = utf-8
min_word_len = 2
}
The index seems to work fine, when I rotate the index, I successfully get the documents. Here is the result of my indexer
[root#server1 data]# indexer --rotate sphinx_index_cc_post
Sphinx 2.0.5-release (r3308)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/etc/sphinx.conf'...
indexing index 'sphinx_index_cc_post'...
WARNING: Attribute count is 0: switching to none docinfo
WARNING: source sphinx_index_cc_post: skipped 1 document(s) with zero/NULL ids
collected 2551 docs, 0.1 MB
sorted 0.0 Mhits, 100.0% done
total 2551 docs, 61900 bytes
total 0.041 sec, 1474933 bytes/sec, 60784.40 docs/sec
total 2 reads, 0.000 sec, 1.3 kb/call avg, 0.0 msec/call avg
total 6 writes, 0.000 sec, 1.0 kb/call avg, 0.0 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=17888).
I also tried removing attributes but no luck!! I'm guessing either its some config problem of query issue

Your query is:
SELECT postedby, category, totalvotes, trendvalue, featured, isactive, postingdate \
FROM cc_post
from column names, I guess you don't have full text in any of those columns. Are you missing the column that contains text?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Ceph OSD always 'down' in Ubuntu 14.04.1 - ubuntu-14.04

Related

How can I set Slurm Partition QoS?

How to monitor the throughput of Heron Cluster

What does LOAD in nodetool status measure?

Spark master node ip not updated in Zookeeper store after failover

Sphinx Search Gives Zero Result

Categories

Resources