How to batch Cassandra dsbulk loading version 1.7 - cassandra

I'm trying to load a large CSV (30 GB) file into my cluster. I'm realizing that I might be overloading my Cassandra driver which is causing it to crash at some point during loading. I am getting a repeated message while it loads the data, until a certain point where it stops and I get an error that stops the process.
My current loading command is:
dsbulk load -url data.csv -k hotels -t reviews -delim '|' -header true -h '' -port 9042 -maxConcurrentQueries 128
Using -maxConcurrentQueries 128 did not change anything in terms of errors.
Any idea how I can modify my command to make it work?

Related

hdfs + what would be the main cause of UNDER REPLICA blocks

I'm just curious to know what would be the main cause of UNDER REPLICA blocks
We have ambari cluster with HDP version - 2.6.5
Number of data nodes machines are - 5
As it would always have at least three copies, I thought this would be hard to happen (but happens)
If HDFS can't create one copy or detects corruption, wouldn't it try to recover by copying a good one into another DataNode?
Or once a file was properly created in HDFS, does it never check if file is corrupted or not until HDFS is restarted?
To fix the under replica we can use the following steps:
su hdfs
hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files
for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 3 $hdfsfile; done
how hadoop fs -setrep 3 , is works?

sstablesplit not working on cassandra 2.1.12

I have a sstable having a size of 40GB which i was trying to split using the following command :
bin/sstablesplit --no-snapshot -s 10 keyspace-columnfamily-ka-2466-Data.db
But it deletes the current file of 40Gb and doens't even split it without giving error. What could be the possible reason or am i doing soomething wrong here.
Try -v version and send the output
bin/sstablesplit -v --no-snapshot -s 10 keyspace-columnfamily-ka-2466-Data.db

To find Cassandra disk space usage

I am using Jconsole for monitoring Cassandra. I can get value like how much load each keyspace is having.
I want to find out disk space usage for each node in a cluster by remotely.
Is there any way to do so?
A shell script can do the trick
for i in node1_ip node2_ip ... nodeN_ip
do
ssh user#$i "du -sh /var/lib/cassandra/data" >> /tmp/disk_usage.txt
done
Replace /var/lib/cassandra/data if your data folder is put somewhere else

Too many open files - KairosDB

on running this query:
{ "start_absolute":1359695700000, "end_absolute":1422853200000,
"metrics":[{"tags":{"Building_id":["100"]},"name":"meterreadings","group_by":[{"name":"time","group_count":"12","range_size":{"value":"1","unit":"MONTHS"}}],"aggregators":[{"name":"sum","align_sampling":true,"sampling":{"value":"1","unit":"Months"}}]}]}
I am getting the following response:
500 {"errors":["Too many open files"]}
Here this link it is written that increase the size of file-max.
My file-max output is:
cat /proc/sys/fs/file-max
382994
it is already very large, do I need to increase its limit
What version are you using? Are you using a lot of grou-by in your queries?
You may need to restart kairosDB as a workaround.
Can you check if you have deleted (ghost) files handles (replace by kairosDB process ID in the command line below)?
ls -l /proc/<PID>/fd | grep kairos_cache | grep -v '(delete)' | wc -l
THere was a fix in 0.9.5 for unclosed file handles.
There's a fix pending for next release (1.0.1).
cf. https://github.com/kairosdb/kairosdb/pull/180, https://github.com/kairosdb/kairosdb/issues/132, and https://github.com/kairosdb/kairosdb/issues/175.

RRD print the timestamp of the last valid data

I have a rdd database storing ping response from a wide range of network equipments
How can i print on the graph the timestamp of the last valid entry in the rrd database, so i can see if a host is down when did it went down
I use the folowing to creade the RRD file.
rrdtool create terminal_1.rrd -s 60 \
DS:ping:GAUGE:120:0:65535 \
RRA:AVERAGE:0.5:1:2880
Use the lastupdate option of rrdtool.
Another solution exists if you only have one file per host : don't update your RRD if the host is down. You can then see the last updated time with a plain ls or stat as in :
ls -l terminal_1.rrd
stat --format %Y terminal_1.rrd
In case you plan to use the caching daemon of RRD, you have to use the last command in order to flush the pending updates.
rrdtool last terminal_1.rrd

Resources