How to control the number of child processes in cassandra server - cassandra

I am trying to load CSV file to cassandra table using COPY command.By default COPY process is starting with 16 child processes("Using 16 child processes"). I Have given 3 vcpus for the cassandra container.
I feel this 3 vcpus are not sufficient to distribute load to 16 processes because that lot of CPU throttling is happening which is resulting intermittent issue "NoHostAvailable" during the COPY process and finally job is completing.
I feel that if I limit child processes to half(i.e) 8 then i don't get this intermittent issue - "NoHostAvailable".I know that I can limit this using parameter numprocesses = x while submitting the query, but I want to limit this in the server.
I have tried to set this in the jvm.options file, but it didn't work for me.
# For systems with > 8 cores, the default ParallelGCThreads is 5/8 the number of logical cores.
# Otherwise equal to the number of cores when 8 or less.
# Machines with > 10 cores should try setting these to <= full cores.
#-XX:ParallelGCThreads=8
# By default, ConcGCThreads is 1/4 of ParallelGCThreads.
# Setting both to the same value can reduce STW durations.
#-XX:ConcGCThreads=8
Please let me know how to control numprocesses from the server config.
Thank you !!

Run you command by with NUMPROCESSES=8.
COPY table_name [ ( column_list ) ]
FROM 'file_name'[ , 'file2_name', ... ] | STDIN
[ WITH option = 'value' [ AND ... ] ]
According to the Apache documentation,
NUMPROCESSES The number of child worker processes to create for COPY
tasks. Defaults to a max of 4 for COPY FROM and 16 for COPY TO.
However, at most (num_cores - 1) processes will be created.
it seems that you can't set this at server level but you can override it at the query level.

Related

Dataflow exceeds number_of_worker_harness_threads

I deployed Dataflow job with param --number_of_worker_harness_threads=5 (streaming mode).
Next I send 20x PubSub messages triggering 20x loading big CSV files from GCS and start processing.
In the logs I see that job took 10 messages and process it in parallel on 6-8 threads (I checked several times, sometimes it was 6, sometimes 8).
Nevertheless all the time it was more than 5.
Any idea how it works? It does not seem to be expected behavior.
Judging from the flag name, you are using Beam Python SDK.
For Python streaming, the total number of threads running DoFns on 1 worker VM in current implementation may be up to the value provided in --number_of_worker_harness_threads times the number of SDK processes running on the worker, which by default is the number of vCPU cores. There is a way to limit number of processes to 1 regardless of # of vCPUs. To do so, set --experiments=no_use_multiple_sdk_containers.
For example, if you are using --machine_type=n1-standard-2 and --number_of_worker_harness_threads=5, you may have up to 10 DoFn instances in different threads running concurrently on the same machine.
If --number_of_worker_harness_threads is not specified, up to 12 threads per process are used. See also: https://cloud.google.com/dataflow/docs/resources/faq#how_many_instances_of_dofn_should_i_expect_dataflow_to_spin_up_

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

Execute (or queue) set of tasks simultaneously

I have the situation like:
between 5 and 20 test environments separated to groups by 5 VMs (1 set = 5 VMs usually)
hundreds of test cases which should be executed simultaneously on 1 VM set.
celery with 5 workers (each worker for 1 VM item from VM's set: alpha, beta, charlie, delta, echo)
Test sets can run in different order and use diff. amount of time to execute.
Each worker should execute only one test case without overlapping or concurrency.
Each worker run tasks only from its own queue/consumer.
In previous version I had a solution with multiprocessing and it works fine. But with Celery I can't add all 100 tests cases for all 5VMs from one set, it only starts adding tasks for VM alpha and wait until they all finished to start tasks for next VM beta and so on.
Now when I've tried to use multiprocessing to create separate threads for each worker I got: AssertionError: daemonic processes are not allowed to have children
Problem is - how to add 100 tests for 5 workers at the same time?
So each worker (from alpha, beta, ...) will run its own set of 100 test cases simultaneously.
This problem can be solved using task keys based on each consumer, like:
app.control.add_consumer(
queue='alpha',
exchange = 'local',
exchange_type = 'direct',
routing_key = 'alpha.*',
destination = ['worker_for_alpha#HOSTNAME'])
So now you can send any task to this consumer for separate worker using key and queue name:
#app.task(queue='alpha', routing_key = 'alpha.task_for_something')
def any_task(arg_1, arg_2):
do something with arg_1 and arg_2
Now you can scale it to any amount of workers or consumers for single worker. Just make a collection and iter them one by one for multiple workers\consumers.
Another issue can be solved with --concurrency option of each worker.
You can set concurrency to 5 to have 5 simultaneously threads on one worker. Or break the task flow on separate threads for each worker with unique key and consumer(queue).

YCSB low read throughput cassandra

The YCSB Endpoint benchmark would have you believe that Cassandra is the golden child of Nosql databases. However, recreating the results on our own boxes (8 cores with hyperthreading, 60 GB memory, 2 500 GB SSD), we are having dismal read throughput for workload b (read mostly, aka 95% read, 5% update).
The cassandra.yaml settings are exactly the same as the Endpoint settings, barring the different ip addresses, and our disk configuration (1 SSD for data, 1 for a commit log). While their throughput is ~38,000 operations per second, ours is ~16,000 regardless (relatively) of the threads/number of client nodes. I.e. one worker node with 256 threads will report ~16,000 ops/sec, while 4 nodes will each report ~4,000 ops/sec
I've set the readahead value to 8KB for the SSD data drive. I'll put the custom workload file below.
When analyzing disk io & cpu usage with iostat, it seems that the reading throughput is consistently ~200,000 KB/s, which seems to suggest that the ycsb cluster throughput should be higher (records are 100 bytes). ~25-30% of cpu seems to be under %iowait, 10-25% in use by the user.
top and nload stats are not ostensibly bottlenecked (<50% memory usage, and 10-50 Mbits/sec for a 10 Gb/s link).
# The name of the workload class to use
workload=com.yahoo.ycsb.workloads.CoreWorkload
# There is no default setting for recordcount but it is
# required to be set.
# The number of records in the table to be inserted in
# the load phase or the number of records already in the
# table before the run phase.
recordcount=2000000000
# There is no default setting for operationcount but it is
# required to be set.
# The number of operations to use during the run phase.
operationcount=9000000
# The offset of the first insertion
insertstart=0
insertcount=500000000
core_workload_insertion_retry_limit = 10
core_workload_insertion_retry_interval = 1
# The number of fields in a record
fieldcount=10
# The size of each field (in bytes)
fieldlength=10
# Should read all fields
readallfields=true
# Should write all fields on update
writeallfields=false
fieldlengthdistribution=constant
readproportion=0.95
updateproportion=0.05
insertproportion=0
readmodifywriteproportion=0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=usertable
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
The key was increasing native_transport_max_threads in the casssandra.yaml file.
Along with the increased settings in the comment (increasing connections in ycsb client as well as concurrent read/writes in cassandra), Cassandra jumped to ~80,000 ops/sec.

Requesting integer multiple of "M" cores per node on SGE

I want to submit a multi-threaded MPI job to SGE, and the cluster I am running in has different nodes that each has different number of cores. Let's say the number of threads per process is M (M == OMP_NUM_THREADS for OpenMP) How can I request that a job submitted to a SGE queue would be run in a such a way that in every node, an integer multiple of M is allocated for my job?
Let's say M=8, and the number of MPI tasks is 5 (so a total of 40 cores requested). And in this cluster, there are nodes with 4, 8, 12, and 16 cores. Then this combination is OK:
2*(8-core nodes) + 1*(16-core nodes) + 0.5*(16-core nodes)
but of course not any of these ones:
2*(4-core nodes) + 2*(8-core nodes) + 1*(16-core node)
2*(12-core nodes) + 1*(16-core node)
(3/8)*(8-core nodes) + (5/8)*(8-core nodes) + 2*(16-core node)
PS: There was another similar question, like this one: ( MPI & pthreads: nodes with different numbers of cores ), but mine is different since I have to run exactly M threads per MPI process (think hybrid MPI+OpenMP).
The best scenario is to run this job exclusively on the same kind of nodes. But to speed up the start time, I want to allow this job to run on different kind of nodes, provided that each node has integer*M cores allocated to the job.
The allocation policy in SGE is specified on per parallel environment (PE) basis. Each PE could be configured to fill the slots available on the cluster nodes in a specific way. One requests a specific PE with the -pe pe_name num_slots parameter and then SGE tries to find num_slots slots following the allocation policy of the pe_name PE. Unfortunately, there is no easy way to request slots in integer multiples per node.
In order to be able to request exactly M slots per host (and not a multiple of M), your SGE administrator (or you, in case you are the SGE administrator) must first create a new PE, let's call it mpi8ppn, set its allocation_rule to 8, and then assign the PE to each cluster queue. Then you have to submit the job to that PE with -pe mpi8ppn 40 and instruct the MPI runtime to start only one process per host, e.g. with -npernode 1 for Open MPI.
If the above is unlikely to happen, your other (unreliable) solution would be to request a very high amount of memory per slot, close to what each node has, e.g. -l h_vmem=23.5G. Assuming that the nodes are configured with h_vmem of 24 GiB, this request will ensure that SGE won't be able to fit more than one slot on each host. So, if you would like to start a hybrid job on 5 nodes, you will simply ask SGE for 5 slots and 23.5G vmem for each slot with:
qsub -pe whatever 5 -l h_vmem=23.5G <other args> jobscript
or
#$ -pe whatever 5
#$ -l h_vmem=23.5G
This method is unreliable since it does not allow you to select cluster nodes that have a specific number of cores and only works if all nodes are configured with h_vmem of less than 47 GB. h_vmem serves just as an example here - any other per-slot consumable attribute should do. The following command should give you an idea of what host complexes are defined and what their values are across the cluster nodes:
qhost -F | egrep '(^[^ ])|(hc:)'
The method works best for clusters where node_mem = k * #cores with k being constant across all nodes. If a node provides twice the number of cores but also has twice the memory, e.g. 48 GiB, then the above request will give you two slots on such nodes.
I don't claim to fully understand SGE and my knowledge dates back from the SGE 6.2u5 era, so simpler solutions might exist nowadays.

Resources