How can I set Slurm Partition QoS? - partition

I created partition QOS to my Slurm partition but it isn't worked. How can I solve this problem. If anyone knows, please let me know. The following steps are my operation.
CreateQoS
$sacctmgr show qos format="Name,MaxWall,MaxTRESPerUser%30,MaxJob,MaxSubmit,Priority,Preempt"
Name MaxWall MaxTRESPU MaxJobs MaxSubmit Priority Preempt
---------- ----------- ------------------------------ ------- --------- ---------- ----------
normal 0
batchdisa+ 0 0 10
2.Attach QOS to partition
$scontrol show partition
PartitionName=sample01
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=batchdisable
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=computenode0[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
3.Run Jobs
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
67109044 sample01 testjob test R 1:42 1 computenode01
67109045 sample01 testjob test R 1:39 1 computenode02

I was able to solve the problem by adding the following setting to slrum.conf.
AccountingStorageEnforce=associations

Related

Default partitioning in spark

I have a question with respect to default partitioning in RDD.
case class Animal(id:Int, name:String)
val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))
Console println myRDD.getNumPartitions
I am running the above piece of code in my laptop which has 12 logical cores.
Hence I see that there are 12 partitions created.
My understanding is that hash partitioning is used to determine which object needs to go to which partition. So in this case, the formula would be: hashCode() % 12
But when I further examine, I see all the RDDs are put in the last partition.
myRDD.foreachPartition( e => { println("----------"); e.foreach(println) } )
Above code prints the below(first eleven partitions are empty and the last one has all the objects. The line is to separate contents of each partition):
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
Animal(2,Elephant)
Animal(4,Tiger)
Animal(3,Jaguar)
Animal(5,Chetah)
Animal(1,Lion)
I don't know why this happens. Can you please help.
Thanks!
I don't think that means all your data are in the last partition. Rather, since foreachPartition is executed in parallel, it might be that the dashed lines have already been printed from all the executors, before the values are printed. The order of the printed lines does not indicate the order of execution.
If you try the code below (source), you can see that the data is evenly partitioned between the executors (at least on my machine):
myRDD.mapPartitionsWithIndex((index, itr) => itr.toList.map(x => x + "#" + index).iterator).collect
// res6: Array[String] = Array(Animal(1,Lion)#1, Animal(2,Elephant)#2, Animal(3,Jaguar)#3, Animal(4,Tiger)#4, Animal(5,Chetah)#5)

How to monitor the throughput of Heron Cluster

I needed to get the throughput of Heron Cluster for some reasons, but there is no metric in the Heron UI. So do you have any ideas about how to monitor the throughput of Heron Cluster? Thanks.
The result of running heron-explorer as follows:
yitian#heron01:~$ heron-explorer metrics aurora/yitian/devel SentenceWordCountTopology
[2018-08-03 21:02:09 +0000] [INFO]: Using tracker URL: http://127.0.0.1:8888
'spout' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count ack-count fail-count
------------------- ----------------- ---------------------- -------------------- ------------ ----------- ------------
container_3_spout_6 2053 0.253257 146 1.13288e+07 1.13278e+07 0
container_4_spout_7 2091 0.150625 137.5 1.1624e+07 1.16228e+07 231
'count' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count execute-count ack-count fail-count
-------------------- ----------------- ---------------------- -------------------- ------------ --------------- ----------- ------------
container_6_count_12 2092 0.184742 155.167 0 4.6026e+07 4.6026e+07 0
container_5_count_9 2091 0.387867 146 0 4.60069e+07 4.60069e+07 0
container_6_count_11 2092 0.184488 157.833 0 4.58158e+07 4.58158e+07 0
container_4_count_8 2091 0.443688 129.833 0 4.58722e+07 4.58722e+07 0
container_5_count_10 2091 0.382577 118.5 0 4.60091e+07 4.60091e+07 0
'split' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count execute-count ack-count fail-count
------------------- ----------------- ---------------------- -------------------- ------------ --------------- ----------- ------------
container_1_split_2 2091 0.143034 75.3333 4.59453e+07 4.59453e+06 4.59453e+06 0
container_3_split_5 2042 1.12248 79.1667 4.64862e+07 4.64862e+06 4.64862e+06 0
container_2_split_3 2150 0.139837 83.6667 4.59443e+07 4.59443e+06 4.59443e+06 0
container_1_split_1 2091 0.145702 104.167 4.59454e+07 4.59454e+06 4.59454e+06 0
container_2_split_4 2150 0.138453 106.333 4.59443e+07 4.59443e+06 4.59443e+06 0
[2018-08-03 21:02:09 +0000] [INFO]: Elapsed time: 0.031s.
You can use the execute-count of you sink component to measure the output of your topology. If each of your components have a 1:1 input:output ratio then this will be your throughput.
However, if you are windowing tuples into batches or splitting tuples (like separating sentences into individual words) then things get a little more complicated. You can get the input into your topology by looking at the emit-count of your spout components. You could then use this in comparison to you bolt execute-counts to create your own throughput metric.
An easy way to get programmatic access to these metrics is via the Heron Tracker REST API. You can use your chosen language's HTTP library (like Requests for Python) to query the last 3 hours of data for a running topology. If you require more than 3 hours of data (the maximum stored by the topology TMaster) you will need to use one of the other metrics sinks to send metrics to an external database. Heron currently provides sinks for saving to local files, Graphite or Prometheus. InfluxDB support is in the works.

cassandra-stress:Unable to find table <keyspace name>.<tablename> at maybeLoadschemainfo (StressProfile.java)

Configurations:-
3 node cluster
Node 1 --> 172.27.21.16(Seed Node)
Node 2 --> 172.27.21.18
Node 3 --> 172.27.21.19
cassandra.yaml paramters for all the nodes:-
1) seeds: "172.27.21.16"
2) write_request_timeout_in_ms: 5000
3) listen_address: 172.27.21.1(6,8,9)
4) rpc_address: 172.27.21.1(6,8,9)
This is my code.yaml file:-
keyspace: prutorStress3node
keyspace_definition: |
CREATE KEYSPACE prutorStress3node WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
table: code
table_definition: |
CREATE TABLE code (
id timeuuid,
assignment_id bigint,
user_id text,
contents text,
save_type smallint,
save_time timeuuid,
PRIMARY KEY(assignment_id, id)
) WITH CLUSTERING ORDER BY (id DESC)
columnspec:
- name: assignment_id
population: uniform(1..100000) # Assignment IDs from range 1-100000
- name: id
cluster: fixed(100) # Each assignment id will have atmost 100 ids(maximum 100 codes allowed per student)
- name: user_id
size: fixed(50) # user_id comes from account table varchar(50)
- name: contents
size: uniform(1..2000)
- name: save_type
size: fixed(1) # Generated values between 1 to 4
population: uniform(1..4)
- name: save_time
insert:
partitions: fixed(1) # The number of partitions to update per batch
select: fixed(1)/100 # The ratio of rows each partition should insert as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns
batchtype: UNLOGGED # The type of CQL batch to use. Either LOGGED/UNLOGGED
queries:
query1:
cql: select id,contents,save_type,save_time from code where assignment_id = ?
fields: samerow
query2:
cql: select id,contents,save_type,save_time from code where assignment_id = ? LIMIT 10
fields: samerow
query3:
cql: SELECT contents FROM code WHERE id = ? # Create index for this query
fields: samerow
Following is my script which I am running from 172.27.21.17(A node on the same network but in a different cluster):-
#!/bin/bash
echo '***********************************'
echo Deleting old data from prutorStress3node
echo '***********************************'
cqlsh -e "DROP KEYSPACE prutorStress3node" 172.27.21.16
sleep 120
echo '***********************************'
echo Creating fresh data
echo '***********************************'
/bin/cassandra-stress user profile=./code.yaml n=1000000 no-warmup cl=quorum ops\(insert=1\) \
-rate threads=25 -node 172.27.21.16
sleep 120
Issue:-
The issue is that when I run the script just after deleting previous data and after bringing up those nodes, it works fine, but when I run it further, for each run, I hit the following error:-
Unable to find prutorStress3node.code
at org.apache.cassandra.stress.StressProfile.maybeLoadSchemaInfo(StressProfile.java:306)
at org.apache.cassandra.stress.StressProfile.maybeCreateSchema(StressProfile.java:273)
at org.apache.cassandra.stress.StressProfile.newGenerator(StressProfile.java:676)
at org.apache.cassandra.stress.StressProfile.printSettings(StressProfile.java:129)
at org.apache.cassandra.stress.settings.StressSettings.printSettings(StressSettings.java:383)
at org.apache.cassandra.stress.Stress.run(Stress.java:95)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
In the file StressProfile.java ,I saw that table metadata is being populated to NULL. I tried to make sense of the stack trace, but was not able to make anything of it. Please give me some directions as to what might have gone wrong?
Giving keyspace and table name in small case solved the issue for me. Previously i was giving mixed case and i was getting the same error.

Howto debug why hints doesn't get processed after all nodes are up again

Did some extended maintenance on a node d1r1n3 out of a 14x node dsc 2.1.15 cluster today, but finished well within the cluster's max hint window.
After bringing the node back up most other nodes' hints disappeared again within minutes except for two nodes (d1r1n4 and d1r1n7), where only part of the hints went away.
After few hours of still showing 1 active hintedhandoff task I restarted node d1r1n7 and then quickly d1r1n4 emptied its hint table.
Howto see for which node stored hints on d1r1n7 are destined?
And possible howto get hints processed?
Update:
Found later corresponding to end-of-maxhint-window after taking node d1r1n3 offline for maintenance that d1r1n7' hints had vanished. Leaving us with a confused feeling of whether this was okay or not. Had the hinted been processed okay or some how just expired after end of maxhint window?
If the latter would we need to run a repair on node d1r1n3 after it's mainenance (this takes quite some time and IO... :/) What if we now applied read [LOCAL]QUORUM instead of as currently read ONE w/one DC and RF=3, could this then trigger read path repairs on needed-basis and maybe spare us is this case for a full repair?
Answer: turned out hinted_handoff_throttle_in_kb was # default 1024 on these two nodes while rest of cluster were # 65536 :)
hints are stored in cassandra 2.1.15 in system.hints table
cqlsh> describe table system.hints;
CREATE TABLE system.hints (
target_id uuid,
hint_id timeuuid,
message_version int,
mutation blob,
PRIMARY KEY (target_id, hint_id, message_version)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (hint_id ASC, message_version ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'hints awaiting delivery'
AND compaction = {'enabled': 'false', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
the target_id correlated with the node id
for example
in my sample 2 node cluster with RF=2
nodetool status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.47 KB 256 100.0% d00c4b10-2997-4411-9fc9-f6d9f6077916 rack1
DN 127.0.0.2 75.4 KB 256 100.0% 1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa rack1
I executed the following while node2 was down
cqlsh> insert into ks.cf (key,val) values (1,1);
cqlsh> select * from system.hints;
target_id | hint_id | message_version | mutation
--------------------------------------+--------------------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa | e80a6230-ec8c-11e6-a1fd-d743d945c76e | 8 | 0x0004000000010000000101cfb4fba0ec8c11e6a1fdd743d945c76e7fffffff80000000000000000000000000000002000300000000000547df7ba68692000000000006000376616c0000000547df7ba686920000000400000001
(1 rows)
as can be seen the system.hints.target_id correlates with host id in nodetool status (1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa)

pending slurm jobs not showing up in sacct

I am experiencing an issue with slurm where sacct does not show pending jobs. Below, you can see that job 110061 doesn't show up in sacct but is clearly pending in squeue. Any ideas as to why this would happen?
[plcmp14evs:/sim/dev/ash/projects/full-trees] 153% sacct -j 110061
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[plcmp14evs:/sim/dev/ash/projects/full-trees] 154% squeue -j 110061
JOBID PARTITION STATE USER TIME TIMELIMIT NODES NODELIST(REASON) NAME
110061 eventual PENDING andrew 0:00 UNLIMITED 1 (Priority) [rf] script.8.R
-- Edit --
This is the output of scontrol show config | grep Acc
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = localhost
AccountingStorageLoc = /disks/linux/tmp/slurm_accounting.txt
AccountingStoragePort = 0
AccountingStorageType = accounting_storage/filetxt
AccountingStorageUser = root
AccountingStoreJobComment = YES
JobAcctGatherFrequency = 3 sec
JobAcctGatherType = jobacct_gather/linux
At this point there is nothing to record, since the job has not even started yet.

Resources