pending slurm jobs not showing up in sacct

pending slurm jobs not showing up in sacct - linux

I am experiencing an issue with slurm where sacct does not show pending jobs. Below, you can see that job 110061 doesn't show up in sacct but is clearly pending in squeue. Any ideas as to why this would happen?
[plcmp14evs:/sim/dev/ash/projects/full-trees] 153% sacct -j 110061
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[plcmp14evs:/sim/dev/ash/projects/full-trees] 154% squeue -j 110061
JOBID PARTITION STATE USER TIME TIMELIMIT NODES NODELIST(REASON) NAME
110061 eventual PENDING andrew 0:00 UNLIMITED 1 (Priority) [rf] script.8.R
-- Edit --
This is the output of scontrol show config | grep Acc
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = localhost
AccountingStorageLoc = /disks/linux/tmp/slurm_accounting.txt
AccountingStoragePort = 0
AccountingStorageType = accounting_storage/filetxt
AccountingStorageUser = root
AccountingStoreJobComment = YES
JobAcctGatherFrequency = 3 sec
JobAcctGatherType = jobacct_gather/linux

At this point there is nothing to record, since the job has not even started yet.

Related

Azure Custom Log Alert - Not Firing

I am trying to debug an issue with an Azure Alert not firing. This alert should run every 30 minutes and find any devices that have not emitted a heartbeat in the last 30 minutes up to the hour. In addition, an alert should only be fired once for each device until it becomes healthy again.
The kusto query is:
let missedHeartbeatsFrom30MinsAgo = traces
| where message == “Heartbeat”
| summarize arg_max(timestamp, *) by tostring(customDimensions.id)
| project Id = customDimensions_id, LastHeartbeat = timestamp
| where LastHeartbeat < ago(30m);
let missedHeartbeatsFrom1HourAgo = traces
| where message == "Heartbeat"
| summarize arg_max(timestamp, *) by tostring(customDimensions.id)
| project Id = customDimensions_id, LastHeartbeat = timestamp
| where LastHeartbeat <= ago(1h);
let unhealthyIds = missedHeartbeatsFrom30MinsAgo
| join kind=leftanti missedHeartbeatsFrom1HourAgo on Id;
let deviceDetails = customEvents
| where name == "Heartbeat"
| distinct tostring(customDimensions.deviceId), tostring(customDimensions.fullName)
| project Id = customDimensions_deviceId, FullName = customDimensions_fullName;
unhealthyIds |
join kind=leftouter deviceDetails on Id
| project Id, FullName, LastHeartbeat
| order by FullName asc
The rules for this alert are:
When I pull the plug on a device, wait ~30 minutes, and run the query manually in App Insights, I see the device in the results data set. However, no alert gets generated (nothing shows up in the Alerts history page and no one in the Action Group gets notified). Any help in this matter would be greatly appreciated!

I can see your KQL Query take several times to execute, and it consume more resource usage to run the query.
Optimize your query to avoid more resource utilization and quick response of your query result.
Make sure your alert processing rule Status should be Enabled like below
Once it is done make sure your query result should be Greater than or equal to 1. So that the alert processing rule will check the threshold if it matches the condition the alert will fire.
Still, you get the issue alert not firing try to delete the alert and run your query in a Query Editor and try to create a New alert rule.

How can I set Slurm Partition QoS?

I created partition QOS to my Slurm partition but it isn't worked. How can I solve this problem. If anyone knows, please let me know. The following steps are my operation.
CreateQoS
$sacctmgr show qos format="Name,MaxWall,MaxTRESPerUser%30,MaxJob,MaxSubmit,Priority,Preempt"
Name MaxWall MaxTRESPU MaxJobs MaxSubmit Priority Preempt
---------- ----------- ------------------------------ ------- --------- ---------- ----------
normal 0
batchdisa+ 0 0 10
2.Attach QOS to partition
$scontrol show partition
PartitionName=sample01
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=batchdisable
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=computenode0[1-2]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=2 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
3.Run Jobs
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
67109044 sample01 testjob test R 1:42 1 computenode01
67109045 sample01 testjob test R 1:39 1 computenode02

I was able to solve the problem by adding the following setting to slrum.conf.
AccountingStorageEnforce=associations

Default partitioning in spark

I have a question with respect to default partitioning in RDD.
case class Animal(id:Int, name:String)
val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))
Console println myRDD.getNumPartitions
I am running the above piece of code in my laptop which has 12 logical cores.
Hence I see that there are 12 partitions created.
My understanding is that hash partitioning is used to determine which object needs to go to which partition. So in this case, the formula would be: hashCode() % 12
But when I further examine, I see all the RDDs are put in the last partition.
myRDD.foreachPartition( e => { println("----------"); e.foreach(println) } )
Above code prints the below(first eleven partitions are empty and the last one has all the objects. The line is to separate contents of each partition):
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
----------
Animal(2,Elephant)
Animal(4,Tiger)
Animal(3,Jaguar)
Animal(5,Chetah)
Animal(1,Lion)
I don't know why this happens. Can you please help.
Thanks!

I don't think that means all your data are in the last partition. Rather, since foreachPartition is executed in parallel, it might be that the dashed lines have already been printed from all the executors, before the values are printed. The order of the printed lines does not indicate the order of execution.
If you try the code below (source), you can see that the data is evenly partitioned between the executors (at least on my machine):
myRDD.mapPartitionsWithIndex((index, itr) => itr.toList.map(x => x + "#" + index).iterator).collect
// res6: Array[String] = Array(Animal(1,Lion)#1, Animal(2,Elephant)#2, Animal(3,Jaguar)#3, Animal(4,Tiger)#4, Animal(5,Chetah)#5)

How to monitor the throughput of Heron Cluster

I needed to get the throughput of Heron Cluster for some reasons, but there is no metric in the Heron UI. So do you have any ideas about how to monitor the throughput of Heron Cluster? Thanks.
The result of running heron-explorer as follows:
yitian#heron01:~$ heron-explorer metrics aurora/yitian/devel SentenceWordCountTopology
[2018-08-03 21:02:09 +0000] [INFO]: Using tracker URL: http://127.0.0.1:8888
'spout' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count ack-count fail-count
------------------- ----------------- ---------------------- -------------------- ------------ ----------- ------------
container_3_spout_6 2053 0.253257 146 1.13288e+07 1.13278e+07 0
container_4_spout_7 2091 0.150625 137.5 1.1624e+07 1.16228e+07 231
'count' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count execute-count ack-count fail-count
-------------------- ----------------- ---------------------- -------------------- ------------ --------------- ----------- ------------
container_6_count_12 2092 0.184742 155.167 0 4.6026e+07 4.6026e+07 0
container_5_count_9 2091 0.387867 146 0 4.60069e+07 4.60069e+07 0
container_6_count_11 2092 0.184488 157.833 0 4.58158e+07 4.58158e+07 0
container_4_count_8 2091 0.443688 129.833 0 4.58722e+07 4.58722e+07 0
container_5_count_10 2091 0.382577 118.5 0 4.60091e+07 4.60091e+07 0
'split' metrics:
container id jvm-uptime-secs jvm-process-cpu-load jvm-memory-used-mb emit-count execute-count ack-count fail-count
------------------- ----------------- ---------------------- -------------------- ------------ --------------- ----------- ------------
container_1_split_2 2091 0.143034 75.3333 4.59453e+07 4.59453e+06 4.59453e+06 0
container_3_split_5 2042 1.12248 79.1667 4.64862e+07 4.64862e+06 4.64862e+06 0
container_2_split_3 2150 0.139837 83.6667 4.59443e+07 4.59443e+06 4.59443e+06 0
container_1_split_1 2091 0.145702 104.167 4.59454e+07 4.59454e+06 4.59454e+06 0
container_2_split_4 2150 0.138453 106.333 4.59443e+07 4.59443e+06 4.59443e+06 0
[2018-08-03 21:02:09 +0000] [INFO]: Elapsed time: 0.031s.

You can use the execute-count of you sink component to measure the output of your topology. If each of your components have a 1:1 input:output ratio then this will be your throughput.
However, if you are windowing tuples into batches or splitting tuples (like separating sentences into individual words) then things get a little more complicated. You can get the input into your topology by looking at the emit-count of your spout components. You could then use this in comparison to you bolt execute-counts to create your own throughput metric.
An easy way to get programmatic access to these metrics is via the Heron Tracker REST API. You can use your chosen language's HTTP library (like Requests for Python) to query the last 3 hours of data for a running topology. If you require more than 3 hours of data (the maximum stored by the topology TMaster) you will need to use one of the other metrics sinks to send metrics to an external database. Heron currently provides sinks for saving to local files, Graphite or Prometheus. InfluxDB support is in the works.

Percona Xtradb cluster crashing

We have a Percona Xtradb cluster with 5 nodes and an arbitrator. One of our Php developers ran a bad query on the cluster, crashing all the nodes. After the crash, we could not collect any error log to tell us what really went wrong as the entire cluster crashed without performing any logging.
I have always thought that when a single query is executed on the cluster, it is processed by only one of the nodes in the cluster. So if the query is bad (to the point of killing a db server), it should only crash the one node thats processing it, leaving the cluster running with the remaining 4 nodes.
This behavior has puzzled us and we would like to understand what is really going on especially that this is the second time this is happening. Why would a query running on the cluster while processed by one of the nodes would cause other nodes in the cluster to crash in case of some issue while being processed?
Below is our my.cnf config:
#
# Default values.
[mysqld_safe]
flush_caches
numa_interleave
#
#
[mysqld]
back_log = 65535
binlog_format = ROW
character_set_server = utf8
collation_server = utf8_general_ci
datadir = /var/lib/mysql
default_storage_engine = InnoDB
expand_fast_index_creation = 1
expire_logs_days = 7
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_instances = 16
innodb_buffer_pool_populate = 1
innodb_buffer_pool_size = 32G # XXX 64GB RAM, 80%
innodb_data_file_path = ibdata1:64M;ibdata2:64M:autoextend
innodb_file_format = Barracuda
innodb_file_per_table
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_io_capacity = 1600
innodb_large_prefix
innodb_locks_unsafe_for_binlog = 1
innodb_log_file_size = 64M
innodb_print_all_deadlocks = 1
innodb_read_io_threads = 64
innodb_stats_on_metadata = FALSE
innodb_support_xa = FALSE
innodb_write_io_threads = 64
log-bin = mysqld-bin
log-queries-not-using-indexes
log-slave-updates
long_query_time = 1
max_allowed_packet = 64M
max_connect_errors = 4294967295
max_connections = 4096
min_examined_row_limit = 1000
port = 3306
relay-log-recovery = TRUE
skip-name-resolve
slow_query_log = 1
slow_query_log_timestamp_always = 1
table_open_cache = 4096
thread_cache = 1024
tmpdir = /db/tmp
transaction_isolation = REPEATABLE-READ
updatable_views_with_limit = 0
user = mysql
wait_timeout = 60
#
# Galera Variable config
wsrep_cluster_address = gcomm://ip_1, ip_2, ip_3,ip_4,ip_4,ip_5
wsrep_cluster_name = cluster_db
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider_options = "gcache.size=4G"
wsrep_slave_threads = 32
wsrep_sst_auth = "user:password"
wsrep_sst_donor = "db1"
#wsrep_sst_method = xtrabackup_throttle
wsrep_sst_method = xtrabackup-v2
#
# XXX You *MUST* change!
server-id = 1

Can you post the query? SELECT queries only execute on a single node but all write queries will execute everywhere. What's in your error log?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pending slurm jobs not showing up in sacct - linux

At this point there is nothing to record, since the job has not even started yet.

Related

Azure Custom Log Alert - Not Firing

How can I set Slurm Partition QoS?

Default partitioning in spark

How to monitor the throughput of Heron Cluster

Percona Xtradb cluster crashing

Categories

Resources