I have a question about Slurm sacct output.
JobID ReqCPUS ResvCPU AllocCPUS ResvCPURAW SystemCPU CPUTimeRAW TotalCPU UserCPU
------------ -------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
6089463_1 10 00:03:30 10 210 00:02.727 4250 06:38.627 06:35.899
6089463_1.b+ 4 4 00:02.727 1700 06:38.627 06:35.899
Why one job has two different lines?
Why 2 line has different ReqCPUS value? A user requests 10 CPUs but used only 4 CPUs?
Thank you,
Related
I run sacct with -j switch, for a specific job-id. Depending on other command line switches two completely different results are reported for the same job. Here are three examples. The second one shows different result than the other two.
attar#lh> sacct -a -s CA,CD,F,NF,PR,TO -S 2020-07-26T00:00:00 -E 2020-07-27T23:59:59 --format=JobId,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus -j 1401 JobID State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ----------
1401 CANCELLED+ UNLIMITED 2020-07-26T20:45:31 2020-07-27T08:36:10 11:50:39 1 2
1401.batch COMPLETED 2020-07-26T20:45:31 2020-07-27T08:36:17 11:50:46 103856K 619812K 1 2
attar#lh> sacct -a -s CA,CD,F,NF,PR,TO -S 2020-07-26T00:00:00 -E 2020-07-26T23:59:59 --format=JobId,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus -j 1401
JobID State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ----------
1401 NODE_FAIL UNLIMITED 2020-06-15T09:38:38 2020-07-26T00:17:26 40-14:38:48 1 2
attar#lh> sacct -a -s CA,CD,F,NF,PR,TO --format=JobId,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus -j 1401
JobID State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ----------
1401 CANCELLED+ UNLIMITED 2020-07-26T20:45:31 2020-07-27T08:36:10 11:50:39 1 2
1401.batch COMPLETED 2020-07-26T20:45:31 2020-07-27T08:36:17 11:50:46 103856K 619812K 1 2
Why are the start/end times different for the same job? One reports 11 hours run-time and the other 40 days run-time!
Any of your insight is highly appreciated!
This would typically happen when two jobs have the same JobId. The sacct documentation says:
If Slurm job ids are reset, some job numbers will probably appear more than once in the accounting log file but refer to different jobs. Such jobs can be distinguished by the "submit" time stamp in the data records.
Try running the sacct command with the --duplicates option.
I can add two tensors x and y inplace like this
x = x.add(y)
Is there a way of doing the same with three or more tensors given all tensors have same dimensions?
result = torch.sum(torch.stack([x, y, ...]), dim=0)
Without stack:
from functools import reduce
result = reduce(torch.add, [x, y, ...])
EDIT
As #LudvigH pointed out, the second method is not as memory-efficient, as inplace addition. So it's better like this:
from functools import reduce
result = reduce(
torch.Tensor.add_,
[x, y, ...],
torch.zeros_like(x) # optionally set initial element to avoid changing `x`
)
How important is it that the operations occur in place?
I believe the only way to do addition in place is with the add_ function.
For example:
a = torch.randn(5)
b = torch.randn(5)
c = torch.randn(5)
d = torch.randn(5)
a.add_(b).add_(c).add_(d) # in place addition of a+b+c+d
In general, in-place operations in PyTorch is tricky. They discourage using it. I think it stems from the fact that it is easy to mess up, ruinig the computation graph and giving you unexpected results. Also, there are many GPU optimizations that will be done anyway, and forcing in place operations can slow down your performance in the end. But assuming that your really know what you are doing, and you want to sum a lot of tensors with compatible shapes I would use the following pattern:
import functools
import operator
list_of_tensors = [a, b, c] # some tensors previously defined
functools.reduce(operator.iadd, list_of_tensors)
### now tensor a in the in-place sum of all the tensors
It builds on the pattern of reduce, which means "do this to all elements in the list/iterable", and operator.iadd which means +=. There are many caveats with +=, since it may mess up with scoping and behaves unexpectedly with immutable variables such as strings. But in the context of PyTorch, it does what we want. It makes calls to add_.
Below, you can see a simple benchmark.
from functools import reduce
from operator import iadd
import torch
def make_tensors():
return [torch.randn(5, 5) for _ in range(1000)]
def profile_action(label, action):
print(label)
list_of_tensors = make_tensors()
with torch.autograd.profiler.profile(
profile_memory=True, record_shapes=True
) as prof:
action(list_of_tensors)
print(prof.key_averages().table(sort_by="self_cpu_memory_usage"))
profile_action("Case A:", lambda tensors: torch.sum(torch.stack(tensors), dim=0))
profile_action("Case B:", lambda tensors: sum(tensors))
profile_action("Case C:", lambda tensors: reduce(torch.add, tensors))
profile_action("Case C:", lambda tensors: reduce(iadd, tensors))
The results vary between runs of course, bit this copy-paste was somewhat representative on my machine. Try it on yours! It probably changes a bit with pytorch version as well...
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::resize_ 0.14% 14.200us 0.14% 14.200us 14.200us 97.66 Kb 97.66 Kb 1
aten::empty 0.06% 5.800us 0.06% 5.800us 2.900us 100 b 100 b 2
aten::stack 17.38% 1.751ms 98.71% 9.945ms 9.945ms 97.66 Kb 0 b 1
aten::unsqueeze 30.55% 3.078ms 78.55% 7.914ms 7.914us 0 b 0 b 1000
aten::as_strided 48.02% 4.837ms 48.02% 4.837ms 4.833us 0 b 0 b 1001
aten::cat 0.73% 73.800us 2.78% 280.000us 280.000us 97.66 Kb 0 b 1
aten::_cat 1.87% 188.900us 2.05% 206.200us 206.200us 97.66 Kb 0 b 1
aten::sum 1.09% 109.400us 1.29% 130.100us 130.100us 100 b 0 b 1
aten::fill_ 0.17% 16.700us 0.17% 16.700us 16.700us 0 b 0 b 1
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us -97.75 Kb -97.75 Kb 2
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 10.075ms
----------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
----------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::add 99.32% 14.711ms 100.00% 14.812ms 14.812us 97.66 Kb 97.65 Kb 1000
aten::empty_strided 0.07% 10.400us 0.07% 10.400us 10.400us 4 b 4 b 1
aten::to 0.37% 54.900us 0.68% 100.400us 100.400us 4 b 0 b 1
aten::copy_ 0.24% 35.100us 0.24% 35.100us 35.100us 0 b 0 b 1
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us -97.66 Kb -97.66 Kb 1002
----------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 14.812ms
------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::add 100.00% 10.968ms 100.00% 10.968ms 10.979us 97.56 Kb 97.56 Kb 999
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us -97.56 Kb -97.56 Kb 999
------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 10.968ms
-------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
-------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::add_ 100.00% 5.184ms 100.00% 5.184ms 5.190us 0 b 0 b 999
-------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 5.184ms
I allocate 1000 tensors holding 25 times 32 bit floats (total 100b per tensor, 100kb=97.66Kb in total). The difference in runtime and memory footprint is quite stunning.
Case A, torch.sum(torch.stack(list_of_tensors), dim=0) allocates 100kb for the stack, and 100b for the result, taking 10 ms.
Case B sum takes 14ms. Mostly because of python overhead I guess. It allocates 10kb for all intermediary results of each addition.
Case C uses reduce-add which gets rid of some overhead, gaining in runtime performance (11ms) but still allocating intermediary results. This time, it does not start with a 0-initialization, which sum does, so we only do 999 additions instead of 1000, and allocate one intermediate result less. The difference with Case B is minute, and in most runs they had the same runtime.
Case D is my recommended way for in place addition of an iterable/list of tensors. It takes roughly half the time and allocates no extra memory. Efficient. But you waste the first tensor in the list, since you perform the operation in place.
Is it possible to expand the number of characters used in the JobName column of the command sacct in SLURM?
For example, I currently have:
JobID JobName Elapsed NCPUS NTasks State
------------ ---------- ---------- ---------- -------- ----------
12345 lengthy_na+ 00:00:01 4 1 FAILED
and I would like:
JobID JobName Elapsed NCPUS NTasks State
------------ ---------- ---------- ---------- -------- ----------
12345 lengthy_name 00:00:01 4 1 FAILED
You should use the format option, with:
sacct --helpformat
you'll see the parameters to show, for instance:
sacct --format="JobID,JobName%30"
will print the job id and the name up to 30 characters:
JobID JobName
------------ ------------------------------
19009 bash
19010 launch.sh
19010.0 hydra_pmi_proxy
19010.1 hydra_pmi_proxy
Now, you can to customize your own output.
To retrieve my list of SLURM jobs running I use the default format with 30 characters showing for job names using the bash command below:
squeue --format="%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %R" --me
If you want to show more job name characters simply change the number of %.30j.
A simple table join is done usualy in 0.0XX seconds and sometimes in 2.0XX seconds (according to PL/SQL Developer SQL execution). It sill happens when running from SQL Plus.
If I run the SQL 10 times, 8 times it runns fine and 2 times in 2+ seconds.
It's a clean install of Oracle 11.2.0.4 for Linux x86_64 on Centos 7.
I've installed Oracle recommended patches:
Patch 19769489 - Database Patch Set Update 11.2.0.4.5 (Includes CPUJan2015)
Patch 19877440 - Oracle JavaVM Component 11.2.0.4.2 Database PSU (Jan2015)
No change after patching.
The 2 tables have:
LNK_PACK_REP: 13 rows
PACKAGES: 6 rows
In SQL Plus i've enabled all statistics and runned the SQL multiple time. Only the time is changed from 0.1 to 2.1 from time to time. No other statistic is changed if I compare a run in 0.1 second with a run in 2.1 second. The server has 16 Gb RAM and 8 CPU core. Server load is under 0.1 (no user is using the server for the moment).
Output:
SQL> select PACKAGE_ID, id, package_name from LNK_PACK_REP LNKPR INNER JOIN PACKAGES P ON LNKPR.PACKAGE_ID = P.ID;
PACKAGE_ID ID PACKAGE_NAME
3 3 RAPOARTE
3 3 RAPOARTE
121 121 VANZARI
121 121 VANZARI
121 121 VANZARI
2 2 PACHETE
2 2 PACHETE
1 1 DEPARTAMENTE
1 1 DEPARTAMENTE
81 81 ROLURI
81 81 ROLURI
PACKAGE_ID ID PACKAGE_NAME
101 101 UTILIZATORI
101 101 UTILIZATORI
13 rows selected.
Elapsed: 00:00:02.01
Execution Plan
Plan hash value: 2671988802
--------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 13 | 351 | 3 (0)| 00:00:01 | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 13 | 351 | 3 (0)| 00:00:01 | Q1,02 | P->S | QC (RAND) |
|* 3 | HASH JOIN | | 13 | 351 | 3 (0)| 00:00:01 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 6 | 84 | 2 (0)| 00:00:01 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | P->P | HASH |
| 6 | PX BLOCK ITERATOR | | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | PCWC | |
| 7 | TABLE ACCESS FULL| PACKAGES | 6 | 84 | 2 (0)| 00:00:01 | Q1,01 | PCWP | |
| 8 | BUFFER SORT | | | | | | Q1,02 | PCWC | |
| 9 | PX RECEIVE | | 13 | 169 | 1 (0)| 00:00:01 | Q1,02 | PCWP | |
| 10 | PX SEND HASH | :TQ10000 | 13 | 169 | 1 (0)| 00:00:01 | | S->P | HASH |
| 11 | INDEX FULL SCAN | UNQ_PACK_REP | 13 | 169 | 1 (0)| 00:00:01 | | | |
--------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
3 - access("LNKPR"."PACKAGE_ID"="P"."ID")
Note
dynamic sampling used for this statement (level=2)
Statistics
24 recursive calls
0 db block gets
10 consistent gets
0 physical reads
0 redo size
923 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
13 rows processed
Table 1 structure:
-- Create table
create table PACKAGES
(
id NUMBER(3) not null,
package_name VARCHAR2(150),
position NUMBER(3),
activ NUMBER(1)
)
tablespace UM
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate primary, unique and foreign key constraints
alter table PACKAGES
add constraint PACKAGES_ID primary key (ID)
using index
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate indexes
create index PACKAGES_ACTIV on PACKAGES (ID, ACTIV)
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
Table 2 structure:
-- Create table
create table LNK_PACK_REP
(
package_id NUMBER(3) not null,
report_id NUMBER(3) not null
)
tablespace UM
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate primary, unique and foreign key constraints
alter table LNK_PACK_REP
add constraint UNQ_PACK_REP primary key (PACKAGE_ID, REPORT_ID)
using index
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
-- Create/Recreate indexes
create index LNK_PACK_REP_REPORT_ID on LNK_PACK_REP (REPORT_ID)
tablespace UM
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 64K
next 1M
minextents 1
maxextents unlimited
);
In Oracle Enterprise Manager in SQL Monitor I can see the SQL that is runned multiple times. All runns have "Database Time" 0.0s (under 10 microsconds if I hover the list) and "Duration" 0.0s for normal run and 2.0s for thoose with delay.
If I go to Monitored SQL Executions for that run of 2.0s I have:
Duration: 2.0s
Database Time: 0.0s
PL/SQL & Java: 0.0
Wait activity: % (no number here)
Buffer gets: 10
IO Requests: 0
IO Bytes: 0
Fetch calls: 2
Parallel: 4
Theese numbers are consistend with a fast run except Duration that is even smaller than Database Time (10,163 microseconds Database Time and 3,748 microseconds Duration) both dispalyed as 0.0s if no mouse hover.
I don't know what else to check.
Parallel queries cannot be meaningfully tuned to within a few seconds. They are designed for queries that process large amounts of data for a long time.
The best way to optimize parallel statements with small data sets is to temporarily disable it:
alter system set parallel_max_servers=0;
(This is a good example of the advantages of developing on workstations instead of servers. On a server, this change affects everyone and you probably don't even have the privilege to run the command.)
The query may be simple but parallelism adds a lot of complexity in the background.
It's hard to say exactly why it's slower. If you have the SQL Monitoring report the wait events may help. But even those numbers may just be generic waits like "CPU". Parallel queries have a lot of overhead, in expectation of a resource-intensive, long-running query. Here are some types of overhead that may explain where those 2 seconds come from:
Dynamic sampling - Parallelism may automatically cause dynamic sampling, which reads data from the tables. Although dynamic sampling used for this statement (level=2)
may just imply missing optimizer statistics.
OS Thread startup - The SQL statement probably needs to start up 8 additional OS threads, and prepare a large amount of memory to hold all the intermediate data. Perhaps
the parameter PARALLEL_MIN_SERVERS could help prevent some time used to create those threads.
Additional monitoring - Parallel statements are automatically monitored, which requires recursive SELECTs and INSERTs.
Caching - Parallel queries often read directly from disk and skip reading and writing into the buffer cache. The rules for when it caches data are complicated and undocumented.
Downgrading - Finding the correct degree of parallelism is complicated. For example, I've compiled a list of 39 factors that influence the DOP. It's possible that one of those is causing downgrading, making some queries fast and others slow.
And there are probably dozens of other types of overhead I can't think of. Parallelism is great for massively improving the run-time of huge operations. But it doesn't work well for tiny queries.
The delay is due to parallelism as suggested by David Aldridge and Jon Heller but I don't agree the solution proposed by Jon Heller to disable parallelism for all queries (at system level). You can play with "alter session" to disable it and re-enable it before running big queries. The exact reason of the delay it's still unknown as the query finish fast in 8 out of 10 runs and I would expect a 10/10 fast run.
I have a markdown document I'm processing with the pandoc tool to generate HTML and PDF documents. I'm trying to include a table in the document. Regular markdown doesn't support tables, but pandoc does. I've tried copy-pasting the definition of a table from the pandoc documentation into my source document, but when running it through the pandoc program the resulting document is all crammed into one big table.
Can anyone show me a pandoc table that renders properly?
# Points about Tweedledee and Tweedledum
Much has been made of the curious features of
Tweedledee and Tweedledum. We propose here to
set some of the controversy to rest and to uproot
all of the more outlandish claims.
. Tweedledee Tweedledum
-------- -------------- ----------------
Age 14 14
Height 3'2" 3'2"
Politics Conservative Conservative
Religion "New Age" Syrian Orthodox
--------- -------------- ----------------
Table: T.-T. Data
# Mussolini's role in my downfall
--------------------------------------------------------------------
*Drugs* *Alcohol* *Tobacco*
---------- ------------- ----------------- --------------------
Monday 3 Xanax 2 pints 3 cigars,
1 hr at hookah bar
Tuesday 14 Adderall 1 Boone's Farm, 1 packet Drum
2 Thunderbird
Wednesday 2 aspirin Tall glass water (can't remember)
---------------------------------------------------------------------
Table: *Tableau des vices*, deluxe edition
# Points about the facts
In recent years, more and more attention has been
paid to opinion, less and less to what were formerly
called the cold, hard facts. In a spirit of traditionalism,
we propose to reverse the trend. Here are some of our results.
------- ------ ---------- -------
12 12 12 12
123 123 123 123
1 1 1 1
---------------------------------------
Table: Crucial Statistics
# Recent innovations (1): False presentation
Some, moved by opinion and an irrational lust for novelty,
would introduce a non-factual element into the data,
perhaps moving all the facts to the left:
------- ------ ---------- -------
12 12 12 12
123 123 123 123
1 1 1 1
---------------------------------------
Table: Crucial "Statistics"
# Recent innovations (2): Illegitimate decoration
Others, preferring their facts to be *varnished*,
as we might say, will tend to 'label' the columns
Variable Before During After
--------- ------ ---------- -------
12 12 12 12
123 123 123 123
1000 1000 1000 1000
----------------------------------------
# Recent innovations (3): "Moderate" decoration
Or, maybe, to accompany this 'spin' with a centered or centrist representation:
Variable Before During After
---------- ------- ---------- -------
12 12 12 12
123 123 123 123
1 1 1 1
-----------------------------------------
# The real enemy
Some even accompany these representations with a bit of leftwing
clap-trap, suggesting the facts have drifted right:
------------------------------------------------------
Variable Before During After
---------- ----------- ---------- -------
12 12 12 12
-- Due to
baleful
bourgeois
influence
123 123 123 123
-- Thanks
to the
renegade
Kautsky
1 1 1 1
-- All a
matter of
sound Party
discipline
-------------------------------------------------------
Table: *"The conditions are not ripe, comrades; they are **overripe**!"*
# The Truth
If comment be needed, let it be thus: the facts have drifted left.
------------------------------------------------------------------------
Variable Before During After
---------- ------------- ---------- ----------------------
12 12 12 12
(here's (due to (something to do
where the rot lapse of with Clinton and
set in ) traditional maybe the '60's)
values)
123 123 123 123
(too much (A=440?)
strong drink)
1 1 1 1
(Trilateral Commission?)
--------------------------------------------------------------------------
Table: *The Decline of Western Civilization*