SGE jobs terminating after 3 hours - qsub

I'm used to Torque so I'm hoping that some Sun SGE guru can help. I can't figure out why my jobs are terminating at, near as I can tell 3 hours. The jobs are being submitted to an empty queue with no competition on a new install of ROCKS 6.2. Jobs shorter than 3 hours do not have any problems.
Here's the Server configuration (I think)
[user#machine]$ qconf -ssconf
algorithm default
schedule_interval 0:0:05
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:10:30
load_formula slots
schedd_job_info false
flush_submit_sec 1
flush_finish_sec 10
params none
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 0
weight_tickets_share 0
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OFS
weight_ticket 0.010000
weight_waiting_time 0.000000
weight_deadline 3600000.000000
weight_urgency 0.100000
weight_priority 1.000000
max_reservation 0
default_duration INFINITY
The queue configuration
[user#machine]$ qconf qconf -sql
all.q
[user#machine]$ qconf qconf -sq all.q
qname all.q
hostlist #allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpi mpich orte
rerun FALSE
slots 1,[compute-0-0.local=4],[compute-1-0.local=4], \
[compute-1-1.local=4],[compute-1-2.local=4], \
[compute-1-3.local=4],[compute-1-4.local=4], \
[compute-2-0.local=4],[compute-2-1.local=4], \
[compute-2-2.local=4],[compute-2-4.local=4]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
And here's a sample in progress job status.
[user#machine]$ qstat -j 255
==============================================================
job_number: 255
exec_file: job_scripts/255
submission_time: Mon Jul 3 11:19:07 2017
owner: <user>
uid: 500
group: <user>
gid: 502
sge_o_home: /home/<user>
sge_o_log_name: <user>
sge_o_path: /home/<user>/<programDIR>/bin/linux:/home/<user>/<gitRootDIR>/<codeDIR>/bin/linux:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/gridengine/bin/linux-x64:/home/<user>/scripts/:/home/<user>/bin
sge_o_shell: /bin/bash
sge_o_workdir: <CORRECT working DIR>
sge_o_host: <HOSTNAME>
account: sge
merge: y
mail_options: ae
mail_list: <user#domain.tld>
notify: FALSE
job_name: STDIN
stdout_path_list: NONE:NONE:test3_w5.eo
jobshare: 0
env_list: HOSTNAME=<HOSTNAME>.FQDN,SHELL=/bin/bash,TERM=xterm,HISTSIZE=1000,EGS_HOME=/home/<user>/<programDIR>/,SSH_CLIENT=172.24.56.106 56512 22,SGE_ARCH=linux-x64,SGE_CELL=default,MPICH_PROCESS_GROUP=no,QTDIR=/usr/lib64/qt-3.3,QTINC=/usr/lib64/qt-3.3/include,SSH_TTY=/dev/pts/0,ROCKSROOT=/opt/rocks/share/devel,ANT_HOME=/opt/rocks,USER=<user>,LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:,LD_LIBRARY_PATH=/opt/gridengine/lib/linux-x64:/opt/openmpi/lib,ROCKS_ROOT=/opt/rocks,<DEFAULT_BATCH_SYSTEM>=sge,PATH=/home/<user>/<programDIR>/bin/linux:/home/<user>/<gitRootDIR>/<codeDIR>/bin/linux:/opt/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/java/latest/bin:/opt/maven/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/gridengine/bin/linux-x64:/home/<user>/scripts/:/home/<user>/bin,MAIL=/var/spool/mail/<user>,MAVEN_HOME=/opt/maven,<codeDIR>=/home/<user>/<gitRootDIR>/<codeDIR>/,PWD=/home/<user>/<programDIR>/dosxyznrc,JAVA_HOME=/usr/java/latest,_LMFILES_=/usr/share/Modules/modulefiles/rocks-openmpi,SGE_EXECD_PORT=537,LANG=en_US.iso885915,MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles,SGE_QMASTER_PORT=536,LOADEDMODULES=rocks-openmpi,SGE_ROOT=/opt/gridengine,HISTCONTROL=ignoredups,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,HOME=/home/<user>,SHLVL=2,ROLLSROOT=/opt/rocks/share/devel/src/roll,MPIHOME=/opt/openmpi,LOGNAME=<user>,CVS_RSH=ssh,QTLIB=/usr/lib64/qt-3.3/lib,SSH_CONNECTION=172.24.56.106 56512 172.24.59.111 22,MODULESHOME=/usr/share/Modules,LESSOPEN=||/usr/bin/lesspipe.sh %s,OMEGA_HOME=/home/<user>/<gitRootDIR>/<codeDIR>/omega,EGS_CONFIG=/home/<user>/<gitRootDIR>/<codeDIR>/specs/linux.conf,G_BROKEN_FILENAMES=1,OMPI_MCA_btl=self,sm,tcp,BASH_FUNC_module()=() { eval `/usr/bin/modulecmd bash $*`
},_=/opt/gridengine/bin/linux-x64/qsub
script_file: STDIN
usage 1: cpu=00:13:50, mem=833.52093 GBs, io=0.05473, vmem=1.109G, maxvmem=1.109G
scheduling info: (Collecting of scheduler job information is turned off)

It sounds like the complex has a default run time. You can inspect these w/:
qconf -sc | grep -e '^#' -e _rt
If the default is not 0:0:0, then that value is enforced on the run time.
You can override this default w/:
qsub -l s_rt=8:0:0 ...
for 8 hours, for example. Set s_rt or h_rt (soft/hard limits) based on which is set at the complex level.

Related

How to display days in a different language?

I downloaded a conky script from gnome-look which shows the day in english and would like to change the day language to Arabic. the script is :
background yes
update_interval 1
cpu_avg_samples 2
net_avg_samples 2
temperature_unit celsius
double_buffer yes
no_buffers yes
text_buffer_size 2048
alignment bottom_left
gap_x 20
gap_y 7000
minimum_size 550 550
maximum_width 550
own_window yes
own_window_type normal
own_window_transparent yes
own_window_hints undecorated,below,sticky,skip_taskbar,skip_pager
own_window_argb_visual yes
own_window_argb_value 0
border_inner_margin 0
border_outer_margin 0
draw_shades no
draw_outline no
draw_borders no
draw_graph_borders no
default_shade_color 112422
override_utf8_locale yes
use_xft yes
xftfont Feena Casual:size=10
xftalpha 1
uppercase yes
default_color D6D5D4
#E87E3C
own_window_colour 000000
TEXT
${font NotoSansArabic:size=15}${color D6D5D4}${LANG=AR_EG.UTF-8 time %A}\
${font Anurati:size=45}${color D6D5D4}${time %A}#${color yellow}\
${font Finlandica:size=25}${color D6D5D4}${time %H:%M}#${color yellow}\
${font Finlandica:size=25}${color D6D5D4}${time %d %B}#${color yellow}\
${font Anurati:size=25}${color D6D5D4}${time %d %B %Y}#${color yellow}\
I've added this line of code hoping it is going to work :
${font Noto Sans Arabic:size=15}${color D6D5D4}${LANG=AR_EG.UTF-8 time %A}\
However, It doesn't work as expected. Any idea how to edit this line of code ?

snakemake allocates memory twice

I am noticing that all my rules request memory twice, one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb). If I run the rules as localrules they do run faster. How can I make sure the default settings do not interfere?
resources: mem_mb=100, disk_mb=8620, tmpdir=/tmp/pop071.54835, partition=h24, qos=normal, mem_gb=100, time=120:00:00
The rules are as follows:
rule bwa_mem2_mem:
input:
R1 = "data/results/qc/{species}.{population}.{individual}_1.fq.gz",
R2 = "data/results/qc/{species}.{population}.{individual}_2.fq.gz",
R1_unp = "data/results/qc/{species}.{population}.{individual}_1_unp.fq.gz",
R2_unp = "data/results/qc/{species}.{population}.{individual}_2_unp.fq.gz",
idx= "data/results/genome/genome",
ref = "data/results/genome/genome.fa"
output:
bam = "data/results/mapped_reads/{species}.{population}.{individual}.bam",
log:
bwa ="logs/bwa_mem2/{species}.{population}.{individual}.log",
sam ="logs/samtools_view/{species}.{population}.{individual}.log",
benchmark:
"benchmark/bwa_mem2_mem/{species}.{population}.{individual}.tsv",
resources:
time = parameters["bwa_mem2"]["time"],
mem_gb = parameters["bwa_mem2"]["mem_gb"],
params:
extra = parameters["bwa_mem2"]["extra"],
tag = compose_rg_tag,
threads:
parameters["bwa_mem2"]["threads"],
shell:
"bwa-mem2 mem -t {threads} -R '{params.tag}' {params.extra} {input.idx} {input.R1} {input.R2} | "
"samtools sort -l 9 -o {output.bam} --reference {input.ref} --output-fmt CRAM -# {threads} /dev/stdin 2> {log.sam}"
and the config is:
cluster:
mkdir -p logs/{rule} && # change the log file to logs/slurm/{rule}
sbatch
--partition={resources.partition}
--time={resources.time}
--qos={resources.qos}
--cpus-per-task={threads}
--mem={resources.mem_gb}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
--parsable # Required to pass job IDs to scancel
default-resources:
- partition=h24
- qos=normal
- mem_gb=100
- time="04:00:00"
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True # Required to run with local conda enviroment
cluster-status: status-sacct.sh # Required to monitor the status of the submitted jobs
cluster-cancel: scancel # Required to cancel the jobs with Ctrl + C
cluster-cancel-nargs: 50
Cheers,
Angel
Right now there are two separate memory resource requirements:
mem_mb
mem_gb
From the perspective of snakemake these are different, so both will be passed to the cluster. A quick fix is to use the same units, e.g. if the resource really requires only 100 mb, then the default resource should be changed to:
default-resources:
- partition=h24
- qos=normal
- mem_mb=100

Error generating the report: org.apache.jmeter.report.core.SampleException: Could not read metadata

I'm trying to run jmeter load testing scripts in non GUI mode to generate HTML report with below command
./jmeter.sh -n -t "/home/dsbloadtest/DSB_New_21_01_2022/apache-jmeter-5.4.3/dsb_test_plans/SERVICE_BOOKING.jmx" -l /home/dsbloadtest/DSB_New_21_01_2022/apache-jmeter-5.4.3/dsb_test_results/testresults.csv -e -o /home/dsbloadtest/DSB_New_21_01_2022/apache-jmeter-5.4.3/dsb_test_results/HTMLReports
It was working fine, but now not getting the result as im getting as below
summary = 0 in 00:00:00 = \*\*\*\*\*\*/s Avg: 0 Min: 9223372036854775807 Max: -9223372036854775808 Err: 0 (0.00%)
Tidying up ... # Fri Apr 01 11:22:40 IST 2022 (1648792360414)
Error generating the report: org.apache.jmeter.report.core.SampleException: Could not read metadata !
... end of run
I have tried to generate HTML report in J meter non GUI mode.
summary = 0 in 00:00:00 = ******/s Avg: 0 Min: 9223372036854775807 Max: -9223372036854775808 Err: 0 (0.00%)
it means that JMeter didn't execute any Sampler, your testresults.csv is empty and you don't have any data to generate the dashboard from.
The reason for test failure normally can be figured out from jmeter.log file, the most common mistakes are:
the file referenced in the CSV Data Set Config doesn't exist
the JMeter Plugins used in the test are not installed for this particular JMeter instance

How to enable CUDA Aware OpenMPI?

I'm using OpenMPI and I need to enable CUDA aware MPI. Together with MPI I'm using OpenACC with the hpc_sdk software.
Following https://www.open-mpi.org/faq/?category=buildcuda I downloaded and installed UCX (not gdrcopy, I haven't managed to install it) with
./contrib/configure-release --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-fortran
and it prints:
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes
So UCX seems to be ok.
After this I re-configured OpenMPI with:
./configure --with-ucx=/home/marco/Downloads/ucx-1.9.0/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-mpi-fortran
and it prints:
CUDA support: yes
Open UCX: yes
If I try to run the application with: mpirun -np 2 -mca pml ucx -x ./a.out (as suggested on openucx.org) I get the errors:
match_arg (utils/args/args.c:163): unrecognized argument mca
HYDU_parse_array (utils/args/args.c:178): argument matching returned error
parse_args (ui/mpich/utils.c:1642): error parsing input array
HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
main (ui/mpich/mpiexec.c:148): error parsing parameters
I see that the directories the compilers is looking for are not the OpenMPI ones but the ones of MPICH, I don't know why. If i type which mpicc ,which mpiexec and which mpirun I get the ones of OpenMPI.
If i run with: mpiexec -n 2 ./a.out I get:
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
EDITED:
Doing the same but using the OpenMPI-4.0.5 that comes with NVIDIA HPC SDK it compiles and runs fine.
I get:
[marco-Inspiron-7501:1356251:0:1356251] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f05cfafa000)
==== backtrace (tid:1356251) ====
0 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7f060ae06dc7]
1 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7f060ae06b87]
2 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7f060ae06ce4]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f060c7433c0]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7f060befb885]
5 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7f060b2bd9e6]
6 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7f060b2bd775]
7 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7f060b2d35b5]
8 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7f060b2d111d]
9 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7f060b055577]
10 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7f060b054725]
11 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7f060b2d3614]
12 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7f060b2d22c7]
13 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7f060b2d15b1]
14 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7f060b2e85bd]
15 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7f060b2e7d15]
16 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7f060b2e721a]
17 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7f060b2e65ac]
18 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7f060dfc3b33]
=================================
[marco-Inspiron-7501:1356252:0:1356252] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd7f7afa000)
==== backtrace (tid:1356252) ====
0 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7fd82a711dc7]
1 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7fd82a711b87]
2 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7fd82a711ce4]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fd82c04e3c0]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7fd82b806885]
5 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7fd82abc89e6]
6 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7fd82abc8775]
7 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7fd82abde5b5]
8 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7fd82abdc11d]
9 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7fd82a960577]
10 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7fd82a95f725]
11 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7fd82abde614]
12 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7fd82abdd2c7]
13 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7fd82abdc5b1]
14 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7fd82abf35bd]
15 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7fd82abf2d15]
16 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7fd82abf221a]
17 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7fd82abf15ac]
18 /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7fd82d8ceb33]
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node marco-Inspiron-7501 exited on signal 11 (Segmentation fault).
The error it's caused by pragma acc host_data use_device(send_buf, recv_buf)
double send_buf[NX_GLOB + 2*NGHOST];
double recv_buf[NX_GLOB + 2*NGHOST];
#pragma acc enter data create(send_buf[:NX_GLOB+2*NGHOST], recv_buf[NX_GLOB+2*NGHOST])
// Top buffer
j = jend;
#pragma acc parallel loop present(phi[:ny_tot][:nx_tot], send_buf[:NX_GLOB+2*NGHOST])
for (i = ibeg; i <= iend; i++) send_buf[i] = phi[j][i];
#pragma acc host_data use_device(send_buf, recv_buf)
{
MPI_Sendrecv (send_buf, iend+1, MPI_DOUBLE, procR[1], 0,
recv_buf, iend+1, MPI_DOUBLE, procR[1], 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
This was an issue in the 20.7 release when adding UCX support. You can lower the optimization level to -O1 work around the problem, or update your NV HPC compiler version to 20.9 where we've resolved the issue.
https://developer.nvidia.com/nvidia-hpc-sdk-version-209-downloads

OpenCV Error: Assertion failed (_img.rows * _img.cols == vecSize)

I keep getting this error
OpenCV Error: Assertion failed (_img.rows * _img.cols == vecSize) in get, file /build/opencv-SviWsf/opencv-2.4.9.1+dfsg/apps/traincascade/imagestorage.cpp, line 157
terminate called after throwing an instance of 'cv::Exception'
what(): /build/opencv-SviWsf/opencv-2.4.9.1+dfsg/apps/traincascade/imagestorage.cpp:157: error: (-215) _img.rows * _img.cols == vecSize in function get
Aborted (core dumped)
when running opencv_traincascade. I run with these arguments: opencv_traincascade -data data -vec positives.vec -bg bg.txt -numPos 1600 -numNeg 800 -numStages 10 -w 20 -h 20.
My project build is as follows:
workspace
|__bg.txt
|__data/ # where I plan to put cascade
|__info/
|__ # all samples
|__info.lst
|__jersey5050.jpg
|__neg/
|__ # neg images
|__opencv/
|__positives.vec
before I ran opencv_createsamples -img jersey5050.jpg -bg bg.txt -info info/info.lst -maxxangle 0.5 - maxyangle 0.5 -maxzangle 0.5 -num 1800
Not quite sure why I'm getting this error. The images are all converted to greyscale as well. The neg's are sized at 100x100 and jersey5050.jpg is sized at 50x50. I saw someone had a the same error on the OpenCV forums and someone suggested deleting the backup .xml files that are created b OpenCV in case the training is "interrupted". I deleted those and nothing. Please help! I'm using python 3 on mac. I'm also running these commands on an ubuntu server from digitalocean with 2GB of ram but I don't think that's part of the problem.
EDIT
Forgot to mention, after the opencv_createsamples command, i then ran opencv_createsamples -info info/info.lst -num 1800 -w 20 -h20 -vec positives.vec
I solved it haha. Even though I specified in the command the width and height to be 20x20, it changed it to 20x24. So the opencv_traincascade command was throwing an error. Once I changed the width and height arguments in the opencv_traincascade command it worked.
This error is observed when the parameters passed is not matching with the vec file generated, as rightly put by the terminal in this line
Assertion failed (_img.rows * _img.cols == vecSize)
opencv_createsamples displays the parameters passed to it for training. Please verify of the parameters used for creating samples are the same that you passed. I have attached the terminal log for reference.
mayank#mayank-Aspire-A515-51G:~/programs/opencv/CSS/homework/HAAR_classifier/dataset$ opencv_createsamples -info pos.txt -num 235 -w 40 -h 40 -vec positives_test.vec
Info file name: pos.txt
Img file name: (NULL)
Vec file name: positives_test.vec
BG file name: (NULL)
Num: 235
BG color: 0
BG threshold: 80
Invert: FALSE
Max intensity deviation: 40
Max x angle: 1.1
Max y angle: 1.1
Max z angle: 0.5
Show samples: FALSE
Width: 40 <--- confirm
Height: 40 <--- confirm
Max Scale: -1
RNG Seed: 12345
Create training samples from images collection...
Done. Created 235 samples

Resources