Error while using ABySS to assemble SRA sequence - linux

I'm having some trouble using ABySS to assemble reads downloaded from the NCBI SRA.
The command I've used is:
abyss-pe name=SRR530529_1 k=27 in=/home/bilalm/H_glaber_quality_filtering/AfterQC/good_reads/SRR530529.good.fq
But the assembly process had stopped with an error message ' Abyss-fixmate: error: all reads are mateless. '
The tail of the error message is:
Building the Burrows-Wheeler transform...
Building the character occurrence table...
Mateless 193637763 100%
Unaligned 0
Singleton 0
FR 0
RF 0
FF 0
Different 0
Total 193637763
abyss-fixmate: error: All reads are mateless. This can happen when first and second read IDs do not match.
error: `SRR530529-3.hist': No such file or directory
/usr/bin/abyss-pe:561: recipe for target 'SRR530529-3.dist' failed
make: *** [SRR530529-3.dist] Error 1
make: *** Deleting file 'SRR530529-3.dist'
The ABySS version: 2.0.1
The fastq file size is: 52G
What has happened? What does mateless mean? My intentions were to clean the data and then assemble the naked mole rat genome using 4 fastq reads (SRR530529, SRR530530, SRR530531 & SRR530532) which I have downloaded from the NCBI SRA.
Cheers, Billy.

You are using the abyss program to assemble paired-end reads, but you are feeding it only one file. If you have paired-end reads, each pair must have the same name so they can be recognized as a pair. Normally you would have the forward and reverse reads in two separated files, and would feed them like:
abyss-pe name=SRR530529_1 k=27 in='reads1.fa reads2.fa'
If you have single reads (mateless), then I'm not sure abyss will work, but you can try this:
abyss-pe name=SRR530529_1 k=27 se=/home/bilalm/H_glaber_quality_filtering/AfterQC/good_reads/SRR530529.good.fq

Related

Slurm not running job generated by snakemake because file, directory or other parameter too long

I have a snakemake rule that has 630k input file dependencies. This rule concatenates the files together with an R script. The R script doesn't take any input files but will grab them from within the R script. When I run this on our HPC via slurm, I'm getting the following error message...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 4950
Job stats:
job count min threads max threads
------------------ ------- ------------- -------------
all_targets 1 1 1
simA_pool_clusters 1 1 1
total 2 1 1
Select jobs to execute...
[Sun Feb 12 13:30:05 2023]
rule simA_pool_clusters:
input: workflow/scripts/simA_pool_clusters.R, data/sim_a/s1_1000_1.nofilter.deseq.bray.clusters.tsv, data/sim_a/s1_1000_1.nofilter.deseq.euclidean.clusters.tsv, [snip...]
output: data/simulation_cluster_accuracy.tsv
jobid: 194145
reason: Missing output files: data/simulation_cluster_accuracy.tsv
resources: mem_mb=2000, disk_mb=1000, tmpdir=<TBD>, cores=1, partition=standard, time_min=120, job_name=rare
sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long
Error submitting jobscript (exit code 1):
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-02-11T093752.352072.snakemake.log
Here is the snakemake rule:
rule simA_pool_clusters:
input:
R="workflow/scripts/simA_pool_clusters.R",
tsv=expand("data/sim_a/{frac}_{depth}_{rep}.{preproc}.{transform}.{distance}.clusters.tsv",
frac = fracs, depth = depths, rep = reps, preproc = preprocs,
transform = transforms, distance = distances)
conda:
"envs/nr-modern.yml"
output:
"data/simulation_cluster_accuracy.tsv"
shell:
"""
{input.R}
"""
The input.tsv expands to 630k small files, I've shortened the list for ease of posting. The input.R is an executable R script with a shebang line - like I said, it gets the *clusters.tsv files from its own logic.
I'm wondering if it's possible that snakemake is sending the entire value of input.tsv to slurm rather than just the R script. Any suggestions to try before I run the R script manually outside of snakemake?
The R script doesn't take any input files but will grab them from within the R script
So you have an R script with 630k file paths hardcoded in it? If so, this link says the slurm script has a limit of 4MB and you are probably over it. If so, it is a limitation of slurm rather than snakemake.
If the above is about right, you could write the 630k file paths to a text file first (one rule), and then use that file as input file to another rule.
Loosely related, I wonder whether the file system is going to struggle in moving around so many files. Perhaps you could consider some refactoring...

CYGWIN, GDAL: batch processing not working; gdalwarp on Cygwin error: "too many points"

I'm attempting to mass convert the Sinusoidal projections of hundreds of MODIS single-band files to WGS84 in a single directory (which were extracted from HDF-EOS format using gdal_translate) using GDAL in Cygwin.
When I execute the gdalwarp command on a single file, it executes the command successfully and the result is my desired output:
$ gdalwarp -s_srs "+proj=sinu +R=6371007.181 +nadgrids=#null +wktext" -t_srs EPSG:4326 testfile.tif testfile_proj_1.tif
Processing testfile.tif [1/1] : 0Using internal nodata values (e.g. -1) for imag
e testfile.tif.
...10...20...30...40...50...60...70...80...90...100 - done.
However, when I try to process the aforementioned hundreds of files using a for loop...:
$ for f in *.tif; do
> gdalwarp -s_srs "+proj=sinu +R=6371007.181 +nadgrids=#null +wktext" -t_srs EPSG:4326 "$f" "${f%.*}.tif"
..., it returns errors:
ERROR 1: Too many points (529 out of 529) failed to transform, unable to compute
output bounds.
Warning 1: Unable to compute source region for output window 0,0,2400,2400, skip
ping.
I am baffled as to why the command will work on a single file, but won't in a mass conversion.
How do I get this to work? Are there any alternatives to GDAL that I can use to mass reproject TIFFs if this is a wall I hit?
Thank you in advance.

String method tutorial in abinit - no output file produced

I'm trying to follow this tutorial in abinit: https://docs.abinit.org/tutorial/paral_images/
When trying to run abinit for any of the tstring files, no output file is produced. For instance, I copy the files tstring_01.in and tstring.files into the subdirectory work_paral_string, edit the tstring.files with the appropriate file names, and run the command mpirun -n 20 abinit < tstring.files > log 2> err. No error message is shown but also no output is produced. (The expected output file would be tstring_01.out.)
Any suggestions?

Oracle Pro*C 12.1 - standard output content not redirected to file when generating core dump

I am facing a rather strange issue with Oracle Pro*C precompiler on linux: as part of our build process, we invoke the proc utility to generate .cxx files that later get compiled as C++ source files. This proc utility is called through a python script, which captures both stdout and stderr and prints them both in case of a non-0 return code. Whenever the precompiler encounters a compilation error, it reports them on standard output which gets correctly printed out and returns a non-0 return code.
However in our CI environment the precompiler systematically crashes returning a negative return code, with nothing getting printed out neither on standard output nor on standard error.
My ultimate goal is to understand this crash and fix it, but I am unable to reproduce that particular crash outside our CI environment. I managed however to generate a different crash of the proc utility on a linux VM by passing bogus include folders as arguments. And I witness a strange behaviour in my bash terminal, which explains while I get no output at all from my python script. When calling proc directly, the error message is correctly printed out in my terminal:
$> /path/to/proc option_1=foo option2=bar
Pro*C/C++: Release 12.1.0.2.0 - Production on Tue Dec 4 08:13:31 2018
Copyright (c) 1982, 2014, Oracle and/or its affiliates. All rights reserved.
System default option values taken from: /usr/lib/oracle/12.1/client64/lib/precomp/admin/pcscfg.cfg
Error at line 3, column 10 in file /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h
#include <bits/wordsize.h>
.........1
PCC-S-02015, unable to open include file
Error at line 39, column 10 in file /usr/include/c++/8/x86_64-redhat-linux/bits/os_defines.h
#include <features.h>
.........1
PCC-S-02015, unable to open include file
Syntax error at line 44, column 21, file /usr/include/c++/8/x86_64-redhat-linux/bits/os_defines.h:
Error at line 44, column 21 in file /usr/include/c++/8/x86_64-redhat-linux/bits/os_defines.h
#if __GLIBC_PREREQ(2,15) && defined(_GNU_SOURCE)
....................1
PCC-S-02201, Encountered the symbol "," when expecting one of the following:
)
Syntax error at line -1741187720, column 0, file p�:
INTERNAL ERROR: Failed assertion [PGE Code=90105]
Segmentation fault (core dumped)
$>
When redirecting the standard output to a file, no error message gets printed out except the last line regarding a core having been generated. However the file containing the redirected output is empty:
$> /path/to/proc option_1=foo option2=bar > test.txt
Segmentation fault (core dumped)
$> more test.txt
$> ls -al test.txt
-rw-r--r-- 1 me staff 0 3 déc 20:27 test.txt
$>
Also the output of piping to cat results in nothing being printed out at all:
$> /path/to/proc option_1=foo option2=bar | cat
$>
Based on that I have 2 questions:
how is it possible the output does not make it to a file when redirected ?
how else could I attempt to capture it ?

cscript - Invalid procedure call or argument when running a vbs file

I've been trying to use check_time.vbs to check the Windows time.
Here's the script: http://pastebin.com/NfUrCAqU
The help message could be display:
C:\Program Files\NSClient++\scripts>cscript //NoLogo check_time.vbs /?
check_time.vbs V1.01
Usage: cscript /NoLogo check_time.vbs serverlist warn crit [biggest]
Options:
serverlist (required): one or more server names, coma-separated
warn (required): warning offset in seconds, can be partial
crit (required): critical offset in seconds, can be partial
biggest (optional): if multiple servers, else use default least offset
Example:
cscript /NoLogo check_time.vbs myserver1,myserver2 0.4 5 biggest
But I get the following error when running:
C:\Program Files\NSClient++\scripts>cscript //NoLogo check_time.vbs 0.asia.pool.ntp.org 20 50
C:\Program Files\NSClient++\scripts\check_time.vbs(53, 1) Microsoft VBScript run
time error: Invalid procedure call or argument
The screenshot:
Manually execute w32tm still works fine:
What might be the cause of this?
IIRC the (53,1) indicates that the error is on line 53. At this point it is expecting an array of regexp matches with at least one item (index 0) and one sub-match (i.e. the object in position 0 in the array has an array property called SubMatches with at least one item in it.
It is not checking to make sure this structure is present and correct before trying to use it.
My assumption is that the regexp call is failing to find anything to match, presumably because the input string is not in the expected format. You could output the content of strOutput before that line to see what it contains - it could be a date/time representation in a different localized form than the regexp is designed for. You could also output the content of input after each call to objProc.StdOut.ReadLine - this would show you if the call to w32tm.exe returned a useful error message that is being skipped over by the script which is just looking for the value returned when all is well and ignoring the possibility of different output.
The culprit is the /nowarn argument:
w32tm /monitor /nowarn /computers:0.asia.pool.ntp.org
The following arguments were unexpected: /nowarn
Exiting with error 0x80070057
Remove it from the script, and now it works:
cscript //NoLogo check_time.vbs 0.uk.pool.ntp.org 20 50
NTP OK: Offset -2.4262131 secs|'offset'=-2.4262131s;20;50;

Resources