Why there are written randomly null characters in some of my output files? - linux

I have some scripts in my RedHat server which contains Microfocus COBOL programs which generates a huge file of aprox 3GB in a sort of time of 3 hours on average. The programs write their output files directly in the directory /my_test/files/.
The problem is that sometimes (randomly) some files generated contains null character sections in the middle of the file. And when I check them up, if I reexecute the script again (with the same input parameters), the output file is perfectly generated (it doesn't contain any nullchars). I've checked it a lot of times and I'm pretty sure is not the fault of the COBOL programs (they use quite simple operations). The space in use of that folder is 40%.
Some programs updates the database, and if they finish with return code 0, then the changes are commited, and I don't have any backup, so this is the point of what I'm doing.
This is an example of a file declaration of one of the problematic COBOL programs:
FILE-CONTROL.
SELECT MYFILE
ASSIGN TO MYFILE
ORGANIZATION IS SEQUENTIAL
ACCESS MODE IS SEQUENTIAL
FILE STATUS IS FILE-STATUS.
DATA DIVISION.
FILE SECTION.
FD MYFILE
LABEL RECORD STANDARD
RECORDING MODE F.
01 REG-OUTPUT PIC X(400).
I've also checked for the nulls in the COBOL programs before the NULL files, but unfortunately there are no nulls spotted.
Then I thought about creating a crontab which executes the following script each 5 seconds:
if [[ -f /tmp/sorry_im_working ]]; then
exit
fi
trap 'rm -rf /tmp/sorry_im_working' EXIT
touch /tmp/sorry_im_working
lsof | awk 'BEGIN{
sfiles="";
} {
if($1=="PROGRAM" && $9~/my_test\/files/){
sfiles=sfiles" "$9
}
}END{
comm="find "sfiles" -newermt \x27-2 seconds\x27 -exec env LC_ALL=C bash -c \x27grep -Pq \x22\x5Cx00{200}\x22 <(tail -c 1000 {}) && echo {}\x27 \x5C\x3B";
while(comm | getline sout){
print sout;
};
close(comm);
}' >> /home/ouhma/nullfiles.txt
Therefore, I would like to ask you the following questions:
Any idea of what's going on here?
Do you have any other way to trigger the lastest modified files?
What other information of interest could I add to my log?

If you construct a file d with only \x00 :
hexdump -C d
00000000 5c 78 30 30 0a |\x00.|
00000005
and you :
grep -Faq '\x00' d;echo $?
0
But they're no null caracter inside d.
Maybe, is better to use grep -Paq '\x00'

Depending on the configuration and record structure that is used for the file MF will pad different characters with hex null.
Please copy the 'ASSIGN' clause and the 'FD' clause of the COBOL program.
BTW: if your COBOL programs run three ours to do some calculations and write three GB of data back you should investigate the storage and / or get a COBOL programmer to check the programs, sounds much to slow.

I suspect you are have non-printable characters in your file, the null inserts can be controlled, take a look # INSERTNULL file configuration.

Related

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

balancing the bash calculations

We have a tool for cutting adaptors https://github.com/vsbuffalo/scythe/blob/master/README.md and we wanted it to be used on all the files in the raw folder and make an output of each file separately as OUT+File Name.
Something is wrong with this script I wrote, because it doesn't take each file separately, and the whole thing doesn't work properly. It's gonna generateing empty file named OUT+files
Expected operation will looks:
take file1, use scythe on it, write output as OUTfile1
take file2 etc.
#!/bin/bash
FILES=/home/dave/raw/*
for f in $FILES
do
echo "Processing the $f file..."
/home/deve/scythe/scythe -a /home/dev/scythe/illumina_adapters.fa -o "OUT"+$f $f
done
Additionally, I noticed (testing for a single file) that the script uses only one core out of 130 available. Is there any way to improve it?
There is no string concatenation operator in shell. Use juxtaposition instead; it's "OUT$f", not "OUT"+$f.

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

change multiple files commandline

I have separated some tracks from mp3 mixes using mp3splt.
BASH: (mp3splt -c('**!!***use .cue file***!!**') [cuefile.cue] [nonstopmix.mp3] ~for anyone interested, is in the Ubu repos~)
And I ended up with these filenames: "Antares" - 01 - "Xibalba".mp3 which is not a format I prefer, now I've made it a little project to change them with a shell script but its more difficult than I anticipated.
I want to change the filename from:
"Antares" - 01 - "Xibalba".mp
to:
01-Antares_-_Xibalba.mp3
so far I've used :
for var in *.mp3; do mv $var {var/"/}; done
and I could repeat that until I'm through, delete the 0x number and add one but I'd like to do it more efficient.
Could anyone give me a pointer (!not a script!) ?
I'd still like to write it myself but there's so much options that I'm a bit lost.
so far I thought to use this program flow:
read all the filenames containing .mp3 and declare as variable $var
strip $var from quotes
select 0x number, append delimiter _ (0x_)
move 0x_ to the beginning of the string
select remaining ' - - ' and change to '-'
done
which bash programs to use? especially changing the 0x puzzles me cuz I need a loop which increments this number and test if it is present in the filename variable and then it has to be changed.
It is easy to do in python 2.x. You can use this logic in any language you want.
import string
a=raw_input('Enter the name of song')
a=a.replace('"', "")
a=a.replace('.mp', ' .mp3')
words = a.split()
print words[2]+'-'+words[0]+'_-_'+words[4]+words[5]
Logic:
I removed ", then make .mp to .mp3, then splitted the string, which created a list ( array ) and then printed the elements according to need.
Try doing this :
rename -n 's/"(\w+)"\s+-\s*(\d+)\s*-\s*"(\w+)"\.mp/$2-$1_-_$3.mp3/' *mp
from the shell prompt. It's very useful, you can put some perl tricks like I does in a substitution.
You can remove the -n (dry-run mode switch) when your tests become valids.
There are other tools with the same name which may or may not be able to do this, so be careful.
If you run the following command (linux)
$ file $(readlink -f $(type -p rename))
and you have a result like
.../rename: Perl script, ASCII text executable
then this seems to be the right tool =)
If not, to make it the default (usually already the case) on Debian and derivative like Ubuntu :
$ sudo update-alternatives --set rename /path/to/rename
Last but not least, this tool was originally written by Larry Wall, the Perl's dad.

How to tell binary from text files in linux

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.
file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
The diff manual specifies that
diff determines whether a file is text
or binary by checking the first few
bytes in the file; the exact number of
bytes is system dependent, but it is
typically several thousand. If every
byte in that part of the file is
non-null, diff considers the file to
be text; otherwise it considers the
file to be binary.
A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.
Update: According to the diff manual, this is exactly what diff does.
This approach defers to the grep command in determining whether a file is binary or text:
is_text_file() { grep -qIF '' "$1"; }
grep options used:
-q Quiet; Exit immediately with zero status if any match is found
-I Process a binary file as if it did not contain matching data
-F Interpret PATTERNS as fixed strings, not regular expressions.
grep pattern used:
'' Empty string. All files (except an empty file)
will match this pattern.
Notes
An empty file is not considered a text file according to this test. (The GNU file command agrees with this assessment.)
A file with one printable character, say a, is considered a text file according to this test. (Makes sense to me.) (The file command disagrees with this assessment. (Tested with GNU file))
This approach requires only one child process to test whether a file is text or binary.
Test
# cd into a temp directory
cd "$(mktemp -d)"
# Create 3 corner-case test files
touch empty_file # An empty file
echo -n a >one_byte_a # A file containing just `a`
echo a >one_line_a # A file containing just `a` and a newline
# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB
# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1
# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }
# Test harness
do_test() {
printf '%22s ... ' "$1"
if is_text_file "$1"; then
echo "is a text file"
else
echo "is a binary file"
fi
}
# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1
Output
empty_file ... is a binary file
one_byte_a ... is a text file
one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file
On my machine, it seems grep checks the first 96 KiB of a file for a NUL. (Tested with GNU grep). The exact crossover point depends on your machine's page size.
Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550
You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

Resources