Read characters from a text file using bash

Read characters from a text file using bash - linux

Does anyone know how I can read the first two characters from a file, using a bash script. The file in question is actually an I/O driver, it has no new line characters in it, and is in effect infinitely long.

The read builtin supports the -n parameter:
$ echo "Two chars" | while read -n 2 i; do echo $i; done
Tw
o
ch
ar
s
$ cat /proc/your_driver | (read -n 2 i; echo $i;)

I think
dd if=your_file ibs=2 count=1 will do the trick
Looking at it with strace shows it is effectively doing a two bytes read from the file.
Here is an example reading from /dev/zero, and piped into hd to display the zero :
dd if=/dev/zero bs=2 count=1 | hd
1+0 enregistrements lus
1+0 enregistrements écrits
2 octets (2 B) copiés, 2,8497e-05 s, 70,2 kB/s
00000000 00 00 |..|
00000002

echo "Two chars" | sed 's/../&\n/g'

G'day,
Why not use od to get the slice that you need?
od --read-bytes=2 my_driver
Edit: You can't use head for this as the head command prints to stdout. If the first two chars are not printable, you don't see anything.
The od command has several options to format the bytes as you want.

Related

get a part of a binary file using gnu-coreutils, bash

I want to get a part of a binary file, from byte #480161397 to #480170447 (included, 9051 bytes in total)
I use cut -b, and I expected the size of trunk1.gz to be 9051 bytes, but I get a different result.
$ wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701152097.59/warc/CC-MAIN-20160205193912-00264-ip-10-236-182-209.ec2.internal.warc.gz
$ cut -b480161397-480170447 CC-MAIN-20160205193912-00264-ip-10-236-182-209.ec2.internal.warc.gz >trunk1.gz
$ echo $((480170447-480161397+1))
9051
$ ls -l trunk1.gz
-rw-r--r-- 1 david staff 3400324 Sep 8 10:28 trunk1.gz
What is wrong?

cut -bN-M copies the range N-M bytes from every line of the input.
Example:
$ cut -b4-7 <<END
0123456789
abcdefghij
ABCDEFGHIJ
END
Output:
3456
defg
DEFG
Consider using dd for your purposes.

If you work with binary, I advise you to use dd command.
dd if=trunk1.gz bs=1 skip=480161397 count=9051 of=output.bin
bs is the block size and is set to 1 byte.

Real-time monitoring of intermittent binary data

Context: monitor a low-volume, intermittent stream from a program
When debugging some program, one sometimes have to monitor some output. When output is ascii, no problem, just run in terminal (the program itself, or nc if it uses a TCP or UDP interface, or cat /dev/somedevice, or socat ..., whatever).
Need: monitor a binary stream real-time... and half-solutions
But sometimes output is binary. One can pipe it into various incantations of od, hd, e.g. od -t d1 for decimal numbers, od -t a1 for augmented ascii display (explicit non-printable characters using printable ones), etc.
The trouble is: those buffer the input until they have a complete line to print (a line often fitting 16 input characters). So basically until the program sends 16 character, the monitoring does not show anything. When the stream is low volume and/or intermittent, this defeats the purpose of a real-time monitoring. Many protocols indeed only send a handful of bytes at a time.
It would be nice to tell it "ok buffer the input if you wish, but don't wait more than delay x before printing it, even if it won't fill one line".
man od, man hd don't mention any related option.
Non-solution
Heavy programs like wireshark are not really an option: they cover only part of the needs and are not combinable. I often do things like this:
{ while read a ; do { echo -n -e 'something' ; } | tee >(od -t d1 >&2) ; done ; } | socat unix-connect:/somesocket stdio | od -t d1
This monitors the output, and each time I press enter in the terminal, injects the sequence "something". It works very well but terminal output is buffered by 16-byte chunks and thus delayed a lot.
Summary
How do you simply monitor binary output from a program without byte-alignment-dependent delay ?

I don't know which distribution you're using, but check to see whether you have, or can install, most. From the manpage:
OPTIONS
-b Binary mode. Use this switch when you want
to view files containing 8 bit characters.
most will display the file 16 bytes per line
in hexadecimal notation. A typical line
looks like:
01000000 40001575 9C23A020 4000168D ....#..u.#. #...
When used with the -v option, the same line
looks like:
^A^#^#^# #^#^U u 9C #A0 #^#^V8D ....#..u.#. #...
Not in the manpage, but essential for your task, is the keystroke F (N.B. upper-case), which puts most into 'tail mode'. In this mode, most updates whenever new input is present.
On the downside, most can't be told to begin in tail mode, so you can't just pipe to its stdin (it will try to read it all before showing anything). So you'll need to
<your_command> >/tmp/output
in the background, or in its own terminal, as appropriate. Then
most -b /tmp/output
and press F.

Here's the best thing I found so far, hope someone knows something better.
Use -w1 option in od, or one of the examples format strings in man hd that eats one byte at a time. Kind of works, though makes a column-based display, which does not use terminal area efficiently.
{ while read a ; do { echo -n -e 'something' ; } | tee >(od -t d1 >&2) ; done ; } | socat unix-connect:/somesocket stdio | hexdump -v -e '/1 "%_ad# "' -e '/1 " _%_u\_\n"'
This displays thing like this:
0# _nul_ (plenty of wasted space on the right... )
1# _1_ (no overview of what's happening...)
2# _R_
3# _ _
4# _+_
5# _D_
6# _w_
7# _d8_
8# _ht_
9# _nak_
The good thing is one can configure it to their context and taste:
{ while read a ; do { echo -n -e 'something' ; } | tee >(od -t d1 >&2) ; done ; } | socat unix-connect:/somesocket stdio | hexdump -v -e '/1 "%_ad# "' -e '/1 "%02X "' -e '/1 " _%_u\_\n"'
This displays thing like this:
0# 6B _k_
1# 21 _!_
2# F6 _f6_
3# 7D _}_
4# 07 _bel_
5# 07 _bel_
6# 60 _`_
7# CA _ca_
8# CC _cc_
9# AB _ab_
But still, a regular hd display would use screen area more efficiently:
hd
00000000 f2 76 5d 82 db b6 88 1b 43 bf dd ab 53 cb e9 19 |.v].....C...S...|
00000010 3b a8 12 01 3c 3b 7a 18 b1 c0 ef 76 ce 28 01 07 |;...<;z....v.(..|

Add line feed every 2391 byte

I am using Redhat Linux 6.
I have a file which should comes from mainframe MVS with EBCDIC-ASCII conversion.
(But I suspect some conversion may be wrong)
Anyway, I know that the record length is 2391 byte. There are 10 records and the file size is 23910 byte.
For each 2391 byte record, there are many 0a or 0d char (not CRLF). I want to replace them with, say, # and #.
Also, I want to add a LF (i.e.0a) every 2391 byte so as to make the file become a normal unix text file for further processing.
I have try to use
dd ibs=2391 obs=2391 if=emyfile of=myfile.new
But, this cannot work. Both files are the same.
I also try
dd ibs=2391 obs=2391 if=myfile | awk '{print $0}'
But, this also not work
Can anyone help on this ?

Something like this:
#!/bin/bash
for i in {0..9}; do
dd if=emyfile bs=2391 count=1 skip=$i | LC_CTYPE=C tr '\r\n' '##'
echo
done > newfile
If your files are longer, you will need more than 10 iterations. I would look to handle that by running an infinite looop and exiting the loop on error, like this:
#!/bin/bash
i=0
while :; do
dd if=emyfile bs=2391 count=1 skip=$i | LC_CTYPE=C tr '\r\n' '##'
[ ${PIPESTATUS[0]} -ne 0 ] && break
echo
((i++))
done > newfile
However, on my iMac under OSX, dd doesn't seem to exit with an error when you go past end of file - maybe try your luck on your OS.

You could try
$ dd bs=2391 cbs=2391 conv=ascii,unblock if=emyfile of=myfile.new
conv=ascii converts from EBCDIC to ASCII. conv=unblock inserts a newline at the end of each cbs-sized block (after removing trailing spaces).
If you already have a file in ASCII and just want to replace some characters in it before splitting the blocks, you could use tr(1). For example, the following will replace each carriage return with '#' and each newline (linefeed) with '#':
$ tr '\r\n' '##' < emyfile | dd bs=2391 cbs=2391 conv=unblock of=myfile.new

Binary grep on Linux?

Say I have generated the following binary file:
# generate file:
python -c 'import sys;[sys.stdout.write(chr(i)) for i in (0,0,0,0,2,4,6,8,0,1,3,0,5,20)]' > mydata.bin
# get file size in bytes
stat -c '%s' mydata.bin
# 14
And say, I want to find the locations of all zeroes (0x00), using a grep-like syntax.
The best I can do so far is:
$ hexdump -v -e "1/1 \" %02x\n\"" mydata.bin | grep -n '00'
1: 00
2: 00
3: 00
4: 00
9: 00
12: 00
However, this implicitly converts each byte in the original binary file into a multi-byte ASCII representation, on which grep operates; not exactly the prime example of optimization :)
Is there something like a binary grep for Linux? Possibly, also, something that would support a regular expression-like syntax, but also for byte "characters" - that is, I could write something like 'a(\x00*)b' and match 'zero or more' occurrences of byte 0 between bytes 'a' (97) and 'b' (98)?
EDIT: The context is that I'm working on a driver, where I capture 8-bit data; something goes wrong in the data, which can be kilobytes up to megabytes, and I'd like to check for particular signatures and where they occur. (so far, I'm working with kilobyte snippets, so optimization is not that important - but if I start getting some errors in megabyte long captures, and I need to analyze those, my guess is I would like something more optimized :) . And especially, I'd like something where I can "grep" for a byte as a character - hexdump forces me to search strings per byte)
EDIT2: same question, different forum :) grepping through a binary file for a sequence of bytes
EDIT3: Thanks to the answer by #tchrist, here is also an example with 'grepping' and matching, and displaying results (although not quite the same question as OP):
$ perl -ln0777e 'print unpack("H*",$1), "\n", pos() while /(.....\0\0\0\xCC\0\0\0.....)/g' /path/to/myfile.bin
ca000000cb000000cc000000cd000000ce # Matched data (hex)
66357 # Offset (dec)
To have the matched data be grouped as one byte (two hex characters) each, then "H2 H2 H2 ..." needs to be specified for as many bytes are there in the matched string; as my match '.....\0\0\0\xCC\0\0\0.....' covers 17 bytes, I can write '"H2"x17' in Perl. Each of these "H2" will return a separate variable (as in a list), so join also needs to be used to add spaces between them - eventually:
$ perl -ln0777e 'print join(" ", unpack("H2 "x17,$1)), "\n", pos() while /(.....\0\0\0\xCC\0\0\0.....)/g' /path/to/myfile.bin
ca 00 00 00 cb 00 00 00 cc 00 00 00 cd 00 00 00 ce
66357
Well.. indeed Perl is very nice 'binary grepping' facility, I must admit :) As long as one learns the syntax properly :)

This seems to work for me:
grep --only-matching --byte-offset --binary --text --perl-regexp "<\x-hex pattern>" <file>
Short form:
grep -obUaP "<\x-hex pattern>" <file>
Example:
grep -obUaP "\x01\x02" /bin/grep
Output (Cygwin binary):
153: <\x01\x02>
33210: <\x01\x02>
53453: <\x01\x02>
So you can grep this again to extract offsets. But don't forget to use binary mode again.

Someone else appears to have been similarly frustrated and wrote their own tool to do it (or at least something similar): bgrep.

One-Liner Input
Here’s the shorter one-liner version:
% perl -ln0e 'print tell' < inputfile
And here's a slightly longer one-liner:
% perl -e '($/,$\) = ("\0","\n"); print tell while <STDIN>' < inputfile
The way to connect those two one-liners is by uncompiling the first one’s program:
% perl -MO=Deparse,-p -ln0e 'print tell'
BEGIN { $/ = "\000"; $\ = "\n"; }
LINE: while (defined(($_ = <ARGV>))) {
chomp($_);
print(tell);
}
Programmed Input
If you want to put that in a file instead of a calling it from the command line, here’s a somewhat more explicit version:
#!/usr/bin/env perl
use English qw[ -no_match_vars ];
$RS = "\0"; # input separator for readline, chomp
$ORS = "\n"; # output separator for print
while (<STDIN>) {
print tell();
}
And here’s the really long version:
#!/usr/bin/env perl
use strict;
use autodie; # for perl5.10 or better
use warnings qw[ FATAL all ];
use IO::Handle;
IO::Handle->input_record_separator("\0");
IO::Handle->output_record_separator("\n");
binmode(STDIN); # just in case
while (my $null_terminated = readline(STDIN)) {
# this just *past* the null we just read:
my $seek_offset = tell(STDIN);
print STDOUT $seek_offset;
}
close(STDIN);
close(STDOUT);
One-Liner Output
BTW, to create the test input file, I didn’t use your big, long Python script; I just used this simple Perl one-liner:
% perl -e 'print 0.0.0.0.2.4.6.8.0.1.3.0.5.20' > inputfile
You’ll find that Perl often winds up being 2-3 times shorter than Python to do the same job. And you don’t have to compromise on clarity; what could be simpler that the one-liner above?
Programmed Output
I know, I know. If you don’t already know the language, this might be clearer:
#!/usr/bin/env perl
#values = (
0, 0, 0, 0, 2,
4, 6, 8, 0, 1,
3, 0, 5, 20,
);
print pack("C*", #values);
although this works, too:
print chr for #values;
as does
print map { chr } #values;
Although for those who like everything all rigorous and careful and all, this might be more what you would see:
#!/usr/bin/env perl
use strict;
use warnings qw[ FATAL all ];
use autodie;
binmode(STDOUT);
my #octet_list = (
0, 0, 0, 0, 2,
4, 6, 8, 0, 1,
3, 0, 5, 20,
);
my $binary = pack("C*", #octet_list);
print STDOUT $binary;
close(STDOUT);
TMTOWTDI
Perl supports more than one way to do things so that you can pick the one that you’re most comfortable with. If this were something I planned to check in as school or work project, I would certainly select the longer, more careful versions — or at least put a comment in the shell script if I were using the one-liners.
You can find documentation for Perl on your own system. Just type
% man perl
% man perlrun
% man perlvar
% man perlfunc
etc at your shell prompt. If you want pretty-ish versions on the web instead, get the manpages for perl, perlrun, perlvar, and perlfunc from http://perldoc.perl.org.

The bbe program is a sed-like editor for binary files. See documentation.
Example with bbe:
bbe -b "/\x00\x00\xCC\x00\x00\x00/:17" -s -e "F d" -e "p h" -e "A \n" mydata.bin
11:x00 x00 xcc x00 x00 x00 xcd x00 x00 x00 xce
Explanation
-b search pattern between //. each 2 byte begin with \x (hexa notation).
-b works like this /pattern/:length (in byte) after matched pattern
-s similar to 'grep -o' suppress unmatched output
-e similar to 'sed -e' give commands
-e 'F d' display offsets before each result here: '11:'
-e 'p h' print results in hexadecimal notation
-e 'A \n' append end-of-line to each result
You can also pipe it to sed to have a cleaner output:
bbe -b "/\x00\x00\xCC\x00\x00\x00/:17" -s -e "F d" -e "p h" -e "A \n" mydata.bin | sed -e 's/x//g'
11:00 00 cc 00 00 00 cd 00 00 00 ce
Your solution with Perl from your EDIT3 give me an 'Out of memory'
error with large files.
The same problem goes with bgrep.
The only downside to bbe is that I don't know how to print context that precedes a matched pattern.

One way to solve your immediate problem using only grep is to create a file containing a single null byte. After that, grep -abo -f null_byte_file target_file will produce the following output.
0:
1:
2:
3:
8:
11:
That is of course each byte offset as requested by "-b" followed by a null byte as requested by "-o"
I'd be the first to advocate perl, but in this case there's no need to bring in the extended family.

What about grep -a? Not sure how it works on truly binary files but it works well on text files that the OS thinks is binary.

Linux shell command to read/print file chunk by chunk

Is there a standard Linux command i can use to read a file chunk by chunk?
For example, i have a file whose size is 6kB. I want to read/print the first 1kB, and then the 2nd 1kB ...
Seems cat/head/tail wont work in this case.
Thanks very much.

You could do this with read -n in a loop:
while read -r -d '' -n 1024 BYTES; do
echo "$BYTES"
echo "---"
done < file.dat

dd will do it
dd if=your_file of=output_tmp_file bs=1024 count=1 skip=0
And then skip=1 for the second chunk, and so on.
You then just need to read the output_tmp_file to get the chunk.

split can split a file into pieces by given byte count

Are you trying to actually read a text file? Like with your eyes? Try less or more

you can use fmt
eg 10bytes
$ cat file
a quick brown fox jumps over the lazy dog
good lord , oh my gosh
$ tr '\n' ' '<file | fmt -w10 file
a quick
brown fox
jumps
over
the lazy
dog good
lord , oh
my gosh
each line is 10 characters. If you want to read the 2nd chunk, pass it to tools like awk ..eg
$ tr '\n' ' '<file | fmt -w10 | awk 'NR==2' # print 2nd chunk
brown fox
To save each chunk to file, (or you can use split with -b )
$ tr '\n' ' '<file | fmt -w10 | awk '{print $0 > "file_"NR}'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string