grepping for a large binary value from an even larger binary file - linux

As the title suggests I would like to grep a reasonably large (about 100MB) binary file, for a binary string - this binary string is just under 5K.
I've tried grep using the -P option, but this only seems to return matches when the pattern is only a few bytes - when I go up to about 100 bytes it no longer finds any matches.
I've also tried bgrep. This worked well originally, however, when I needed to extend the pattern to the length I have now I just get "invalid/empty search string" errors.
The irony is, in Windows I can use HxD to search the file and I finds it in a instance. What I really need though is a Linux command line tool.
Thanks for your help,
Simon

Say we have a couple of big binary data files. For a big one that shouldn't match, we create a 100MB file whose contents are all NUL bytes.
dd ibs=1 count=100M if=/dev/zero of=allzero.dat
For the one we want to match, create a hundred random megabytes.
#! /usr/bin/env perl
use warnings;
binmode STDOUT or die "$0: binmode: $!";
for (1 .. 100 * 1024 * 1024) {
print chr rand 256;
}
Execute it as ./mkrand >myfile.dat.
Finally, extract a known match into a file named pattern.
dd skip=42 count=10 if=myfile.dat of=pattern
I assume you want only the files that match (-l) and want your pattern to be treated literally (-F or --fixed-strings). I suspect you may have been running into a length limit with -P.
You may be tempted to use the --file=PATTERN-FILE option, but grep interprets the contents of PATTERN-FILE as newline-separated patterns, so in the likely case that your 5KB pattern contains newlines, you'll hit an encoding problem.
So hope your system's ARG_MAX is big enough and go for it. Be sure to quote the contents of pattern. For example:
$ grep -l --fixed-strings "$(cat pattern)" allzero.dat myfile.dat
myfile.dat

Try using grep -U which treats files as binary.
Also, how are you specifying the search pattern? It might just need escaping to survive shell parameter expansions

As the string you are searching is pretty long. You could benefit by an implementation of the Boyer-Moore search algorithm which is very efficient when search string is very long
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
The wiki also has links to some sample code.

You might want to look at a simple Python script.
match= (b"..."
b"...."
b"..." ) # Some byte string literal of immense proportions
with open("some_big_file","rb") as source:
block= read(len(match))
while block != match:
byte= read(1)
if not byte: break
block= block[1:]+read(1)
This might work reliably under Linux as well as Windows.

Related

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

Search ill encoded characters in a file on Linux

I have a lot of huge CSV files, some of them contain ill encoded characters: in vi, I see things like "<8f>" or "<8e>", for example.
First, I wanted to search and replace (:%s) all the characters, but it will be a very long process because I will have to do this everytime I have to handle a file, and I'm not always sure whether new characters are here.
Is it possible to detect such characters, so that I can extract lines containing ill encoded characters?
A simple command may exist, taking a file for argument and creating a file containing only the lines with a problem.
I don't know if I explain me very well...
Thanks in advance!
You could use :g/char/p [vim] to print all the lines in a given file, or the bash utility grep:
grep -lr 'char1\|char2\|char2' .
Will output all the files in a directory containing any of the chars you have listed (the -r makes it recursive and the -l lists only the filenames, rather than all the line matches.

Generate file names for proper sequential sorting under shell globing

I was generating a sequence of png images in my program, which files I was planning to get is passed through some tool that converts them to a video file. I am generating files one by one, in the proper sequence that I want them. I want to name them in such a way that the subsequent video conversion tool will take them in proper sequence under the file name globbing used by the shell ( I am using bash with Linux.). I tried adding a numeric sequence like 'scene1.png, scene10.png, scene12.png, but the shell doesn't sort globs numerically. I could pass a sorted list like this:
convert -antialias -delay 1x10 $(ls povs/*.png | sort -V) mymovie.mp4
But some programs do their own globbing and don't use shells globbing ( like FFmpeg), and so this approach does not always work. so I am looking for a scheme of naming files that are guaranteed to be in sequence as per shell globbing rules.
You may prefix your files with a zero padded integer.
This script emulates what ls * should output after renaming :
$ for i in {1..12};do
$ printf '%05d_%s\n' ${i} file${i}
$ done;
00000_file0
00001_file1
00002_file2
00003_file3
00004_file4
00005_file5
00006_file6
00007_file7
00008_file8
00009_file9
00010_file10
00011_file11

Is there any software that allows me to read text binary files in windows?

I have some binary files that have chapters of a book (the files have no extension) is there any software that allows me to access there content? In other words to convert binary to English?
I tried several solutions...
Thank you
for a very "basic" try,
first of all, try to :
file the_file
to see if it is a known format. Then rename the file accordingly and open it using the proper program.
If this fails, you could use a very low level approach:
string the_file > the_strings_of_the_file
to create a new file (the_strings_of_the_file) containing "strings" from the_file.
It is probably cluttered with meaningless things, and the (few?) sentences can probably be in whatever order...
You may try to narrow down the potentially good ones with some filters:
string the_file | grep some_regexp > the_strings_of_the_file.filtered_in
string the_file | grep -v some_regexp > the_strings_of_the_file.filtered_out
and adjust some_regexp until the "filtered_out" do not contain anything of value...
(I could help with the regexp if you provide us some of the output of "strings", both meaningful ones and some non-meaningfull ones)
(and if you precise what kind of langage is used: ascii? accentuated letters? etc)
Another approach: delete "non usefull" letters:
tr -cd ' -~\n\t' the_file > the_file_without_weird_letters
# note that if your file contain accentuated letters, you'll need to change the range.
# The range above is good for "everything printable in regular ascii"

How to extract the first x-megabyte from a large file in unix/linux?

I have a large file which I am only interested in the first couple of megabytes in the head.
How do I extract the first x-megabyte from a large file in unix/linux and put it into a seperate file?
(I know the split command can split files into many pieces. And using bash scripts I can erase the pieces I don't want. I would prefer a easier way)
Head works with binary files and the syntax is neater than dd.
head -c 2M input.file > output.file
Tail works the same way if you want the end of a file.
E.g.
dd if=largefile count=6 bs=1M > largefile.6megsonly
The 1M spelling assumes GNU dd. Otherwise, you could do
dd if=largefile count=$((6*1024)) bs=1024 > largefile.6megsonly
This again assumes bash-style arithmetic evaluation.
On a Mac (Catalina) the head and tail commands don't seem to take modifiers like m (mega) and g (giga) in upper or lower case, but will take a large integer byte count like this one for 50 MB
head -c50000000 inputfile.txt > outputfile.txt
Try the command dd. You can use "man dd" to get main ideas of it.

Resources