How to tell binary from text files in linux - linux

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.

file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).

The diff manual specifies that
diff determines whether a file is text
or binary by checking the first few
bytes in the file; the exact number of
bytes is system dependent, but it is
typically several thousand. If every
byte in that part of the file is
non-null, diff considers the file to
be text; otherwise it considers the
file to be binary.

A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.
Update: According to the diff manual, this is exactly what diff does.

This approach defers to the grep command in determining whether a file is binary or text:
is_text_file() { grep -qIF '' "$1"; }
grep options used:
-q Quiet; Exit immediately with zero status if any match is found
-I Process a binary file as if it did not contain matching data
-F Interpret PATTERNS as fixed strings, not regular expressions.
grep pattern used:
'' Empty string. All files (except an empty file)
will match this pattern.
Notes
An empty file is not considered a text file according to this test. (The GNU file command agrees with this assessment.)
A file with one printable character, say a, is considered a text file according to this test. (Makes sense to me.) (The file command disagrees with this assessment. (Tested with GNU file))
This approach requires only one child process to test whether a file is text or binary.
Test
# cd into a temp directory
cd "$(mktemp -d)"
# Create 3 corner-case test files
touch empty_file # An empty file
echo -n a >one_byte_a # A file containing just `a`
echo a >one_line_a # A file containing just `a` and a newline
# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB
# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1
# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }
# Test harness
do_test() {
printf '%22s ... ' "$1"
if is_text_file "$1"; then
echo "is a text file"
else
echo "is a binary file"
fi
}
# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1
Output
empty_file ... is a binary file
one_byte_a ... is a text file
one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file
On my machine, it seems grep checks the first 96 KiB of a file for a NUL. (Tested with GNU grep). The exact crossover point depends on your machine's page size.
Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550

You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.

These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.

A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary

Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

Related

Linux Header Removal from a ppm file

Does anybody know the command to remove the header from a ppm file in Linux? I've tried this already
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
head -n 4 Example.ppm > header.txt
tail -n 5+ Example.ppm > body.bin
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
It tells me that "Tail" could not be found.
Most ppm files use newlines in the header so your first command is fine. However, the rest of the file is binary, so:
head -n 4 Example.ppm > header.txt
filesize=$(wc -c header.txt)
dd if=Example.ppm of=body.bin bs=1 skip=$filesize
You should have /bin/tail if you have /bin/head; both are in the coreutils RPM package.
The format of a ppm(5) file (http://netpbm.sourceforge.net/doc/ppm.html) is awkward to use with the line-based head/tail/sed family. The documentation describes fields separated by whitespace that is not necessarily a line break.
You will need to: 1) Ignore comments from '#' to end of line; and 2) process the remainder one field (not column, not line) at a time. Using awk(1) could be an option here.
Check the documentation (http://netpbm.sourceforge.net/doc/directory.html) for a list of conversion programs. You may find one that converts the PPM file into a form better suited to whatever usage is your ultimate goal.

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

Grep show filename and found line for binary files (PDF)

I have a folder with lots of PDF files. I need to get the filename of matching content files as well as specific text in them - Rotate 270, which defines a page rotation. Grep's arguments anH or /dev/null method seems not to work, nor can pdftotext or pdfgrep help, as it is not any visible or searchable text on page I need.
I can either get the "Binary file aaa.pdf matches" or the line like this (which is not a text visible on a page!):
<</Filter/FlateDecode/Length 61>>stream4 595.19995]/MediaBox[0 0 841.92004 595.19995]/Parent 5 0 R/Resources<</ProcSet[/PDF/Text/ImageB/ImageC/ImageI]/XObject<</img3 11 0 R>>>>/Rotate 270/Type/Page>>
Suspect there is a way to loose the non printable bytes before grep gets them, or split the filename before grep part and assemble back after the grep has found the line, or maybe sed has an easy way to achieve this?
How do I get both filename and found line, approximately like grep does on regular text files?
I don't have a pdf file with that string inside but you can try
identify -verbose somefile.pdf | grep 'Rotate 270'
identify is part of ImageMagick package.
You can also try a brute force method :-)
strings somefile.pdf | grep 'Rotatae 270'

"grep" offset of ascii string from binary file

I'm generating binary data files that are simply a series of records concatenated together. Each record consists of a (binary) header followed by binary data. Within the binary header is an ascii string 80 characters long. Somewhere along the way, my process of writing the files got a little messed up and I'm trying to debug this problem by inspecting how long each record actually is.
This seems extremely related, but I don't understand perl, so I haven't been able to get the accepted answer there to work. The other answer points to bgrep which I've compiled, but it wants me to feed it a hex string and I'd rather just have a tool where I can give it the ascii string and it will find it in the binary data, print the string and the byte offset where it was found.
In other words, I'm looking for some tool which acts like this:
tool foobar filename
or
tool foobar < filename
and its output is something like this:
foobar:10
foobar:410
foobar:810
foobar:1210
...
e.g. the string which matched and a byte offset in the file where the match started. In this example case, I can infer that each record is 400 bytes long.
Other constraints:
ability to search by regex is cool, but I don't need it for this problem
My binary files are big (3.5Gb), so I'd like to avoid reading the whole file into memory if possible.
grep --byte-offset --only-matching --text foobar filename
The --byte-offset option prints the offset of each matching line.
The --only-matching option makes it print offset for each matching instance instead of each matching line.
The --text option makes grep treat the binary file as a text file.
You can shorten it to:
grep -oba foobar filename
It works in the GNU version of grep, which comes with linux by default. It won't work in BSD grep (which comes with Mac by default).
You could use strings for this:
strings -a -t x filename | grep foobar
Tested with GNU binutils.
For example, where in /bin/ls does --help occur:
strings -a -t x /bin/ls | grep -- --help
Output:
14938 Try `%s --help' for more information.
162f0 --help display this help and exit
I wanted to do the same task. Though strings | grep worked, I found gsar was the very tool I needed.
http://tjaberg.com/
The output looks like:
>gsar.exe -bic -sfoobar filename.bin
filename.bin: 0x34b5: AAA foobar BBB
filename.bin: 0x56a0: foobar DDD
filename.bin: 2 matches found

encoding problem?

i work with txt files, and i recently found e.g. these characters in a few of them:
http://pastebin.com/raw.php?i=Bdj6J3f4
what could these characters be? wrong character-encoding? i just want to use normal UTF-8 TXT files, but when i use:
iconv -t UTF-8 input.txt > output.txt
it's still the same.
When i open the files in gedit, copy+paste them in another txt files, then there's no characters like in the ones in pastebin. so gedit can solve this problem, it encodes the TXT files well. but there are too many txt files.
why are there http://pastebin.com/raw.php?i=Bdj6J3f4 -like chars in the text files? can they be converted to "normal chars"? I can't see e.g.: the "Ì" char, when i open the files with vim, only after i "work with them" (e.g.: awk, etc)
It would help if you posted the actual binary content of your file (perhaps by using the output of od -t x1). The pastebin returns this as HTML:
"Ì"
"Ã"
"é"
The first line corresponds to U+00C3 U+0152. THe last line corresponds to U+00C3 U+00A9, which is the string "\ux00e9" in UTF ("\xc3\xa9") with the UTF-8 bytes reinterpreted as Latin-1.
From man iconv:
The iconv program converts text from
one encoding to another encoding. More
precisely, it converts from the
encoding given for the -f option to
the encoding given for the -t option.
Either of these encodings defaults to
the encoding of the current locale
Because you didn't specify the -f option it assumes the file is encoded with your current locale's encoding (probably UTF-8), which apparently is not true. Your text editors (gedit, vim) do some encoding detection - you can check which encoding do they detect (I don't know how - I don't use any of them) and use that as -f iconv option (or save the open file with your desired encoding using one of those text editors).
You can also use some tool for encoding detection like Python chardet module:
$ python -c "import chardet as c; print c.detect(open('file.txt').read(4096))"
{'confidence': 0.7331842298102511, 'encoding': 'ISO-8859-2'}
..solved !
how:
i just right clicked on the folders containing the TXT files, and pasted them to another folder.. :O and presto..theres no more ugly chars..

Resources