Grep show filename and found line for binary files (PDF) - linux

I have a folder with lots of PDF files. I need to get the filename of matching content files as well as specific text in them - Rotate 270, which defines a page rotation. Grep's arguments anH or /dev/null method seems not to work, nor can pdftotext or pdfgrep help, as it is not any visible or searchable text on page I need.
I can either get the "Binary file aaa.pdf matches" or the line like this (which is not a text visible on a page!):
<</Filter/FlateDecode/Length 61>>stream4 595.19995]/MediaBox[0 0 841.92004 595.19995]/Parent 5 0 R/Resources<</ProcSet[/PDF/Text/ImageB/ImageC/ImageI]/XObject<</img3 11 0 R>>>>/Rotate 270/Type/Page>>
Suspect there is a way to loose the non printable bytes before grep gets them, or split the filename before grep part and assemble back after the grep has found the line, or maybe sed has an easy way to achieve this?
How do I get both filename and found line, approximately like grep does on regular text files?

I don't have a pdf file with that string inside but you can try
identify -verbose somefile.pdf | grep 'Rotate 270'
identify is part of ImageMagick package.
You can also try a brute force method :-)
strings somefile.pdf | grep 'Rotatae 270'

Related

Linux Header Removal from a ppm file

Does anybody know the command to remove the header from a ppm file in Linux? I've tried this already
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
head -n 4 Example.ppm > header.txt
tail -n 5+ Example.ppm > body.bin
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
It tells me that "Tail" could not be found.
Most ppm files use newlines in the header so your first command is fine. However, the rest of the file is binary, so:
head -n 4 Example.ppm > header.txt
filesize=$(wc -c header.txt)
dd if=Example.ppm of=body.bin bs=1 skip=$filesize
You should have /bin/tail if you have /bin/head; both are in the coreutils RPM package.
The format of a ppm(5) file (http://netpbm.sourceforge.net/doc/ppm.html) is awkward to use with the line-based head/tail/sed family. The documentation describes fields separated by whitespace that is not necessarily a line break.
You will need to: 1) Ignore comments from '#' to end of line; and 2) process the remainder one field (not column, not line) at a time. Using awk(1) could be an option here.
Check the documentation (http://netpbm.sourceforge.net/doc/directory.html) for a list of conversion programs. You may find one that converts the PPM file into a form better suited to whatever usage is your ultimate goal.

How do i grep data out of a .txt document in linux?

I am trying to run a grep command on files on my server to find data from e.g. error logs and databases.
grep -r text file
I am running this command and it is not working responds with the error saying
Binary file "FILE"
The file you are searching is not a normal text file, it's a binary file.
If u have a binary file, then most of it will be gibberish when u print it out but some might have ascii text.
To search for a string in a binary file you first need to print the content of the binary file and then pip the output to grep for searching.
cat binaryfile | grep searchstring
You can directly give like this too :-
grep searchstring binaryfile
If the given string is found it will reply with a string matched in binary file message. However unlike a normal text file and binary file doesn't have Line numbers as such for the ASCII text that you find in it. So grep will only say if it matched or not and doesn't give line number which makes sense.

finding number of occurences in large text file in linux

I have a 17 GB txt file and i cannot seem to load it via vim. Researched on solutions provided here. However i do not seem to understand them very well and i am not good with linux or perl.
I understand i would have to use grep or something.
grep -oP "/^2" file
I have tried up to this code but i cannot seem to find the solution to output the number of occurences without printing all the lines to screen
I would like to find the number of lines that starts with a digit 2 in the file and output the number to shell.
If you want to continue using PCRE:
grep -cP ^2 file
Using grep's "basic regular expressions":
grep -c ^2 file

Why can't i detect this file?

I have this file in a directory say test.php whose contents are below
< ? php $XZKsyG=’as’;
I want to pick up the file test.php with a search based on its content. So from the directory containing it I do:
grep 'php \$[a-zA-Z]*=.as.;'
However I get no result...what am I doing wrong?
Thanks
It works for me:
$ cat file
< ? php $XZKsyG=’as’;
$ grep 'php \$[a-zA-Z]*=.as.;' file
< ? php $XZKsyG=’as’;
Are you sure the contents of the file are exactly what you showed us?
Try cat -A file or od -c file to see whether the file really looks the way you think it does.
(Note that you don't need to escape the $ character; it's only a metacharacter at the end of a line. But escaping it should be ok.)
EDIT :
The characters around the as in your file are not ASCII apostrophes; they're Unicode RIGHT SINGLE QUOTATION MARK characters (0x2019). If the file is stored in UTF-8, each of them is represented as a 3-byte sequence. The grep command works for me because my locale settings "en_US.UTF-8" are such that a UTF-8 character is matched by . in a regexp, even if it has a multi-byte representation. I suspect your locale is such that it would be matched by ....
Probably the simplest solution is to edit the file to use ASCII apostrophes.
You might also want to play around with your locale settings. Try the grep command with $LANG set to "en_US.UTF-8".
What's the output of the locale command?
That works fine for me, though you may want to look into those "funny" single quotes you have around as:
pax$ cat testfile
< ? php $XZKsyG='as';
pax$ grep 'php \$[a-zA-Z]*=.as.;' testfile
< ? php $XZKsyG='as';
Failing that, there's some things you can look at. Some of these may sound silly but I'm really just checking all bases.
Are you sure the file contains only what you think it does? Executing od -xcb file will give you a hex dump of it for better checking.
Are you sure you're accessing the right file, in the right directory?
Have you done something silly like aliasing grep to be something else?
That's if you're looking for a file containing that string. If instead you're looking for a file named like that, you can use something like:
ls -1 | grep 'php \$[a-zA-Z]*=.as.;'
The ls -1 command gives you one file per line, and piping that through grep will filter out those not matching the pattern.
I suppose I should mention that I'm not really a big fan of file names with spaces in them, but I'm violently opposed to file names made up of PHP scripts :-)

How to tell binary from text files in linux

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.
file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
The diff manual specifies that
diff determines whether a file is text
or binary by checking the first few
bytes in the file; the exact number of
bytes is system dependent, but it is
typically several thousand. If every
byte in that part of the file is
non-null, diff considers the file to
be text; otherwise it considers the
file to be binary.
A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.
Update: According to the diff manual, this is exactly what diff does.
This approach defers to the grep command in determining whether a file is binary or text:
is_text_file() { grep -qIF '' "$1"; }
grep options used:
-q Quiet; Exit immediately with zero status if any match is found
-I Process a binary file as if it did not contain matching data
-F Interpret PATTERNS as fixed strings, not regular expressions.
grep pattern used:
'' Empty string. All files (except an empty file)
will match this pattern.
Notes
An empty file is not considered a text file according to this test. (The GNU file command agrees with this assessment.)
A file with one printable character, say a, is considered a text file according to this test. (Makes sense to me.) (The file command disagrees with this assessment. (Tested with GNU file))
This approach requires only one child process to test whether a file is text or binary.
Test
# cd into a temp directory
cd "$(mktemp -d)"
# Create 3 corner-case test files
touch empty_file # An empty file
echo -n a >one_byte_a # A file containing just `a`
echo a >one_line_a # A file containing just `a` and a newline
# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB
# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1
# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }
# Test harness
do_test() {
printf '%22s ... ' "$1"
if is_text_file "$1"; then
echo "is a text file"
else
echo "is a binary file"
fi
}
# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1
Output
empty_file ... is a binary file
one_byte_a ... is a text file
one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file
On my machine, it seems grep checks the first 96 KiB of a file for a NUL. (Tested with GNU grep). The exact crossover point depends on your machine's page size.
Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550
You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

Resources