bulk rename pdf files with name from specific line of its content in linux - linux

I have multiple pdf files which I want to rename. new name should be taken from pdf's file content on specific(lets say 5th) line. for example, if file's 5th line has content some string <-- this string should be name of file. and same thing goes to the rest of files. each file should be renamed with content's 5th line. I tried this in terminal
for pdf in *.pdf
do
filename=`basename -s .pdf "${pdf}"`
newname=`awk 'NR==5' "${filename}.pdf"`
mv "${pdf}" "${newname}"
done
it copies the files, but name is invalid string. I know the system doesn't see the file as plain text and images, there are metadata, xml tags and so on.. but is there way to take content from that line?

Out of the box, bash and its usual utilities are not able to read pdf files. However, less is able to recover the text from a pdf file. You could change your script as follow :
for pdf in *.pdf
do
mv "$pdf" "$(less $pdf | sed '5q;d').pdf"
done
Explanation :
less "$pdf" : display the text part of the pdf file. Will take spacing into account
make some tests to see if less returns the desired output
sed '5q;d' : extracts the 5th line of the input file
Optionally, you could use the following script to remove blank lines and exceeding spaces :
mv "$pdf" "$(less "$pdf" | sed -e '/^\s*$/d' -e 's/ \+/ /g' | sed '5q;d').pdf"

Related

How to remove 1st empty column from file using bash commands?

I have output file that can be read by a visualizing tool, the problem is that at the beginning of each line there is one space. Because of this the visualizing tool isn't able to recognize the scripts and, hence crashing. When I manually remove the first column of spaces, it works fine. Is there a way to remove the 1st empty column of spaces using bash command
What I want is to remove the excess column of empty space like shown in this second image using a bash command
At present I use Vim editor to remove the 1st column manually. But I would like to do it using a bash command so that I can automate the process. My file is not just full of columns, it has some independent data line
Using cut or sed would be two simple solutions.
cut
cut -c2- file > file.cut && mv file.cut file
cut cannot modify a file, therefore you need to redirect its output to a different file and then overwrite the old file with the new file.
sed
sed -i 's/^.//' file
-i modifies the file in-place.
I would use
sed -ie 's/^ //' file
to just remove spaces (in case a line does not contain it)

How to print only txt files on linux terminal?

On my Linux directory I have 6 files. 5 files are txt files and 1 file a .tar.gz type file. How can I print to the terminal only the name of the txt files?
directory :dir
content:
ex1, ex2, ex3, ex4, ex5, ex6.tar.gz
Because you do not have a file extension (.txt) I would try to do it with exclusion.
ls | grep -v tar.gz
If you have multiple types then use extensions.
The command 'file', followed by the name of a file, will return the type of the file.
You can loop over the files in your directory, use each filename as input to the 'file' command, and if it is a text file, print that filename.
The following includes some extra output from the file command, which I'm not sure how to remove yet, but it does give you the filenames you want:
#!/bin/bash
for f in *
do
file $f | grep text
done
You can put this into a shell script in the directory you want to get the filenames from, and run it from the command line.
The suggestions of using the file command are correct. The problem here is parsing the output of this command, because (1) file names can contain pretty any character, and (2) the concrete output of the file command is a bit unpredictable, because it depends on how the so called magic files are present.
If we rely on the fact that the explanation text of the output of the file command - i.e. that part which explains what file it is - always contains the word text if it is a text file, and that it never contains a colon, we can process it as follows:
The last colon in the output must separated the filename from the explanation. Everything to the left is the filename, and if the word text (note the leading space before text!) occurs in the right part, we have a text file.
This still leaves us with those (hopefully rare) cases where a file name contains a non-printable character, they would be translated to their octal equivalent, which might or might not be what you want to see. You can suppress this by passing the -r option to the file command. This is useful if you want to process this filename further instead of just displaying it to the user, but it might corrupt your parsing logic, especially if the filename contains a newline.
Finally, don't forget that in any case, you see what the system considers a text file. This is not necessarily the same what you define to be a text file.
Updated Answer
As #hek2mgl points out in the comments, a more robust solution is to separate filenames using nul characters (which may not occur in filenames) and that will deal with filenames containing newlines, and colons:
file -0 * | awk -F'\0' '$2 ~ /text/{print $1}'
Original Answer
I would do this:
file * | awk -F: '$2~/text/{print $1}'
That runs file to see the type of each file and passes the names and types to awk separated by a colon. awk then looks for the word text in the second field and if it finds it, prints the first field - which is the filename.
Try running the following simpler command on its own to see how it works:
file *
Given this directory of files:
$ file *
1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
2.pdf: PDF document, version 1.5
3.pdf: PDF document, version 1.5
4.dat: data
5.txt: ASCII text
6.jpg: JPEG image data, JFIF standard 1.02, aspect ratio, density 100x100, segment length 16, baseline, precision 8, 2833x972, frames 3
7.html: HTML document text, UTF-8 Unicode text, with very long lines, with no line terminators
8.js: UTF-8 Unicode text
9.xml: XML 1.0 document text
A.pl: a /opt/local/bin/perl script text executable, ASCII text
B.Makefile: makefile script text, ASCII text
C.c: c program text, ASCII text
D.docx: Microsoft Word 2007+
You can see the only files that are pure ascii are 5.txt, 9.xml, and A-C. The rest are either binary or UTF according to file.
You can use a Bash glob to loop through files and use file to test each file. This save having to parse the output of file for the file names but relies on file to accurate identify what you consider to be 'text':
for fn in *; do
[ -f "$fn" ] || continue
fo=$(file "$fn")
[[ $fo =~ ^"$fn":.*text ]] || continue
echo "$fn"
done
If you cannot use file, which is certainly the easiest way, you can open the file and look for binary characters. Use Perl for that:
for fn in *; do
[ -f "$fn" ] || continue
head -c 2000 "$fn" | perl -lne '$tot+=length; $cnt+=s/[^[:ascii:]]//g; END{exit 1 if($cnt/$tot>0.03);}'
[ $? -eq 0 ] || continue
echo "$fn"
done
In this case, I am looking for a percentage of ascii vs non ascii in the first 2000 bytes of a file. YMMV but that allows finding a file that file would report as UTF (since it has a binary BOM) but most of the file is ascii.
For that directory, the two Bash scripts report (with my comments on each file):
1.txt # UTF file with a binary BOM but no UTF characters -- all ascii
4.dat # text based configuration file for a router. file does not report this
5.txt # Pure ascii file
7.html # html file
8.js # Javascript sourcecode
9.xml # xml file all text
A.pl # Perl file
B.Makefile # Unix make file
C.c # C source file
Since file does not consider the all ascii file 4.dat to be text, it is not reported by the first Bash script but is by the second. Otherwise -- same output.

Using grep to overwrite its current file

I have a list of directories within directories and this is what I am trying to attempt:
find a specific file format which is .xml
within all these .xml files, read the contents in the files and remove line 3
For line 3, its string is as follows: dxflib <Name of whatever folder it is in>.dxb
I tried using find -name "*.xml" | xargs grep -v "dxflib" in the terminal (I am using linux) and I found out that while my code works and it displays the results, it did not overwrite the changes to the file.
And as I googled online, it is mentioned that I will need to add in >> output.txt etc
And hence, are there anyways in which I can make it to save / overwrite its own file?
Removes third line in file:
sed -i '3d' file

grep -f on files in a zipped folder

I am performing a recursive fgrep/grep -f search on a zipped up folder using the following command in one of my programs:
The command I am using:
grep -r -i -z -I -f /path/to/pattern/file /home/folder/TestZipFolder.zip
Inside the pattern file is the string "Dog" that I am trying to search for.
In the zipped up folder there are a number of text files containing the string "Dog".
The grep -f command successfully finds the text files containing the string "Dog" in 3 files inside the zipped up folder, but it prints the output all on one line and some strange characters appear at the end i.e PK (as shown below). And when I try and print the output to a file in my program other characters appear on the end such as ^B^T^#
Output from the grep -f command:
TestZipFolder/test.txtThis is a file containing the string DogPKtest1.txtDog, is found again in this file.PKTestZipFolder/another.txtDog is written in this file.PK
How would I get each of the files where the string "Dog" has been found to print on a new line so they are not all grouped together on one line like they are now?
Also where are the "PK" and other strange characters appearing from in the output and how do i prevent them from appearing?
Desired output
TestZipFolder/test.txt:This is a file containing the string Dog
TestZipFolder/test1.txt:Dog, is found again in this file
TestZipFolder/another.txt:Dog is written in this file
Something along these lines, whereby the user is able to see where the string can be found in the file (you actually get the output in this format if you run the grep command on a file that is not a zip file).
If you need a multiline output, better use zipgrep :
zipgrep -s "pattern" TestZipFolder.zip
the -s is to suppress error messages(optional). This command will print every matched lines along with the file name. If you want to remove the duplicate names, when more than one match is in a file, some other processing must be done using loops/grep or awk or sed.
Actually, zipgrep is a combination egrep and unzip. And its usage is as follows :
zipgrep [egrep_options] pattern file[.zip] [file(s) ...] [-x xfile(s) ...]
so you can pass any egrep options to it.

linux split file with incremental characters before extension

I want to split a large collection if text files. This is the script that i currently use
for file in *.txt
do
split -b 120k "$file" "$file"_
done
When the input file is
hello_world.txt
the splitted files would be
hello_world.txt_AA
hello_world.txt_AB
hello_world.txt_AC
i want it to be like
hello_world_AA
hello_world_AB
hello_world_AC
How do i do it in linux ?
In bash, an expression of the form ${variable%suffix} will expand to the contents of variable with the suffix removed from the end.
You can use this when you specify the new file prefix to split like this:
for file in *.txt
do
split -b 120k "$file" "${file%.txt}"_
done

Resources