I am working on a shell script that takes a single command line parameter, a file path (might be relative or absolute). The script should examine that file and print a single line consisting of the phrase:
Windows ASCII
if the files is an ASCII text file with CR/LF line terminators, or
Something else
if the file is binary or ASCII with “Unix” LF line terminators.
currently I have the following code.
#!/bin/sh
file=$1
if grep -q "\r\n" $file;then
echo Windows ASCII
else
echo Something else
fi
It displays information properly, but when I pass something that is not of Windows ASCII type through such as /bin/cat it still id's it as Windows ASCII. When I pass a .txt file type it displays something else as expected it is just on folders that it displays Windows ASCII. I think I am not handling it properly, but I am unsure. Any pointers of how to fix this issue?
As you specify you only need to differentiate between 2 cases, this should work.
#!/bin/sh
file="$1"
case $(file "$file") in
*"ASCII text, with CRLF line terminators" )
echo "Windows ASCII"
;;
* )
echo "Something else"
;;
esac
As you have specified #!/bin/sh, OR if your goal is total backward compatibility, you may need to change
$(file "$file")
with
`file "$file"`
To use your script with filenames that include spaces, note that all $ variable names are now surrounded with double-quotes. AND you'll also have to quote the space char in the filename when you call the script, i.e.
myFileTester.sh "file w space.txt"
OR
myFileTester.sh 'file w space.txt'
OR
myFileTester.sh file\ w\ space.txt
OR
Also, if you have to start discriminating all the possible cases that file can analyze, you'll have a rather large case statement on your hands. AND file is notorious for the different messages it returns, depending on the the contents of /etc/file/magic, OS, versions, etc.
IHTH
Use file command to find out file type:
$ file /etc/passwd
/etc/passwd: ASCII English text
$ file /bin/cat
/bin/cat: Mach-O 64-bit executable x86_64
$ file test.txt
test.txt: ASCII text, with CRLF line terminators
Related
I want to create a shell file that looks for whether a file is a Unix or a Dos file type. Using an IF query I want to decide after checking whether the file needs to be converted using "dos2unix" or not. I know the command "FILE" but the return value is no BOOLEAN data type its a string.
So is there any way to set a BOOLEAN bit to true if the file is a unix file type?
thanks in advance...!
You could parse the output of the file command. For text files with \n line endings, it outputs ASCII text ..., while for text files with \r\n line endings, it outputs ASCII text ... with CRLF line terminators. Note that depending on the actual file contents, there can be additional information in place of the "...". Hence, you could do something like
file YOURFILE | grep -q '^ASCII text.*with CRLF'
((is_dos_text_file=1-$?))
The variable is_dos_text_file contains the value 1, if YOURFILE was judged by file as a text file with CRLF endings. It is 0 if YOURFILE either has Unix line endings, or was not judged as textfile.
UPDATE: I just noticed that you have used the shell tag in your posting and hence search for a Posix Shell solution. In this case, the ((...)) construct can't be used and you would have to do something like
if file YOURFILE | grep -q '^ASCII text.*with CRLF'
then
is_dos_text_file=1 # true
else
is_dos_text_file=0 # false
fi
to get the same effect.
You can convert the file to a Unix file and check if it is still the same. In that case it is a Unix file. Otherwise it is a DOS file.
echo unix > unix-file
echo dos | unix2dos > dos-file
for file in {dos,unix}-file; do
if cmp -s $file <(dos2unix < $file); then
echo $file is a unix file
else
echo $file is a dos file
fi
done
I am new to linux (not my own server) and I want to split some windows txt files by calling a bash script from a third party application:
So far I have it working in two ways up to a point:
split -l 5000 LargeFile.txt SmallFile
for file in LargeFile.*
do
mv "$file" "$file.txt"
done
awk '{filename = "wrd." int((NR-1)/5000) ".txt"; print >> filename}' LargeFile.txt
But both give me txt files with the result:
line1line2line3line4
I found some topics about putting LargeFile.txt like this $ (LargeFile.txt) but it is not working for me. (Also I found a swich to let the split command produce txt files directly, but this is also not working)
I hope some one can help me out on this one.
Explanation: Line terminators
As explained by various answers to this question, the standard line terminators differ between OS's:
Linux uses LF (line feed, 0x0a)
Windows uses CRLF (carriage return and line feed 0x0d 0x0a)
Mac, pre OS X used CR (carriage return CR)
To solve your problem, it would be important to figure out what line terminators your LargeFile.txt uses. The simplest way would be the file command:
file LargeFile.txt
The output will indicate if line terminators are CR or CRLF and otherwise just state that it is an ASCII file.
Since LF and CRLF line terminators will be recognized properly in Linux and lines should not appear merged together (no matter which way you use to view the file) unless you configure an editor specifically so that they do, I will assume that your file has CR line terminators.
Example solution to your problem (assuming CR line terminators)
If you want to split the file in the shell and with shell commands, you will potentially face the problem that the likes of cat, split, awk, etc will not recognize line endings in the first place. If your file is very large, this may additionally lead to memory issues (?).
Therefore, the best way to handle this may be to translate the line terminators first (using the tr command) so that they are understood in Linux (i.e. to LF) and then apply your split or awk code before translating the line terminators back (if you believe you need to do this).
cat LargeFile.txt | tr "\r" "\n" > temporary_file.txt
split -l 5000 temporary_file.txt SmallFile
rm temporary_file.txt
for file in `ls SmallFile*`; do filex=$file.txt; cat $file | tr "\n" "\r" > $filex; rm $file; done
Note that the last line is actually a for loop:
for file in `ls SmallFile*`
do
filex=$file.txt
cat $file | tr "\n" "\r" > $filex
rm $file
done
This loop will again use tr to restore the CR line terminators and additionally give the resulting files a txt filename ending.
Some Remarks
Of course, if you would like to keep the LF line terminators you should not execute this line.
And finally, if you find that you have a different type of line terminators, you may need to adapt the tr command in the first line.
Both tr and split (and also cat and rm) are part of GNU coreutils and should be installed on your system unless you are in a very untypical environment (a rescue shell of an initial RAM disk perhaps). The same (should typically be available) goes for the file command, this one.
I need to read a file into an array and concatenate a string at the end of each line. Here is my bash script:
#!/bin/bash
IFS=$'\n' read -d '' -r -a lines < ./file.list
for i in "${lines[#]}"
do
tmp="$i"
tmp="${tmp}stuff"
echo "$tmp"
done
However, when I do this, an action of replace happens, instead of concatenation.
For example, in the file.list, we have:
http://www.example1.com
http://www.example2.com
What I need is:
http://www.example1.comstuff
http://www.example2.comstuff
But after executing the script above, I get things as below on the terminal:
stuff//www.example1.com
stuff//www.example2.com
Btw, my PC is Mac OS.
The problem also occurs while concatenating strings via awk, printf, and echo commands. For example echo $tmp"stuff" or echo "${tmp}""stuff"
The file ./file.lst is, most probably, generated on a Windows system or, at least, it was saved using the Windows convention for end of line.
Windows uses a sequence of two characters to mark the end of lines in a text file. These characters are CR (\r) followed by LF (\n). Unix-like systems (Linux and macOS starting with version 10) use LF as end of line character.
The assignment IFS=$'\n' in front of read in your code tells read to use LF as line separator. read doesn't store the LF characters in the array it produces (lines[]) but each entry from lines[] ends with a CR character.
The line tmp="${tmp}stuff" does what is it supposed to do, i.e. it appends the word stuff to the content of the variable tmp (a line read from the file).
The first line read from the input file contains the string http://www.example1.com followed by the CR character. After the string stuff is appended, the content of variable tmp is:
http://www.example1.com$'\r'stuff
The CR character is not printable. It has a special interpretation when it is printed on the terminal: it sends the cursor at the start of the line (column 1) without changing the line.
When echo prints the line above, it prints (starting on a new line) http://www.example1.com, then the CR character that sends the cursor back to the start of the line where is prints the string stuff. The stuff fragment overwrites the first 5 characters already printed on that line (http:) and the result, as it is visible on screen, is:
stuff//www.example1.com
The solution is to get rid of the CR characters from the input file. There are several ways to accomplish this goal.
A simple way to remove the CR characters from the input file is to use the command:
sed -i.bak s/$'\r'//g file.list
It removes all the CR characters from the content of file file.list, saves the updated string back into the file.list file and stores the original file.list file as file.list.bak (a backup copy in case it doesn't produce the output you expect).
Another way to get rid of the CR character is to ask the shell to remove it in the command where stuff is appended:
tmp="${tmp/$'\r'/}stuff"
When a variable is expanded in a construct like ${tmp/a/b}, all the appearances of a in $tmp are replaced with b. In this case we replace \r with nothing.
I'm guessing it's have something to do with the Carriage Return character.
Did your file.list created on windows? If so, try to use dos2unix before running the script.
Edit
You can check your files using the file command.
Example:
file file.list
If you saved the file in Windows Notepad like this:
Then it will probably come up like this:
file.list: ASCII text, with no line terminators
You can use built in tools like iconv to convert the encodings. However for a simple use like this, you can just use a command that works for multiple encodings without any conversion necessary.
You could simply buffer the file through cat, and use a regular expression that applies to either:
Carriage return followed by line terminator, or
Line terminator on it's own
Then append the string.
Example:
cat file.list | grep -E -v "^$" | sed -E -e "s/(\r?$)/stuff/g"
Will work with ASCII text, and ASCII text with no line terminators.
If you need to modify a stream to append a fixed string, you can use sed or awk, for instance:
sed 's/$/stuff/'
to append stuff to the end of each line.
using "dos2unix file.list" would also solve the problem
On my Linux directory I have 6 files. 5 files are txt files and 1 file a .tar.gz type file. How can I print to the terminal only the name of the txt files?
directory :dir
content:
ex1, ex2, ex3, ex4, ex5, ex6.tar.gz
Because you do not have a file extension (.txt) I would try to do it with exclusion.
ls | grep -v tar.gz
If you have multiple types then use extensions.
The command 'file', followed by the name of a file, will return the type of the file.
You can loop over the files in your directory, use each filename as input to the 'file' command, and if it is a text file, print that filename.
The following includes some extra output from the file command, which I'm not sure how to remove yet, but it does give you the filenames you want:
#!/bin/bash
for f in *
do
file $f | grep text
done
You can put this into a shell script in the directory you want to get the filenames from, and run it from the command line.
The suggestions of using the file command are correct. The problem here is parsing the output of this command, because (1) file names can contain pretty any character, and (2) the concrete output of the file command is a bit unpredictable, because it depends on how the so called magic files are present.
If we rely on the fact that the explanation text of the output of the file command - i.e. that part which explains what file it is - always contains the word text if it is a text file, and that it never contains a colon, we can process it as follows:
The last colon in the output must separated the filename from the explanation. Everything to the left is the filename, and if the word text (note the leading space before text!) occurs in the right part, we have a text file.
This still leaves us with those (hopefully rare) cases where a file name contains a non-printable character, they would be translated to their octal equivalent, which might or might not be what you want to see. You can suppress this by passing the -r option to the file command. This is useful if you want to process this filename further instead of just displaying it to the user, but it might corrupt your parsing logic, especially if the filename contains a newline.
Finally, don't forget that in any case, you see what the system considers a text file. This is not necessarily the same what you define to be a text file.
Updated Answer
As #hek2mgl points out in the comments, a more robust solution is to separate filenames using nul characters (which may not occur in filenames) and that will deal with filenames containing newlines, and colons:
file -0 * | awk -F'\0' '$2 ~ /text/{print $1}'
Original Answer
I would do this:
file * | awk -F: '$2~/text/{print $1}'
That runs file to see the type of each file and passes the names and types to awk separated by a colon. awk then looks for the word text in the second field and if it finds it, prints the first field - which is the filename.
Try running the following simpler command on its own to see how it works:
file *
Given this directory of files:
$ file *
1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
2.pdf: PDF document, version 1.5
3.pdf: PDF document, version 1.5
4.dat: data
5.txt: ASCII text
6.jpg: JPEG image data, JFIF standard 1.02, aspect ratio, density 100x100, segment length 16, baseline, precision 8, 2833x972, frames 3
7.html: HTML document text, UTF-8 Unicode text, with very long lines, with no line terminators
8.js: UTF-8 Unicode text
9.xml: XML 1.0 document text
A.pl: a /opt/local/bin/perl script text executable, ASCII text
B.Makefile: makefile script text, ASCII text
C.c: c program text, ASCII text
D.docx: Microsoft Word 2007+
You can see the only files that are pure ascii are 5.txt, 9.xml, and A-C. The rest are either binary or UTF according to file.
You can use a Bash glob to loop through files and use file to test each file. This save having to parse the output of file for the file names but relies on file to accurate identify what you consider to be 'text':
for fn in *; do
[ -f "$fn" ] || continue
fo=$(file "$fn")
[[ $fo =~ ^"$fn":.*text ]] || continue
echo "$fn"
done
If you cannot use file, which is certainly the easiest way, you can open the file and look for binary characters. Use Perl for that:
for fn in *; do
[ -f "$fn" ] || continue
head -c 2000 "$fn" | perl -lne '$tot+=length; $cnt+=s/[^[:ascii:]]//g; END{exit 1 if($cnt/$tot>0.03);}'
[ $? -eq 0 ] || continue
echo "$fn"
done
In this case, I am looking for a percentage of ascii vs non ascii in the first 2000 bytes of a file. YMMV but that allows finding a file that file would report as UTF (since it has a binary BOM) but most of the file is ascii.
For that directory, the two Bash scripts report (with my comments on each file):
1.txt # UTF file with a binary BOM but no UTF characters -- all ascii
4.dat # text based configuration file for a router. file does not report this
5.txt # Pure ascii file
7.html # html file
8.js # Javascript sourcecode
9.xml # xml file all text
A.pl # Perl file
B.Makefile # Unix make file
C.c # C source file
Since file does not consider the all ascii file 4.dat to be text, it is not reported by the first Bash script but is by the second. Otherwise -- same output.
I have created an sh script in Unix and it basically compiles a program specified via argument, for example: sh shellFile cprogram.c, and feeds the compiled program a variety of input files.
These input files are created in Unix and i have tested it, I am happy with those results. I have even tested it out with input files, made in Unix, that even had an extra carriage return at the end and yet i get good results, lets name this test file 00ecr for reference. Is it possible that if i created a test file in windows and transferred it over to Unix, that this new test file, lets call it 00wind, will produce bad results in my shell program.
This is just a theoretical question overall. I am just wondering if it will muck things up even though I tested my shell script with files, made in Unix, that accounted for extra carriage returns?
How about in your script, you use Linux command file to identify if the file has Windows style line terminations:
$file test.txt
test.txt: ASCII text, with CRLF line terminators
So your script could have a converting function like this:
#!/bin/bash
windows_file=`file $1 | grep CRLF`
if [[ -z "$windows_file" ]]; then
echo "File already in unix format"
else
# file need converting
dos2unix $1
echo "Converted a windows file"
fi
So here we first use the file utility to output the file type, and grep CRLF string, to see if the file needs converting. The grep will return null if it's not in windows format, and we can test for null string with if [[ -z "$var" ]] statement.
And then just dos2unix utility to convert it:
$ dos2unix test.txt
dos2unix: converting file test.txt to Unix format ...
$ file test.txt
test.txt: ASCII text
This way you could ensure that your input is always "sanitized".