I have some subtitle files in UTF-8. Sometimes there are some sporadic multibyte characters in these files which cause problem in some applications.
How do I check in linux (and possibility locate these) if a certain file contains any multibyte character.
You can use file command
chalet16$ echo test > a.txt
chalet16$ echo testก > b.txt #One of Thai characters
chalet16$ file *.txt
a.txt: ASCII text
b.txt: UTF-8 Unicode text
You can use file or chardet command.
Related
I have a file coded as UTF-16 and I want to split each line into fields (at fixed positions) separated by commas.
I have tried the following:
Option 1)
sed -i 's/./&,/400;s/./&,/360;*<and so on for several positions in the line>* FILE
This seems to work, but when editing the file with vim, it is obvious that something is wrong, since the commas are displayed as a single character, but the other symbols are displayed as a two-byte character.
BEFORE: 2^#A^#U^#W^#2^#0^#1^#9^#0^#1^#0^#1^#0^#0^#0^#1^#0^#0^#
AFTER: 2^#,A^#U^#,W^#,2^#0^#1^#9^#0^#1^#0^#1^#,0^#0^#0^#1^#0^#0^#,
Option 2)
Then I tried to use sed again but, instead of "," I typed the UTF-16 code of the comma, that is 002c:
sed -i "s/./&\U002c/400...
or even
s/./&$(echo -ne '\u002c')/400...
None of these options worked, the results are exactly the same as in Option1.
The problem is not the insertion of a new character with sed.
I noticed that the file is UTF-16LE (I was assuming it was UTF-16BE). I converted it directly to ascii with
iconv -f UTF-16LE -t ASCII sourcefile >destinationfile
and after that I could handle the file as usual with sed, grep, awk, etc.
I have a file which contains the letter ö. Except that it doesn't. When I open the file in gedit, I see:
\u00f6
I tried to convert the file, applying code that I found on other threads:
$ file blb.txt
blb.txt: ASCII text
$ iconv -f ISO-8859-15 -t UTF-8 blb.txt > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: ASCII text
What am I missing?
EDIT
I found this solution:
echo -e "$(cat blb.txt)" > blb_tmp.txt
$ file blb_tmp.txt
blb_tmp.txt: UTF-8 Unicode text
The -e "enables interpretation of backslash escapes".
Still not sure why iconv didn't make it happen. I'm guessing it's something like "iconv only changes the encoding, it doesn't interpret". Not sure yet, what the difference is though. Why did the Unicode people make this world such a mess? :D
This question already has answers here:
Remove file coding mark but preserve its coding
(2 answers)
Closed 1 year ago.
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
Using VIM
Open file in VIM:
vi text.xml
Remove BOM encoding:
:set nobomb
Save and quit:
:wq
For a non-interactive solution, try the following command line:
vi -c ":set nobomb" -c ":wq" text.xml
That should remove the BOM, save the file and quit, all from the command line.
A BOM is Unicode codepoint U+FEFF; the UTF-8 encoding consists of the three hex values 0xEF, 0xBB, 0xBF.
With bash, you can create a UTF-8 BOM with the $'' special quoting form, which implements Unicode escapes: $'\uFEFF'. So with bash, a reliable way of removing a UTF-8 BOM from the beginning of a text file would be:
sed -i $'1s/^\uFEFF//' file.txt
This will leave the file unchanged if it does not start with a UTF-8 BOM, and otherwise remove the BOM.
If you are using some other shell, you might find that "$(printf '\ufeff')" produces the BOM character (that works with zsh as well as any shell without a printf builtin, provided that /usr/bin/printf is the Gnu version ), but if you want a Posix-compatible version you could use:
sed "$(printf '1s/^\357\273\277//')" file.txt
(The -i in-place edit flag is also a Gnu extension; this version writes the possibly-modified file to stdout.)
Well, just dealt with this today and my preferred way was dos2unix:
dos2unix will remove BOM and also take care of other idiosyncrasies from other SOs:
$ sudo apt install dos2unix
$ dos2unix test.xml
It's also possible to remove BOM only (-r, --remove-bom):
$ dos2unix -r test.xml
Note: tested with dos2unix 7.3.4
IF you are certain that a given file starts with a BOM, then it is possible to remove the BOM from a file with the tail command:
tail --bytes=+4 withBOM.txt > withoutBOM.txt
Joshua Pinter's answer works correctly on mac so I wrote a script that removes the BOM from all files in a given folder, see here.
It can be used like follows:
Remove BOM from all files in current directory: rmbom .
Print all files with a BOM in the current directory: rmbom . -a
Only remove BOM from all files in current directory with extension txt or cs: rmbom . -e txt -e cs
If you want to work on a bulk of files, by improving Reginaldo Santos's answers, there is a quick way:
find . -name "*.java" | grep java$ | xargs -n 1 dos2unix
On my Linux directory I have 6 files. 5 files are txt files and 1 file a .tar.gz type file. How can I print to the terminal only the name of the txt files?
directory :dir
content:
ex1, ex2, ex3, ex4, ex5, ex6.tar.gz
Because you do not have a file extension (.txt) I would try to do it with exclusion.
ls | grep -v tar.gz
If you have multiple types then use extensions.
The command 'file', followed by the name of a file, will return the type of the file.
You can loop over the files in your directory, use each filename as input to the 'file' command, and if it is a text file, print that filename.
The following includes some extra output from the file command, which I'm not sure how to remove yet, but it does give you the filenames you want:
#!/bin/bash
for f in *
do
file $f | grep text
done
You can put this into a shell script in the directory you want to get the filenames from, and run it from the command line.
The suggestions of using the file command are correct. The problem here is parsing the output of this command, because (1) file names can contain pretty any character, and (2) the concrete output of the file command is a bit unpredictable, because it depends on how the so called magic files are present.
If we rely on the fact that the explanation text of the output of the file command - i.e. that part which explains what file it is - always contains the word text if it is a text file, and that it never contains a colon, we can process it as follows:
The last colon in the output must separated the filename from the explanation. Everything to the left is the filename, and if the word text (note the leading space before text!) occurs in the right part, we have a text file.
This still leaves us with those (hopefully rare) cases where a file name contains a non-printable character, they would be translated to their octal equivalent, which might or might not be what you want to see. You can suppress this by passing the -r option to the file command. This is useful if you want to process this filename further instead of just displaying it to the user, but it might corrupt your parsing logic, especially if the filename contains a newline.
Finally, don't forget that in any case, you see what the system considers a text file. This is not necessarily the same what you define to be a text file.
Updated Answer
As #hek2mgl points out in the comments, a more robust solution is to separate filenames using nul characters (which may not occur in filenames) and that will deal with filenames containing newlines, and colons:
file -0 * | awk -F'\0' '$2 ~ /text/{print $1}'
Original Answer
I would do this:
file * | awk -F: '$2~/text/{print $1}'
That runs file to see the type of each file and passes the names and types to awk separated by a colon. awk then looks for the word text in the second field and if it finds it, prints the first field - which is the filename.
Try running the following simpler command on its own to see how it works:
file *
Given this directory of files:
$ file *
1.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
2.pdf: PDF document, version 1.5
3.pdf: PDF document, version 1.5
4.dat: data
5.txt: ASCII text
6.jpg: JPEG image data, JFIF standard 1.02, aspect ratio, density 100x100, segment length 16, baseline, precision 8, 2833x972, frames 3
7.html: HTML document text, UTF-8 Unicode text, with very long lines, with no line terminators
8.js: UTF-8 Unicode text
9.xml: XML 1.0 document text
A.pl: a /opt/local/bin/perl script text executable, ASCII text
B.Makefile: makefile script text, ASCII text
C.c: c program text, ASCII text
D.docx: Microsoft Word 2007+
You can see the only files that are pure ascii are 5.txt, 9.xml, and A-C. The rest are either binary or UTF according to file.
You can use a Bash glob to loop through files and use file to test each file. This save having to parse the output of file for the file names but relies on file to accurate identify what you consider to be 'text':
for fn in *; do
[ -f "$fn" ] || continue
fo=$(file "$fn")
[[ $fo =~ ^"$fn":.*text ]] || continue
echo "$fn"
done
If you cannot use file, which is certainly the easiest way, you can open the file and look for binary characters. Use Perl for that:
for fn in *; do
[ -f "$fn" ] || continue
head -c 2000 "$fn" | perl -lne '$tot+=length; $cnt+=s/[^[:ascii:]]//g; END{exit 1 if($cnt/$tot>0.03);}'
[ $? -eq 0 ] || continue
echo "$fn"
done
In this case, I am looking for a percentage of ascii vs non ascii in the first 2000 bytes of a file. YMMV but that allows finding a file that file would report as UTF (since it has a binary BOM) but most of the file is ascii.
For that directory, the two Bash scripts report (with my comments on each file):
1.txt # UTF file with a binary BOM but no UTF characters -- all ascii
4.dat # text based configuration file for a router. file does not report this
5.txt # Pure ascii file
7.html # html file
8.js # Javascript sourcecode
9.xml # xml file all text
A.pl # Perl file
B.Makefile # Unix make file
C.c # C source file
Since file does not consider the all ascii file 4.dat to be text, it is not reported by the first Bash script but is by the second. Otherwise -- same output.
Does anyone know how to replace a string containing \u2015 in a SED command like the example below?
sed -ie "s/some text \u2015 some more text/new text/" inputFileName
You just need to escape the slashes present. Below example works fine in GNU sed version 4.2.1
$ echo "some text \u2015 some more text" | sed -e "s/some text \\\u2015 some more text/abc/"
$ abc
Also you don't have to use the -i flag which according to the the man page is only for editing files in-place.
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if extension supplied). The default operation mode is to break symbolic and hard links. This can be changed with --follow-symlinks and
--copy.
Not sure if this is exactly what you need, but maybe you should take a look at native2ascii tool to convert such unicode escapes.
Normally it replaces all characters that cannot be displayed in ISO-8859-1 with their unicodes (escaped with \u), but it also supports reverse conversions. Assuming you have some file in UTF-8 named "input" containing \u00abSome \u2015 string\u00bb, then executing
native2ascii -encoding UTF-8 -reverse input output
will result in "output" file with «Some ― string».