Unicode character not visible while doing cat - linux

I have a CSV file generated by a windows system. The file is then moved to linux. The linux environment is NAME="Red Hat Enterprise Linux Server".VERSION="7.3 (Maipo)".ID="rhel".
When I use vi editor, all characters are visible. For example, one line is given :"Sarah--bitte nicht löschen".
But when i cat the file, i get something like "Sarah--bitte nicht l▒schen".
This file is consumed by datastage application and this unicode characters are coming as "?" in datastage. Since cat is not showing the character properly, I believe the issue is at the linux server. Any help is appreciated.

vi reads the file using encoding according fenc setting and show the content using your locales setting ($LANG env). If fenc is different from LANG, vi can handle the translate.
But cat doesn't handle the translate, it always output the exact byte stream without any convert.
Your terminal will show the output content of both vi and cat using your local PC locale setting.

Related

Preserving accentuated letters when running a PERL script from linux terminal

I want to get a plain text file from the French Wikipedia dump XML file.
To that end, I am applying a Perl script
I can give the full file if necessary, I only added the line
tr/a-zàâééèëêîôûùç-/ /cs;
to the script here: http://mattmahoney.net/dc/textdata.html
However, when I run on linux terminal:
perl filterwikifr.pl frwiki.xml > frwikiplaintext.txt
the output text file does not print accentuated letters correctly. For example, I get catégorie instead of catégorie...
I also tried:
perl -CS filterwikifr.pl frwiki.xml > frwikiplaintext.txt
without better success (and other variants instead of -CS...)
the problem is with the text editor gedit.
If, instead of opening the file directly, I open gedit, and then go to "open" and down, in "Character encoding", I choose UTF-8 instead of "Automatically Detected", then the accents are printed correctly.

syntax error near unexpected token ' - bash

I have a written a sample script on my Mac
#!/bin/bash
test() {
echo "Example"
}
test
exit 0
and this works fine by displaying Example
When I run this script on a RedHat machine, it says
syntax error near unexpected token '
I checked that bash is available using
cat /etc/shells
which bash shows /bin/bash
Did anyone come across the same issue ?
Thanks in advance !
It could be a file encoding issue.
I have encountered file type encoding issues when working on files between different operating systems and editors - in my case particularly between Linux and Windows systems.
I suggest checking your file's encoding to make sure it is suitable for the target linux environment. I guess an encoding issue is less likely given you are using a MAC than if you had used a Windows text editor, however I think file encoding is still worth considering.
--- EDIT (Add an actual solution as recommended by #Potatoswatter)
To demonstrate how file type encoding could be this issue, I copy/pasted your example script into Notepad in Windows (I don't have access to a Mac), then copied it to a linux machine and ran it:
jdt#cookielin01:~/windows> sh ./originalfile
./originalfile: line 2: syntax error near unexpected token `$'{\r''
'/originalfile: line 2: `test() {
In this case, Notepad saved the file with carriage returns and linefeeds, causing the error shown above. The \r indicates a carriage return (Linux systems terminate lines with linefeeds \n only).
On the linux machine, you could test this theory by running the following to strip carriage returns from the file, if they are present:
cat originalfile | tr -d "\r" > newfile
Then try to run the new file sh ./newfile . If this works, the issue was carriage returns as hidden characters.
Note: This is not an exact replication of your environment (I don't have access to a Mac), however it seems likely to me that the issue is that an editor, somewhere, saved carriage returns into the file.
--- /EDIT
To elaborate a little, operating systems and editors can have different file encoding defaults. Typically, applications and editors will influence the filetype encoding used, for instance, I think Microsoft Notepad and Notepad++ default to Windows-1252. There may be newline differences to consider too (In Windows environments, a carriage return and linefeed is often used to terminate lines in files, whilst in Linux and OSX, only a Linefeed is usually used).
A similar question and answer that references file encoding is here: bad character showing up in bash script execution
try something like
$ sudo apt-get install dos2unix
$ dos2unix offendingfile
Easy way to convert example.sh file to UNIX if you are working in Windows is to use NotePad++ (Edit>EOL Conversion>UNIX/OSX Format)
You can also set the default EOL in notepad++ (Settings>Preferences>New Document/Default Directory>select Unix/OSX under the Format box)
Thanks #jdt for your answer.
Following that, and since I keep having this issue with carriage return, I wrote that small script. Only run carriage_return and you'll be prompted for the file to "clean".
https://gist.github.com/kartonnade/44e9842ed15cf21a3700
alias carriage_return=remove_carriage_return
remove_carriage_return(){
# cygwin throws error like :
# syntax error near unexpected token `$'{\r''
# due to carriage return
# this function runs the following
# cat originalfile | tr -d "\r" > newfile
read -p "File to clean ? "
file_to_clean=$REPLY
temp_file_to_clean=$file_to_clean'_'
# file to clean => temporary clean file
remove_carriage_return_one='cat '$file_to_clean' | tr -d "\r" > '
remove_carriage_return_one=$remove_carriage_return_one$temp_file_to_clean
# temporary clean file => new clean file
remove_carriage_return_two='cat '$temp_file_to_clean' | tr -d "\r" > '
remove_carriage_return_two=$remove_carriage_return_two$file_to_clean
eval $remove_carriage_return_one
eval $remove_carriage_return_two
# remove temporary clean file
eval 'rm '$temp_file_to_clean
}
I want to add to the answer above is how to check if it is carriage return issue in Unix like environment (I tested in MacOS)
1) Using cat
cat -e my_file_name
If you see the lines ended with ^M$, then yes, it is the carriage return issue.
2) Find first line with carriage return character
grep -r $'\r' Grader.sh | head -1
3) Using vim
vim my_file_name
Then in vim, type
:set ff
If you see fileformat=dos, then the file is from a dos environment which contains a carriage return.
After finding out, you can use the above mentioned methods by other people to correct your file.
I had the same problem when i was working with armbian linux and Windows .
i was trying to coppy my codes from windows to armbian and when i run it this Error Pops Up. My problem Solved this way :
1- try to Coppy your files from windows using WinSCP .
2- make sure that your file name does not have () characters

Is there any way to handle ISO-8859-16 / latin10 encoded files in vim?

From the help page section encoding-values:
Supported 'encoding' values are: *encoding-values*
1 latin1 8-bit characters (ISO 8859-1, also used for cp1252)
1 iso-8859-n ISO_8859 variant (n = 2 to 15)
[...]
Somehow, it seems that ISO-8859-16 / latin10 was left out? I fail to read files with that encoding correctly. Am I overlooking anything? If not, can I somehow add support for this character encoding to vim through a plugin or so?
On Windows, my version of Vim is compiled with +iconv/dyn. According to the Vim documentation:
On MS-Windows Vim can be compiled with the +iconv/dyn feature. This
means Vim will search for the "iconv.dll" and "libiconv.dll"
libraries. When neither of them can be found Vim will still work but
some conversions won't be possible.
The most recent version from the DLL from here http://sourceforge.net/projects/gettext/files/libiconv-win32/ seems to do job for me. Without it I could not convert most iso-8859 encodings other than iso-8859-1. Having iconv.dll installed I can load the files easily with:
:e ++enc=iso-8859-16 file.txt
If Vim cannot handle it, you can convert to (for example) UTF-8) with the iconv tool:
$ iconv --from-code ISO-8859-16 --to-code UTF-8 -o outputfile inputfile

Linux replace ^M$ with $ in csv

I have received a csv file from a ftp server which I am ingesting into a table.
While ingesting the file I am receiving the error "File was a truncated file"
The actual reason is the data in a file contains $ and ^M$ in end of the line.
e.g :
ACT_RUN_TM, PROG_RUN_TM, US_HE_DT*^M$*
"CONFIRMED","","3600"$
How can I remove these $ and ^M$ from end of the line using linux command.
The ultimately correct solution is to transfer the file from the FTP server in text mode rather than binary mode, which does the appropriate end-of-line conversion for you. Change your download scripts or FTP application configuration to enable text transfers to fix this in future.
Assuming this is a one-shot transfer and you have already downloaded the file and just want to fix it, you can use tr(1) to translate characters. So to remove all control-M characters from a file, you can pipe through tr -d '\r'. Or if you want to replace them with control-J instead – for example you would do this if the file came from a pre-OSX Mac system — do tr '\r' '\n'.
It's odd to see ^M as not-the-last character, but:
sed -e 's/^M*\$$//g' <badfile >goodfile
Or use "sed -i" to update in-place.
(Note that "^M" is entered on the command line by pressing CTRL-V CTRL_M).
Update: It's been established that the question is wrong as the "^M$" are not in the file but displayed with VI. He actually wants to change CRLF pairs to just LF.
sed -e 's/^M$//g' <badfile >goodfile

How to determine encoding table of a text file

I have .txt and .java files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?
If you're on Linux, try file -i filename.txt.
$ file -i vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
For reference, here is my environment:
$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic
Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:
$ file -I vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
Also, have a look here.
Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.
You can't reliably detect the encoding from a textfile - what you can do is make an
educated guess by searching for a non-ascii char and trying to determine if it is a
unicode combination that makes sens in the languages you are parsing.
See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.
In a text file there is no header that saves the encoding or so. You can try the linux/unix command find which tries to guess the encoding:
file -i unreadablefile.txt
or on some systems
file -I unreadablefile.txt
But that often gives you text/plain; charset=iso-8859-1 although the file is unreadable (cryptic glyphs).
This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv. First I tried all encodings, displaying (grep) a line that contained the word www. (a website address):
for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less
This last commandline shows the the tested file encoding and then the translated/transcoded line.
There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:
ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt
In my case it was a chinese windows encoding, which is now readable (if you know chinese).
Does there exist any program to determine the file encoding or to see the encoding?
This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:
% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary
somefile.asc was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit - a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?
if you are using python, the chardet package is a good option, for example
from chardet.universaldetector import UniversalDetector
files = ['a-1.txt','a-2.txt']
detector = UniversalDetector()
for filename in files:
print(filename.ljust(20), end='')
detector.reset()
for line in open(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector.result)
gives me as a result:
a-1.txt {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Resources