How to insert UTF-16 character with sed in a file? - linux

I have a file coded as UTF-16 and I want to split each line into fields (at fixed positions) separated by commas.
I have tried the following:
Option 1)
sed -i 's/./&,/400;s/./&,/360;*<and so on for several positions in the line>* FILE
This seems to work, but when editing the file with vim, it is obvious that something is wrong, since the commas are displayed as a single character, but the other symbols are displayed as a two-byte character.
BEFORE: 2^#A^#U^#W^#2^#0^#1^#9^#0^#1^#0^#1^#0^#0^#0^#1^#0^#0^#
AFTER: 2^#,A^#U^#,W^#,2^#0^#1^#9^#0^#1^#0^#1^#,0^#0^#0^#1^#0^#0^#,
Option 2)
Then I tried to use sed again but, instead of "," I typed the UTF-16 code of the comma, that is 002c:
sed -i "s/./&\U002c/400...
or even
s/./&$(echo -ne '\u002c')/400...
None of these options worked, the results are exactly the same as in Option1.

The problem is not the insertion of a new character with sed.
I noticed that the file is UTF-16LE (I was assuming it was UTF-16BE). I converted it directly to ascii with
iconv -f UTF-16LE -t ASCII sourcefile >destinationfile
and after that I could handle the file as usual with sed, grep, awk, etc.

Related

How to replace non printable characters in file like <97> on linux [duplicate]

I am trying to remove non-printable character (for e.g. ^#) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time.
I tried using
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME
but still the ^# characters are not removed.
Also I tried using
awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE
but it also did not help.
Can anybody suggest some alternative way to remove non-printable characters?
Used tr -cd but it is removing accented characters. But they are required in the file.
Perhaps you could go with the complement of [:print:], which contains all printable characters:
tr -cd '[:print:]' < file > newfile
If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):
sed 's/[^[:print:]]//g' file
Remove all control characters first:
tr -dc '\007-\011\012-\015\040-\376' < file > newfile
Then try your string:
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile
I believe that what you see ^# is in fact a zero value \0.
The tr filter from above will remove those as well.
strings -1 file... > outputfile
seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.
"man strings" will provide the documentation.
Was searching for this for a while & found a rather simple solution:
The package ansifilter does exactly this. All you need to do is just pipe the output through it.
On Mac:
brew install ansifilter
Then:
cat file.txt | ansifilter

concatenation of strings in bash results in substitution

I need to read a file into an array and concatenate a string at the end of each line. Here is my bash script:
#!/bin/bash
IFS=$'\n' read -d '' -r -a lines < ./file.list
for i in "${lines[#]}"
do
tmp="$i"
tmp="${tmp}stuff"
echo "$tmp"
done
However, when I do this, an action of replace happens, instead of concatenation.
For example, in the file.list, we have:
http://www.example1.com
http://www.example2.com
What I need is:
http://www.example1.comstuff
http://www.example2.comstuff
But after executing the script above, I get things as below on the terminal:
stuff//www.example1.com
stuff//www.example2.com
Btw, my PC is Mac OS.
The problem also occurs while concatenating strings via awk, printf, and echo commands. For example echo $tmp"stuff" or echo "${tmp}""stuff"
The file ./file.lst is, most probably, generated on a Windows system or, at least, it was saved using the Windows convention for end of line.
Windows uses a sequence of two characters to mark the end of lines in a text file. These characters are CR (\r) followed by LF (\n). Unix-like systems (Linux and macOS starting with version 10) use LF as end of line character.
The assignment IFS=$'\n' in front of read in your code tells read to use LF as line separator. read doesn't store the LF characters in the array it produces (lines[]) but each entry from lines[] ends with a CR character.
The line tmp="${tmp}stuff" does what is it supposed to do, i.e. it appends the word stuff to the content of the variable tmp (a line read from the file).
The first line read from the input file contains the string http://www.example1.com followed by the CR character. After the string stuff is appended, the content of variable tmp is:
http://www.example1.com$'\r'stuff
The CR character is not printable. It has a special interpretation when it is printed on the terminal: it sends the cursor at the start of the line (column 1) without changing the line.
When echo prints the line above, it prints (starting on a new line) http://www.example1.com, then the CR character that sends the cursor back to the start of the line where is prints the string stuff. The stuff fragment overwrites the first 5 characters already printed on that line (http:) and the result, as it is visible on screen, is:
stuff//www.example1.com
The solution is to get rid of the CR characters from the input file. There are several ways to accomplish this goal.
A simple way to remove the CR characters from the input file is to use the command:
sed -i.bak s/$'\r'//g file.list
It removes all the CR characters from the content of file file.list, saves the updated string back into the file.list file and stores the original file.list file as file.list.bak (a backup copy in case it doesn't produce the output you expect).
Another way to get rid of the CR character is to ask the shell to remove it in the command where stuff is appended:
tmp="${tmp/$'\r'/}stuff"
When a variable is expanded in a construct like ${tmp/a/b}, all the appearances of a in $tmp are replaced with b. In this case we replace \r with nothing.
I'm guessing it's have something to do with the Carriage Return character.
Did your file.list created on windows? If so, try to use dos2unix before running the script.
Edit
You can check your files using the file command.
Example:
file file.list
If you saved the file in Windows Notepad like this:
Then it will probably come up like this:
file.list: ASCII text, with no line terminators
You can use built in tools like iconv to convert the encodings. However for a simple use like this, you can just use a command that works for multiple encodings without any conversion necessary.
You could simply buffer the file through cat, and use a regular expression that applies to either:
Carriage return followed by line terminator, or
Line terminator on it's own
Then append the string.
Example:
cat file.list | grep -E -v "^$" | sed -E -e "s/(\r?$)/stuff/g"
Will work with ASCII text, and ASCII text with no line terminators.
If you need to modify a stream to append a fixed string, you can use sed or awk, for instance:
sed 's/$/stuff/'
to append stuff to the end of each line.
using "dos2unix file.list" would also solve the problem

Remove lines with japanese characters from a file

First question on here- I've searched around to put together an answer to this but have come up empty thus far.
I have a multi-line text file that I am cleaning up. Part of this is to remove lines that include Japanese characters. I have been using sed for my other operations but it is not working in this instance.
I was under the impression that using the -r switch and the \p{Han} regular expression would work (from looking at other questions of this kind), but it is not working in this case.
Here is my test string - running this returns the full string, and does not filter out the JP characters as I was expecting.
echo 80岁返老还童的处女: 第3话 | sed -r "s/\\p\{Han\}//g"
Am I missing something? Is there another command I should be using instead?
I think this might work for you:
echo "80岁返老还童的处女: 第3话" | tr -cd '[:print:]\n'
sed doesn't support unicode classes AFAIK, and nor support multibyte ranges.
-d deletes characters in SET1, and -c reverses it.
[:print:] matches all printable characters including space.
\n is a newline
The above will not only remove Japanese characters but all multibyte characters, including control characters.
Perl can also be used:
PERLIO=:utf8 perl -pe 's/\p{Han}//g' file
PERLIO=:utf8 tells Perl to tread input and output as UTF-8

Sed is not writing to file

I wanna simply change the delimiter on my CSV.
The file comes from a outside server, so the delimiter is something like this: ^A.
name^Atype^Avalue^A
john^Ab^A500
mary^Ac^A400
jack^Ad^A200
I want to get this:
name,type,value
john,b,500
mary,c,400
jack,d,200
I need to change it to a comma(,) or a tab(,), but my sed command, despite correctly output, does not write the file.
cat -v CSVFILE | sed -i "s/\^A/,/g"
When i use the line above, it correctly outputs the file delimited by a comma instead of ^A, but it doesn't write to the file.
I also tried like this:
sed -i "s/\^A/,/g" CSVFILE
Does not work also...
What am i doing wrong?
Literal ^A (two characters, ^ and A) is how cat -v visualizes control character 0x1 (ASCII code 1, named SOH (start of heading)). ^A is an example of caret notation to represent unprintable ASCII characters:
^A stands for keyboard combination Control-A, which, when preceded by generic escape sequence Control-V, is how you can create the actual control character in your terminal; in other words,
Control-VControl-A will insert an actual 0x1 character.
Incidentally, the logic of caret notation (^<letter>) is: the letter corresponds to the ASCII value of the control character represented; e.g., A corresponds to 0x1, and D corresponds to 0x4 (^D, EOT).
To put it differently: you add 0x40 to the ASCII value of the control character to get the ASCII value of its letter representation in caret notation.
^# to represent NUL (0x0 characters) and ^? to represent DEL (0x7f) are consistent with this notation, because # has ASCII value 0x40 (i.e., it comes just before A (0x41) in the ASCII table) and 0x40 + 0x7f constrained to 7 bits (bit-ANDed with the max. ASCII value 0x7f) yields 0x3f, which is the ASCII value of ?.
To inspect a given file for the ASCII values of exotic control characters, you can pipe it to od -c, which represents 0x1 as (octal) 001.
This implies that, when passing the file to sed directly, you cannot use caret notation and must instead use the actual control character in your s call.
Note that when you use Control-VControl-A to create an actual 0x1 character, it will also appear in caret notation - as ^A - but in that case it is just the terminal's visualization of the true control character; while it may look like the two printable characters ^ and A, it is not. Purely visually you cannot tell the difference - which is why using an escape sequence or ANSI C-quoted string to represent the control character is the better choice - see below.
Assuming your shell is bash, ksh, or zsh, the better alternative to using Control-VControl-A is to use an ANSI C-quoted string to generate the 0x1 character: $'\1'
However, as Lars Fischer points out in a comment on the question, GNU sed also recognizes escape sequence \x01 for 0x1.
Thus, your command should be:
sed -i 's/\x01/,/g' CSVFILE # \x01 only recognized by GNU sed
or, using an ANSI C-quoted string:
sed -i $'s/\1/,/g' CSVFILE
Note: While this form can in principle be used with BSD/OSX sed, the -i syntax is slightly different: you'd have to use sed -i '' $'s/\1/,/g' CSVFILE
The only reason to use sed for your task is to take advantage of in-place updating (-i); otherwise, tr is the better choice - see Ed Morton's answer.
This is the job tr was created to do:
tr '<control-A>' ',' < file > tmp && mv tmp file
Replace <control-A> with a literal control-A obviously.
If your sed supports the -i option, you could use it like this:
sed -i.bak -e "s/\^A/,/g" CSVFILE
(This assumes the delimiter in the source file consists of the two characters ^ and A; if ^A is supposed to refer to Control-A, then you will have to make adjustments accordingly, e.g. using 's/\x01/,/g'.)
Otherwise, assuming you want to keep a copy of the original file (e.g. in case the result is not what you expect -- see below), an incantation such as the following can be used:
mv CSVFILE CSVFILE.bak && sed "s/\^A/,/g" CSVFILE.bak > CSVFILE
As pointed out elsewhere, if the source-file separator is Control-A, you could also use tr '\001' , (or tr '\001' '\t' for a tab).
The caution is that the delimiter in the source file might well be used precisely because commas might appear in the "values" that the separator-character is separating. If that is a possibility, then a different approach will be needed. (See e.g. https://www.rfc-editor.org/rfc/rfc4180)
In case it's run under OS X :
Add an extension to the -i to write in a new file :
sed -i.bak "s/^A/,/g" CSVFILE
Or to write in place :
sed -i '' "s/^A/,/g" CSVFILE
You can also output to file with a cat but without -i on your sed
command :
cat -v CSVFILE | sed "s/^A/,/g" > ouput
Make sure you write the ^A this way :
Ctrl+V+Ctrl+A

How to replace a <85> to a new line in bash script

I’m running out of idea on how to replace this character “<85>” to a new line (please treat this as one character only – I think this is a non-printable character).
I tried this one in my script:
cat file | awk '{gsub(”<85>”,RS);print}' > /tmp/file.txt
but didn’t work.
I hope someone can help.
Thanks.
With sed: sed -e $'s/\302\205/\\n/' file > file.txt
Or awk: awk '{gsub("\302\205","\n")}7'
The magic here was in converting the <85> character to octal codepoints.
I used hexdump -b on a file I manually inserted that character into.
tr '\205' '\n' <file > file.txt
tr is the transliterate command; it translates one character to another (or deletes it, or …). The version of tr on Mac OS X doesn't recognize hexadecimal escapes, so you have to use octal, and octal 205 is hex 85.
I am assuming that the file contains a single byte '\x85', rather than some combination of bytes that is being presented as <85>. tr is not good for recognizing multibyte sequences that need to be transliterated.

Resources