gsed does not recognize SHIFT_JIS charactors

gsed does not recognize SHIFT_JIS charactors - linux

I'm writing a program that uses gsed to extract multibyte charactors from csv file.
It works well with csv file encoded UTF-8, but it doesn't work with csv file encoded SHIFT_JIS.
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8 | gsed -r 's/"(.*)","(.*)"/\1 \2/'
こんにちは hello%
test % cat sjis_sample.csv | gsed -r 's/"(.*)","(.*)"/\1 \2/' | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
LINE 1：
Read file with UTF-8
LINE 2：
Extracted text contents from csv file after converting encoding from SHIFT_JIS to UTF-8
-> Works well
LINE 3：
Extracted text contents from csv file without converting encoding
-> It seems that `gsed` failed to capture text contents with match pattern.
Does anybody know how to use gsed for SHIFT_JIS encoded file?
Thank you.
% gsed --version
gsed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
This sed program was built without SELinux support.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed#gnu.org>.
test % locale
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
Solved
Thanks to #KamilCuk
GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:
LC_ALL=C sed ....
I set LANG instead of LC_ALL as C because I could not set LC_ALL as C.
test % cat sjis_convert.sh
#!/bin/bash
LANG=C
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
test % ./sjis_convert.sh
こんにちは hello%
Appendix
I could not set C to LC_ALL.
test % cat sjis_convert.sh
#!/bin/bash
LC_ALL=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
"こんにちは","hello"
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
Instead, I set C to LANG and it worked.
test % cat ./sjis_convert.sh
#!/bin/bash
LANG=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
こんにちは hello
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:
LC_ALL=C sed ....
If you want to work with UTF-8, set UTF-8 locale, which most probably is your default:
LC_ALL=en_US.UTF-8 sed ...
And if you want to work with any other locale, tell it to sed:
LC_ALL=ja_JP.Shift_JIS sed ...

Related

How to search string in multiple files? if doesnot exist then display file name

I have a scenario
where I want to search a keyword in multiple files
if the keyword does not exist then display the file name
keyword need to search is '$$DEMO_STUDENT_NAME'
Command not working ...
grep "$$DEMO_STUDENT_NAME" /d/demo/

Assuming $$DEMO_STUDENT_NAME is literally the word you're looking for and not a mispelled environment variable name, use this instead (pay attention to the single quotes):
grep -vrl '$$DEMO_STUDENT_NAME' /d/demo/
or this:
grep -rlL '$$DEMO_STUDENT_NAME' /d/demo/
Here is a proof of concept:
$ mkdir temp
$ echo 'teste' > temp/file1
$ echo 'teste' > temp/file2
$ echo 'teste' > temp/file3
$ echo 'work' > temp/file4
$ grep -vrl teste temp/
temp/file4
$ grep -rlL teste temp/
temp/file4
$ grep -V
grep (GNU grep) 2.20
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

Assuming you have the file list in FLIST, you can find files without a match. Look for -L/--files-without-match
grep --files-without-match '$$DEMO_STUDENT_NAME' $FLIST
If you have older grep, without --files-without-match option , you can use the following loop:
for file in $FLIST ; do
if ! fgrep -q '$$DEMO_STUDENT_NAME' $file ; then
echo "$file"
fi
done

I like to use grep -c (then checking against trailing ':0') for such kind of task:
grep -c $PATTERN $FILE_LIST | grep :0\$ | sed -e s/:0\$//

One may want finding a file set complement for such a task:
(grep -l $PATTERN $FILE_LIST ; ls $FILE_LIST ) | sort | uniq -u

Below is the code working as expected
Slightly modified taking reference from #dash-o
_Dir='/d/d1/Demo'
_P='/d/files'
find $_Dir -type f -iname "*.txt" >> $_P/Demo.csv
for file in `cat $_P/Demo.csv`
do
if ! fgrep -q '$$DEMO_STUDENT_NAME' $file
then
echo "$file" >> $_P/keyword_not_found.csv
fi
done

sed can't change a file when called in postinstall [duplicate]

Is there an invocation of sed todo in-place editing without backups that works both on Linux and Mac? While the BSD sed shipped with OS X seems to need sed -i '' …, the GNU sed Linux distributions usually come with interprets the quotes as empty input file name (instead of the backup extension), and needs sed -i … instead.
Is there any command line syntax which works with both flavors, so I can use the same script on both systems?

If you really want to just use sed -i the 'easy' way, the following DOES work on both GNU and BSD/Mac sed:
sed -i.bak 's/foo/bar/' filename
Note the lack of space and the dot.
Proof:
# GNU sed
% sed --version | head -1
GNU sed version 4.2.1
% echo 'foo' > file
% sed -i.bak 's/foo/bar/' ./file
% ls
file file.bak
% cat ./file
bar
# BSD sed
% sed --version 2>&1 | head -1
sed: illegal option -- -
% echo 'foo' > file
% sed -i.bak 's/foo/bar/' ./file
% ls
file file.bak
% cat ./file
bar
Obviously you could then just delete the .bak files.

This works with GNU sed, but not on OS X:
sed -i -e 's/foo/bar/' target.file
sed -i'' -e 's/foo/bar/' target.file
This works on OS X, but not with GNU sed:
sed -i '' -e 's/foo/bar/' target.file
On OS X you
can't use sed -i -e since the extension of the backup file would be set to -e
can't use sed -i'' -e for the same reasons—it needs a space between -i and ''.

When on OSX, I always install GNU sed version via Homebrew, to avoid problems in scripts, because most scripts were written for GNU sed versions.
brew install gnu-sed --with-default-names
Then your BSD sed will be replaced by GNU sed.
Alternatively, you can install without default-names, but then:
Change your PATH as instructed after installing gnu-sed
Do check in your scripts to chose between gsed or sed depending on your system

As Noufal Ibrahim asks, why can't you use Perl? Any Mac will have Perl, and there are very few Linux or BSD distributions that don't include some version of Perl in the base system. One of the only environments that might actually lack Perl would be BusyBox (which works like GNU/Linux for -i, except that no backup extension can be specified).
As ismail recommends,
Since perl is available everywhere I just do perl -pi -e s,foo,bar,g target.file
and this seems like a better solution in almost any case than scripts, aliases, or other workarounds to deal with the fundamental incompatibility of sed -i between GNU/Linux and BSD/Mac.

Answer: No.
The originally accepted answer actually doesn't do what is requested (as noted in the comments). (I found this answer when looking for the reason a file-e was appearing "randomly" in my directories.)
There is apparently no way of getting sed -i to work consistently on both MacOS and Linuces.
My recommendation, for what it is worth, is not to update-in-place with sed (which has complex failure modes), but to generate new files and rename them afterwards. In other words: avoid -i.

There is no way to have it working.
One way is to use a temporary file like:
TMP_FILE=`mktemp /tmp/config.XXXXXXXXXX`
sed -e "s/abc/def/" some/file > $TMP_FILE
mv $TMP_FILE some/file
This works on both

Here's another version that works on Linux and macOS without using eval and without having to delete backup files. It uses Bash arrays for storing the sed parameters, which is cleaner than using eval:
# Default case for Linux sed, just use "-i"
sedi=(-i)
case "$(uname)" in
# For macOS, use two parameters
Darwin*) sedi=(-i "")
esac
# Expand the parameters in the actual call to "sed"
sed "${sedi[#]}" -e 's/foo/bar/' target.file
This does not create a backup file, neither a file with appended quotes.

The -i option is not part of POSIX Sed. A more portable method would be
to use Vim in Ex mode:
ex -sc '%s/alfa/bravo/|x' file
% select all lines
s replace
x save and close

Steve Powell's answer is quite correct, consulting the MAN page for sed on OSX and Linux (Ubuntu 12.04) highlights the in-compatibility within 'in-place' sed usage across the two operating systems.
JFYI, there should be no space between the -i and any quotes (which denote an empty file extension) using the Linux version of sed, thus
sed Linux Man Page
#Linux
sed -i""
and
sed OSX Man page
#OSX (notice the space after the '-i' argument)
sed -i ""
I got round this in a script by using an alias'd command and the OS-name output of 'uname' within a bash 'if'. Trying to store OS-dependant command strings in variables was hit and miss when interpreting the quotes. The use of 'shopt -s expand_aliases' is necessary in order to expand/use the aliases defined within your script. shopt's usage is dealt with here.

Portable script for both GNU systems and OSX:
if [[ $(uname) == "Darwin" ]]; then
SP=" " # Needed for portability with sed
fi
sed -i${SP}'' -e "s/foo/bar/g" -e "s/ping/pong/g" foobar.txt

I ran into this problem. The only quick solution was to replace the sed in mac to the gnu version:
brew install gnu-sed

If you need to do sed in-place in a bash script, and you do NOT want the in-place to result with .bkp files, and you have a way to detect the os (say, using ostype.sh), -- then the following hack with the bash shell built-in eval should work:
OSTYPE="$(bash ostype.sh)"
cat > myfile.txt <<"EOF"
1111
2222
EOF
if [ "$OSTYPE" == "osx" ]; then
ISED='-i ""'
else # $OSTYPE == linux64
ISED='-i""'
fi
eval sed $ISED 's/2222/bbbb/g' myfile.txt
ls
# GNU and OSX: still only myfile.txt there
cat myfile.txt
# GNU and OSX: both print:
# 1111
# bbbb
# NOTE:
# if you just use `sed $ISED 's/2222/bbbb/g' myfile.txt` without `eval`,
# then you will get a backup file with quotations in the file name,
# - that is, `myfile.txt""`

The problem is that sed is a stream editor, therefore in-place editing is a non-POSIX extension and everybody may implement it differently. That means for in-place editing you should use ed for best portability. E.g.
ed -s foobar.txt <<<$',s/foo/bar/g\nw'
Also see https://wiki.bash-hackers.org/howto/edit-ed.

You can use sponge. Sponge is an old unix program, found in moreutils package (both in ubuntu and probably debian, and in homebrew in mac).
It will buffer all the content from the pipe, wait until the pipe is close (probably meaning that the input file is already close) and then overwrite:
From the man page:
Synopsis
sed '...' file | grep '...' | sponge file

The following works for me on Linux and OS X:
sed -i' ' <expr> <file>
e.g. for a file f containing aaabbaaba
sed -i' ' 's/b/c/g' f
yields aaaccaaca on both Linux and Mac. Note there is a quoted string containing a space, with no space between the -i and the string. Single or double quotes both work.
On Linux I am using bash version 4.3.11 under Ubuntu 14.04.4 and on the Mac version 3.2.57 under OS X 10.11.4 El Capitan (Darwin 15.4.0).

Hexdump reverse command

The hexdump command converts any file to hex values.
But what if I have hex values and I want to reverse the process, is this possible?

There is a similar tool called xxd. If you run xxd with just a file name it dumps the data in a fairly standard hex dump format:
# xxd bdata
0000000: 0001 0203 0405
......
Now if you pipe the output back to xxd with the -r option and redirect that to a new file, you can convert the hex dump back to binary:
# xxd bdata | xxd -r >bdata2
# cmp bdata bdata2
# xxd bdata2
0000000: 0001 0203 0405

I've written a short AWK script which reverses hexdump -C output back to the
original data. Use like this:
reverse-hexdump.sh hex.txt > data
Handles '*' repeat markers and generating original data even if binary.
hexdump -C and reverse-hexdump.sh make a data round-trip pair. It is
available here:
GitHub reverse-hexdump repo
Direct to reverse-hexdump.sh

Restore file, given only the output of hexdump file
If you only have the output of hexdump file and want to restore the original file, first note that hexdump's default output depends on the endianness of the system you ran hexdump on!
If you have access to the system that created the dump, you can determinate its endianness using below command:
[[ "$(printf '\01\03' | hexdump)" == *0103* ]] && echo big || echo little
Reversing little-endian hexdump
This is the most common case. All x86/x64 systems are little-endian. If you don't know the endianness of the system that ran hexdump file, try this.
sed 's/ \(..\)\(..\)/ \2\1/g;$d' dump | xxd -r
The sed part converts hexdump's format into xxd's format, at least so far that xxd -r works.
Reversing big-endian hexdump
sed '$d' dump | xxd -r
Known Bugs (see comment section)
A trailing null byte is added if the original file was of odd length (e.g. 1, 3, 5, 7, ..., byte long).
Repeating sections of the original file are not restored correctly if they were hexdumped using a *.
You can check your dump for above problematic cases by running below command:
grep -qE '^\*|^[0-9a-f]*[13579bdf] *$' dump && echo bug || echo ok
Better alternative to create hexdumps in the first place
Besides the non-posix (and therefore not so portable) xxd there is od (octal dump) which should be available on all unix-like systems as it is specified by posix:
od -tx1 -An -v
Will print a hexadecimal dump, grouping digits as single bytes (-tx1), with no Address prefixes (-An, similar to xxd -p) and without abbreviating repeated sections as * (-v). You can reverse such a dump using xxd -r -p.

As someone who sucks at bash, I could not understand the examples already posted.
Here is what would have helped me when I was originally searching:
Take your text file "AYE.TXT" and convert it into a hex dump called "BEE.TXT"
xxd -p "AYE.TXT" > "BEE.TXT"
Take your hex dump file ("BEE.TXT") and covert it back to ascii file "CEE.TXT"
xxd -r -p "BEE.TXT" > "CEE.TXT"
Now that you have some simple working code, feel free to check out
"xxd -help" on the command line for an explanation of what all those flags do.
(That part is the easy part, the hard part is the specifics of the bash syntax)

There is a tonne of more elegant ways to get this done, but I've quickly hacked something together that Works for Me (tm) when regenerating a binary file from a hex dump generated by hexdump -C some_file.bin:
sed 's/\(.\{8\}\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\)/\1: \2\3 \4\5 \6\7 \8\9/g' some_file.hexdump | sed 's/\(.*\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) \(..\) |/\1 \2\3 \4\5 \6\7 \8\9 /g' | sed 's/.$//g' | xxd -r > some_file.restored
Basically, uses 2 sed processeses, each handling it's part of each line. Ugly, but someone might find it useful.

If you don't have xxd, use hexdump, od, perl or python:
The following all give the same output:
# If you only have hexdump
hexdump -ve '1/1 "%.2x"' mybinaryfile > mydump
# This gives exactly the same output as:
xxd -p mybinaryfile > mydump
# Or, much slower:
od -v -t x1 -An < mybinaryfile | tr -d "\n " > mydump
# Or, the fastest:
perl -pe 'BEGIN{$/=\1e6} $_=unpack "H*"' < mybinaryfile > mydump
# Or, if you somehow have Python, and not Perl:
python -c "print(open('mybinaryfile','rb').read().hex())" > mydump
Then you can copy and paste, or pipe the output, and convert back with:
xxd -r -p mydump mybinaryfileagain
# Or
xxd -r -p < mydump > mybinaryfileagain
The hexdump command is available almost everywhere, and is usually part of the default busybox - if it's not linked, you can try running busybox hexdump or busybox xxd.
If xxd is not available to reverse the data, then you can try awk
The old days: Zmodem
In the old days we used to use X/Y/Zmodem which is available in the package lrzsz which can tolerate lossy comms - but it's a bidirectional protocol so the binaries need to be running at the same time and there needs to be bidirectional comms:
# Demo on local machine, using FIFOs
mkfifo /tmp/fifo-in
mkfifo /tmp/fifo-out
sz -b mybinaryfile > /tmp/fifo-out < /tmp/fifo-in
mkdir out; cd out
rz -b < /tmp/fifo-out > /tmp/fifo-in
Luckily, screen supports receiving Zmodem, so if you're in a screen session:
screen
telnet somehost
Then type Ctrl+A and : and then zmodem catch and Enter. Then inside the screen on the remote host, issue:
# sz -b mybinaryfile
Press Enter when you see the string starting with "!!!".
When you see "Transfer Complete", you may want to run reset if you want to continue the terminal session normally.

This program reverses hexdump -C output back to the original data.
Usage:
make
make test
./unhexdump -i inputfile -o outputfile
see https://github.com/zhouzq-thu/unhexdump!

i found more simple solution:
bin2hex
echo -n "abc" | hexdump -ve '1/1 "%02x"'
hex2bin
echo -n "616263" | grep -Eo ".{2}" | sed 's/\(.*\)/\\x\1/' | tr -d '\n' | xargs -0 echo -ne

LF --> CR/LF conversion for UTF-16 file

I have an UTF-16 encoded file and I want replace UNIX line endings with Windows line endings. I don't want to touch anything else.
Is there a linux command line tool that can search for two bytes "0A 00" and replace it with four bytes "0D 00 0A 00"?

Perl to the rescue:
perl -we 'binmode STDIN, ":encoding(UTF-16le)";
binmode STDOUT, ":encoding(UTF-16le):crlf";
print while <STDIN>;
' < input.txt > output.txt

You may use unix2dos, but you have to convert the file to a 8-bit encoding before, and back to UTF-16 after. The obvious intermediate candidate is UTF-8:
$ cat in.txt | iconv -f UTF-16 -t UTF-8 | unix2dos | iconv -f UTF-8 -t UTF-16 > out.txt
You can wrap these three piped commands in a handy script, if you wish.
#/bin/sh
iconv -f UTF-16 -t UTF-8 | unix2dos | iconv -f UTF-8 -t UTF-16

unix2dos is what you're looking for. See its different options to find the one that's right for your UTF-16 encoding.

Solution:
perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/\n\0/\r\0\n\0/g;" < input.file > output.file
Credit to my coworker Manu and Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

How to sort lines based on a sub string in the line

I have the following output:
aaa=12
bbb=124
cccc=1
dddd=15
I want to sort the above lines based on the value.s so the output should look like this:
$ cat file | awk_or_sed_or_any_command
cccc=1
aaa=12
dddd=15
bbb=124
UPDATE
I tried the following commands:
$ cat file | awk -F '=' '{print $2"="$1}' | sort | awk -F '=' '{print $2"="$1}'
But it's too long.
Are there another suggestion better than the above one?
Note: my linux use sort from busybox that support only the following options:
$ sort --help
BusyBox v1.19.4 (2014-04-04 18:50:39 CEST) multi-call binary.
Usage: sort [-nru] [FILE]...
Sort lines of text
-n Sort numbers
-r Reverse sort order
-u Suppress duplicate lines

Use the following command
sort -n -t = -k 2 your_file
gives me
alex#rhyme ~ $ ash
$ cat <<EOF | sort -n -t = -k 2
> aaa=12
> bbb=124
> cccc=1
> dddd=15
> EOF
cccc=1
aaa=12
dddd=15
bbb=124
$ which sort
/usr/bin/sort
$ LANG=C sort --version
sort (GNU coreutils) 8.21
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and Paul Eggert.
Check the sort manpage for other sort options

sed -e 's/=\(.*\)$/=000000\1_\1/;s/=0*\([0-9]\{7\}\)_/=\1_/' YourFile | sort | sed -e 's/=[0-9]_/=/'
prepare for a basic sort not numeric nor taking column than put back in original form

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

gsed does not recognize SHIFT_JIS charactors - linux

Related

How to search string in multiple files? if doesnot exist then display file name

sed can't change a file when called in postinstall [duplicate]

Hexdump reverse command

LF --> CR/LF conversion for UTF-16 file

How to sort lines based on a sub string in the line

Categories

Resources