count number of Chinese character and add in the end of line - string

I have the file, which have a Chinese word in each line like this :
王大明
新型传染病
電子雷射
I want to add the number of Chinese character in each end of line :
王大明 3
新型传染病 5
電子雷射 4
How can I do this?
I know command, sed, wc. However, I cannot achieve this work. I tried many things, but clearly I need help here.
sed -i s/$/{length $0}/ myfile
sed -i s/$/{wc -m}/ myfile
awk '{$2=system(awk 'length') OFS $2} 1' myfile

What exactly will work will depend entirely on what exactly your input looks like. If you are dealing with Unicode glyphs, use a Unicode-aware tool such as e.g. Python.
bash$ cat uniline
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.rstrip('\n')
print(line, len(line))
bash$ chmod +x uniline
bash$ uniline <<\:
> 王大明
> 新型传染病
> 電子雷射
> :
王大明 3
新型传染病 5
電子雷射 4
(I had to trim some whitespace from the ends of the lines in the example you posted.)
For the record, my system encoding is UTF-8, meaning the first line's representation as bytes is
bash$ echo '王大明' | xxd
00000000: e78e 8be5 a4a7 e698 8e0a ..........
Perhaps see also Problematic questions about decoding errors for some relevant background.
If you are lucky, even Awk and wc might be locale-aware on your platform. Your sed attempts really have no chance of working (though if you have GNU sed you could try with the /e option; but really, probably don't). If you have GNU Awk and the en_US.UTF-8 locale defined, this works, too:
bash$ echo $'\xe7\x8e\x8b\xe5\xa4\xa7\xe6\x98\x8e' |
> LC_ALL=en-US.UTF-8 awk '{ print $0, length }'
王大明 3

if you're VERY certain the only multi-byte characters there are chinese, then do
gawk/mawk/mawk2 '{ print $0, \
\
gsub(/\342|\343|\344|\345|\346|\347|\350|\351|\357|\360/, "&") }'
This list of leading-bytes shall correctly account for either 3- or 4-byte code-points related to chinese chars, of either simplified and traditional, plus all special compatibility variants.
Run that in either byte-mode or unicode-mode and it'll give you the same result. Your locale settings DOES NOT matter here (as long as your input is already UTF8 compliant text)
If you're definitely in byte-mode or LC_ALL=C, then
awk '{ print $0, gsub(/[\342-\351\357\360]/,"&") }'
One of the less-mentioned-but-excellent use case for gsub() is to use it for purposes of counting occurrences without having to do split() or substr().
if you're REALLY pedantic about exactness, the hideous regex i use myself is
function isChinese(str6) { return (str6 ~
/\344|\345|\346|\347|\350|\351|
(\343|\360|\357)(\244|\245|\246|\247|
\250|\251|\252|\253)|(\357\271|
\343(\204|\207))(\200|\201|\202|\203|\204|
\205|\206|\207|\210|
\211|\212|\213|\214|\215|\216|\217)|(\343\206|
\357\270)(\260|\261|\262|\263|\264|\265|\266|\267|
\270|\271|\272|\273|\274|\275|\276|\277)|
(\343|\360)(\240|\241|\242|\243|\254|\255|\256|\257|\260|
\261)|\342(\272|\273|\274|\275|\276|\277(\200|
\210|\211|\212|\213|\214|\215|\216|\217))|
(\342\277|\343(\204|\206|\207))(\220|\221|\222|
\223|\224|\225|\226|\227|\230|\231|\232|\233|
\234|\235|\236|\237)|\343(\200|\210|\211|\212|
\213|\214|\215|\216|\217|\220|\221|\222|\223|
\224|\225|\226|\227|\230|\231|\232|\233|\234|
\235|\236|\237|\262|\263|\264|\265|\266|(\204|
\206|\207)(\240|\241|\242|\243|\244|
\245|\246|\247|\250|\251|\252|\253|\254|\255|\256|\257))/) };

Related

extracting text from one file and creating new file with that text using linux/bash

I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on.
I want to extract the text between and including each >g## and create a new file titled protein_g##.faa
In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk?
Thanks!
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt This part reads in the input file.
awk Runs awk.
/^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
(If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
If awk is an option:
awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat
awk is better than sed for this problem. You can implement it in sed with
sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file
This solution is nice for showing some sed tricks:
-z for parsing fragments that span several lines
(..) for remembering strings
\$ matching a literal $
[^\n]* matching until end of line
'\'' for a single quote
End single quoted string, escape single quote and start new single quoted string
\2 for recalling the second remembered string
Write a bash command in the replacement string
e execute result of replacement
awk procedure
awk allows records to be extracted between empty (or white space only) lines by setting the record separator to an empty string RS=""
Thus the records intended for each file can be got automatically.
The id to be used in the filename can be extracted from field 1 $1 by splitting the (default white-space-separated) field at the ">" mark, and using element 2 of the split array (named id in this example).
The file is written from awk before closing the file to prevent errors is you have many lines to process.
The awk procedure
The example data was saved in a file named all.seq and the following procedure used to process it:
awk 'BEGIN{RS="";} {split($1,id,">"); fn="protein_"id[2]".faa"; print $0 > fn; close(fn)}' all.seq
tests results
(terminal listings/outputs)
$ ls
all.seq protein_g104.faa protein_g115.faa protein_g34.faa
$ cat protein_g104.faa
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$ cat protein_g115.faa
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
$ cat protein_g34.faa
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
Tested using GNU Awk 5.1.0

How can I truncate a line of text longer than a given length?

How would you go about removing everything after x number of characters? For example, cut everything after 15 characters and add ... to it.
This is an example sentence should turn into This is an exam...
GnuTools head can use chars rather than lines:
head -c 15 <<<'This is an example sentence'
Although consider that head -c only deals with bytes, so this is incompatible with multi-bytes characters like UTF-8 umlaut ü.
Bash built-in string indexing works:
str='This is an example sentence'
echo "${str:0:15}"
Output:
This is an exam
And finally something that works with ksh, dash, zsh…:
printf '%.15s\n' 'This is an example sentence'
Even programmatically:
n=15
printf '%.*s\n' $n 'This is an example sentence'
If you are using Bash, you can directly assign the output of printf to a variable and save a sub-shell call with:
trim_length=15
full_string='This is an example sentence'
printf -v trimmed_string '%.*s' $trim_length "$full_string"
Use sed:
echo 'some long string value' | sed 's/\(.\{15\}\).*/\1.../'
Output:
some long strin...
This solution has the advantage that short strings do not get the ... tail added:
echo 'short string' | sed 's/\(.\{15\}\).*/\1.../'
Output:
short string
So it's one solution for all sized outputs.
Use cut:
echo "This is an example sentence" | cut -c1-15
This is an exam
This includes characters (to handle multi-byte chars) 1-15, c.f. cut(1)
-b, --bytes=LIST
select only these bytes
-c, --characters=LIST
select only these characters
Awk can also accomplish this:
$ echo 'some long string value' | awk '{print substr($0, 1, 15) "..."}'
some long strin...
In awk, $0 is the current line. substr($0, 1, 15) extracts characters 1 through 15 from $0. The trailing "..." appends three dots.
Todd actually has a good answer however I chose to change it up a little to make the function better and remove unnecessary parts :p
trim() {
if (( "${#1}" > "$2" )); then
echo "${1:0:$2}$3"
else
echo "$1"
fi
}
In this version the appended text on longer string are chosen by the third argument, the max length is chosen by the second argument and the text itself is chosen by the first argument.
No need for variables :)
Using Bash Shell Expansions (No External Commands)
If you don't care about shell portability, you can do this entirely within Bash using a number of different shell expansions in the printf builtin. This avoids shelling out to external commands. For example:
trim () {
local str ellipsis_utf8
local -i maxlen
# use explaining variables; avoid magic numbers
str="$*"
maxlen="15"
ellipsis_utf8=$'\u2026'
# only truncate $str when longer than $maxlen
if (( "${#str}" > "$maxlen" )); then
printf "%s%s\n" "${str:0:$maxlen}" "${ellipsis_utf8}"
else
printf "%s\n" "$str"
fi
}
trim "This is an example sentence." # This is an exam…
trim "Short sentence." # Short sentence.
trim "-n Flag-like strings." # Flag-like strin…
trim "With interstitial -E flag." # With interstiti…
You can also loop through an entire file this way. Given a file containing the same sentences above (one per line), you can use the read builtin's default REPLY variable as follows:
while read; do
trim "$REPLY"
done < example.txt
Whether or not this approach is faster or easier to read is debatable, but it's 100% Bash and executes without forks or subshells.

Replace multiline string with sed

I have a file that's basically an INI/CFG file the looks like this:
[thing-a]
attribute1=foo
attribute2=bar
attribute3=foobar
attribute4=barfoo
[thing-b]
attribute1=dog
attribute3=foofoo
attribute4=castles
[thing-c]
attribute1=foo
attribute4=barfoo
[thing-d]
attribute1=123455
attribute2=dogs
attribute3=biscuits
attribute4=1234
Each 'thing' has a set of attributes that could include all the same ones or a subset there of.
I am trying to write a small bash script that will replace the attributes for 'thing-c' with a predefined block $a1, $a2 & $a3 are generated elsewhere in the wider script:
NEW_BLOCK="[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}"
I can find the right block with sed like this:
THING_BLOCK=$(sed -nr "/^\[thing-c\]/ { :l /^\s*[^#].*/ p; n; /^\[/ q; b l; }" ./myThingFile)
I'm not sure if i've gone down a rabbit hole or what with this and I'm pretty sure there is a better way of doing it.
I'm wanting to do what is:
sed "s/${THING_BLOCK}/${NEW_BLOCK}/"
But I can't quite figure out the multiline aspect to this and I'm not sure what the best route to take is.
Is there a way to do this sort of multiline find and replace with sed (or a better way with bash)
Is there a way to do this sort of multiline find and replace ...
Yes there is indeed a better way, albeit using awk:
awk -v blk="$NEW_BLOCK" -v RS= '{ORS = RT} $1 == "[thing-c]" {$0 = blk} 1' file
Using -v RS= we use an empty record separator that splits records in input file on each new line.
Another awk. Store the replacement to file2 and:
$ awk -v RS="" '
NR==FNR {
b=$0
next
}
$1~/thing-c/ {
$0=b
}
{
print (++c==1?"":ORS) $0
}' file2 file1
Output:
[thing-a]
attribute1=foo
attribute2=bar
attribute3=foobar
attribute4=barfoo
[thing-b]
attribute1=dog
attribute3=foofoo
attribute4=castles
[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}
[thing-d]
attribute1=123455
attribute2=dogs
attribute3=biscuits
attribute4=1234
When you want to use sed(IMHO awk is better here), you must have "nice" data (no special characters that sed will try to handle and [ inside block thing-3).
I tested with
read -d '' -r NEW_BLOCK <<END
[thing-c]
attribute1=${a1}
attribute2=${a2}
attribute3=${a3}
END
For my solution I first need to replace newlines in $NEW_BLOCK with the two characters \n.
echo "This is the replacement string: ${NEW_BLOCK//$'\n'/\\n}"
With the "multi-line" option "-z" you can do
sed -rz "s/\[thing-c\][^[]*/${NEW_BLOCK//$'\n'/\\n}\n\n/" myThingFile

Find HEX value in file and grep the following value

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?
As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)
If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135
Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources