How to extract characters from text file - text

I want to extract all characters from text file for create subset font. How I can extract and sort characters?
Example:
input "Hello, Harry. 안녕? 잘 지내니? おはよう。どうもありがとう。"
↓
output " ,.?Haelory。あうおがとどはもより내녕니안잘지"

perl -C -Mutf8 -MList::Util=uniq -E'say uniq sort "Hello, Harry. 안녕? 잘 지내니? おはよう。どうもありがとう。" =~ /(\X)/g'

In JavaScript, that would be:
let input = "Hello, Harry. 안녕? 잘 지내니? おはよう。どうもありがとう。";
let output = [...new Set(Array.from(input))].sort().join('');
// -> " ,.?Haelory。あうおがとどはもより내녕니안잘지"

Related

How to grep multi line string with new line characters or tab characters or spaces

My test file has text like:
> cat test.txt
new dummy("test1", random1).foo("bar1");
new dummy("
test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
"test4", random4).foo("bar4");
I am trying to match all single lines ending with semicolon (;) and having text "dummy(". Then I need to extract the string present in the double quotes inside dummy. I have come up with the following command, but it matches only the first and third statement.
> perl -ne 'print if /dummy/ .. /;/' test.txt | grep -oP 'dummy\((.|\n)*,'
dummy("test1",
dummy("test3",
With -o flag I expected to extract string between the double quotes inside dummy. But that is also not working. Can you please give me an idea on how to proceed?
Expected output is:
test1
test2
test3
test4
Some of the below answers work for basic file structures. If lines contains more than 1 new line characters, then code breaks. e.g. Input text files with more new line characters:
new dummy("test1", random1).foo("bar1");
new dummy("
test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
"test4", random4).foo("bar4");
new dummy("test5",
random5).foo("bar5");
new dummy("test6", random6).foo(
"bar6");
new dummy("test7", random7).foo("
bar7");
I referred to following SO links:
How to give a pattern for new line in grep?
how to grep multiple lines until ; (semicolon)
#TLP was pretty close:
perl -0777 -nE 'say for map {s/^\s+|\s+$//gr} /\bdummy\(\s*"(.+?)"/gs' test.txt
test1
test2
Using
-0777 to slurp the file in as a single string
/\bdummy\(\s*"(.+?)"/gs finds all the quoted string content after "dummy(" (with optional whitespace before the opening quote)
the s flag allows . to match newlines.
any string containing escaped double quotes will break this regex
map {s/^\s+|\s+$//gr} trims leading/trailing whitespace from each string.
This perl should work:
perl -0777 -pe 's/(?m)^[^(]* dummy\(\s*"\s*([^"]+).*/$1/g' file
test1
test2
test3
test4
Following gnu-grep + tr should also work:
grep -zoP '[^(]* dummy\(\s*"\s*\K[^"]+"' file | tr '"' '\n'
test1
test2
test3
test4
With your shown samples, please try following awk code, written and tested in GNU awk.
awk -v RS='(^|\n)new[^;]*;' '
RT{
rt=RT
gsub(/\n+|[[:space:]]+/,"",rt)
match(rt,/"[^"]*"/)
print substr(rt,RSTART+1,RLENGTH-2)
}
' Input_file
You can use Text::ParseWords to extract the quoted fields.
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $str = do {
local $/;
<DATA>;
}; # slurp the text into a variable
my #lines = quotewords(q("), 1, $str); # extract fields
my #txt;
for (0 .. $#lines) {
if ($lines[$_] =~ /\bdummy\s*\(/) {
push #txt, $lines[$_+1]; # target text will be in fields following "dummy("
}
}
s/^\s+|\s+$//g for #txt; # trim leading/trailing whitespace
print Dumper \#txt;
__DATA__
new dummy("test1", random1).foo("bar1");
new dummy("
test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
"test4", random4).foo("bar4");
Output:
$VAR1 = [
'test1',
'test2',
'test3',
'test4'
];
Given:
$ cat file
new dummy("test1", random1).foo("bar1");
new dummy("
test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
"test4", random4).foo("bar4");
You can use GNU grep this way:
$ grep -ozP '[^;]*\bdummy[^";]*"\s*\K[^";]*[^;]*;' file | tr '\000' '\n' | grep -oP '^[^"]*'
test1
test2
test3
test4
Somewhat more robust, if this is a ; delimited text, you can:
split on the ;;
filter for /\bdummy\b/;
grab the first field in quotes;
strip the whitespace.
Here is all that in a ruby:
ruby -e 'puts $<.read.split(/(?<=;)/).
select{|b| b[/\bdummy\b/]}.
map{|s| s[/(?<=")[^"]*/].strip}' file
# same output
awk-based solution handling everything via FS :
<test1.txt gawk -b -e 'BEGIN { RS="^$"
FS="((^|\\n)?"(___="[^\\n")"]+y[(]"(_="[ \\t\\n]*")(__="[\\42]")(_)\
"|"(_="[ \\t]*")(__)(_)"[,]"(___)";]+[;][\\n])+"} sub(OFS=ORS,"",$!--NF)'
test1
test2
test3
test4
gawk was benchmarked at 2 million rows at 5.15 secs, so unless your input file is beyond 100 MB, this suffices.
*** caveat : avoid using mawk-1.9.9.6 with this solution
Suggesting simple gawk script (standard linux awk):
awk '/dummy/{print gensub("[[:space:]]*","",1,$2)}' RS=';' FS='"' input.txt
Explanation:
RS=';' Set awk records separator to ;
FS='"' Set awk fields separator to "
/dummy/ Filter only records matchingdummy RexExp
gensub("[[:space:]]*","",1,$2) Trim any white-spaces from the beginning of 2nd field
print gensub("[[:space:]]*","",1,$2) print trimmed 2nd field

Is there a Go function that works like linux cut?

This is probably a very basic question, but I have not been able to find an answer after reviewing the strings package docs.
Basically, all I want to do is the equivalent of:
echo "hello world" | cut -d" " -f2
echo "hello world" | cut -d" " -f2
This splits the string "hello world" using spaces as delimeters, and selects only the 2nd part (1-indexed).
In Go for spitting there is strings.Split() which returns a slice, which you can index or slice however you like.
s := "hello world"
fmt.Println(strings.Split(s, " ")[1])
This outputs the same. Try it on the Go Playground. If the input is not guaranteed to have 2 parts, the above indexing ([1]) might panic. Check the length of the slice before doing so.
There is the strings.Split() function which splits the string at the specified sub-string.
There are also the functions Fields(s string) []string, and FieldsFunc(s string, f func(rune) bool) []string.
The former split the string at spaces, and the later uses the given function to determine if the string must be split.
The difference between Split and Fields is that Fields consider multiple consecutive spaces as one split location. strings.Fields(" foo bar baz ")) yields ["foo" "bar" "baz"], and strings.Split(" foo bar baz ", " ") yields ["" "" "foo" "bar" "" "baz" "" "" ""].

In Perl, why is a utf-8 string printed differently when split into characters?

A specially constructed string is printed differently when I use
print $b;
or
print for split //, $b;
A minimal example is:
#!perl
use warnings;
use strict;
use Encode;
my $b = decode 'utf8', "\x{C3}\x{A1}\x{E2}\x{80}\x{93}\x{C3}\x{A1}"; # 'á–á' in Unicode;
print $b, "\n";
print for split //, $b
The output on the console screen (I think I use cp860) is:
Wide character in print at xx.pl line 9.
├íÔÇô├í
Wide character in print at xx.pl line 10.
ßÔÇôß
or in hex:
C3 A1 E2 80 93 C3 A1
E1 E2 80 93 E1
(separated by 0D 0A of course, i.e., \r\n).
The question is WHY is the character rendered differently?
Surprisingly, the effect disappears without the em-dash. The effect is seen for longer strings, as the following example shows.
For the string 'Él es mi tío Toño –Antonio Pérez' (typed as Unicode in the program; note that the two lines are different!):
Wide character in print at xx.pl line 14.
├ël es mi t├¡o To├▒o ÔÇôAntonio P├®rez
Wide character in print at xx.pl line 15.
╔l es mi tÝo To±o ÔÇôAntonio PÚrez
However, for the string 'Él es mi tío Toño, Antonio Pérez':
╔l es mi tÝo To±o, Antonio PÚrez
╔l es mi tÝo To±o, Antonio PÚrez
nothing bad happens, and the two lines are rendered in the same way. The only difference is the presence of an en-dash –, i.e., '\x{E2}\x{80}\x{93}'!
Also, print join '', split //, $b; gives the same result as print $b; but different from print for split //, $b;.
If I add binmode STDOUT, 'utf8';, then both outputs are ÔÇô├í = E2 80 93 C3 A1.
So my question is not exactly about how to avoid it, but about why this happens: why does the same string behave differently when split?
Apparently in both cases the utf8 flag is on. Here is a more detailed program that shows more information about both strings: $a before decode and $b after decode:
#!perl
use warnings;
use strict;
use 5.010;
use Encode;
my $a = "\x{C3}\x{A1}\x{E2}\x{80}\x{93}\x{C3}\x{A1}"; # 'á–á' in Unicode;
my $b = decode 'utf8', $a;
say '------- length and utf8 ---------';
say "Length (a)=", length $a, ", is_uft8(a)=", (Encode::is_utf8 ($a) // 'no'), ".";
say "Length (b)=", length $b, ", is_uft8(b)=", (Encode::is_utf8 ($b) // 'no'), ".";
say '------- as a variable---------';
say "a: $a";
say "b: $b", ' <== *** WHY?! ***';
say '------- split ---------';
print "a: "; print for split //, $a; say '';
print "b: "; print for split //, $b; say ' <== *** DIFFERENT! ***';
say '------- split with spaces ---------';
print "a: "; print "[$_] " for split //, $a; say '';
print "b: "; print "[$_] " for split //, $b; say '';
say '------- split with properties ---------';
print "a: "; print "[$_ is_utf=" . Encode::is_utf8 ($_) . " length=" . length ($_) . "] " for split //, $a; say '';
print "b: "; print "[$_ is_utf=" . Encode::is_utf8 ($_) . " length=" . length ($_) . "] " for split //, $b; say '';
say '------- ord() ---------';
print "a: "; print ord, " " for split //, $a; say '';
print "b: "; print ord, " " for split //, $b; say '';
and here is its output on the console:
------- length and utf8 ---------
Length (a)=7, is_uft8(a)=.
Length (b)=3, is_uft8(b)=1.
------- as a variable---------
a: ├íÔÇô├í
Wide character in say at x.pl line 16.
b: ├íÔÇô├í <== *** WHY?! ***
------- split ---------
a: ├íÔÇô├í
Wide character in print at x.pl line 19.
b: ßÔÇôß <== *** DIFFERENT! ***
------- split with spaces ---------
a: [├] [í] [Ô] [Ç] [ô] [├] [í]
Wide character in print at x.pl line 22.
b: [ß] [ÔÇô] [ß]
------- split with properties ---------
a: [├ is_utf= length=1] [í is_utf= length=1] [Ô is_utf= length=1] [Ç is_utf= length=1] [ô is_utf= length=1] [├ is_utf= length=1] [í is_utf= length=1]
Wide character in print at x.pl line 25.
b: [ß is_utf=1 length=1] [ÔÇô is_utf=1 length=1] [ß is_utf=1 length=1]
------- ord() ---------
a: 195 161 226 128 147 195 161
b: 225 8211 225
The difference is whether the string being printed contains any characters >255. print only knows you did something wrong in that situation[1].
Given a handle with no :encoding, print expects a string of bytes (string of characters ≤255).
When it doesn't receive bytes (the string contains characters >255), it notifies you of the error ("wide character") and guesses that you meant to encode the string using UTF-8.
You can think of print on a handle with no :encoding as doing the following:
if ($s =~ /[^\x00-\xFF]/) {
warn("Wide character");
utf8::encode($s);
}
my $b = decode 'utf8', "\x{C3}\x{A1}\x{E2}\x{80}\x{93}\x{C3}\x{A1}";
is the same as
my $b = "\xE1\x{2013}\xE1";
As such, you are doing
print "\xE1\x{2013}\xE1";
print "\xE1";
print "\x{2013}";
print "\xE1";
print "\xE1\x{2013}\xE1"; # Wide char! C3 A1 E2 80 93 C3 A1
Perl notices you forgot to encode, warns you, and prints the string encoded using UTF-8.
print "\xE1"; # E1
Perl has no way of knowing you forgot to encode, so it prints what you asked it to print.
print "\x{2013}"; # Wide char! E2 80 93
Perl notices you forgot to encode, warns you, and prints the string encoded using UTF-8.
Footnotes
The choice of storage format (as returned by is_utf8) should never have an effect. print is correctly unaffected by it.
utf8::downgrade( my $d = chr(0xE1) ); print($d); # UTF8=0 prints E1
utf8::upgrade( my $u = chr(0xE1) ); print($u); # UTF8=1 prints E1

How to parse words in awk?

I was wondering how to parse a parragraph that looks like the following:
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
And many other lines with text that I do not need
* * * * * * *
Autolisp - Dialect of LISP used by the Autocad CAD package, Autodesk,
Sausalito, CA.
CPL -
1. Combined Programming Language. U Cambridge and U London. A very
complex language, syntactically based on ALGOL-60, with a pure functional
subset.
Modula-3* - Incoprporation of Modula-2* ideas into Modula-3. "Modula-3*:
So I can get the following exit from the awk sentence:
Autolisp
CPL
Modula-3*
I have tried the following sentences because the file I want to filter is huge. It is a list of all the existing programming languages so far, but basically all the lines follow the same pattern as the above
Sentences I have used so far:
BEGIN{$0 !~ /^ / && NF == 2 && $2 == "-"} { print $1 }
BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"} /^FLIP -/{print $1,$3}
BEGIN{RS=""; FS=OFS="\n"} {print $1 NF-1}
BEGIN{NF == 2 && $2 == "-" } { print $1 }
BEGIN { RS = "" } { print $1 }
The sentences that have worked for me so far are:
BEGIN { RS = "\n\n"; FS = " - " }
{ print $1 }
awk -F " - " "/ - /{ print $1 }" file.txt
But it still prints or skips lines that I need/ don't need.
Thanks for your help & response!
I have broken my head for some days because I am a rookie with AWK programming
The default FS should be fine, to avoid any duplicate lines you can pipe the output to sort -u
$ gawk '$2 == "-" { print $1 }' file | sort -u
Autolisp
CPL
Modula-3*
It might not filter out everything you want but you can keep adding rules until the bad data is filtered.
Alternately you can avoid using sort by using an associative array:
$ gawk '$2=="-" { arr[$1] } END { for (key in arr) print key}' file
Autolisp
CPL
Modula-3*
If it doesn't have to be with awk, it would probably work to first use grep to select lines of the right form, and then use sed to trim off the end, as follows:
grep -e '^.* -' | sed -e 's/\(^.*\) -.*$/\1\n/; p;'
Edit: After some playing around with awk, it looks like part of your issue is that you don't always have '[languagename] - [stuff]', but rather '[languagename] -\n[stuff]', as is the case with CPL in the sample text, and therefore, FS=" - " doesn't separate on things like that.
Also, one possible thing to try is as follows:
BEGIN { r = "^.* -"; }
{
if (match($0, r)) {
printf("%s\n", substr($0, 1, RSTART + RLENGTH - 3));
}
}
I don't actually know much about awk, but this is my best guess at replicating what the grep and sed do above. It does appear to work on the sample text you gave, at least.

add constant after keyword

I have a question concerning the manipulation of a text file.
I have something like this
any text keyword 21 any text 32 any text
any text keyword 12 any text keyword 12 any text 23 any text
any text keyword 34 any text (keyword 45) any text (34) any text
now I wonder if I can grep/awk/sed/vi/.. somehow to add constants after the keyword?
For example I want to add e. g. a value of 10 to every integer after keyword but leaving the other numbers and the file format the same?
any text keyword 31 any text 32 any text
any text keyword 22 any text keyword 22 any text 23 any text
any text keyword 44 any text (keyword 55) any text (34) any text
Sorry, I did not find anything so far...
If Perl solution is ok for you:
perl -pe 's/(?<=keyword )(\d+)/$1+10/ge;' file
you mentioned vim, here it goes:
:%s/\v(keyword )#<=[0-9]+/\=submatch(0)+10/g
I tried hard for a sed version:
sed 's/keyword[ \t]*\([0-9]*\)/keyword $(( \1 + 10))/g;
s/"/\\"/g;
s/^/echo \"/;
s/$/\"/' input |
sh
look at this perl solution
perl -pe 's/keyword (\d)+/"keyword ".($1 + 10)/eg' your_file
if you wanna exclude some number from the sum (34 and 35 in this example)
perl -pe 's/keyword (\d)+/if ($1 != 34 && $1 != 35) { "keyword ".($1 + 10) } else { "keyword ".$1 }/eg' your_file
This works. Now I will admit I'm no awk expert so there may be shorter ways to do it but this is what I hacked together:
#!/bin/sh
cat $1 | awk \
'function incr(str) {
if (match(str, "[0-9]+")) {
number = substr(str, RSTART, RLENGTH)
number = number+10
printf("keyword %d",number)
str = substr(str, RSTART+RLENGTH)
}
}
function findall(str, re) {
where=match(str, re)
if (where==0)
{
print(str)
}
else
{
printf("%s", substr(str, 0, RSTART-1))
offset=RSTART+RLENGTH
incr(substr(str, RSTART, RLENGTH))
str = substr(str, offset)
findall(str, re)
}
}
{
findall($0, "keyword [0-9]+");
}'

Resources