I have a a text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. This text file contains English and Arabic pairs. Once the word is found and replaced it should not be changed again.
For example:
adam a +dam
a b
ال ال+
So for a given text file:
adam played with a ball ال
I expect:
a +dam played with b ball ال+
However, I get:
b +dbm plbyed with b bbll ال+
I am using the following sed command to find and replace:
sed -e 's/^/s%/' -e 's/\t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt
How can I fix this issue
The basic problem to your approach is that you don't want to replace matched text in a prior substitution with a later one - you don't want to change the a's in a +dam to b's. This makes sed a pretty poor choice - you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.
A way using GNU awk:
gawk -F'\t' '
FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
ENDFILE {
if (FILENAME == ARGV[1]) {
# Build a regular expression of things to substitute
subre = "\\<("
first=0
for (s in subs)
subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
subre = sprintf("%s)\\>", subre)
}
}
{
# Do the substitution
nwords = patsplit($0, words, subre, between)
printf "%s", between[0]
for (n = 1; n <= nwords; n++)
printf "%s%s", subs[words[n]], between[n]
printf "\n"
}
' tab_sep_file.txt original_file.txt
which outputs
a +dam played with b ball
First it reads the TSV file and builds an array of words to be replaced and text to replace it with (subs). Then after reading that file, it builds a regular expression to match all possible words to be found - \<(a|adam)\> in this case. The \< and \> match only at the beginning and end, respectively, of words, so the a in ball won't match.
Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts (words) and the bits between matches (between), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that's already been replaced.
And a perl version that uses a similar approach (Taking advantage of perl's ability to evaluate the replacement text in a s/// substitution):
perl -e '
use strict;
use warnings;
# Set file/standard stream char encodings from locale
use open ":locale";
# Or for explicit UTF-8 text
# use open ":encoding(UTF-8)", ":std";
my %subs;
open my $words, "<", shift or die $!;
while (<$words>) {
chomp;
my ($word, $rep) = split "\t" ,$_, 2;
$subs{$word} = $rep;
}
my $subre = "\\b(?:" . join("|", map { quotemeta } keys %subs) . ")\\b";
while (<<>>) {
print s/$subre/$subs{$&}/egr;
}
' tab_sep_file.txt original_file.txt
(This one will escape regular expression metacharacters in the words to replace, making it more robust)
Related
I have a file which is having some specific words. I have another file having the URLs which contains that word from file1.
I would like to print url if each word in file1 matches with file2. If word is not found in file2 then return "no matching"
I tried with Awk and grep and used if conditions also. But did not get expected results.
File1:
abc
Def
XYZ
File2:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
Output can be like:
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Etc..
Tried:
file=/bin/file1.txt
for i in `cat $file1`;
do
a=$i
echo "$a:" | awk '$repos.txt ~ $a {printf $?}'
done
Tried some other ways like if condition with grep and all... but no luck.
abc means it should only search for abc, not abcd.
You appear to want case-insensitive matching.
An awk solution:
$ cat <<'EOD' >file1
abc
Def
XYZ
missing
EOD
$ cat <<'EOD' >file2
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
EOD
$ awk '
# create lowercase versions
{
lc = tolower($0)
}
# loop over lines of file1
# store search strings in array
# key is search string, value will be results found
NR==FNR {
h[lc]
next
}
# loop over lines of file2
# if search string found, append line to results
{
for (s in h)
if (lc ~ s)
h[s] = h[s]"\n"$0
}
# loop over seearch strings and print results
# if no result, show error message
END {
for (s in h)
print s":"( h[s] ? h[s] : "\nno matching" )
}
' file1 file2
missing:
no matching
def:
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
$
Your attempt is pretty far from the mark. Probably learn the basics of the shell and Awk before you proceed.
Here is a simple implementation which avoids reading lines with for.
while IFS='' read -r word; do
echo "$word:"
grep -F "$word" File2
done <File1
If you want to match case-insensitively, use grep -iF.
The requirement to avoid substring matches is a complication. The -w option to grep nominally restrics matching to entire words, but the definition of "word" characters includes the underscore character, so you can't use that directly. A manual approximation might look like
grep -iE "(^|[^a-z])$word([^a-z]|$)" File2
but this might not work with all grep implementations.
A better design is perhaps to prefix the match(es) before each output line, and only loop over the input file once.
awk 'NR==FNR { w[a] = "(^|[^a-z])" $0 "([^a-z]|$)"; next }
{ m = ""
for (a in w) if ($0 ~ w[a]) m = m (m ? "," : "") a
if (m) print m ":" $0 }' File1 File2
In brief, we collect the search words in the array w from the first input file. When reading the second input file, we collect matches on all the search words in m; if m is non-empty, we print its value followed by the input line which matched.
Again, if you want case-insensitive matching, use tolower() where appropriate.
Demo, featuring lower-case comparisons: https://ideone.com/iTWpFn
I have a very large csv file that is too big to open in excel for this operation.
I need to replace a specific string for approx 6000 records out of the 1.5mil in the csv, the string itself is in the comma separated format like so:
ABC,FOO.BAR,123456
With other columns on either side that are of no concern. I only need enough to get enough data to make sure the final data string (the numbers) are unique.
I have another file with the string to replace and the replacement string like (for the above):
"ABC,FOO.BAR,123456","ABC,FOO.BAR,654321"
So in the case above 123456 is being replaced by 654321. A simple (yet maddeningly slow) way to do this is open both docs in notepad++ and find the first string then replace with the second string, but with over 6000 records this isnt great.
I was hoping someone could give advice on a scripting solution? e.g.:
$file1 = base.csv
$file2 = replace.csv
For each row in $file2 {
awk '{sub(/$file2($firstcolumn)/,$file2($Secondcolumn)' $file1
}
Though Im not entirely sure how to adapt awk to do an operation like this..
EDIT: Sorry I should have been more specific, the data in my replacement csv is only in two columns; two raw strings!
it would be easier of course if your delimiter is not used within the fields...
you can do in two steps, create a sed script from the lookup file and use it for the main data file for replacements
for example,
(assumes there is no escaped quotes in the fields, may not hold)
$ awk -F'","' '{print "s/" $1 "\"/\"" $2 "/"}' lookup_file > replace.sed
$ sed -f replace.sed data_file
awk -F\" '
NR==FNR { subst[$2]=$4; next }
{
for (s in subst) {
pos = index($0, s)
if (pos) {
$0 = substr($0, 1, pos-1) subst[s] substr($0, pos + length(s))
break
}
}
print
}
' "$file2" "$file1" # > "$file1.$$.tmp" && mv "$file1.$$.tmp" "$file1"
The part after the # shows how you could replace the input data file with the output.
The block associated with NR==FNR is only executed for the first input file, the one with the search and replacement strings.
subst[$2]=$4 builds an associative array (dictionary): the key is the search string, the value the replacement string.
Fields $2 and $4 are the search string and the replacement string, respectively, because Awk was instructed to break in the input into fields by " (-F\"); note that this assumes that your strings do not contain escaped embedded " chars.
The remaining block then processes the data file:
For each input line, it loops over the search strings and looks for a match on the current line:
Once a match is found, the replacement string is substituted for the search string, and matching stops.
print simply prints the (possibly modified) line.
Note that since you want literal string replacements, regex-based functions such as sub() are explicitly avoided in favor of literal string-processing functions index() and substr().
As an aside: since you say there are columns on either side in the data file, consider making the search/replacement strings more robust by placing , on either side of them (this could be done inside the awk script).
I would recommend using a language with a CSV parsing library rather than trying to do this with shell tools. For example, Ruby:
require 'csv'
replacements = CSV.open('replace.csv','r').to_h
File.open('base.csv', 'r').each_line do |line|
replacements.each do |old, new|
line.gsub!(old) { new }
end
puts line
end
Note that Enumerable#to_h requires Ruby v2.1+; replace with this for older Rubys:
replacements = Hash[*CSV.open('replace.csv','r').to_a.flatten]
You only really need CSV for the replacements file; this assumes you can apply the substitutions to the other file as plain text, which speeds things up a bit and avoids having to parse the old/new strings out into fields themselves.
I finally know how to use regular expressions to replace one substring with another every place where it occurs within a string. But what I need to do now is a bit more complicated than that.
A string I must transform will have many instances of the newline character ('\n'). If those newline character are enclosed within fish-tags (between '<' and '>') I need to replace it with a simple whitespace character (' ').
However, if a newline character occurs anywhere else in the string, I need to leave that newline character alone.
There will be several places in the string that are enclosed in fish-tags, and several places that aren't.
Is there a way to do this in PERL?
I honestly don't recommend doing this with regular expressions. Besides the fact that you should never parse html with a regular expression, it's also a pain to do negative matches with regular expressions and anyone reading the code will honestly have no idea what you just did. Doing it manually on the other hand is really easy to understand.
This code assumes well formed html that doesn't have tags starting inside the definition of other tags (otherwise you would have to track all the instances and increment/decrement a count appropriately) and it does not handle < or > inside quoted strings which isn't the most common thing. And if you're doing all that I really recommend you use a real html parser, there are many of them.
Obviously if you're not reading this from a filehandle, the loop would be going over an array of lines (or the output of splitting the whole text, though you would instead be appending ' ' or "\n" depending on the inside variable if you split since it would remove the newline)
use strict;
use warnings;
# Default to being outside a tag
my $inside = 0;
while(my $line = <DATA>) {
# Find the last < and > in the string
my ($open, $close) = map { rindex($line, $_) } qw(< >);
# Update our state accordingly.
if ($open > $close) {
$inside = 1;
} elsif ($open < $close) {
$inside = 0;
}
# If we're inside a tag change the newline (last character in the line) with a space. If you instead want to remove it you can use the built-in chomp.
if ($inside) {
# chomp($line);
substr($line, -1) = ' ';
}
print $line;
}
__DATA__
This is some text
and some more
<enclosed><a
b
c
> <d
e
f
>
<g h i
>
Given:
$ echo "$txt"
Line 1
Line 2
< fish tag line 1
and line 2 >
< line 3 >
< fish tag line 4
and line 5 >
You can do:
$ echo "$txt" | perl -0777 -lpe "s/(<[^\n>]*)\n+([^>]*>)/\1\2/g"
Line 1
Line 2
< fish tag line 1 and line 2 >
< line 3 >
< fish tag line 4 and line 5 >
I will echo that this only works in limited cases. Please do not get in the general habit of using a regex for HTML.
This solution uses zdim's data (thanks, zdim)
I prefer to use an executable replacement together with the non-destructive option of the tr/// operator
This solution finds all occurrences of strings enclosed in angle brackets <...> and alters all newlines within each one to single spaces
Note that it would be simple to allow for quoted substrings containing any characters by writing this instead
$data =~ s{ ( < (?: "[^"]+" | [^>] )+ > ) }{ $1 =~ tr/\n/ /r }gex;
use strict;
use warnings 'all';
use v5.14; # For /r option
my $data = do {
local $/;
<DATA>;
};
$data =~ s{ ( < [^<>]+ > ) }{ $1 =~ tr/\n/ /r }gex;
print $data;
__DATA__
start < inside tags> no new line
again <inside, with one nl
> out
more <inside, with two NLs
and more text
>
output
start < inside tags> no new line
again <inside, with one nl > out
more <inside, with two NLs and more text >
The (X)HTML/XML shouldn't be parsed with regex. But since no description of the problem is given here is a way to go at it. Hopefully it demonstrates how tricky and involved this can get.
You can match a newline itself. Together with details of how linefeeds may come in text
use warnings;
use strict;
my $text = do { # read all text into one string
local $/;
<DATA>;
};
1 while $text =~ s/< ([^>]*) \n ([^>]*) >/<$1 $2>/gx;
print $text;
__DATA__
start < inside tags> no new line
again <inside, with one nl
> out
more <inside, with two NLs
and more text
>
This prints
start < inside tags> no new line
again <inside, with one nl > out
more <inside, with two NLs and more text >
The negated character class [^>] matches anything other than >, optionally and any number of times with *, up to an \n. Then another such pattern follows \n, up to the closing >. The /x modifier allows spaces inside, for readability. We also need to consider two particular cases.
There may be multiple \n inside <...>, for which the while loop is a clean solution.
There may be multiple <...> with \n, which is what /g is for.
The 1 while ... idiom is another way to write while (...) { }, where the body of the loop is empty so everything happens in the condition, which is repeatedly evaluated until false. In our case the substitution keeps being done in the condition until there is no match, when the loop exits.
Thanks to ysth for bringing up these points and for the 1 while ... solution.
All of this necessary care for various details and edge cases (of which there may be more) hopefully convinces you that it is better to reach for an HTML parsing module suitable for the particular task. For this we'd need to know more about the problem.
I have two text files file1.txt and file2.txt which both contain lines of words like this:
fare
word
word-ed
wo-ded
wor
and
fa-re
text
uncial
woded
wor
worded
or something like this. By a word, I mean a succession of the letters a-z possibly with accents, together with the symbol -. My question is, how can I create a third file output.txt from linux command line (using awk, sed etc.) out of these two files which satisfies the following three conditions:
If the same word occurs in the two files, the third file output.txt contains it exactly once.
If a hyphenated version (for example fa-re in file2.txt) of a word in on file occurs in another, then only the hyphenated version is retained in output.txt (for example, only fa-re is retained in our example).
Thus, output.txt should contain the following words:
fa-re
word
word-ed
wo-ded
wor
text
uncial
================Edit========================
I have modified the files and given the output file as well.
I will try to make sure manually that there are no differently hyphenated words (such as wod-ed and wo-ded).
Another awk:
!($1 in a) || $1 ~ "-" {
key = value = $1; gsub("-","",key); a[key] = value
}
END { for (i in a) print a[i] }
$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re
This is not exactly what you asked but perhaps better suited with what you need.
awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'
this will group all words in the files by equivalence class (match without hyphen). You can have another pass from this result to get what you desire.
uncial
word
woded wo-ded
wor wor
worded word-ed
text
fa-re fare
The advantages are not manually checking whether there are alternative hyphenated words and see how many different instances you have for each word.
For example, this will filter out the previous list to desired output.
awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'
Awk Solution
!($1 in words) {
split($1, f, "-")
w = f[1] f[2]
if (f[2])
words[w] = $1
else
words[w]
}
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
$ awk -f script.awk file1.txt file2.txt
wor
fa-re
text
wo-ded
uncial
word-ed
word
Breakdown
!($1 in words) {
...
}
Only process the line if the first field doesn't already reside as a key in the array words.
split($1, f, "-")
Splits the first field into the array f using - as the delimiter. The first and second parts of the word will reside in f[1] and f[2] respectively. If the word is not hyphened, it will reside in its entirety inside f[1].
w = f[1] f[2]
Assigns the dehyphened word to w by concatenating the first and second parts of the word. If the word was not originally hyphened, the result will be the same since f[2] is empty.
if (f[2])
words[w] = $1
else
words[w]
Store the dehyphened word as a key in the words array. If the word was hyphened (f[2] is not empty), store it as the key's value.
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
After the file has been processed, iterate through the words array, and if the key holds a value (hyphened word), print it, otherwise print the key (non-hyphened word).
I was wondering how to parse a parragraph that looks like the following:
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
And many other lines with text that I do not need
* * * * * * *
Autolisp - Dialect of LISP used by the Autocad CAD package, Autodesk,
Sausalito, CA.
CPL -
1. Combined Programming Language. U Cambridge and U London. A very
complex language, syntactically based on ALGOL-60, with a pure functional
subset.
Modula-3* - Incoprporation of Modula-2* ideas into Modula-3. "Modula-3*:
So I can get the following exit from the awk sentence:
Autolisp
CPL
Modula-3*
I have tried the following sentences because the file I want to filter is huge. It is a list of all the existing programming languages so far, but basically all the lines follow the same pattern as the above
Sentences I have used so far:
BEGIN{$0 !~ /^ / && NF == 2 && $2 == "-"} { print $1 }
BEGIN{RS=""; ORS="\n\n"; FS=OFS="\n"} /^FLIP -/{print $1,$3}
BEGIN{RS=""; FS=OFS="\n"} {print $1 NF-1}
BEGIN{NF == 2 && $2 == "-" } { print $1 }
BEGIN { RS = "" } { print $1 }
The sentences that have worked for me so far are:
BEGIN { RS = "\n\n"; FS = " - " }
{ print $1 }
awk -F " - " "/ - /{ print $1 }" file.txt
But it still prints or skips lines that I need/ don't need.
Thanks for your help & response!
I have broken my head for some days because I am a rookie with AWK programming
The default FS should be fine, to avoid any duplicate lines you can pipe the output to sort -u
$ gawk '$2 == "-" { print $1 }' file | sort -u
Autolisp
CPL
Modula-3*
It might not filter out everything you want but you can keep adding rules until the bad data is filtered.
Alternately you can avoid using sort by using an associative array:
$ gawk '$2=="-" { arr[$1] } END { for (key in arr) print key}' file
Autolisp
CPL
Modula-3*
If it doesn't have to be with awk, it would probably work to first use grep to select lines of the right form, and then use sed to trim off the end, as follows:
grep -e '^.* -' | sed -e 's/\(^.*\) -.*$/\1\n/; p;'
Edit: After some playing around with awk, it looks like part of your issue is that you don't always have '[languagename] - [stuff]', but rather '[languagename] -\n[stuff]', as is the case with CPL in the sample text, and therefore, FS=" - " doesn't separate on things like that.
Also, one possible thing to try is as follows:
BEGIN { r = "^.* -"; }
{
if (match($0, r)) {
printf("%s\n", substr($0, 1, RSTART + RLENGTH - 3));
}
}
I don't actually know much about awk, but this is my best guess at replicating what the grep and sed do above. It does appear to work on the sample text you gave, at least.