Merge two files with no pseudo-repetitions - linux

I have two text files file1.txt and file2.txt which both contain lines of words like this:
fare
word
word-ed
wo-ded
wor
and
fa-re
text
uncial
woded
wor
worded
or something like this. By a word, I mean a succession of the letters a-z possibly with accents, together with the symbol -. My question is, how can I create a third file output.txt from linux command line (using awk, sed etc.) out of these two files which satisfies the following three conditions:
If the same word occurs in the two files, the third file output.txt contains it exactly once.
If a hyphenated version (for example fa-re in file2.txt) of a word in on file occurs in another, then only the hyphenated version is retained in output.txt (for example, only fa-re is retained in our example).
Thus, output.txt should contain the following words:
fa-re
word
word-ed
wo-ded
wor
text
uncial
================Edit========================
I have modified the files and given the output file as well.
I will try to make sure manually that there are no differently hyphenated words (such as wod-ed and wo-ded).

Another awk:
!($1 in a) || $1 ~ "-" {
key = value = $1; gsub("-","",key); a[key] = value
}
END { for (i in a) print a[i] }
$ awk -f npr.awk file1.txt file2.txt
text
word-ed
uncial
wor
wo-ded
word
fa-re

This is not exactly what you asked but perhaps better suited with what you need.
awk '{k=$1; gsub("-","",k); w[k]=$1 FS w[k]} END{for( i in w) print w[i]}'
this will group all words in the files by equivalence class (match without hyphen). You can have another pass from this result to get what you desire.
uncial
word
woded wo-ded
wor wor
worded word-ed
text
fa-re fare
The advantages are not manually checking whether there are alternative hyphenated words and see how many different instances you have for each word.
For example, this will filter out the previous list to desired output.
awk '{w=$1; for(i=1;i<=NF;i++) if(match($i,/-/)!=0)w=$i; print w}'

Awk Solution
!($1 in words) {
split($1, f, "-")
w = f[1] f[2]
if (f[2])
words[w] = $1
else
words[w]
}
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
$ awk -f script.awk file1.txt file2.txt
wor
fa-re
text
wo-ded
uncial
word-ed
word
Breakdown
!($1 in words) {
...
}
Only process the line if the first field doesn't already reside as a key in the array words.
split($1, f, "-")
Splits the first field into the array f using - as the delimiter. The first and second parts of the word will reside in f[1] and f[2] respectively. If the word is not hyphened, it will reside in its entirety inside f[1].
w = f[1] f[2]
Assigns the dehyphened word to w by concatenating the first and second parts of the word. If the word was not originally hyphened, the result will be the same since f[2] is empty.
if (f[2])
words[w] = $1
else
words[w]
Store the dehyphened word as a key in the words array. If the word was hyphened (f[2] is not empty), store it as the key's value.
END {
for (k in words)
if (words[k])
print words[k]
else
print k
}
After the file has been processed, iterate through the words array, and if the key holds a value (hyphened word), print it, otherwise print the key (non-hyphened word).

Related

Check if a word from one file exists in another file and print the matching line

I have a file which is having some specific words. I have another file having the URLs which contains that word from file1.
I would like to print url if each word in file1 matches with file2. If word is not found in file2 then return "no matching"
I tried with Awk and grep and used if conditions also. But did not get expected results.
File1:
abc
Def
XYZ
File2:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
Output can be like:
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Etc..
Tried:
file=/bin/file1.txt
for i in `cat $file1`;
do
a=$i
echo "$a:" | awk '$repos.txt ~ $a {printf $?}'
done
Tried some other ways like if condition with grep and all... but no luck.
abc means it should only search for abc, not abcd.
You appear to want case-insensitive matching.
An awk solution:
$ cat <<'EOD' >file1
abc
Def
XYZ
missing
EOD
$ cat <<'EOD' >file2
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
EOD
$ awk '
# create lowercase versions
{
lc = tolower($0)
}
# loop over lines of file1
# store search strings in array
# key is search string, value will be results found
NR==FNR {
h[lc]
next
}
# loop over lines of file2
# if search string found, append line to results
{
for (s in h)
if (lc ~ s)
h[s] = h[s]"\n"$0
}
# loop over seearch strings and print results
# if no result, show error message
END {
for (s in h)
print s":"( h[s] ? h[s] : "\nno matching" )
}
' file1 file2
missing:
no matching
def:
Https://gitlab.private.com/apm-team/mi_def_linux1.git
Https://gitlab.private.com/apm-team/mi_def_linux2.git
abc:
Https://gitlab.private.com/apm-team/mi_abc_linux1.git
Https://gitlab.private.com/apm-team/mi_abc_linux2.git
Https://gitlab.private.com/apm-team/mi_abc_linux3.git
xyz:
Https://gitlab.private.com/apm-team/mi_xyz_linux1.git
Https://gitlab.private.com/apm-team/mi_xyz_linux2.git
$
Your attempt is pretty far from the mark. Probably learn the basics of the shell and Awk before you proceed.
Here is a simple implementation which avoids reading lines with for.
while IFS='' read -r word; do
echo "$word:"
grep -F "$word" File2
done <File1
If you want to match case-insensitively, use grep -iF.
The requirement to avoid substring matches is a complication. The -w option to grep nominally restrics matching to entire words, but the definition of "word" characters includes the underscore character, so you can't use that directly. A manual approximation might look like
grep -iE "(^|[^a-z])$word([^a-z]|$)" File2
but this might not work with all grep implementations.
A better design is perhaps to prefix the match(es) before each output line, and only loop over the input file once.
awk 'NR==FNR { w[a] = "(^|[^a-z])" $0 "([^a-z]|$)"; next }
{ m = ""
for (a in w) if ($0 ~ w[a]) m = m (m ? "," : "") a
if (m) print m ":" $0 }' File1 File2
In brief, we collect the search words in the array w from the first input file. When reading the second input file, we collect matches on all the search words in m; if m is non-empty, we print its value followed by the input line which matched.
Again, if you want case-insensitive matching, use tolower() where appropriate.
Demo, featuring lower-case comparisons: https://ideone.com/iTWpFn

Find and replace words using sed command not working

I have a a text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. This text file contains English and Arabic pairs. Once the word is found and replaced it should not be changed again.
For example:
adam a +dam
a b
ال ال+
So for a given text file:
adam played with a ball ال
I expect:
a +dam played with b ball ال+
However, I get:
b +dbm plbyed with b bbll ال+
I am using the following sed command to find and replace:
sed -e 's/^/s%/' -e 's/\t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt
How can I fix this issue
The basic problem to your approach is that you don't want to replace matched text in a prior substitution with a later one - you don't want to change the a's in a +dam to b's. This makes sed a pretty poor choice - you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.
A way using GNU awk:
gawk -F'\t' '
FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
ENDFILE {
if (FILENAME == ARGV[1]) {
# Build a regular expression of things to substitute
subre = "\\<("
first=0
for (s in subs)
subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
subre = sprintf("%s)\\>", subre)
}
}
{
# Do the substitution
nwords = patsplit($0, words, subre, between)
printf "%s", between[0]
for (n = 1; n <= nwords; n++)
printf "%s%s", subs[words[n]], between[n]
printf "\n"
}
' tab_sep_file.txt original_file.txt
which outputs
a +dam played with b ball
First it reads the TSV file and builds an array of words to be replaced and text to replace it with (subs). Then after reading that file, it builds a regular expression to match all possible words to be found - \<(a|adam)\> in this case. The \< and \> match only at the beginning and end, respectively, of words, so the a in ball won't match.
Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts (words) and the bits between matches (between), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that's already been replaced.
And a perl version that uses a similar approach (Taking advantage of perl's ability to evaluate the replacement text in a s/// substitution):
perl -e '
use strict;
use warnings;
# Set file/standard stream char encodings from locale
use open ":locale";
# Or for explicit UTF-8 text
# use open ":encoding(UTF-8)", ":std";
my %subs;
open my $words, "<", shift or die $!;
while (<$words>) {
chomp;
my ($word, $rep) = split "\t" ,$_, 2;
$subs{$word} = $rep;
}
my $subre = "\\b(?:" . join("|", map { quotemeta } keys %subs) . ")\\b";
while (<<>>) {
print s/$subre/$subs{$&}/egr;
}
' tab_sep_file.txt original_file.txt
(This one will escape regular expression metacharacters in the words to replace, making it more robust)

How to grep new line character present in a column in text file

I have one text file with data separated by pipe delimiter.I have one column named address having new line character in the address value.
I have to remove new line from the file. Is there any way to grep the new line character from the address field?
This is very fragile, and you may want to make it robust, but you could try something like:
$ cat input
a|b|c|d|e|f|g
1|2|3
a|4|5|6|7
$ awk 'NF<7{getline a; $0 = $0 a}1' FS=\| input
a|b|c|d|e|f|g
1|2|3a|4|5|6|7
If you're worried about multiple fields containing the record separator, or runs of newlines, you might try:
awk '{while( split($0, a, "\|") < 7 ){if(getline b) $0 = $0 b; else exit;}}1' FS=\| input
(In each, I've arbitrarily chosen 7 as the number of expected columns, but that's easy to parameterize.)

Vim function to sort by lines by tags

I'm trying to create a custom function in gVim to sort lines based on a tag and I'm a bit out of my depth. The goal is for me to create a custom function in my _gvimrc so I can resort like this quickly and as necessary. I need some pointers as to where to study.
My goal is to group lines by a tag at the start of the line
file contents 1
:a some text
:b some other text
:a more text
:b more other text
:c final text
file contents 2
will become
:a
some text
more text
:b
some other text
more other text
:c
final text
file contents 1
file contents 2
The challenge is largely that I don't want to mess with the other lines in the file - those which have no tags. the :sort function will reorder all of those.
I have a feeling I will need to:
1. yank all the lines with tags into a register, and delete them (some kind of :g/pattern/yank register ?)
2. put them all at the beginning of the file (some kind of :g/pattern/put register ?)
3. sort the block by the tags (some kind of :sort r /pattern/ ?)
4. iterate over each tag to reformat from
:a text
:a text
:a text
to
:a
text
text
text
I'm not proficient enough in gVim to know where to really start - if anyone here is expert enough to assist with one of these sub-problems, so to speak, or has a better idea for the methodology, I'd really appreciate it. I'm particularly stymied on number 4.
If this isn't the kind of thing gVim is capable of, please let me know, but I have a feeling this is quite possible and just out of my pay grade.
I think awk or some higher language would fit this better. Example using awk and Vim's filter, %!:
:%!awk '$1 ~ /^:/ { a[$1] = a[$1] substr($0, 3) "\n"; next} {r = r $0 "\n"} END { asorti(a, keys); for (k in keys) { print keys[k]; print a[keys[k]] } print ""; print r}'
Prettier version:
$1 ~ /^:/ {
a[$1] = a[$1] substr($0, 3) "\n";
next
}
{
r = r $0 "\n"
}
END {
asorti(a, keys);
for (k in keys) {
print keys[k];
print a[keys[k]]
}
print "";
print r
}
Note: I use gawk

How to Split a Delimited Text file in Linux, based on no of records, which has end-of-record separator in data fields

Problem Statement:
I have a delimited text file offloaded from Teradata which happens to have "\n" (newline characters or EOL markers) inside data fields.
The same EOL marker is at the end of each new line for one entire line or record.
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines.
Example:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation :
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
What i have tried :
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. Is there a way it can be achieved?
EDIT:: i have changed the problem statement with example. Please share your thoughts on the new example.
Use the following awk approach:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0 - your logical single line occupies two physical records, so we expect to separate on each 4 records
Results:
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS appropriately, e.g.:
parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this:
gawk -f parse.awk n=2 infile
Where n is the number of records to put into each file.
Output:
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11

Resources