Is it possible to repeat a match in a grep regexp? - linux

I am using this:
grep '\s[A-Z]+\s[A-Z]+\s' file.txt -Po
Which will match
ABC DE
AB AB
DEF GHIFOO
etc
What I want to do is something like
grep '\s([A-Z]+)\s%1\s' file.txt -Po
to only match
AB AB
BC BC
DDD DDD
etc.
I can't work out if it's even possible, let alone how. Is it?
Thanks

The first captured group should be specified as \1 not as %1:
Sample file.txt:
AA AB
AB AB
BC BC
DDD DDD
NN WN
Consider the updated regex patten:
grep -Po '\b([A-Z]+)\s\1\s*' file.txt
The output:
AB AB
BC BC
DDD DDD
Bonus approach for opposite action:
grep -Po '\b([A-Z]+)\s(?!\1)[A-Z]+\s*' file.txt
The output:
AA AB
NN WN

Related

Join two files linux

I am trying to join two files but they don't have the same number of lines. I need to join them by the second column.
File1:
11#San Noor#New York, US
22#Maria Shiry#Dubai, UA
55#John Smith#London, England
66#Viki Sam#Roman, Italy
81#Sara Moheeb#Montreal, Canada
File2:
C1#Steve White#11
C2#Hight Look#21
E1#The Heaven is more#52
I1#The Roma Seen#55
The output should be:
The output for paired lines should look like:
San Noor#Sereve White
The output for unpairable lines should look like:
Sara Moheeb#NA
(The file3 after joining should contain 5 lines and look as followed.)
San Noor#Steve White
Maria Shiry#Hight Look
John Smith#The Heaven is more
Viki Sam#The Roma Seen
Sara Moheeb#NA
I have tried to join these two files using this command:
join -t '#' -j2 -e "NA" <(sort -t '#' -k2 File1) <(sort -t '#' -k2 File2) > File3
It says that both files are not sorted. Also, I need a way to fill in missing values after join.
Extract relevant columns and paste them together.
paste -d '#' <(cut -d '#' -f2 file1) <(cut -d '#' -f2 file2)
Well, but this will fail for the NA case, when one file has less lines then the other. You could pipe it to something along awk -v OFS='#' -F'#' { for (i=1;i<NF;++i) if (length($i) == 0) $i="NA"; } to substitute empty fields for the string NA.
So I guess your method is a possible one, but you have nothing to "join" on the files. So join on an a imaginary column with line numbers:
join -t'#' -eNA -a1 -a2 -o1.2,2.2 <(cut -d'#' -f2 file1 | nl -w1 -s'#') <(cut -d'#' -f2 file2 | nl -w1 -s'#')

Swap column x of tab-separated values file with column x of second tsv file

Let's say I have:
file1.tsv
Foo\tBar\tabc\t123
Bla\tWord\tabc\tqwer
Blub\tqwe\tasd\tqqq
file2.tsv
123\tzxcv\tAAA\tqaa
asd\t999\tBBB\tdef
qwe\t111\tCCC\tabc
And I want to overwrite column 3 of file1.tsv with column 3 of file2.tsv to end up with:
Foo\tBar\tAAA\t123
Bla\tWord\tBBB\tqwer
Blub\tqwe\tCCC\tqqq
What would be a good way to do this in bash?
Take a look at this awk:
awk 'FNR==NR{a[NR]=$3;next}{$3=a[FNR]}1' OFS='\t' file{2,1}.tsv > output.tsv
If you want to use just bash, with little more effort:
while IFS=$'\t' read -r a1 a2 _ a4; do
IFS=$'\t' read -ru3 _ _ b3 _
printf '%s\t%s\t%s\t%s\n' "$a1" "$a2" "$b3" "$a4"
done <file1.tsv 3<file2.tsv >output.tsv
Output:
Foo Bar AAA 123
Bla Word BBB qwer
Blub qwe CCC qqq
Another way to do this can be, with correction as pointed out by #PesaThe:
paste -d$'\t' <(cut -d$'\t' -f1,2 file1.tsv) <(cut -d$'\t' -f3 file2.tsv) <(cut -d$'\t' -f4 file1.tsv)
The output will be:
Foo Bar AAA 123
Bla Word BBB qwer
Blub qwe CCC qqq

Printing all the lines that contain a certain word exactly k times

I have to search for all the lines from a file which contain a given word exactly k times. I think that I should use grep/sed/awk but I don't know how. My idea was to check every line by line using sed and grep like this:
line=1
while [ (sed -n -'($line)p' $name) -n ]; do
if [ (sed -n -'($line)p' $name | grep -w -c $word) -eq "$number" ]; then
sed -n -'($line)p' $name
fi
let line+=1
done
My first problem is that I get the following error : syntax error near unexpected token 'sed'. Then I realize that for my test file the command sed -n -'p1' test.txt | grep -w -c "ab" doesn't return the exact number of apparitions of "ab" in the first line from my file (it returns 1 but there are 3 apparitions).
My test.txt file:
abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b
awk to the rescue!
$ awk -F'\\<ab\\>' -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
note that \< and \> word boundaries might be gawk specific.
for variable assignment, I think easiest will be
$ word=ab; awk -F"\\\<$word\\\>" -v count=2 'NF==count+1' file
kkmd ab jnabc bad ab
You could use grep, but you'd have to use it twice. (You can't use a single grep because ERE has no way to negate a string, you can only negate a bracket expression, which will match single characters.)
The following is tested with GNU grep v2.5.1, where you can use \< and \> as (possibly non-portable) word delimiters:
$ word="ab"
$ < input.txt egrep "(\<$word\>.*){3}" | egrep -v "(\<$word\>.*){4}"
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
$ < input.txt egrep "(\<$word\>.*){2}" | egrep -v "(\<$word\>.*){3}"
kkmd ab jnabc bad ab
The idea here is that we'll extract from our input file lines with N occurrences of the word, then strip from that result any lines with N+1 occurrences. Lines with fewer than N occurrences of course won't be matched by the first grep.
Or, you might also do this in pure bash, if you're feeling slightly masochistic:
$ word="ab"; num=3
$ readarray lines < input.txt
$ for this in "${lines[#]}"; do declare -A words=(); x=( $this ); for y in "${x[#]}"; do ((words[$y]++)); done; [ "0${words[$word]}" -eq "$num" ] && echo "$this"; done
abc ab cds ab abcd edfs ab
abcdefghijklmnop ab cdab ab ab
Broken out for easier reading (or scripting):
#!/usr/bin/env bash
# Salt to taste
word="ab"; num=3
# Pull content into an array. This isn't strictly necessary, but I like
# getting my file IO over with quickly if possible.
readarray lines < input.txt
# Walk through the array (or you could just walk through the input file)
for this in "${lines[#]}"; do
# Initialize this line's counter array
declare -A words=()
# Break up the words into array elements
x=( $this )
# Step though the array, counting each unique word
for y in "${x[#]}"; do
((words[$y]++))
done
# Check the count for "our" word
[ "0${words[$word]}" -eq $num ] && echo "$this"
done
Wasn't that fun? :)
But this awk option makes the most sense to me. It's a portable one-liner that doesn't depend on GNU awk (so it'll work in OS X, BSD, etc.)
awk -v word="ab" -v num=3 '{for(i=1;i<=NF;i++){a[$i]++}} a[word]==num; {delete a}' input.txt
This works by building an associative array to count the words on each line, then printing the line if the count for the "interesting" word is what's specified as num. It's the same basic concept as the bash script above, but awk lets us do this so much better. :)
You can do this with grep
grep -E "(${word}.*){${number}}" test.txt
This looks for ${number} occurrences of ${word} per line. The wildcard .* is needed since we also want to match occurrences where matches of ${word} are not next to each other.
Here's what I do:
$ echo 'abc ab cds ab abcd edfs ab
kkmd ab jnabc bad ab
abcdefghijklmnop ab cdab ab ab
abcde bad abc cdef a b' > test.txt
$ word=abc
$ number=2
$ grep -E "(${word}.*){${number}}" test.txt
> abc ab cds ab abcd edfs ab
> abcde bad abc cdef a b
Maybe you need to use sed. If you are looking for character sequences, you can use code like this. However, it doesn't distinguish between the word on its own and the word embedded in another word (so it treats ab and abc as both containing ab).
word="ab"
number=2
sed -n -e "/\($word.*\)\{$(($number + 1))\}/d" -e "/\($word.*\)\{$number\}/p" test.txt
By default, nothing is printed (-n).
The first -e expression looks for 3 (or more) occurrences of $word and deletes lines containing them (and skips to the next line of input). The $(($number + 1)) is shell arithmetic.
The second -e expressions looks for 2 occurrences of $word (there won't be more) and prints the lines that match.
If you want words on their own, then you have to work a lot harder. You'd need extended regular expressions, triggered with the -E option on BSD (Mac OS X), or -r with GNU sed.
number=2
plus1=$(($number + 1))
word=ab
sed -En -e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$plus1}/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]).*){$number}$word$/d" \
-e "/(^|[^[:alnum:]])($word([^[:alnum:]]|$).*){$number}/p" test.txt
This is similar to the previous version, but it has considerably more delicate word handling.
The unit (^|[^[:alnum:]]) looks for either the start of line or a non-alphanumeric character (change alnum to alpha throughout if you don't want digits to stop matches).
The first -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters, N+1 times, and deletes such lines (skipping to the next line of input).
The second -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times, and then the word again followed by end of line, and deletes such lines.
The third -e looks for start of line or a non-alphanumeric character, followed by the word and a non-alphanumeric and zero or more other characters N times and prints such lines.
Given the (extended) input file:
abc NO ab cds ab abcd edfs ab
kkmd YES ab jnabc bad ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO efghijklmnop ab cdab ab ab
abcd NO e bad abc cdef a b
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab NO abcd abcd ab ab
hope NO abcd abcd ab ab ab
nope NO abcd abcd ab ab ab
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
Example output:
kkmd YES ab jnabc bad ab
ab YES abcd abcd ab
best YES ab ab candidly
best YES ab ab candidly
ab YES abcd abcd ab not bad
said YES ab not so bad ab or bad
It is not a trivial exercise in sed. It would be simpler if you could rely on word-boundary detection. For example, in Perl:
number=2
plus1=$(($number + 1))
word=ab
perl -n -e "next if /(\b$word\b.*?){$plus1}/;
print if /(\b$word\b.*?){$number}/" test.txt
This produces the same output as the sed script, but is a lot simpler because of the \b word boundary detection (the .*? non-greedy matching isn't crucial to the operation of the script).

How To Check File Name from 2nd List Name in Linux?

I want to ask, how to check file if I've two list name's, like
cat /data/file1/ab.txt
aa
bb
cc
dd
ee
cat /data/file2/cd.txt
cc
dd
ee
aa
zz
xx
yy
and I want the output something like :
zz
xx
yy
sort ab.txt > /tmp/file1
sort cd.txt > /tmp/file2
comm -13 /tmp/file1 /tmp/file2
The comm program compares two files and shows the lines that they have in common or unique to each. -13 means to omit the lines that are unique to file 1 and in common.
You can also use grep:
$ grep -vf ab.txt cd.txt
zz
xx
yy
-f tells grep to obtain patterns from ab.txt and -v inverts the matches.
You can also use awk:
awk 'NR==FNR{a[$1];next}!($1 in a)' ab.txt cd.txt

Use a file to extract specified rows from another file

input1:
1 s1
100 s100
90 s90
input2:
a 1
b 3
c 7
d 100
e 101
f 90
Output:
a 1
d 100
f 90
I know join can do this, but it needs to (1) sort these common fields (2) after join, I need to remove the second column from input1. Does anyone have better solution for this.
Here's one way using awk:
awk 'FNR==NR { a[$1]; next } $2 in a' file1 file2
Results:
a 1
d 100
f 90
This might work for you (GNU sed):
sed -r 's|(\S+).*|/\\<\1$/p|' input1 | sed -nf - input2
Depending on your requirements, grep might do:
grep -wFf <(cut -d' ' -f1 input1) input2
Output:
a 1
d 100
f 90
Note that grep is not column-aware and will happily match where it can.
As far i know awk is better soluiton for this,but since its already provided :below is the perl solution.
> perl -F -lane '$H{$F[0]}=$F[1];END{%T=reverse(%H);foreach (values %H){if(exists($H{$_})){print $T{$_}." ".$_;}}}' file1 file2
a 1
d 100
f 90

Resources