use sed to return only last line that contains specific string - linux

All help would be appreciated as I have tried much googling and have drawn a blank :)
I am new to sed and do not know what the command I need might be.
I have a file containing numerous lines eg
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
The file is plain text and is not formatted (ie not csv or anything like that )
I want to search for a list of specific strings, eg. John Smith, Mike Smith, Jim Smith and return only the last line entry in the file for each string found (all other lines found would be deleted).
(I do not necessarily need every unique entry, ie Jane Smith may or may not be needed)
It is important that the original order of the lines found is maintained in the output.
So the result would be :
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
I am new to sed and do not know what this command might be.
There are about 100 specific search strings.
Thank you :)

Assuming sample.txt contains the data you provided:
$ cat sample.txt
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
For this sample data, following script works just fine:
$ cut -f1,2 -d' ' sample.txt | sort | uniq | while read s; do tac sample.txt | grep -m1 -n -e "$s" ; done | sort -n -r -t':' | cut -f2 -d':'
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
Here is the breakdown of the script:
First generate all the unique strings ( First Name, Last Name in this case )
Now find the last occurrence each of these strings. For this we find first occurrence by reversing the file. Also print the line number along with output.
Now reverse the output in reverse line number order, and then remove the line numbers ( we don't need them )

you didn't tell the format of given list, I assume it is CSV, same as you wrote in question: eg. John Smith, Mike Smith, Jim Smith
from your description, you want line each string found, not only col1 and col2
from above two points, I have:
awk -v list="John Smith, Mike Smith, Jim Smith" 'BEGIN{split(list,p,",\\s*")}
{for(i=1;i<=length(p);i++){
if($0~p[i]){
a[p[i]]=$0
break
}
}
}END{for(x in a)print a[x]}' file
you can fill the list with your strings, sepearated by comma. output this with your test data:
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432

Reverse the list fist, e.g. something like this:
$ sed -n '/Mike/{p;q}' <(tac input.txt)
Mike Smith C2345613213

sed -n -e 's/.*/&³/
H
$ {x
s/\n/²/g
t a
:a
s/²\([a-zA-Z]* [a-zA-Z]* \)[^³]*³\(\(.*\)²\1\)/\2/
t a
s/²//g;s/³$//;s/³/\
/g
p
}' YourFile
work with any name (cannot contain ² nor ³). For composed name change pattern with [-a-z-A-Z]
in your list, there is also Jane Smith that appear at least once
For the specific list, use a grep -f before, it's faster and easier to maintain without changing the code

Related

Regular Expressions Python replacing couple names

I would like to find and replace expressions like "John and Jane Doe" with "John Doe and Jane Doe"
for a sample expression
regextest = 'Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
I can find the expression and replace it with a fixed string but I am not able to replace it with a modification of the original text.
re.sub(r'[a-zA-Z]+\s*and\s*[a-zA-Z]+.[^,]*',"kittens" ,regextest)
Output: 'Heather Robinson, kittens, kittens, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
I think instead of a string ("kittens"), we can pass a function that can make that change but I am unable to write that function. I am getting errors below.
def re_couple_name_and(m):
return f'*{m.group(0).split()[0]+m.group(0).split()[-1:]+ m.group(0).split()[1:]}'
re.sub(r'[a-zA-Z]+\s*and\s*[a-zA-Z]+.[^,]*',re_couple_name_and ,regextest)
IIUC, one way using capture groups:
def re_couple_name_and(m):
family_name = m.group(3).split(" ",1)[1]
return "%s %s" % (m.group(1), family_name) + m.group(2) + m.group(3)
re.sub(r'([a-zA-Z]+)(\s*and\s*)([a-zA-Z]+.[^,]*)',re_couple_name_and ,regextest)
Output:
'Heather Robinson, Jane Smith and John Smith, Kiwan Brady John and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
You can use below regex to capture the items to be interchanged and use re.sub() to construct new string.
(\w+)( +and +)(\w+)( +[^,]*)
Demo
Example
import re
text="Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown"
print(re.sub(r"(\w+)( +and +)(\w+)( +[^,]*)",r"\1\4\2\3\4",text))
Output
Heather Robinson, Jane Smith and John Smith, Kiwan Brady John and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown

How to use cut and paste commands as a single line command?

In Unix, I am trying to write a sequence of cut and paste commands (saving result of each command in a file) that inverts every name in the file(below) shortlist and places a coma after the last name(for example, bill johnson becomes johnson, bill).
here is my file shortlist:
2233:charles harris :g.m. :sales :12/12/52: 90000
9876:bill johnson :director :production:03/12/50:130000
5678:robert dylan :d.g.m. :marketing :04/19/43: 85000
2365:john woodcock :director :personnel :05/11/47:120000
5423:barry wood :chairman :admin :08/30/56:160000
I am able to cut from shortlist but not sure how to paste it on to my filenew file in same command line. Here is my code for cut:
cut -d: -f2 shortlist
result:
charles harris
bill johnson
robert dylan
john woodcock
barry wood
Now I want this to be pasted in my filenew file and when I cat filenew, result should look like below,
harris, charles
johnson, bill
dylan, robert
woodcock, john
wood, barry
Please guide me through this. Thank you.
You could do it with a single awk:
awk -F: '{split($2,a, / /); if(a[2]) l=a[2] ", "; print l a[1]}' shortlist
I am assuming that if you don't have a second name, you don't want to print the comma (and you don't have more than 2 words in the name).
Once you've used cut to split up the string, it may be easier to use awk than paste to produce the result you want:
$ cut -d":" -f2 shortlist | awk '{printf "%s, %s\n", $2, $1}'

How to each nth line a a column using awk?

I have a single column text file looking like this:
John
Doe
Male
1984
Marie
Parker
Female
1989
And I would like to convert it to look like this:
John Doe Male 1984
Marie Parker Female 1989
I've tried using awk and modulo but I cannot manage to find a working solution.
$ pr -4at file
John Doe Male 1984
Marie Parker Female 1989
or your format
$ pr -4ats' ' file
John Doe Male 1984
Marie Parker Female 1989
of course with awk
$ awk 'ORS=NR%4?FS:RS' file
John Doe Male 1984
Marie Parker Female 1989
with paste
$ paste -d' ' - - - - < file
John Doe Male 1984
Marie Parker Female 1989

Replace single character in string with new line

I'm trying to edit some text files by replacing single characters with a new line.
Before:
Bill Fred Jack L Max Sam
After:
Bill Fred Jack
Max Sam
This is the closest I have gotten, but the single character is not always going to be 'L'.
cat File.txt | tr "L" "\n
you can try this
sed "s/\s\S\s/\n/g" File.txt
explanation,
You want to convert any word formed by a single character in special character: line break \n,
\s : Space and tab
\S : Non-whitespace characters
bash-4.3$ cat file.txt
1.Bill Fred Jack L Max Sam
2.Bill Fred Jack M Max Sam
3.Bill Fred Jack N Max Sam
bash-4.3$ sed 's/\s[A-Z]\s/\n/g' file.txt
1.Bill Fred Jack
Max Sam
2.Bill Fred Jack
Max Sam
3.Bill Fred Jack
Max Sam
sed "s/[[:blank:]][[:alpha:]][[:blank:]]/\
/g" YourFile
posix version
assuming that single letter is inside the string and not to the edge (start or end)

grep shows occurrences of pattern on a per line basis

From the input file:
I am Peter
I am Mary
I am Peter Peter Peter
I am Peter Peter
I want output to be like this:
1 I am Peter
3 I am Peter Peter Peter
2 I am Peter Peter
Where 1, 3 and 2 are occurrences of "Peter".
I tried this, but the info is not formatted the way I wanted:
grep -o -n Peter inputfile
This is not easily solved with grep, I would suggest moving "two tools up" to awk:
awk '$0 ~ FS { print NF-1, $0 }' FS="Peter" inputfile
Output:
1 I am Peter
3 I am Peter Peter Peter
2 I am Peter Peter
###Edit
To answer a question in the comments:
What if I want case insensitive? and what if I want multiple pattern
like "Peter|Mary|Paul", so "I am Peter peter pAul Mary marY John",
will yield the count of 5?
If you are using GNU awk, you do it by enabling IGNORECASE and setting the pattern in FS like this:
awk '$0 ~ FS { print NF-1, $0 }' IGNORECASE=1 FS="Peter|Mary|Paul" inputfile
Output:
1 I am Peter
1 I am Mary
3 I am Peter Peter Peter
2 I am Peter Peter
5 I am Peter peter pAul Mary marY John
You don’t need -o or -n. From grep --help:
-o, --only-matching show only the part of a line matching PATTERN
...
-n, --line-number print line number with output lines
Remove them and your output will be better. I think you’re misinterpreting -n -- it just shows the line number, not the occurrence count.
It looks like you’re trying to get the count of “Peter” appearances per line. You’d need something beyond a single grep for that. awk could be a good choice. Or you could loop over each each line to split into words (say an array) and grep -c the array for each line, to print the line’s count.

Resources