Removing multi-line string from a filed in a file - linux

I have a csv file as below which is sent by source system and they have no processing mechanism from their end except to add columns:
1,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",3,4,"qqqqzzzz"
5,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",6,7,"qqqqzzzz"
Expected output:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7
I have tried below approach:
Requested source system to add a identified at the end of each line "qqqqzzzz"
Tried to replace all the new line with space and then again replace all qqqqzzzz with new line
But the last replace of qqqqzzzz results in new line replacement with quotes which breaks into next line as below:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4,""
5,"Bob Smith
sed '/^$/d' all.csv|tr '\n' ' '|sed 's/qqqqzzzz/\n/g' >results.csv
Tried for the solution of replacing the quoted text here,here and here
Update after trying with command:
$ sed 'N;N;s/\n//g;s/,"qqqqzzzz"$//' quotetest.csv
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4,"qqqqzzzz"
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7

Using GNU awk:
$ awk 'BEGIN{RS=",\"qqqqzzzz\" ?\r?\n"}{$1=$1}1' file
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7
Tested with dos and unix line endings. The key was to use the identifier and related extra characters (comma, conditional space and line ending characters) as record separator (RS) and the problem was to se there was a space after the first identifier but not after the second.

Related

Regular Expressions Python replacing couple names

I would like to find and replace expressions like "John and Jane Doe" with "John Doe and Jane Doe"
for a sample expression
regextest = 'Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
I can find the expression and replace it with a fixed string but I am not able to replace it with a modification of the original text.
re.sub(r'[a-zA-Z]+\s*and\s*[a-zA-Z]+.[^,]*',"kittens" ,regextest)
Output: 'Heather Robinson, kittens, kittens, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
I think instead of a string ("kittens"), we can pass a function that can make that change but I am unable to write that function. I am getting errors below.
def re_couple_name_and(m):
return f'*{m.group(0).split()[0]+m.group(0).split()[-1:]+ m.group(0).split()[1:]}'
re.sub(r'[a-zA-Z]+\s*and\s*[a-zA-Z]+.[^,]*',re_couple_name_and ,regextest)
IIUC, one way using capture groups:
def re_couple_name_and(m):
family_name = m.group(3).split(" ",1)[1]
return "%s %s" % (m.group(1), family_name) + m.group(2) + m.group(3)
re.sub(r'([a-zA-Z]+)(\s*and\s*)([a-zA-Z]+.[^,]*)',re_couple_name_and ,regextest)
Output:
'Heather Robinson, Jane Smith and John Smith, Kiwan Brady John and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
You can use below regex to capture the items to be interchanged and use re.sub() to construct new string.
(\w+)( +and +)(\w+)( +[^,]*)
Demo
Example
import re
text="Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown"
print(re.sub(r"(\w+)( +and +)(\w+)( +[^,]*)",r"\1\4\2\3\4",text))
Output
Heather Robinson, Jane Smith and John Smith, Kiwan Brady John and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown

Using Spark to merge two or more files content and manipulate the content

Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.
Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
Output:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Explained:
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print
It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers = spark.read.csv("called browser_type.tsv") \
.toDF("browser_type", "browser_name")
Join:
users.join(browser, "browser_type", "left").write.csv(path)

How to each nth line a a column using awk?

I have a single column text file looking like this:
John
Doe
Male
1984
Marie
Parker
Female
1989
And I would like to convert it to look like this:
John Doe Male 1984
Marie Parker Female 1989
I've tried using awk and modulo but I cannot manage to find a working solution.
$ pr -4at file
John Doe Male 1984
Marie Parker Female 1989
or your format
$ pr -4ats' ' file
John Doe Male 1984
Marie Parker Female 1989
of course with awk
$ awk 'ORS=NR%4?FS:RS' file
John Doe Male 1984
Marie Parker Female 1989
with paste
$ paste -d' ' - - - - < file
John Doe Male 1984
Marie Parker Female 1989

use sed to return only last line that contains specific string

All help would be appreciated as I have tried much googling and have drawn a blank :)
I am new to sed and do not know what the command I need might be.
I have a file containing numerous lines eg
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
The file is plain text and is not formatted (ie not csv or anything like that )
I want to search for a list of specific strings, eg. John Smith, Mike Smith, Jim Smith and return only the last line entry in the file for each string found (all other lines found would be deleted).
(I do not necessarily need every unique entry, ie Jane Smith may or may not be needed)
It is important that the original order of the lines found is maintained in the output.
So the result would be :
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
I am new to sed and do not know what this command might be.
There are about 100 specific search strings.
Thank you :)
Assuming sample.txt contains the data you provided:
$ cat sample.txt
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
For this sample data, following script works just fine:
$ cut -f1,2 -d' ' sample.txt | sort | uniq | while read s; do tac sample.txt | grep -m1 -n -e "$s" ; done | sort -n -r -t':' | cut -f2 -d':'
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
Here is the breakdown of the script:
First generate all the unique strings ( First Name, Last Name in this case )
Now find the last occurrence each of these strings. For this we find first occurrence by reversing the file. Also print the line number along with output.
Now reverse the output in reverse line number order, and then remove the line numbers ( we don't need them )
you didn't tell the format of given list, I assume it is CSV, same as you wrote in question: eg. John Smith, Mike Smith, Jim Smith
from your description, you want line each string found, not only col1 and col2
from above two points, I have:
awk -v list="John Smith, Mike Smith, Jim Smith" 'BEGIN{split(list,p,",\\s*")}
{for(i=1;i<=length(p);i++){
if($0~p[i]){
a[p[i]]=$0
break
}
}
}END{for(x in a)print a[x]}' file
you can fill the list with your strings, sepearated by comma. output this with your test data:
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Reverse the list fist, e.g. something like this:
$ sed -n '/Mike/{p;q}' <(tac input.txt)
Mike Smith C2345613213
sed -n -e 's/.*/&³/
H
$ {x
s/\n/²/g
t a
:a
s/²\([a-zA-Z]* [a-zA-Z]* \)[^³]*³\(\(.*\)²\1\)/\2/
t a
s/²//g;s/³$//;s/³/\
/g
p
}' YourFile
work with any name (cannot contain ² nor ³). For composed name change pattern with [-a-z-A-Z]
in your list, there is also Jane Smith that appear at least once
For the specific list, use a grep -f before, it's faster and easier to maintain without changing the code

grep and egrep selecting numbers

I have to find all entries of people whose zip code has “22” in it. NOTE: this should not include something like Mike Keneally whose street address includes “22”.
Here are some samples of data:
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
Here is the command I have so far, but I don't know why it's not working.
egrep '.*[A-Z][A-Z]\s*[0-9]+[22][0-9]+$' names.txt
guess this is your sample names.txt
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
egrep '.[A-Z][A-Z]\s[0-9]+[22][0-9]+$' names.txt
your code translates to match any line satisfy this conditions:
[A-Z][A-Z] has two consecutive upper case characters
\s* zero or more space characters
[0-9]+ one or more digit character
[22] a character matches either 2 or 2
[0-9]+$ one or more digit characters at the end of the line
to get lines satisfying your requirement:
zip code has “22” in it
you can do it this way:
egrep '[A-Z]{2}\s+[0-9]*22' names.txt
If zip code is always the last field, you can use this awk
awk '$NF~/22/' file
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Ruth Underwood, Mariemont, OH 42522

Resources