grep and egrep selecting numbers - linux

I have to find all entries of people whose zip code has “22” in it. NOTE: this should not include something like Mike Keneally whose street address includes “22”.
Here are some samples of data:
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
Here is the command I have so far, but I don't know why it's not working.
egrep '.*[A-Z][A-Z]\s*[0-9]+[22][0-9]+$' names.txt

guess this is your sample names.txt
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
egrep '.[A-Z][A-Z]\s[0-9]+[22][0-9]+$' names.txt
your code translates to match any line satisfy this conditions:
[A-Z][A-Z] has two consecutive upper case characters
\s* zero or more space characters
[0-9]+ one or more digit character
[22] a character matches either 2 or 2
[0-9]+$ one or more digit characters at the end of the line
to get lines satisfying your requirement:
zip code has “22” in it
you can do it this way:
egrep '[A-Z]{2}\s+[0-9]*22' names.txt

If zip code is always the last field, you can use this awk
awk '$NF~/22/' file
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Ruth Underwood, Mariemont, OH 42522

Related

Removing multi-line string from a filed in a file

I have a csv file as below which is sent by source system and they have no processing mechanism from their end except to add columns:
1,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",3,4,"qqqqzzzz"
5,"Bob Smith
531 Pennsylvania Avenue
Washington, DC",6,7,"qqqqzzzz"
Expected output:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7
I have tried below approach:
Requested source system to add a identified at the end of each line "qqqqzzzz"
Tried to replace all the new line with space and then again replace all qqqqzzzz with new line
But the last replace of qqqqzzzz results in new line replacement with quotes which breaks into next line as below:
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4,""
5,"Bob Smith
sed '/^$/d' all.csv|tr '\n' ' '|sed 's/qqqqzzzz/\n/g' >results.csv
Tried for the solution of replacing the quoted text here,here and here
Update after trying with command:
$ sed 'N;N;s/\n//g;s/,"qqqqzzzz"$//' quotetest.csv
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4,"qqqqzzzz"
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7
Using GNU awk:
$ awk 'BEGIN{RS=",\"qqqqzzzz\" ?\r?\n"}{$1=$1}1' file
1,"Bob Smith 531 Pennsylvania Avenue Washington, DC",3,4
5,"Bob Smith 531 Pennsylvania Avenue Washington, DC",6,7
Tested with dos and unix line endings. The key was to use the identifier and related extra characters (comma, conditional space and line ending characters) as record separator (RS) and the problem was to se there was a space after the first identifier but not after the second.

Using Spark to merge two or more files content and manipulate the content

Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.
Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
Output:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Explained:
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print
It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers = spark.read.csv("called browser_type.tsv") \
.toDF("browser_type", "browser_name")
Join:
users.join(browser, "browser_type", "left").write.csv(path)

Replace single character in string with new line

I'm trying to edit some text files by replacing single characters with a new line.
Before:
Bill Fred Jack L Max Sam
After:
Bill Fred Jack
Max Sam
This is the closest I have gotten, but the single character is not always going to be 'L'.
cat File.txt | tr "L" "\n
you can try this
sed "s/\s\S\s/\n/g" File.txt
explanation,
You want to convert any word formed by a single character in special character: line break \n,
\s : Space and tab
\S : Non-whitespace characters
bash-4.3$ cat file.txt
1.Bill Fred Jack L Max Sam
2.Bill Fred Jack M Max Sam
3.Bill Fred Jack N Max Sam
bash-4.3$ sed 's/\s[A-Z]\s/\n/g' file.txt
1.Bill Fred Jack
Max Sam
2.Bill Fred Jack
Max Sam
3.Bill Fred Jack
Max Sam
sed "s/[[:blank:]][[:alpha:]][[:blank:]]/\
/g" YourFile
posix version
assuming that single letter is inside the string and not to the edge (start or end)

use sed to return only last line that contains specific string

All help would be appreciated as I have tried much googling and have drawn a blank :)
I am new to sed and do not know what the command I need might be.
I have a file containing numerous lines eg
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
The file is plain text and is not formatted (ie not csv or anything like that )
I want to search for a list of specific strings, eg. John Smith, Mike Smith, Jim Smith and return only the last line entry in the file for each string found (all other lines found would be deleted).
(I do not necessarily need every unique entry, ie Jane Smith may or may not be needed)
It is important that the original order of the lines found is maintained in the output.
So the result would be :
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
I am new to sed and do not know what this command might be.
There are about 100 specific search strings.
Thank you :)
Assuming sample.txt contains the data you provided:
$ cat sample.txt
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
For this sample data, following script works just fine:
$ cut -f1,2 -d' ' sample.txt | sort | uniq | while read s; do tac sample.txt | grep -m1 -n -e "$s" ; done | sort -n -r -t':' | cut -f2 -d':'
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
Here is the breakdown of the script:
First generate all the unique strings ( First Name, Last Name in this case )
Now find the last occurrence each of these strings. For this we find first occurrence by reversing the file. Also print the line number along with output.
Now reverse the output in reverse line number order, and then remove the line numbers ( we don't need them )
you didn't tell the format of given list, I assume it is CSV, same as you wrote in question: eg. John Smith, Mike Smith, Jim Smith
from your description, you want line each string found, not only col1 and col2
from above two points, I have:
awk -v list="John Smith, Mike Smith, Jim Smith" 'BEGIN{split(list,p,",\\s*")}
{for(i=1;i<=length(p);i++){
if($0~p[i]){
a[p[i]]=$0
break
}
}
}END{for(x in a)print a[x]}' file
you can fill the list with your strings, sepearated by comma. output this with your test data:
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Reverse the list fist, e.g. something like this:
$ sed -n '/Mike/{p;q}' <(tac input.txt)
Mike Smith C2345613213
sed -n -e 's/.*/&³/
H
$ {x
s/\n/²/g
t a
:a
s/²\([a-zA-Z]* [a-zA-Z]* \)[^³]*³\(\(.*\)²\1\)/\2/
t a
s/²//g;s/³$//;s/³/\
/g
p
}' YourFile
work with any name (cannot contain ² nor ³). For composed name change pattern with [-a-z-A-Z]
in your list, there is also Jane Smith that appear at least once
For the specific list, use a grep -f before, it's faster and easier to maintain without changing the code

Cut command not extracting fields properly by default delimiter

I have a text file in which I must cut the fields 3,4,5 and 8:
219 432 4567 Harrison Joel M 4540 Accountant 09-12-1985
219 433 4587 Mitchell Barbara C 4541 Admin Asst 12-14-1995
219 433 3589 Olson Timothy H 4544 Supervisor 06-30-1983
219 433 4591 Moore Sarah H 4500 Dept Manager 08-01-1978
219 431 4527 Polk John S 4520 Accountant 09-22-1998
219 432 4567 Harrison Joel M 4540 Accountant 09-12-1985
219 432 1557 Harrison James M 4544 Supervisor 01-07-2000
Since the delimiter by default is tab the command to extract the fields would be:
cut -f 3,4,5,8 filename
The thing is that the output is the same as the original file content. What is happening here? Why doesn't this work?
Your file doesn't actually contain any tab characters.
By default, cut prints any lines which don't contain the delimiter, unless you specify the -s option.
Since your records are aligned on character boundaries rather than tab-separated, you should use the -c option to specify which columns to cut. For example:
cut -c 9-12,14-25,43-57 file

Resources