Using Spark to merge two or more files content and manipulate the content - apache-spark

Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.

Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print

It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users ="delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers ="called browser_type.tsv") \
.toDF("browser_type", "browser_name")
users.join(browser, "browser_type", "left").write.csv(path)


Reading from a file how to append strings to a list until a specific marker in python

I have a input text file:
This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&
Now I would like to append students name in a list in a line when find ### marker & stop appending when find &##& marker. Output should be(using python):
bob alice rhea john mary alex roma peter
You can use this method
txt = """This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&"""
or :
txt = open("./file.txt").read()
ls = txt.split("###")[1].split("&##&")[0].split()
This prints :
['bob', 'alice', 'rhea', 'john', 'mary', 'alex', 'roma', 'peter']
Using re.findall with re.sub:
inp = """This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&"""
output = re.sub(r'\s+', ' ', re.findall(r'###\s+(.*?)\s+&##&', inp, flags=re.DOTALL)[0])
print(output) # bob alice rhea john mary alex roma peter
If you want a list, then use:
output = re.split(r'\s+', re.findall(r'###\s+(.*?)\s+&##&', inp, flags=re.DOTALL)[0])
This prints:
['bob', 'alice', 'rhea', 'john', 'mary', 'alex', 'roma', 'peter']

Linux : How can I print line number and column number when values do not match for tab separated files using AWK in linux

Compare File 1 vs File2 and print line no. for difference record and column no of difference present in file2.
In file1:
User_ID First_name Last_name Address Postal_code
User_1 fistname Lastname 35, Park Lake, California 32068
user2 Johnny Depp 32, Park Lake, California
user3 Tom Cruise 5322 Otter Lane Middleberge 32907
user4 Leonardo DiCaprio Half-Way Pond, Georgetown 1230
user5 Sylvester Stallone 6762,33 Ave N,St. Petersburg 33710
user6 Srleo Stallone 6762,33 Ave N,St. Petersburg 33700
In file2:
User_ID First_name Last_name Address Postal_code
User_1 fistname Lastname 35, Park Lake, California 32068
user2 Johnny Depp 32, NEW Street, California 96206
user30 Tom Cruise 5322 Otter Lane Middleberge 32907
user4 Leonardo DiCaprio' Half-Way Pond, Georgetown 00000
user5 Sylvester Stallone 6762,33 Ave N,St. Petersburg 33710
user7 Nicolas Cage 55010
user6 Srleo Stallone 6762,33 Ave N,St. Petersburg 33700
**Expected Result:-
Difference in file2 is
line number followed by column number (where the values do not match)**
Line No. 2 COLUMN NO- 4,5
Line No. 3 COLUMN NO-1
Line No. 4 COLUMN NO 3,5
Line No. 5 COLUNN NO 5
Line No. 6 COLUMN NO 1,2,3,4,5
Note: File size to be compare is in GB and File is tab separated and has more than 400 tab separated column.
I am using-
awk 'NR==FNR{Arr[$0]++;next}!($0 in Arr){print FNR}' file1 file2
However, it gives me the line numbers and not the Column Numbers
this should do, however doesn't match your expected result
paste f1 f2 |
awk -F'\t' 'NR==1 {n=NF/2}
{c=c s i; s=","}
{print "Line No. " NR-1 " COLUMN NO " c;
Line No. 2 COLUMN NO 4,5
Line No. 3 COLUMN NO 1
Line No. 4 COLUMN NO 3,5
Line No. 6 COLUMN NO 1,2,3,4,5
Line No. 7 COLUMN NO 1,2,3,4,5
either you're not comparing line by line or some following some other unwritten spec.

use sed to return only last line that contains specific string

All help would be appreciated as I have tried much googling and have drawn a blank :)
I am new to sed and do not know what the command I need might be.
I have a file containing numerous lines eg
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
The file is plain text and is not formatted (ie not csv or anything like that )
I want to search for a list of specific strings, eg. John Smith, Mike Smith, Jim Smith and return only the last line entry in the file for each string found (all other lines found would be deleted).
(I do not necessarily need every unique entry, ie Jane Smith may or may not be needed)
It is important that the original order of the lines found is maintained in the output.
So the result would be :
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
I am new to sed and do not know what this command might be.
There are about 100 specific search strings.
Thank you :)
Assuming sample.txt contains the data you provided:
$ cat sample.txt
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
For this sample data, following script works just fine:
$ cut -f1,2 -d' ' sample.txt | sort | uniq | while read s; do tac sample.txt | grep -m1 -n -e "$s" ; done | sort -n -r -t':' | cut -f2 -d':'
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
Here is the breakdown of the script:
First generate all the unique strings ( First Name, Last Name in this case )
Now find the last occurrence each of these strings. For this we find first occurrence by reversing the file. Also print the line number along with output.
Now reverse the output in reverse line number order, and then remove the line numbers ( we don't need them )
you didn't tell the format of given list, I assume it is CSV, same as you wrote in question: eg. John Smith, Mike Smith, Jim Smith
from your description, you want line each string found, not only col1 and col2
from above two points, I have:
awk -v list="John Smith, Mike Smith, Jim Smith" 'BEGIN{split(list,p,",\\s*")}
}END{for(x in a)print a[x]}' file
you can fill the list with your strings, sepearated by comma. output this with your test data:
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Reverse the list fist, e.g. something like this:
$ sed -n '/Mike/{p;q}' <(tac input.txt)
Mike Smith C2345613213
sed -n -e 's/.*/&³/
$ {x
t a
s/²\([a-zA-Z]* [a-zA-Z]* \)[^³]*³\(\(.*\)²\1\)/\2/
t a
}' YourFile
work with any name (cannot contain ² nor ³). For composed name change pattern with [-a-z-A-Z]
in your list, there is also Jane Smith that appear at least once
For the specific list, use a grep -f before, it's faster and easier to maintain without changing the code

grep and egrep selecting numbers

I have to find all entries of people whose zip code has “22” in it. NOTE: this should not include something like Mike Keneally whose street address includes “22”.
Here are some samples of data:
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
Here is the command I have so far, but I don't know why it's not working.
egrep '.*[A-Z][A-Z]\s*[0-9]+[22][0-9]+$' names.txt
guess this is your sample names.txt
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Frank V. Zappa, 6221 Hot Rats Blvd, Los Angeles, CA 90125
George Duke, San Diego, CA 93241
Ruth Underwood, Mariemont, OH 42522
egrep '.[A-Z][A-Z]\s[0-9]+[22][0-9]+$' names.txt
your code translates to match any line satisfy this conditions:
[A-Z][A-Z] has two consecutive upper case characters
\s* zero or more space characters
[0-9]+ one or more digit character
[22] a character matches either 2 or 2
[0-9]+$ one or more digit characters at the end of the line
to get lines satisfying your requirement:
zip code has “22” in it
you can do it this way:
egrep '[A-Z]{2}\s+[0-9]*22' names.txt
If zip code is always the last field, you can use this awk
awk '$NF~/22/' file
Bianca Jones, 612 Charles Blvd, Louisville, KY 40228
Ruth Underwood, Mariemont, OH 42522

Combine results of column one Then sum column 2 to list total for each entry in column one

I am bit of Bash newbie, so please bear with me here.
I have a text file dumped by another software (that I have no control over) listing each user with number of times accessing certain resource that looks like this:
Jim 109
Bob 94
John 92
Sean 91
Mark 85
Richard 84
Jim 79
Bob 70
John 67
Sean 62
Mark 59
Richard 58
Jim 57
Bob 55
John 49
Sean 48
Mark 46
My goal here is to get an output like this.
Jim [Total for Jim]
Bob [Total for Bob]
John [Total for John]
And so on.
Names change each time I run the query in the software, so static search on each name and then piping through wc does not help.
This sounds like a job for awk :) Pipe the output of your program to the following awk script:
your_program | awk '{a[$1]+=$2}END{for(name in a)print name " " a[name]}'
Sean 201
Bob 219
Jim 245
Mark 190
Richard 142
John 208
The awk script itself can be explained better in this format:
# executed on each line
# 'a' is an array. It will be initialized
# as an empty array by awk on it's first usage
# '$1' contains the first column - the name
# '$2' contains the second column - the amount
# on every line the total score of 'name'
# will be incremented by 'amount'
# executed at the end of input
# print every name and its score
for(name in a)print name " " a[name]
Note, to get the output sorted by score, you can add another pipe to sort -r -k2. -r -k2 sorts the by the second column in reverse order:
your_program | awk '{a[$1]+=$2}END{for(n in a)print n" "a[n]}' | sort -r -k2
Jim 245
Bob 219
John 208
Sean 201
Mark 190
Richard 142
Pure Bash:
declare -A result # an associative array
while read name value; do
done < "$infile"
for name in ${!result[*]}; do
printf "%-10s%10d\n" $name ${result[$name]}
If the first 'done' has no redirection from an input file
this script can be used with a pipe:
your_program | ./
and sorting the output
your_program | ./ | sort
The output:
Bob 219
Richard 142
Jim 245
Mark 190
John 208
Sean 201
GNU datamash:
datamash -W -s -g1 sum 2 < input.txt
Bob 219
Jim 245
John 208
Mark 190
Richard 142
Sean 201
