Linux : How can I print line number and column number when values do not match for tab separated files using AWK in linux - linux

Compare File 1 vs File2 and print line no. for difference record and column no of difference present in file2.
In file1:
User_ID First_name Last_name Address Postal_code
User_1 fistname Lastname 35, Park Lake, California 32068
user2 Johnny Depp 32, Park Lake, California
user3 Tom Cruise 5322 Otter Lane Middleberge 32907
user4 Leonardo DiCaprio Half-Way Pond, Georgetown 1230
user5 Sylvester Stallone 6762,33 Ave N,St. Petersburg 33710
user6 Srleo Stallone 6762,33 Ave N,St. Petersburg 33700
and
In file2:
User_ID First_name Last_name Address Postal_code
User_1 fistname Lastname 35, Park Lake, California 32068
user2 Johnny Depp 32, NEW Street, California 96206
user30 Tom Cruise 5322 Otter Lane Middleberge 32907
user4 Leonardo DiCaprio' Half-Way Pond, Georgetown 00000
user5 Sylvester Stallone 6762,33 Ave N,St. Petersburg 33710
user7 Nicolas Cage 55010
user6 Srleo Stallone 6762,33 Ave N,St. Petersburg 33700
**Expected Result:-
Difference in file2 is
line number followed by column number (where the values do not match)**
Line No. 2 COLUMN NO- 4,5
Line No. 3 COLUMN NO-1
Line No. 4 COLUMN NO 3,5
Line No. 5 COLUNN NO 5
Line No. 6 COLUMN NO 1,2,3,4,5
Note: File size to be compare is in GB and File is tab separated and has more than 400 tab separated column.
I am using-
awk 'NR==FNR{Arr[$0]++;next}!($0 in Arr){print FNR}' file1 file2
However, it gives me the line numbers and not the Column Numbers

this should do, however doesn't match your expected result
paste f1 f2 |
awk -F'\t' 'NR==1 {n=NF/2}
{for(i=1;i<=n;i++)
if($i!=$(i+n))
{c=c s i; s=","}
if(c)
{print "Line No. " NR-1 " COLUMN NO " c;
c="";s=""}}'
Line No. 2 COLUMN NO 4,5
Line No. 3 COLUMN NO 1
Line No. 4 COLUMN NO 3,5
Line No. 6 COLUMN NO 1,2,3,4,5
Line No. 7 COLUMN NO 1,2,3,4,5
either you're not comparing line by line or some following some other unwritten spec.

Related

AWK count occurrences of column A based on uniqueness of column B

I have a file with several columns and I want to count the occurrence of one column based on a second columns value being unique to the first column
For example:
column 10 column 15
-------------------------------
orange New York
green New York
blue New York
gold New York
orange Amsterdam
blue New York
green New York
orange Sweden
blue Tokyo
gold New York
I am fairly new to using commands like awk and am looking to gain more practical knowledge.
I've tried some different variations of
awk '{A[$10 OFS $15]++} END {for (k in A) print k, A[k]}' myfile
but, not quite understanding the code, the output was not what I've expected.
I am expecting output of
orange 3
blue 2
green 1
gold 1
With GNU awk. I assume tab is your field separator.
awk '{count[$10 FS $15]++}END{for(j in count) print j}' FS='\t' file | cut -d $'\t' -f 1 | sort | uniq -c | sort -nr
Output:
3 orange
2 blue
1 green
1 gold
I suppose it could be more elegant.
Single GNU awk invocation version (Works with non-GNU awk too, just doesn't sort the output):
$ gawk 'BEGIN{ OFS=FS="\t" }
NR>1 { names[$2,$1]=$1 }
END { for (n in names) colors[names[n]]++;
PROCINFO["sorted_in"] = "#val_num_desc";
for (c in colors) print c, colors[c] }' input.tsv
orange 3
blue 2
gold 1
green 1
Adjust column numbers as needed to match real data.
Bonus solution that uses sqlite3:
$ sqlite3 -batch -noheader <<EOF
.mode tabs
.import input.tsv names
SELECT "column 10", count(DISTINCT "column 15") AS total
FROM names
GROUP BY "column 10"
ORDER BY total DESC, "column 10";
EOF
orange 3
blue 2
gold 1
green 1

File Manipulation in UNIX - Creating duplicate records based on the count from a column and manipulating only one column

I have a space delimited file.txt with n number of columns. 3rd column in the file.txt is comma delimited, and I want to create duplicate records in same file.txt based on the number of counts in column(n=3) by splitting the comma delimited column with each value.
--file.txt
I have 0,1,2,3 apples
I have 2,3 bananas
I have 3 oranges
--desiredoutput.txt
I have 0 apples
I have 1 apples
I have 2 apples
I have 3 apples
I have 2 bananas
I have 3 bananas
I have 3 oranges
Awk solution:
awk '$3~/,/{
split($3, a, ","); f=$1 OFS $2; sub(/^[^ ]+ +[^ ]+ +[^ ]+/,"");
for (i in a) print f,a[i] $0; next
}1' file
The output:
I have 0 apples
I have 1 apples
I have 2 apples
I have 3 apples
I have 2 bananas
I have 3 bananas
I have 3 oranges

How to delete lines in file1 based on column match with file2

I have 2 files; file1 and file2. File1 has many lines/rows and columns. File2 has just one column, with several lines/rows. All of the strings in file2 are found in file1. I want to create a new file (file3), such that the lines in file1 that contain the any of the strings in file2 are deleted.
For example,
File1:
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
File2:
083
121
Desired file3:
Rick has 241 cars
John won 505 dollars
Note that I do not want to enter the strings in file 2 into a command manually (the actual files are much larger than in the example).
Thanks!
awk approach:
awk 'BEGIN{p=""}FNR==NR{if(!/^$/){p=p$0"|"} next} $0!~substr(p, 1, length(p)-1)' file2 file1 > file3
p="" the variable treated as pattern containing all column values from file2
FNR==NR ensures that the next expression is performed for the first input file i.e. file2
if(!/^$/){p=p$0"|"} means: if it's not an empty line !/^$/ (as it could be according to your input) concatenate pattern parts with | so it eventually will look like 083|121|
$0!~substr(p, 1, length(p)-1) - checks if a line from the second input file(file1) is not matched with pattern(i.e. file2 column values)
The file3 contents:
Rick has 241 cars
John won 505 dollars
grep suites your purpose better than a line editor
grep -v -f File2 File1 >File3
Try this -
#cat f1
Sally ate 083 popcorn
Rick has 241 cars
John won 505 dollars
Bruce knows 121 people
#cat f2
083
121
#grep -vwf f2 f1
Rick has 241 cars
John won 505 dollars

Using Spark to merge two or more files content and manipulate the content

Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.
Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
Output:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Explained:
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print
It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers = spark.read.csv("called browser_type.tsv") \
.toDF("browser_type", "browser_name")
Join:
users.join(browser, "browser_type", "left").write.csv(path)

Compare one field, Remove duplicate if value of another field is greater

Trying to do this at linux command line. Wanting to combine two files, compare values based on ID, but only keeping the ID that has the newer/greater value for Date (edit: equal to or greater than). Because the ID 456604 is in both files, wanting to only keep the one from File 2 with the newer date: "20111015 456604 tgf"
File 1
Date ID Note
20101009 456604 abc
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
File 2
Date ID Note
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf
And then the output to have both files combined, but only keeping the second ID value, with the newer date. The order of the rows are in does not matter, just example of the output for concept.
Output
Date ID Note
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf
$ cat file1.txt file2.txt | sort -ru | awk '!($2 in seen) { print; seen[$2] }'
Date ID Note
20111015 456604 tgf
20111015 333333 xyz
20111015 222222 abc
20111015 111111 abc
20101009 666666 xyz
20101009 555555 abc
20101009 444444 abc
Sort the combined files by descending date and only print a line the first time you see an ID.
EDIT
More compact edition, thanks to Steve:
cat file1.txt file2.txt | sort -ru | awk '!seen[$2]++'
You didn't specify how you'd like to handle the case were the dates are also duplicated, or even if this case could exist. Therefore, I have assumed that by 'greater', you really mean 'greater or equal to' (it also makes handling the header a tiny bit easier). If that's not the case, please edit your question.
awk code:
awk 'FNR==NR {
a[$2]=$1
b[$2]=$0
next
}
a[$2] >= $1 {
print b[$2]
delete b[$2]
next
}
1
END {
for (i in b) {
print b[i]
}
}' file2 file1
Explanation:
Basically, we use an associative array, called a, to store the 'ID' and 'Date' as key and value, respectively. We also store the contents of file2 in memory using another associative array called b. When file1 is read, we test if column two exists in our array, a, and that the key's value is greater or equal to column one. If it is, we print the corresponding line from array b, then delete it from the array, and next onto the next line/record of input. The 1 on it's lonesome will return true, thereby enabling printing where the previous (two) conditions are not met. This has the effect of printing any unmatched records from file1. Finally, we print what's left in array b.
Results:
Date ID Note
20111015 456604 tgf
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 222222 abc
20111015 111111 abc
20111015 333333 xyz
Another awk way
awk 'NR==1;FNR>1{a[$2]=(a[$2]<$1&&b[$2]=$3)?$1:a[$2]}
END{for(i in a)print a[i],i,b[i]}' file file2
Compares value in an array to previously stored value to determine which is higher, also stores the third field if current record is higher.
Then prints out the stored date,key(field 2) and the value stored for field 3.
Or shorter
awk 'NR==1;FNR>1{(a[$2]<$1&&b[$2]=$0)&&a[$2]=$1}END{for(i in b)print b[i]}' file file2

Resources