script to compare two large 900 x 900 comma delimited files - linux

I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files.

If you just want a rough answer, possibly the simplest thing is to do something like:
tr , \\n file1 > /tmp/output
tr , \\n file2 | diff - /tmp/output
That will convert each file to one column and run diff. You can compute the cells that differ from the line numbers of the output.

Simplest way with awk without accounting for newlines inside fields,quoted commas etc.
Print the same
awk 'BEGIN{RS=",|"RS}a[FNR]==$0;{a[NR]=$0}' file{,2}
Print differences
awk 'BEGIN{RS=",|"RS}FNR!=NR&&a[FNR]!=$0;{a[NR]=$0}' file{,2}
Print which are the same different
awk 'BEGIN{RS=",|"RS}FNR!=NR{print "cell"FNR (a[FNR]==$0?"":" not")" the same"}{a[NR]=$0}' file{,2}
Input
file
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
file2
1,2,3,4,5
2,7,1,9,12
1,1,1,1,12
Output
same
1
2
3
4
5
7
9
Different
2
1
12
1
1
1
1
12
Same different
cell1 the same
cell2 the same
cell3 the same
cell4 the same
cell5 the same
cell6 not the same
cell7 the same
cell8 not the same
cell9 the same
cell10 not the same
cell11 not the same
cell12 not the same
cell13 not the same
cell14 not the same
cell15 not the same

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Extract part of Column Awk

I am trying to count the number of occurrences per second in a log file for a term searched. I've been using AWK and have the issue of the time stamp being locate in a column with additional information. Is it possible to get the number of occurrences per second by only looking for the time pattern 00:00:00 - 24:00:00?
Data example:
[01/May/2018:23:59:59.532
[01/May/2018:23:59:59.848
[01/May/2018:23:59:59.851
[01/May/2018:23:59:59.911
[01/May/2018:23:59:59.923
[01/May/2018:23:59:59.986
[01/May/2018:23:59:59.988
[01/May/2018:23:59:59.756
[01/May/2018:23:59:59.786
[01/May/2018:23:59:59.883
So far I can extract the data easily enough using:
awk '/00:00:00/,/24:00:00/{if(/search_term/) a[$4]++} END{for(k in a) print k " - " a[k]}' file.log |sort
This will return:
[02/May/2018:10:40:05.903 - 1
[02/May/2018:10:40:05.949 - 1
[02/May/2018:10:40:05.975 - 1
[02/May/2018:10:40:05.982 - 2
[02/May/2018:10:40:06.022 - 1
[02/May/2018:10:40:06.051 - 1
[02/May/2018:10:40:06.054 - 1
[02/May/2018:10:40:06.086 - 1
[02/May/2018:10:40:06.094 - 1
[02/May/2018:10:40:06.126 - 1
What I'm aiming for is more:
10:40:05 - 5
10:40:06 - 6
No idea if I'm even thinking about this correctly. New to AWK in general.
Use colon and dot as the field separators, and we have hours in col2, minutes in col3 and seconds in col4
awk -F'[:.]' '
{count[$2 ":" $3 ":" $4]++}
END {for (time in count) print time " - " count[time]}
' file
10:40:05 - 4
10:40:06 - 6
Output will not necessarily be sorted. If you're using GNU awk, use
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (time in count)
print time " - " count[time]
}
(reference),
or simply pipe the output to | sort
One thing you can do is this:
awk 'BEGIN{FIELDWIDTHS = "1 11 1 12"} {print $4}' datetimes
Specify the field widths and then this will give you your time, for example. If you don't care about milliseconds, then "1 11 1 8 4"
You can use substr for the line as index of an array. for example, you have this file
cat 1.txt
[01/May/2018:23:59:59.532
[01/May/2018:01:59:59.848
[01/May/2018:02:59:59.851
[01/May/2018:02:59:59.911
[01/May/2018:02:59:59.923
[01/May/2018:02:00:59.986
you can use an awk command like this
cat 1.txt | awk '{a[substr($0,index($0,":")+1,8)]++} END{for(i in a) print i" - "a[i]}'
where substr($0,index($0,":")+1,8) cuts 8 chars from the occurrence of the first ":", use this as index of the array

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Mapping lines to columns in *nix

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.
Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.
You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

write a two column file from two files using awk

I have two files of one column each
1
2
3
and
4
5
6
I want to write a unique file with both elements as
1 4
2 5
3 6
It should be really simple I think with awk.
You could try paste -d ' ' <file1> <file2>. (Without -d ' ' the delimiter would be tab.)
paste works okay for the example given but it doesn't handle variable length lines very well. A nice little-know core-util pr provides a more flexible solution:
$ pr -mtw 4 file1 file2
1 4
2 5
3 6
A variable length example:
$ pr -mtw 22 file1 file2
10 4
200 5
300,000,00 6
And since you asked about awk here is one way:
$ awk '{a[FNR]=a[FNR]$0" "}END{for(i=1;i<=length(a);i++)print a[i]}' file1 file2
1 4
2 5
3 6
Using awk
awk 'NR==FNR { a[FNR]=$0;next } { print a[FNR],$0 }' file{1,2}
Explanation:
NR==FNR will ensure our first action statement runs for first file only.
a[FNR]=$0 with this we are inserting first file into array a indexed at line number
Once first file is complete we move to second action
Here we print each line of first file along with second file

Resources