I have a file of about 1 Million records. I need to extract the records which have different FName and LName for id.
Input File
Col1,Col2,Col3,Col4,ID,FName,Col5,LName,Col6,Col7,Col8
AP,abc#gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2#gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
AP,abc#gmail.com,xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,abc#gmail.com,xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,abc#gmail.com,xyz1,abc1,567,Alex,,Smith,phn2,fax2,url1
AP,abc#gmail.com,xyz1,abc1,789,Allen,,Prack,phn2,fax2,url1
The result that I want to see
AP,abc#gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2#gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
Any AWK or Sed command or script can help? Thanks
You may try this awk:
awk 'BEGIN {FS=OFS=","} {id = $5; name = $6 FS $8} id in map && map[id] != name {if (!done[id]++) print rec[id]; print} {map[id] = name; rec[id] = $0}' file
AP,abc#gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2#gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
Or a bit more readable:
awk 'BEGIN {
FS=OFS=","
}
{
id = $5
# name variable to store fname, lname
name = $6 FS $8
}
# if this id is already stored as key in map and if it is there check
# if stored name is different from current name
id in map && map[id] != name {
# print previous record if not already printed
if (!done[id]++)
print rec[id]
# print current record
print
}
{
# store name by key as id in map array
# and store full record by key as id in rec array
map[id] = name
rec[id] = $0
}' file
Using GNU awk for arrays of arrays:
$ awk -F, '
{ vals[$5][$6 FS $8] = $0 }
END {
for ( id in vals ) {
if ( length(vals[id]) > 1 ) {
for (name in vals[id]) {
print vals[id][name]
}
}
}
}
' file
AP,abc#gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2#gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
or if your input file is sorted by "id" as shown in your sample input then with any awk and without storing the input file in memory:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR > 1 {
id = $5
name = $6 FS $8
if ( id == prevId ) {
if ( name != prevName ) {
if ( firstRec != "" ) {
print firstRec
firstRec = ""
}
print
}
}
else {
firstRec = $0
}
prevId = id
prevName = name
}
$ awk -f tst.awk file
AP,abc#gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2#gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
This one-liner should do the job:
awk -F "," '!a[$5] {a[$5]=$0} a[$5]!=$0{print a[$5]; print $0; a[$5]=$0}' input_file.txt
Output:
AP,abc#gmail.com,xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,abc2#gmail.com,xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
Note that the entire lines are compared based on ID.
awk -F, -v id="123" '$1 == id { map[NR]=$0 } END { for(i in map) { print map[i] } }' file
With awk, set the field separator to a comma and pass a variable in called id. When the first field is equal to the passed id, add to an array called map, indexed by the record number and with the line as the value. At the end loop through the array and print the values.
I have two files.First column is common between the both files and I would like to merge the file and generate the output where its copy the first file third column every time in second file whenever there is match.
file1
412234;mark
413234;raja
file2
412234;value1
412234;value2
412234;value3
412234;value4
413234;value1
413234;value2
413234;value3
Output file
412234;value1;mark
412234;value2;mark
412234;value3;mark
412234;value4;mark
413234;value1;raja
413234;value2;raja
413234;value3;raja
Try this:
awk -F';' 'BEGIN{FS=OFS=";"} FNR==NR{a[$1]=$2; next} ($1 in a){print $1, $2, a[$1]}' file1 file2
explanation:
-F';' means that AWK will use ; as field separator;
BEGIN{FS=OFS=";"} set the Output filed separator, used by print function;
AWK parse all files sequentially, the condition:
FNR==NR
is true only when parsing the first file.
While parsing file1, it saves a vector a with first match as index and second match as value;
a is expected to be
a[412234] = mark
a[413234] = raja
($1 in a) is the condition to met, true when first match on file2 is found on vector a.
If true then execute:
print $1";"$2";"a[$1]
that prints matches from file2 and the value of the vector a, saved from file1
----- EDIT
In case file1 contains multiple lines with same index, you need to save all distinct values in a vector and then scan the whole vector for multiple matches on file2
awk -F';' ' \
function vlen(a){n=0; for(i in a) n++; return n;} # helper function defined here \
function contained(val, vect) {found =0; for (x in vect) { if(vect[x] == val) found=1}; return found} # helper function defined here \
BEGIN{FS=OFS=";"} # Set output field separator \
FNR==NR{n=vlen(a); a[n]=$1; b[n]=$2; next} # scan file1 and save all indexes and value in different vectors \
{if(contained($1,a)) { for (i in a) { if (a[i] == $1) { print $1, $2, b[i]}} } else { print $1, $2 } } # for each line in file2, scan the whole vector a looking for a match \
' file1 file2
here we are defining the vlen and contained helper functions
Would you try the following:
awk '
BEGIN {FS=OFS=";"}
NR==FNR {
c[$1]++
a[$1,c[$1]]=$2
next
}
{
if (c[$1]) {
for (i=1; i<=c[$1]; i++) {
$3=a[$1,i]; print
}
} else {
print
}
}' file1 file2
Result with the file1 and file2 provided in the OP's last comment:
412234;value1;mark
412234;value1;raja
412234;value2;mark
412234;value2;raja
413234;value1
413234;value2
If the index in the 1st column (such as 412234) appears more than once
in file1, we need to preserve the existing value in the 2nd column
(such as mark) without overwriting.
Then an array c is introduced to count the occurrences of the index.
Note that the order of the result differs from the OP's expected output.
I hope it is acceptable.
I have two text files:
File-1:
PRKCZ
TNFRSF14
PRDM16
MTHFR
File-2(contains two tab delimited columns):
atherosclerosis GRAB1|PRKCZ|TTN
cardiomyopathy,hypercholesterolemia PRKCZ|MTHFR
Pulmonary arterial hypertension,arrhythmia PRDM16|APOE|GATA4
Now, for each name in File-1, print also the corresponding diseases names from File-2 where it matches. So the output would be:
PRKCZ atherosclerosis,cardiomyopathy,hypercholesterolemia
PRDM16 Pulmonary arterial hypertension,arrhythmia
MTHFR cardiomyopathy,hypercholesterolemia
I have tried the code:
$ awk '{k=$1}
NR==FNR{if(NR>1)a[k]=","b"="$1";else{a[k]="";b=$1}next}
k in a{print $0a[k]}' File1 File2
but I obtained no desired output. Can anybody correct/help please.
You can do this with the following awk script:
script.awk
BEGIN { FS="[\t]" }
NR==FNR { split($2, tmp, "|")
for( ind in tmp ) {
name = tmp[ ind ]
if (name in disease) { disease[ name ] = disease[ name ] "," $1 }
else { disease[ name ] = $1 }
}
next
}
{ if( $1 in disease) print $1, disease[ $1 ] }
Use it like this awk -f script.awk File-2 File-1 (note first File-2).
Explanation:
the BEGIN block sets up tab as separator.
the NR == FNR block is executed for the first argument (File-2): it reads the diseases with the names, splits the names and then appends the disease to a dictionary under each of the names
the last block is executed only (due to the next in the previous block) for the second argument (File-1): it outputs the diseases that are stored under the name (taken from $1)
I want to transform a file from this format
1;a;34;34;a
1;a;34;23;d
1;a;34;23;v
1;a;4;2;r
1;a;3;2;d
2;f;54;3;f
2;f;34;23;e
2;f;23;5;d
2;f;23;23;g
3;t;26;67;t
3;t;34;45;v
3;t;25;34;h
3;t;34;23;u
3;t;34;34;z
to this format
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
These are cvs files, so it should work with awk or sed ... but I have failed till now. If the first value is the same, I want to add the last three values to the first line. And this will run till the last entry in the file.
Here some code in awk, but it does not work:
#!/usr/bin/awk -f
BEGIN{ FS = " *; *"}
{ ORS = "\;" }
{
x = $1
print $0
}
{ if (x == $1)
print $3, $4, $5
else
print "\n"
}
END{
print "\n"
}
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ curr = $1 FS $2 }
curr == prev {
sub(/^[^;]*;[^;]*/,"")
printf "%s", $0
next
}
{
printf "%s%s", (NR>1?ORS:""), $0
prev = curr
}
END { print "" }
$ awk -f tst.awk file
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
If I understand you correctly that you want to build a line from fields 3-5 of all lines with the same first two fields (preceded by those two fields), then
awk -F \; 'key != $1 FS $2 { if(NR != 1) print line; key = $1 FS $2; line = key } { line = line FS $3 FS $4 FS $5 } END { print line }' filename
That is
key != $1 FS $2 { # if the key (first two fields) changed
if(NR != 1) print line; # print the line (except at the very
# beginning, to not get an empty line there)
key = $1 FS $2 # remember the new key
line = key # and start building the next line
}
{
line = line FS $3 FS $4 FS $5 # take the value fields from each line
}
END { # and at the very end,
print line # print the last line (that the block above
} # cannot handle)
You got good answers in awk. Here is one in perl:
perl -F';' -lane'
$key = join ";", #F[0..1]; # Establish your key
$seen{$key}++ or push #rec, $key; # Remember the order
push #{ $h{$key} }, #F[2..$#F] # Build your data structure
}{
$, = ";"; # Set the output list separator
print $_, #{ $h{$_} } for #rec' file # Print as per order
This is going to seem a lot more complicated than the other answers, but it's adding a few things:
It computes the maximum number of fields from all built up lines
Appends any missing fields as blanks to the end of the built up lines
The posix awk on a mac doesn't maintain the order of array elements even when the keys are numbered when using the for(key in array) syntax. To maintain the output order then, you can keep track of it as I've done or pipe to sort afterwards.
Having matching numbers of fields in the output appears to be a requirement per the specified output. Without knowing what it should be, this awk script is built to load all the lines first, compute the maximum number of fields in an output line then output the lines with any adjustments in order.
#!/usr/bin/awk -f
BEGIN {FS=OFS=";"}
{
key = $1
# create an order array for the mac's version of awk
if( key != last_key ) {
order[++key_cnt] = key
last_key = key
}
val = a[key]
# build up an output line in array a for the given key
start = (val=="" ? $1 OFS $2 : val)
a[key] = start OFS $3 OFS $4 OFS $5
# count number of fields for each built up output line
nf_a[key] += 3
}
END {
# compute the max number of fields per any built up output line
for(k in nf_a) {
nf_max = (nf_a[k]>nf_max ? nf_a[k] : nf_max)
}
for(i=1; i<=key_cnt; i++) {
key = order[i]
# compute the number of blank flds necessary
nf_pad = nf_max - nf_a[key]
blank_flds = nf_pad!=0 ? sprintf( "%*s", nf_pad, OFS ) : ""
gsub( / /, OFS, blank_flds )
# output lines along with appended blank fields in order
print a[key] blank_flds
}
}
If the desired number of fields in the output lines is known ahead of time, simply appending the blank fields on key switch without all these arrays would work and make a simpler script.
I get the following output:
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
I have file something like this
1111,K1
2222,L2
3333,LT50
4444,K2
1111,LT50
5555,IA
6666,NA
1111,NA
2222,LT10
Output that is need
1111,K1,LT50,NA
2222,L2,LT10
3333,LT50
4444,K2
5555,IA
6666,NA
1 st Column number may repeat anytime but output that i need is sort and uniq
awk -F"," '{a[$1]=a[$1]FS$2}END{for(i in a) print i,a[i]}' file | sort
If you have a big file, you can try printing the items out every few lines eg 50000
BEGIN{FS=","}
{ a[$1]=a[$1]FS$2 }
NR%50000==0 {
for(i in a) { print a[i] }
delete a #delete array so it won't take up memory
}
END{
for(i in a){ print a[i] }
}
Here is an understandable try using a non-standard tool, SQLite shell. Database is in-memory.
echo 'create table tmp (a int, b text);
.separator ,
.import file.txt tmp
.output out.txt
SELECT a, group_concat(b) FROM tmp GROUP BY a ORDER BY a ASC;
.output stdout
.q' | sqlite
This is solution in python. Script reads data from stdin.
#!/usr/bin/env python
import sys
d = {}
for line in sys.stdin.readlines():
pair = line.strip().split(',')
d[pair[0]] = d.get(pair[0], [])
d[pair[0]].append(str(pair[1]))
for key in sorted(d):
print "%s,%s" % (key, ','.join(d[key]))
Here's one in Perl, but it isn't going to be particularly efficient:
#!/usr/bin/perl -w
use strict;
my %lines;
while (<>) {
chomp;
my ($key, $value) = split /,/;
$lines{$key} .= "," if $lines{$key};
$lines{$key} .= $value;
}
my $key;
for $key in (keys(%lines)) {
print "$key,$lines{$key}\n";
}
Use like this:
$ ./command <file >newfile
You will likely have better luck with a multiple-pass solution, though. I don't really have time to write that for you. Here's an outline:
Grab and remove the first line from the file.
Parse through the rest of the file, concatenating any matching line and removing it.
At the end of the file, output your new long line.
If the file still has content, loop back to 1.