how to Aggregate files or merge - linux

Can any one help to merge defferent files by common data (columns)? please =(
file1.txt
ID Kg Year
3454 1000 2010
3454 1200 2011
3323 1150 2009
2332 1000 2011
3454 1156 201
file2.txt
ID Place
3454 A1
3323 A2
2332 A6
5555 A9
file 1+2
ID Kg Year Place
3454 1000 2010 A1
3454 1200 2011 A1
3323 1150 2009 A2
2332 1000 2011 A6
3454 1156 2013 A1
So second file should be connected to first. As you can see ID 5555 from file 2 just not using.
How to do it in linux or....

If you start with sorted files, the tool is join. In your case, you can sort on the fly.
join <(sort file1.txt) <(sort file2.txt)
The headers will be joined as well but won't appear on top. Pipe to sort -r

If you don't care about maintaining the order of lines, use karakfa's join command.
To keep the original order of lines, use awk
awk '
NR==FNR {place[$1]=$2; next}
$1 in place {print $0, place[$1]}
' file2.txt file1.txt | column -t
ID Kg Year Place
3454 1000 2010 A1
3454 1200 2011 A1
3323 1150 2009 A2
2332 1000 2011 A6
3454 1156 201 A1

Related

Trying to sort two different columns of a text file, (one asc, one desc) in the same awk script

I have tried to do it separately, and I am getting the right result, but I need help to combine the two.
This is the csv file:
maruti swift 2007 50000 5
honda city 2005 60000 3
maruti dezire 2009 3100 6
chevy beat 2005 33000 2
honda city 2010 33000 6
chevy tavera 1999 10000 4
toyota corolla 1995 95000 2
maruti swift 2009 4100 5
maruti esteem 1997 98000 1
ford ikon 1995 80000 1
honda accord 2000 60000 2
fiat punto 2007 45000 3
I am using this script to sort by first field:
BEGIN { print "========Sorted Cars by Maker========"
}
{arr[$1]=$0}
END{
PROCINFO["sorted_in"]="#val_str_desc"
for(i in arr)print arr[i]
}
I also want to run a sort on the year($3) ascending in the same script.
I have tried many ways but to no avail.
A little help to do that would be appreciated..
One in GNU awk:
$ gawk '
{
a[$1][$3][++c[$1,$3]]=$0
}
END {
PROCINFO["sorted_in"]="#ind_str_desc"
for(i in a) {
PROCINFO["sorted_in"]="#ind_str_asc"
for(j in a[i]) {
PROCINFO["sorted_in"]="#ind_num_asc"
for(k in a[i][j])
print a[i][j][k]
}
}
}' file
Output:
toyota corolla 1995 95000 2
maruti esteem 1997 98000 1
maruti swift 2007 50000 5
...
Assumptions:
individual fields do not contain white space
primary sort: 1st field in descending order
secondary sort: 3rd field in ascending order
no additional sorting requirements provided in case there's a duplicate of 1st + 3rd fields (eg, maruti + 2009) so we'll maintain the input ordering
One idea using sort:
sort -k1,1r -k3,3n auto.dat
Another idea using GNU awk (for arrays of arrays and PROCINFO["sorted_in"]):
awk '
{ cars[$1][$3][n++]=$0 } # "n" used to distinguish between duplicates of $1+$3
END { PROCINFO["sorted_in"]="#ind_str_desc"
for (make in cars) {
PROCINFO["sorted_in"]="#ind_num_asc"
for (yr in cars[make])
for (n in cars[make][yr])
print cars[make][yr][n]
}
}
' auto.dat
Both of these generate:
toyota corolla 1995 95000 2
maruti esteem 1997 98000 1
maruti swift 2007 50000 5
maruti dezire 2009 3100 6
maruti swift 2009 4100 5
honda accord 2000 60000 2
honda city 2005 60000 3
honda city 2010 33000 6
ford ikon 1995 80000 1
fiat punto 2007 45000 3
chevy tavera 1999 10000 4
chevy beat 2005 33000 2

Put every X rows of input into a new column

I have a file with 3972192 lines and two values tab separated for each line. I would like to separate every 47288 lines into a new column (this derives in 84 columns). I read these other question (Put every N rows of input into a new column) in which it does the same as I want but with awk I get:
awk: program limit exceeded: maximum number of fields size=32767
if I do it with pr, the limit of columns to separate is 36.
For doing this I first selected column 2 with awk:
awk '{print $2}' input_file>values_file
For getting the first column values I did:
awk '{print $1}' input_file>headers_file
head -n 47288 headers_file >headers_file2
Once I get the both files I will put them together with the paste function:
paste -d values_file headers_file2 >Desired_output
Example:
INPUT:
-Line1: ABCD 12
-Line2: ASDF 3435
...
-Line47288: QWER 345466
-Line47289: ABCD 456
...
-Line94576: QWER 25
...
-Line3972192 QWER 436
DESIRED output WANTED:
- Line1: ABCD 12 456 ....
...
- Line47288: QWER 345466 25 .... 436
Any advice? thanks in advance,
I suppose each block has the same pattern, I mean, the first column is in the same order [ABCD ASDF ... QWER] and again.
If so, you have to take the first column of the first BLOCK [47288 lines] and echo to the target file.
Then you have to get the second column of each BLOCK and paste it to the target file.
I tried with this data file :
ABCD 1001
EFGH 1002
IJKL 1003
MNOP 1004
QRST 1005
UVWX 1006
ABCD 2001
EFGH 2002
IJKL 2003
MNOP 2004
QRST 2005
UVWX 2006
ABCD 3001
EFGH 3002
IJKL 3003
MNOP 3004
QRST 3005
UVWX 3006
ABCD 4001
EFGH 4002
IJKL 4003
MNOP 4004
QRST 4005
UVWX 4006
ABCD 5001
EFGH 5002
IJKL 5003
MNOP 5004
QRST 5005
UVWX 5006
And with this script :
#!/bin/bash
#target number of lines, change to 47288
LINES=6
INPUT='data.txt'
TOTALLINES=`wc --lines $INPUT | cut --delimiter=" " --field=1`
TOTALBLOCKS=$((TOTALLINES / LINES))
#getting first block of target file, the first column of first LINES of data file
head -n $LINES $INPUT | cut --field=1 > target.txt
#get second column of each line, by blocks, and paste it into target file
BLOCK=1
while [ $BLOCK -le $TOTALBLOCKS ]
do
HEADVALUE=$((BLOCK * LINES))
head -n $HEADVALUE $INPUT | tail -n $LINES | cut --field=2 > tmpcol.txt
cp target.txt targettmp.txt
paste targettmp.txt tmpcol.txt > target.txt
BLOCK=$((BLOCK+1))
done
#removing temp files
rm -f targettmp.txt
rm -f tmpcol.txt
And I got this target file :
ABCD 1001 2001 3001 4001 5001
EFGH 1002 2002 3002 4002 5002
IJKL 1003 2003 3003 4003 5003
MNOP 1004 2004 3004 4004 5004
QRST 1005 2005 3005 4005 5005
UVWX 1006 2006 3006 4006 5006
I hope this helps you.

Merging and Adding Data in Excel Worksheets

I have 8 sheets of data (from Dec 2014 to July 2015, separated month wise). Each sheet contains monthly data (e.g. Dec 2014 sheet contains data of dec 2014 in three columns namely AC #, Name, Amount).
Dec 2014 Contains Data as Mentioned Below:
A/C # Name Dec 2014
A12 ABC 100
A13 CBA 200
A14 BCA 300
Whereas January 2015 contains data as below
A/C # Name Dec 2014
A12 ABC 5
A13 CBA 300
*A15 IJK 900*
All sheets contains mostly same data but some additional data based on customers added in that month or amount. E.g. January 2015 may contain an additional client a/c #, name and amount of January 2015 as marked above.
I want a consolidated sheet of data where all data is arranged as below:
A/C # Name Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015
A12 ABC 100 5
A13 CBA 200 300
A14 BCA 300 0
A15 IJK 0 900
I would suggest connecting to the worksheets using ADODB. Then you can issue an SQL statement that will merge the records together.
This could be run from a VBScript, or from Excel.
For a similar strategy, see here.

Using a pipe to input in an awk statement

So I'm dealing with a file named cars, here's it contents:
toyota corolla 1970 2500
chevy malibu 1999 3000
ford mustang 1965 10000
volvo s80 1998 9850
ford thundbd 2003 10500
chevy malibu 2000 3500
honda civic 1985 450
honda accord 2001 6000
ford taurus 2004 17000
toyota rav4 2002 750
chevy impala 1985 1550
ford explor 2003 9500
I'm using grep to filter for lines containing a specific automaker and then piping that to my awk statement, and finally piping the final result to a new pipe with tee.
Here's the line of code I'm having trouble with:
grep "$model" cars |
awk '($3+0) >= ("'$max_year'"+0) && ($4+0) <= ("'$max_price'"+0)' |
tee last_search
I previously defined variables max_year and max_price as a user input in my script.
The file last_search is made but it's always empty.
You almost certainly have something wrong with your variables, you should print them out and gradually build up the pipeline one command at a time to debug.
As it stands, it works fine for the following values:
$ max_year=2000
$ max_price=10000
$ model=a
$ grep "$model" cars
toyota corolla 1970 2500
chevy malibu 1999 3000
ford mustang 1965 10000
chevy malibu 2000 3500
honda civic 1985 450
honda accord 2001 6000
ford taurus 2004 17000
toyota rav4 2002 750
chevy impala 1985 1550
$ grep "$model" cars | awk '($3+0) >= ("'$max_year'"+0) && ($4+0) <= ("'$max_price'"+0)'
chevy malibu 2000 3500
honda accord 2001 6000
toyota rav4 2002 750
There are also better ways of doing it without having to manage your command string the way you have, since it's probably prone to errors. You can use:
grep "$model" cars |
awk -vY=$max_year -vP=$max_price '$3>=Y&&$4<=P{print}'
(you'll note I'm not using the string+0 trick there, GNU awk, which you're almost certainly using under Linux, handles that just fine, it will compare numerically if both arguments are numeric in nature).
set -a
model=malibu
max_year=2000
max_price=4000
awk '
$2 == ENVIRON["model"] &&
$3 >= ENVIRON["max_year"] &&
$4 <= ENVIRON["max_price"]
' cars |
tee last_search

merging two files based on two columns

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you
join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.
If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Resources