I would like to merge two files by one column in unix.
I have file_a:
subjectid name age
12 Jane 16
24 Kristen 90
15 Clarke 78
23 Joann 31
I have another file_b:
subjectid prob_disease
12 0.009
24 0.738
15 0.392
23 1.2E-5
I would like to merge these files in the command line. I'd like to merge files a and b by subjectid. Since each file is about 2 million lines long, I tried in R but it froze due to the amount of data, could someone please help me do this in linux?
Desired output:
subjectid prob_disease name age
12 0.009 Jane 16
24 0.738 Kristen 90
15 0.392 Clarke 78
23 1.2E-5 Joanna 31
Please help and thank you!
Check out join(1). In your case, you don't even need any flags:
$ join file_b file_a
subjectid prob_disease name age
12 0.009 Jane 16
24 0.738 Kristen 90
15 0.392 Clarke 78
23 1.2E-5 Joann 31
You're looking for the join command:
$ cat test.1
12 Jane 16
24 Kristen 90
15 Clarke 78
23 Joann 31
$ cat test.2
12 0.009
24 0.738
15 0.392
23 1.2E-5
$ join -j1 -o 2.1,2.2,1.2,1.3 <(sort test.1) <(sort test.2)
12 0.009 Jane 16
15 0.392 Clarke 78
23 1.2E-5 Joann 31
24 0.738 Kristen 90
$
Related
I am new to KDB. I have a table in following format:
id date name order
34 2020.01.20 John 10
23 2020.01.20 John -20
21 2020.01.20 John 30
43 2020.01.20 John -400
44 2020.01.20 Dan -6483
22 2020.01.20 Dan 8796
The sample table can be created as follows:
t:([]id:(34, 23, 21, 43, 44, 22); date:(2020.01.20; 2020.01.20; 2020.01.20; 2020.01.20; 2020.01.20; 2020.01.20); name:(`John`John`John`John`Dan`Dan); order:(10, -20, 30, -400, -6483, 8796));
I want all the subsets of orders possible for any given date and name in the below format. Now the order column below is sum of order value for id and all the ids.
id date name order ids
34 2020.01.20 John 10 0n
34 2020.01.20 John -10 23
34 2020.01.20 John 40 21
34 2020.01.20 John -390 43
34 2020.01.20 John 20 23, 21
34 2020.01.20 John -360 21, 43
34 2020.01.20 John -380 23, 21, 43
23 2020.01.20 John -20 0n
23 2020.01.20 John -10 34
23 2020.01.20 John 10 21
23 2020.01.20 John -420 43
23 2020.01.20 John 20 34, 21
23 2020.01.20 John -390 21, 43
23 2020.01.20 John -380 34, 21, 43
21 2020.01.20 John 30 0n
21 2020.01.20 John 40 34
21 2020.01.20 John 10 23
21 2020.01.20 John -370 43
21 2020.01.20 John 20 34, 23
21 2020.01.20 John 20 23, 43
21 2020.01.20 John -380 34, 23, 43
43 2020.01.20 John -400 0n
43 2020.01.20 John -390 34
43 2020.01.20 John -420 23
43 2020.01.20 John -370 21
43 2020.01.20 John -410 34, 23
43 2020.01.20 John -390 23, 21
43 2020.01.20 John -380 34, 23, 21
44 2020.01.20 Dan -6483 0n
44 2020.01.20 Dan 2313 22
22 2020.01.20 Dan 8796 0n
22 2020.01.20 Dan 2313 44
Not sure if this is the most optimal solution, but below snippet will do what you are looking for:
orderMap: (!) . t`id`order;
subsets: ungroup
update ids: {x where each (count[x]-1){x cross 01b}/01b} each ids from
select ids: id by name from t;
t: ej[`name;t;subsets];
t: delete from t where id in' ids;
t: update order: order + sum each orderMap#/:ids from t;
t
For consistency, ids column is created as list of integer lists and empty array `long$() is used instead of 0n
More detaily:
orderMap: (!) . t`id`order gives id-to-order mapping. Here I use assume that ids are unique.
subsets is table of names and id subsets assigned to the name. {(count[x]-1){x cross 01b}/01b} returns "include" flags which help forming subsets, e.g. 0000b, 1000b, 0100b, .... Could be done more efficiently with integers binary representation.
ej[`name;t;subsets] - joins original table with id subsets by name.
delete from t where id in' ids deletes rows where id is included into ids subset.
update order: order + sum each orderMap#/:ids from t sums up order of id and ids' subset by using orderMap
This gets you part of the way, though this approach has an extra combination that you seem to exclude (you can exclude these afterwards if need be):
q)comb:{$[type b:(count[a:x except y]-1)(01b cross)/01b;(`long$();a);a where each b]};
q)update sum each order from ungroup ungroup select id,order:(order,/:'order i?comb[i]each i),ids:id i?comb[i]each i by date,name from t
date name id order ids
---------------------------------
2020.01.20 Dan 44 -6483 `long$()
2020.01.20 Dan 44 2313 ,22
2020.01.20 Dan 22 8796 `long$()
2020.01.20 Dan 22 2313 ,44
2020.01.20 John 34 10 `long$()
2020.01.20 John 34 -390 ,43
2020.01.20 John 34 40 ,21
2020.01.20 John 34 -360 21 43
2020.01.20 John 34 -10 ,23
2020.01.20 John 34 -410 23 43
2020.01.20 John 34 20 23 21
2020.01.20 John 34 -380 23 21 43
2020.01.20 John 23 -20 `long$()
2020.01.20 John 23 -420 ,43
...
I have a file, for example, as follow:
HEAD 100 ABC 12 24 12.5
HEAD 101 ABC 10 20 11.9
HEAD 102 ABC 13 18 15.4
HEAD 103 ABC 14 21 17.6
HEAD 104 ABC 16 23 13.4
how could I add a number (like 5) to column 2 of every row? It will be
HEAD 105 ABC 12 24 12.5
HEAD 106 ABC 10 20 11.9
HEAD 107 ABC 13 18 15.4
HEAD 108 ABC 14 21 17.6
HEAD 109 ABC 16 23 13.4
I try to do it in vi command mode, I try
:%s/\d\+/\=submatch(0)+1/g
But it add every digit in the row.
THANKS!!!
Alternatively, in block-visual mode (Ctrl-V) select the numbers to be incremented:
HEAD 100 ABC 12 24 12.5
HEAD 101 ABC 10 20 11.9
HEAD 102 ABC 13 18 15.4
HEAD 103 ABC 14 21 17.6
HEAD 104 ABC 16 23 13.4
Then press keys: 5 Ctrl-A
The visual approach is quick, but to give another example suppose you want the output:
HEAD 150 ABC 12 24 12.5
HEAD 151 ABC 10 20 11.9
HEAD 152 ABC 13 18 15.4
HEAD 153 ABC 14 21 17.6
HEAD 154 ABC 16 23 13.4
that is add 5 to the numbers to the 7th column. You could either use the visual method with 50 Ctrl-A. Or use specific-column matching (\%c):
%s/\%>6c\%<8c\d/\=+5/
You were very close, but just don't use a global match, instead use:
:%s/\d\+/\=submatch(0)+5/
This will only replace the first match (in this case the desired second column).
Output:
HEAD 105 ABC 12 24 12.5
HEAD 106 ABC 10 20 11.9
HEAD 107 ABC 13 18 15.4
HEAD 108 ABC 14 21 17.6
HEAD 109 ABC 16 23 13.4
I've been trying to figure out how to SUM the top 2 values of an array using SUMPRODUCT but I also want to add a criteria that will only sum the product if it matches a specific string. I thought I could combine SUMPRODUCT and SUMIF but I have been unsuccessful.
Position Age ADP Trend Value
QB 23 241 84.2 21
QB 35 185 -37.5 142
QB 27 300 25 19
QB 26 300 25 19
QB 32 300 25 19
RB 22 98 -2.2 1051
RB 24 69 0.3 1929
RB 24 238 6 25
RB 26 300 25 19
RB 26 300 25 19
WR 22 300 25 19
WR 24 300 25 19
WR 26 232 -17 36
WR 25 300 25 19
WR 28 300 25 19
WR 23 9 -4.2 8591
WR 23 178 21.4 161
WR 23 38 8.5 4679
WR 26 222 102.8 53
WR 23 300 25 19
WR 26 300 25 19
TE 26 117 -18.7 617
TE 36 193 -30.3 119
TE 26 199 -22.5 105
TE 24 300 25 19
What I want is to SUM the top two values under the Value column IF the Position = QB.
How can I accomplish this?
Cheers!
Use this array formula:
=SUM(LARGE(IF(A2:A25="QB",E2:E25,""),1),LARGE(IF(A2:A25="QB",E2:E25,""),2))
Press CTRL+SHIFT+ENTER to evaluate the formula as it is an array formula.
I have a bash script as below:
day=(58 34 107 91 43 39 41 76 37 47 70 74 56 19 95 38 48 96 50 76 89 79 46 105 26 88 69 87 23 82 99 77 114 52 87 63 33 52 57 45 48 49 55 60 34 107 48 40 25 20 16)
year=(1952 1953 1954 1955 1956 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004)
for dom in $day; do
for yrs in $year; do
ncks -O -d time,$dom imdJJAS$yrs.nc ac_$yrs.nc
done
done
Basically i am trying to extract the time dimension for each year using the NCO ncks command, the scripts run but the outputs are incorrect. For year 1951, it succesfully extracted the 58th time value, but from 1952 onwards, it extracts the last value in the day array (16), which is incorrect.
ive tried setting {$day[a]} since its an array, but if i used this, for all years in extracts the last value in the array instead.
I am not too sure what im doing wrong, ive looked through quite a few post regarding this, but it doest seem to be working.
Id appreciate any help.
Cheers!
$array by itself will expand to the first element in the array. To expand to the full array you should use ${array[#]}:
for dom in "${day[#]}"; do
for yrs in "${year[#]}"; do
ncks -O -d "time,${dom}" "imdJJAS${yrs}.nc" "ac_${yrs}.nc"
done
done
I also quoted your variable expansions and changed $dom and $yrs to ${dom} and ${yrs}. The later is done to prevent mistakenly referring to an undefined variable $dom_abc is not the same as ${dom}_abc
If I understand your intention correctly, you are trying to use corresponding values from both arrays. In that case you need a numerical index. for VAR in ARRAY iterates over all values of the array.
This question already has answers here:
Excel add column starting at 1 and increments to 24 then resets [closed]
(2 answers)
Closed 8 years ago.
Here is a sample of my data
Hour Index Visits
0 67
1 22
2 111
3 22
4 0
5 0
6 22
7 44
8 0
9 89
10 22
11 111
12 44
13 89
14 44
15 111
16 177
17 89
18 44
19 44
20 89
21 22
22 89
23 44
24 133
25 44
26 22
27 22
28 44
29 22
30 44
31 44
32 22
What I want to do is add two columns. In one column there is the date starting at Jan 1, 2013 and repeats this date for 24 rows until it increments to the next day. Then I want another column that just displays the month of the previous column. Here is what it should look like
Hour Index Visits date month
0 67 1/1/2013 1
1 22 1/1/2013 1
2 111 1/1/2013 1
3 22 1/1/2013 1
4 0 1/1/2013 1
5 0 1/1/2013 1
6 22 1/1/2013 1
7 44 1/1/2013 1
8 0 1/1/2013 1
9 89 1/1/2013 1
10 22 1/1/2013 1
11 111 1/1/2013 1
12 44 1/1/2013 1
13 89 1/1/2013 1
14 44 1/1/2013 1
15 111 1/1/2013 1
16 177 1/1/2013 1
17 89 1/1/2013 1
18 44 1/1/2013 1
19 44 1/1/2013 1
20 89 1/1/2013 1
21 22 1/1/2013 1
22 89 1/1/2013 1
23 44 1/1/2013 1
24 133 2/1/2013 1
25 44 2/1/2013 1
26 22 2/1/2013 1
27 22 2/1/2013 1
28 44 2/1/2013 1
29 22 2/1/2013 1
30 44 2/1/2013 1
31 44 2/1/2013 1
32 22 2/1/2013 1
Suppose your Hours starts from A2. Then you can write in date column (column C):
=DATE(2013,1,1)+INT(A2/24)
and drop it down.
Next step, write in month column (Column D):
=MONTH(C2)
and drop it down.