merging two files based on two columns

merging two files based on two columns - linux

I have a question very similar to a previous post:
Merging two files by a single column in unix
but i want to merge my data based on two columns (The orders are the same, so no need to sort).
Example,
subjectid subID2 name age
12 121 Jane 16
24 241 Kristen 90
15 151 Clarke 78
23 231 Joann 31
subjectid subID2 prob_disease
12 121 0.009
24 241 0.738
15 151 0.392
23 231 1.2E-5
And the output to look like
subjectid SubID2 prob_disease name age
12 121 0.009 Jane 16
24 241 0.738 Kristen 90
15 151 0.392 Clarke 78
23 231 1.2E-5 Joanna 31
when i use join it only considers the first column(subjectid) and repeats the SubID2 column.
Is there a way of doing this with join or some other way please? Thank you

join command doesn't have an option to scan more than one field as a joining criteria. Hence, you will have to add some intelligence into the mix. Assuming your files has a FIXED number of fields on each line, you can use something like this:
join f1 f2 | awk '{print $1" "$2" "$3" "$4" "$6}'
provided the the field counts are as given in your examples. Otherwise, you need to adjust the scope of print in the awk command, by adding or taking away some fields.

If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like:
join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b
as described in join(1).

Related

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?

This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

Ranking Dates Based on Another Column - Spotfire

Does anyone know of way to circumvent the Spotfire limitation for using the OVER function to RANK or order dates when using a custom expression?
Providing a little background, I am trying to identify or mark a lease based on the below data as 1, 2, 3 etc. For example, since we see twice 63 in the left column, I would like to return a 1 and a 2 to identify the two different leases, starting on 1/1/2016 and 8/1/2016. Then a 1 and 2 for 72, a 1 for 140 and so one. Unfortunately, OVER functions can only be used with aggregation methods and I don't know of another method to produce the result that I am looking for.
Tenant Lease_From Lease_To Tenant_status
63 1/1/2016 1/31/2017 Current
63 8/1/2017 7/31/2018 Current
72 10/1/2016 7/31/2017 Current
72 8/1/2017 7/31/2018 Current
140 2/1/2017 7/31/2018 Current
149 8/1/2016 7/31/2017 Current
149 8/1/2017 7/31/2018 Current
156 1/15/2017 3/31/2018 Current
156 4/1/2018 3/31/2019 Current

Use this:
Rank([Lease_From], [Tenant])
Gives this as the result:
Tenant Lease_From Lease_To Tenant_status Rank([Lease_From], [Tenant])
63 1/1/2016 1/31/2017 Current 1
63 8/1/2017 7/31/2018 Current 2
72 10/1/2016 7/31/2017 Current 1
72 8/1/2017 7/31/2018 Current 2
140 2/1/2017 7/31/2018 Current 1
149 8/1/2016 7/31/2017 Current 1
149 8/1/2017 7/31/2018 Current 2
156 1/15/2017 3/31/2018 Current 1
156 4/1/2018 3/31/2019 Current 2

please consider #blakeoft's answer as the correct one!
that said, as an FYI, First() is considered an aggregation method, and OVER statements can be included inside of an If()! so you can accomplish the same thing with an expression like:
If([Lease_From] = First([Lease_From]) OVER ([Tenant]), 1, 2)
when you combine If() and OVER in this way, you can get some really cool and powerful visualizations, BUT you do lose the ability to mark data effectively. this is because the expression is evaluated from the context of the If() rather than the OVER; in other words, all rows are considered instead of only the ones selected.
you can get around this with some black magic (AKA data functions) but it's a bit contrived.
again, in this situation, Rank() is absolutely the correct solution.

how can I use multiple operation in awk to edit text file

I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk -F "\t" '{ (if($4 < -10 & $5==BA)), (if($4 < -5 & $5==CA)) ; print $2 = BA/CA} file.txt > out.txt
I tried to follow the steps that mentioned in the code but did not succeed. do you know how to solve the problem?

If the records with the same ID are always consecutive, you can do that:
awk 'ID!=$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
The first block tests if the ID has changed, in this case, it displays the previous ID and the ratio.
The second block filter unwanted records.
The END block displays the result for the last ID.

Combine results of column one Then sum column 2 to list total for each entry in column one

I am bit of Bash newbie, so please bear with me here.
I have a text file dumped by another software (that I have no control over) listing each user with number of times accessing certain resource that looks like this:
Jim 109
Bob 94
John 92
Sean 91
Mark 85
Richard 84
Jim 79
Bob 70
John 67
Sean 62
Mark 59
Richard 58
Jim 57
Bob 55
John 49
Sean 48
Mark 46
.
.
.
My goal here is to get an output like this.
Jim [Total for Jim]
Bob [Total for Bob]
John [Total for John]
And so on.
Names change each time I run the query in the software, so static search on each name and then piping through wc does not help.

This sounds like a job for awk :) Pipe the output of your program to the following awk script:
your_program | awk '{a[$1]+=$2}END{for(name in a)print name " " a[name]}'
Output:
Sean 201
Bob 219
Jim 245
Mark 190
Richard 142
John 208
The awk script itself can be explained better in this format:
# executed on each line
{
# 'a' is an array. It will be initialized
# as an empty array by awk on it's first usage
# '$1' contains the first column - the name
# '$2' contains the second column - the amount
#
# on every line the total score of 'name'
# will be incremented by 'amount'
a[$1]+=$2
}
# executed at the end of input
END{
# print every name and its score
for(name in a)print name " " a[name]
}
Note, to get the output sorted by score, you can add another pipe to sort -r -k2. -r -k2 sorts the by the second column in reverse order:
your_program | awk '{a[$1]+=$2}END{for(n in a)print n" "a[n]}' | sort -r -k2
Output:
Jim 245
Bob 219
John 208
Sean 201
Mark 190
Richard 142

Pure Bash:
declare -A result # an associative array
while read name value; do
((result[$name]+=value))
done < "$infile"
for name in ${!result[*]}; do
printf "%-10s%10d\n" $name ${result[$name]}
done
If the first 'done' has no redirection from an input file
this script can be used with a pipe:
your_program | ./script.sh
and sorting the output
your_program | ./script.sh | sort
The output:
Bob 219
Richard 142
Jim 245
Mark 190
John 208
Sean 201

GNU datamash:
datamash -W -s -g1 sum 2 < input.txt
Output:
Bob 219
Jim 245
John 208
Mark 190
Richard 142
Sean 201

linux sort inside a column

I only want to sort a file by the second character in the second column by the number order.
the sample file like this:
aa 19
aa 189
aa 167
ab 13
nd 23
at 32
ca 90
I expect the result like
ca 90
at 32
ab 13
nd 23
aa 167
aa 189
aa 19
I use the command sort -n -k 2.2,2.2 [filename].
But it shows me the result like this:
aa 167
aa 189
aa 19
ab 13
nd 23
at 32
ca 90
It is not the right answer. Does anybody know what's wrong with my command?

The problem is that you didn't specify the correct column delimiter, and sort assumes it's a tab instead of a space.
sort -t ' ' -nk 2.2
works just fine.
Edit: in my man page it says that any whitespace is counted as delimiter by default, but the fact is that adding -t ' ' solves it.

sort -t ' ' -k2.2,2.2 filename

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

merging two files based on two columns - linux

If the orders are identical, you could still merge by a single column and specify the format of which columns to output, like: join -o '1.1 1.2 2.3 1.3 1.4' file_a file_b as described in join(1).

Related

sort pyspark dataframe within groups

Ranking Dates Based on Another Column - Spotfire

how can I use multiple operation in awk to edit text file

Combine results of column one Then sum column 2 to list total for each entry in column one

linux sort inside a column

Categories

Resources