How to use AWK command for a variable with multiple entries? [closed] - linux

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
How can I use awk to get the information from an ID with various observations/variables. As an example following, I have two IDs (a,b). Both have 6 observations at different ages with different measurements.
The file is sorted based on ID and ages.
The 'line numbers' are not the part of the actual data file but the headings are!
I'd like to use awk to identify and extract {print} the "measurement" difference between the earliest age and the latest age of every each unique ID. Looking at the following example, for ID (a), I'd like to obtain 50-11=39.
id age measurement
1 a 2 11
2 a 4 20
3 a 6 19
4 a 7 89
5 a 8 43
6 a 12 50
7 b 1 15
8 b 3 23
9 b 5 30
10 b 6 33
11 b 7 45
12 b 10 60
I would highly appreciate if you please explain in details that I could learn.

Ordered input data
Given that the line numbers are not part of the data but the headings are, the file looks more like:
id age measurement
a 2 11
a 4 20
a 6 19
a 7 89
a 8 43
a 12 50
b 1 15
b 3 23
b 5 30
b 6 33
b 7 45
b 10 60
This script analyzes that file as desired:
awk 'NR==1 { next }
$1 != id { if (id != "") print id, id_max - id_min; id = $1; id_min = $3; }
{ id_max = $3 }
END { if (id != "") print id, id_max - id_min; }' data
The first line skips the first line of the file.
The second line checks whether the ID has changed; if so, it checks whether there was an old ID, and if so, prints the data. It then stores the current ID and records the measurement for the minimum age.
The third line records the measurement for the current maximum age for the current ID.
The last line prints out the data for the last group in the file, if there was any data in the file.
Sample output:
a 39
b 45
Unordered input data
Even if the data is not sequenced by ID and age, the code can be adapted to work:
awk 'NR==1 { next }
{ if (group[$1]++ == 0)
{
order[++sequence] = $1
id_age_min[$1] = $2; id_val_min[$1] = $3
id_age_max[$1] = $2; id_val_max[$1] = $3
}
if ($2 < id_age_min[$1]) { id_age_min[$1] = $2; id_val_min[$1] = $3; }
if ($2 > id_age_max[$1]) { id_age_max[$1] = $2; id_val_max[$1] = $3; }
}
END { for (i = 1; i <= sequence; i++)
{
id = order[i];
print id, id_val_max[id] - id_val_min[id]
}
}' data
This skips the heading line, then tracks groups as they arrive and arranges to print the data in that order (group and order — and sequence). For each row, if the group has not been seen before, set up the data for the current row (the values are both the minimum and maximum).
If the age in the current row ($2) is less than the age for the current minimum age (id_min_age[$1]), record the new age and the corresponding value. If the age in the current row is larger than the age for the current maximum age (id_max_age[$1]), record the new age and the corresponding value.
At the end, for each ID in sequence, print out the ID and the difference between the maximum and minimum value for that ID.
Shuffled data:
id age measurement
a 12 50
b 10 60
a 4 20
b 3 23
b 5 30
a 7 89
b 6 33
b 7 45
a 8 43
a 6 19
a 2 11
b 1 15
Sample output:
a 39
b 45
It so happens that an a row still appeared before a b row, so the output is the same as before.
More data (note b appears before a this time):
id age measurement
b 10 60
a 12 50
a 4 20
b 3 23
c 2 19
d -9 20
d 10 31
e 10 31
b 5 30
a 7 89
b 6 33
e -9 20
b 7 45
a 8 43
a 6 19
f -9 -3
f -7 -1
g -2 -8
g -8 -3
a 2 11
b 1 15
Result:
b 45
a 39
c 0
d 11
e 11
f 2
g -5

Related

Calculate the number of occurrences of words in a column and find the second, third most common

I have a formula that finds the frequent occurring text and works well.
=INDEX(Rng,MATCH(MAX(COUNTIF(Rng,Rng)),COUNTIF(Rng,Rng),0))
How can I tweak to find the second highest, third highest?
2nd:
=LARGE(A2:A; 2)
3rd:
=LARGE(A2:A; 3)
update 1:
use query:
=QUERY(A:A,
"select A,count(A) where A is not null group by A label count(A)''")
to get only 2nd or 3rd you can use index like:
=INDEX(QUERY(A:A,
"select A,count(A) where A is not null group by A label count(A)''"), 2)
update 2:
=INDEX(QUERY({'Data Entry Errors'!I:I},
"select Col1,count(Col1) where Col1 is not null group by Col1 order by count(Col1) desc limit 3 label count(Col1)''"),,1)
In Google Sheets, to get the number of occurrences of each word in the column A2:A, use this:
=query(A2:A, "select A, count(A) where A is not null group by A order by count(A) desc label count(A) '' ", 0)
To get just the second and third result and the number of their occurrences, use this:
=query(A2:A, "select A, count(A) where A is not null group by A order by count(A) desc limit 2 offset 1 label count(A) '' ", 0)
To get just the names that are the second and third by the number of their occurrences, use this:
=query( query(A2:A, "select A, count(A) where A is not null group by A order by count(A) desc limit 2 offset 1 label count(A) '' ", 0), "select Col1", 0 )
For Excel 365
Say we have data in column A from A2 through A66 like:
20
11
27
18
3
31
2
30
8
1
18
32
3
5
4
6
4
1
22
11
2
46
33
34
25
53
37
9
20
2
12
4
5
4
23
39
19
4
28
22
5
16
24
7
6
10
13
31
56
23
1
16
27
39
1
6
11
6
20
11
24
12
9
29
12
and we want a frequency table listing the most frequent value, the second most frequent value, the third, etc.
The simplest approach is to construct a Pivot Table, but if you need a formula approach, then in B2 enter:
=UNIQUE(A2:A66)
in C2 enter:
=COUNTIF(A$2:A$66,B2)
We now sort cols B:C by C. In D2 enter:
=SORTBY(B2:C35,C2:C35,-1)
:

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

Concatenating columns from different files, while skipping the blank lines

I know it's likely possible to do this with awk, but I have no idea how to do it.
Suppose I have the following 2 tab separated files, where there are blank lines that only contain \n:
file1:
A 1 4
B 2 5
C 3 6
D 7 10
E 8 11
A 9 12
file2:
E 13 16
F 14 17
G 15 18
H 19 22
I 20 23
J 21 24
I want to generate a new file which corresponds to the concatenation of the first 2 columns from file 1 with the third column from file 2, and then the third column from file 1:
final file:
A 1 16 4
B 2 17 5
C 3 18 6
D 7 22 10
E 8 23 11
A 9 24 12
Note that, in the final file, it's important that the blank lines should be kept blank, and no tabs should be inserted in there.
Simple paste + awk combination:
paste file1 file2 | awk '!NF{ print "" }NF{ print $1,$2,$6,$3 }'
The output:
A 1 16 4
B 2 17 5
C 3 18 6
D 7 22 10
E 8 23 11
A 9 24 12
awk 'NR==FNR{a[NR]=$3;next} NF{$3=a[FNR] OFS $3} 1' file2 file1

Dropping all id rows if at least one cell meets a given criterion (e.g. has a missing value)

My dataset is in the following form:
clear
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
end
In my true dataset, observations are sorted by id and by year (not shown here).
What I need to do is to drop all the rows of a specific id if (at least) one of the following two conditions is met:
there is at least one missing value of var.
var decreases from one row to the next (for the same id)
So in my example what I would like to obtain is:
id var
1 20
1 21
1 32
1 34
Now, my unfortuante attempt has been to use row-wise operations together with by, in order to create a drop1 variable to be later used to subset the dataset.
Something on these lines (which is clearly wrong), :
bysort id: gen drop1=1 if var[_n] < var[_n-1] | var[_n]==.
This doesn't work, and I am not even sure that I am considering the most "clean" and direct way to solve the task.
How would you proceed? Any help would be highly appreciated.
My interpretation is that you want to drop the complete group if either of two conditions are met. I assume your dataset is sorted in some way, most likely, based on another variable. Otherwise, the structure is fragile.
The logic is simple. Check for decreasing values but leave out the first observation of each group, i.e., leave out _n == 1. The first observation, if non-missing, will always be smaller. Then, check also for missings.
clear
set more off
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
end
// maintain original sequencing
gen orig = _n
order id orig
bysort id (orig) : gen todrop = sum((var < var[_n-1] & _n > 1) | missing(var))
list, sepby(id)
by id : drop if todrop[_N]
list, sepby(id)
One way to do this is to create some indicator variable as you had attempted. If you only want to drop where var decreases from one observation to the next, you could use:
clear
input id var
1 20
1 21
1 32
1 34
2 11
2 .
2 15
3 21
3 22
3 1
3 2
3 5
4 .
4 2
end
gen i = id if mi(var)
bysort id : egen k = mean(i)
drop if id == k
drop i k
drop if var[_n-1] > var[_n] & _n != 1
However, if you want to get the output you supplied in the post (drop all subsequent observations where var decreases from some max value), you could try the following in place of the last line above.
local N = _N
forvalues i = 1/`N' {
drop if var[_n-1] > var[_n] & _n != 1
}
The loop just ensures that the drop if var... section of code is executed enough so that all observations where var < 34 are dropped.

Resources