Align rows in text file with command line - text

I have data in a text file where data rows are repeated after several columns and each new block shares the first column of row labels. I'd like to use the command line to get align the rows into a single table.
The data text file looks like:
Values SampleA SampleB SampleC
Value1 1.00 2.00 3.00
Value2 3.00 2.00 1.00
Value3 2.00 1.00 3.00
Values SampleD SampleE SampleF
Value1 1.00 2.00 3.00
Value2 3.00 2.00 1.00
Value3 2.00 1.00 3.00
And I'd like the resulting file to look like:
Values SampleA SampleB SampleC SampleD SampleE SampleF
Value1 1.00 2.00 3.00 1.00 2.00 3.00
Value2 3.00 2.00 1.00 3.00 2.00 1.00
Value3 2.00 1.00 3.00 2.00 1.00 3.00

This solution creates lots of temp files, but cleans up after.
# put each paragraph into it's own file:
awk -v RS= '{print > sprintf("%s_%06d", FILENAME, NR)}' data.txt
# now, join them, and align the columns
join data.txt_* | column -t | tee data.txt.joined
# and cleanup the temp files
rm data.txt_*
Verify afterwards with: wc -l data.txt data.txt.joined

Related

How to club the average value (in another column) for rows having same values in csv file using python/pandas

I have generated a CSV shown below :-
term unique_identity score1 score2 score3
abc uid_1 8 0.5 0.8
abc uid_1 22 0.65 0.42
abc uid_1 17 0.78 0.48
abc uid_2 12 0.62 0.45
abc uid_2 13 0.49 0.21
abc uid_2 15 0.50 0.10
abc uid_3 12 0.49 0.20
abc uid_3 13 0.60 0.10
abc uid_3 31 0.21 0.56
However, I want to club the value of each score for all the unique identities, somewhat like below :-
term unique_id average_score1 average_score2 average_score3
abc uid_1 15.66(8+22+17/3) 0.6433(0.5+0.65+0.78/3) 0.566(0.8+.42+.48/3)
abc uid_2 13.33(12+13+15/3) 0.5366(0.62+0.49+0.50/3) 0.2533(0.45+.21+.10/3)
I tried using .groupby() in pandas but failed to get anything like this.

Failing to sort pandas dataframe by values in descending order and then alphabetically in ascending order

A slice of my dataframe, df, is like this, so you can reproduce it.
data ={'feature_name': ['nite', 'thank', 'ok', 'havent', 'done', 'beverage', 'yup', 'lei','thanx', 'okie', '146tf150p', 'home', 'too', 'anytime',
'where', '645', 'er', 'tick', 'blank'], 'values':[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.98, 0.98] }
df = pd.DataFrame(data)
df.set_index('feature_name',inplace=True)
dfs=df.sort_index(ascending=True).sort_values(by = ['values'], ascending=False)
dfs
my output is this.
values
feature_name
146tf150p 1.00
645 1.00
where 1.00
too 1.00
thanx 1.00
thank 1.00
okie 1.00
ok 1.00
nite 1.00
lei 1.00
home 1.00
havent 1.00
er 1.00
done 1.00
beverage 1.00
anytime 1.00
yup 1.00
blank 0.98
tick 0.98
I do not quite understand why it is not like this? It really should work yet it does not work as expected.
146tf150p 1.00
645 1.00
anytime 1.00
beverage 1.00
done 1.00
er 1.00
haven't 1.00
home 1.00
...
How can I fix this?
Get rid of the set_index and use sort_values on both values and feature_name:
print (df.sort_values(by = ['values',"feature_name"], ascending=(False, True)))
feature_name values
10 146tf150p 1.00
15 645 1.00
13 anytime 1.00
5 beverage 1.00
4 done 1.00
16 er 1.00
3 havent 1.00
11 home 1.00
7 lei 1.00
0 nite 1.00
2 ok 1.00
9 okie 1.00
1 thank 1.00
8 thanx 1.00
12 too 1.00
14 where 1.00
6 yup 1.00
18 blank 0.98
17 tick 0.98

Get Propportions of One Hot Encoded Values While Aggregation - Pandas

I have a df like this,
Date Value
0 2019-03-01 0
1 2019-04-01 1
2 2019-09-01 0
3 2019-10-01 1
4 2019-12-01 0
5 2019-12-20 0
6 2019-12-20 0
7 2020-01-01 0
Now, I need to group them by quarter and get the proportions of 1 and 0. So, I get my final output like this,
Date Value1 Value0
0 2019-03-31 0 1
1 2019-06-30 1 0
2 2019-09-30 0 1
3 2019-12-31 0.25 0.75
4 2020-03-31 0 1
I tried the following code, doesn't seem to work.
def custom_resampler(array):
import numpy as np
return array/np.sum(array)
>>df.set_index('Date').resample('Q')['Value'].apply(custom_resampler)
Is there a pandastic way I can achieve my desired output?
Resample by quarter, get the value_counts, and unstack. Next, rename the columns, using the name property of the columns. Last, divide each row value by the total per row :
df = pd.read_clipboard(sep='\s{2,}', parse_dates = ['Date'])
res = (df
.resample(rule="Q",on="Date")
.Value
.value_counts()
.unstack("Value",fill_value=0)
)
res.columns = [f"{res.columns.name}{ent}" for ent in res.columns]
res = res.div(res.sum(axis=1),axis=0)
res
Value0 Value1
Date
2019-03-31 1.00 0.00
2019-06-30 0.00 1.00
2019-09-30 1.00 0.00
2019-12-31 0.75 0.25
2020-03-31 1.00 0.00

Convert over 24Hr time into 24Hr time

My data contains a row (A2:A40000) that contains accumulated hourly time. I am trying to convert it to 24 hour time, example: 24 to 0, 57 to 9 and so on.
Once that is done, I also want to get which day the time falls on, based on column A, assuming the first 0 starts on Monday 12:00AM, that would mean that Hr 24 will be Tuesday Hr 57 will be Wednesday and so on.
The end goal is to create a whisker chart with Day and time on X and column B on Y.
I would appreciate any help to get me started.
Example:
A B
0.00 0.00
1.00 0.00
1.00 0.00
2.00 0.00
12.00 2.00
14.00 0.00
16.00 0.00
17.00 0.00
17.00 0.00
18.00 0.00
19.00 10.00
22.00 0.00
23.00 0.00
24.00 1.00
26.00 0.00
28.00 0.00
46.00 0.00
58.00 0.00
10240.00 0.00
to get the hour of the day:
=HOUR(1+A1/24)
To get the weekday(Numerical)
=WEEKDAY(2+A1/24)
To get the Weekday(Name)
=CHOOSE(WEEKDAY(2+A1/24),"Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
In your case to get hour of day, do
=HOUR(1+A1/24), you have to do +1 since you start with zero.
Hope this helps

Linux text: add line to previous line of a pattern

I would like to add a specific line "TER" to several variable text files:
Input:
[...]
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
[...]
Output:
[...]
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
TER
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
[...]
So the pattern is: if after a " C " for the first time a " D " is found add a "TER" before the " D " line (after the " C " line). All other numbers and characters can be variable.
I found some examples with the sed command however I do not know how to do add to the previous line.
With awk:
$ awk 'last_c5=="C" && $5=="D" {print "TER"}; last_c5=$5' file
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
TER
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
It keeps tracking last 5th column value storing it in last_c5 variable. In case the previous was C and the current is D, it prints TER. On last_c5=$5 all lines are being printed.

Resources