Extract and copy rows and columns from a large text file - linux

In addition to my previous question - how canI extract values in the Columns format from a large text file. Also If I want to extract out the value of the specific column. The text file looks like the attached image.
I want to extract out the RESOL and FSC values as Given in the text to plot it in Numbers or EXCEL.
To be more general - what should be done if I do want to extract the values to Part_FSC and CC later.
Thanks
Text File:
Opening MRC/CCP4 file for WRITE...
File : map2.mrc
NX, NY, NZ: 600 600 600
MODE : real
SYMMETRY REDUNDANCY: 1
Fraction_mask, Fraction_particle = 0.1193 0.0111
C sqrt sqrt
C NO. RESOL RING RAD FSPR FSC Part_FSC Part_SSNR Rec_SSNR CC
C 2 952.50 0.0017 0.09 1.000 0.000 400.0145 479.40 0.1222
C 3 476.25 0.0033 0.19 1.000 0.000 159.3959 159.98 0.1586
C 4 317.50 0.0050 0.92 0.999 0.000 48.2248 43.27 0.0155
C 5 238.12 0.0067 0.42 1.000 0.000 88.3074 76.69 0.2637
C 6 190.50 0.0083 0.48 0.999 0.000 64.0162 56.25 0.4148
C 7 158.75 0.0100 1.41 0.992 0.000 17.1695 15.64 0.1282
C 8 136.07 0.0117 5.56 0.954 0.000 6.8244 6.47 0.0171
C 9 119.06 0.0133 1.49 0.993 0.000 16.1918 16.42 0.2729
C 10 105.83 0.0150 1.68 0.990 0.000 12.8313 13.83 0.3729
C 11 95.25 0.0167 3.55 0.969 0.000 6.8012 7.95 0.2624
C 12 86.59 0.0183 16.00 0.830 0.000 2.5273 3.13 0.0826

Untested, but should work
awk '$1 == "C"{printf "%s\t%s\n", $3,$6}' <filename>

Related

Reorganize aggregate results in pandas multilevel-columns dataframe

I have a dataframe with 217 rows:
df
A C
fm ae fm ae
0 0.491 0.368 0.789 0.789
1 0.369 0.333 0.433 0.412
2 0.372 0.276 0.772 0.759
3 0.346 0.300 0.474 0.391
4 0.213 0.161 0.323 0.312
.. ... ... ... ...
212 0.000 0.000 1.000 1.000
213 1.000 1.000 1.000 1.000
214 1.000 1.000 1.000 1.000
215 1.000 1.000 1.000 1.000
216 1.000 1.000 1.000 1.000
[217 rows x 4 columns]
I need to report the mean results, and I get the following results:
df.mean()
A fm 0.548
ae 0.508
C fm 0.671
ae 0.650
dtype: float64
But I would like the output to look like this:
fm ae
A 0.548 0.508
C 0.671 0.650
I tried with df.mean().groupby(level=0) but the output was none.
What is the cleanest way to achieve this organization of aggregate results?
One way is to unstack it:
out = df.mean().unstack()
or you could stack it, then use groupby + mean:
out = df.stack(level=0).groupby(level=1).mean()
Output:
ae fm
A 0.5438 0.5791
C 0.7663 0.7791

combining two dataframes just based on the rows, without considering respective index

I have two dataframes, df1, and df2 respectively
A1 A2
0 0.001 0.002
1 100 200
2 0.3 0.4
B1 B2
86 0.0002 0.003
12 0.2 0.3
123 -0.001 0.000
Due to their generation mechanism them, their indexes are not matched, and one of the dataframes was not even indexed orderly. I need to just combine these two dataframes regardless of index, the expected result is as follows.
A1 A2 B1 B2
0 0.001 0.002 0.002 0.003
1 100 200 0.2 0.3
2 0.3 0.4 -.0.001 0.000
In other words, I just need to combine them based on the row orders. I tried
Pd.concat([df1,df2],index=1,ignore_index=True) or pd.DataFrame([df1,df2]). Neither works.
Use pd.concat:
>>> pd.concat([df1.reset_index(drop=True),
df2.reset_index(drop=True)], axis=1)
A1 A2 B1 B2
0 0.001 0.002 0.0002 0.003
1 100.000 200.000 0.2000 0.300
2 0.300 0.400 -0.0010 0.000

Fill a pandas dataframe by counting strings in a array and adding their values from a array

1) numpy array r which consists of strings.
import numpy as np
r = np.array([['S', 'S'],['S', 'V1'],['S', 'V2'],['V1', 'S'],['V1', 'V1']])
2) numpy array acc conatin values. The first value refers to the first element of two dimensional array r and second value refers to the second element of two dimensional array r
acc = np.array([0.613,0.387])
3) Question: I want to fill dataframe df1. For example: Row1) Array r[0]=['S', 'S'] contains 'S' in both then fill S=0.613+0.387=1.0 in df1 and V1 and V2 in df1 will zero as they do not exist in the array. Row2) Array r[1]=['S', 'V1'] contains one 'S' then fill S=0.613 and V1=0.387 in df1,and V2=0 (does not exist).......and so on.
Desired output:
import pandas as pd
df1 = pd.DataFrame({'S':[1,0.613,0.613,0.387,0], 'V1': [0,0.387,0,0.613,1],'V2': [0,0,0.387,0,0]})
print(df1)
S V1 V2
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000
You can stack the dataframe, map the values and pivot back:
s = pd.DataFrame(r).stack().reset_index(name='val')
s['level_1'] = acc[s['level_1']]
s.pivot_table(index='level_0',
columns='val',
values='level_1',
aggfunc='sum',
fill_value=0)
Output:
val S V1 V2
level_0
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000
Another way using pd.get_dummies(),np.vectorize and df.groupby() on axis=1:
df=pd.get_dummies(pd.DataFrame(r),prefix='',prefix_sep='')
s=pd.Series(acc,index=range(1,len(acc)+1))
final=(pd.DataFrame(np.vectorize(s.get)(np.where(df.eq(1),df.cumsum(axis=1),df)),
columns=df.columns).groupby(df.columns,axis=1).sum())
S V1 V2
0 1.000 0.000 0.000
1 0.613 0.387 0.000
2 0.613 0.000 0.387
3 0.387 0.613 0.000
4 0.000 1.000 0.000

Position of Cells where the mid point of graph lies

I have the following data.
x y
0.00 0.00
0.03 1.74
0.05 2.60
0.08 3.04
0.11 3.47
0.13 3.90
0.16 4.33
0.19 4.59
0.21 4.76
0.20 3.90
0.18 3.12
0.18 2.60
0.16 2.17
0.15 1.73
0.13 1.47
0.12 1.21
0.14 2.60
0.17 3.47
0.21 3.90
0.23 4.33
0.26 4.76
0.28 5.19
0.31 5.45
0.33 5.62
0.37 5.79
0.38 5.97
0.42 6.14
0.44 6.22
0.47 6.31
0.49 6.39
0.51 6.48
I used =max()/2 to obtain the 50%th percentile, which in this case is 3.24.
The point 3.24 does not exist for the y values but it falls in between the 3.04 and 3.47.
How can I find the address of these 2 cells?
Note: The 50th percentile also hits on the other part of the graph, but I only require the first instance.
Assuming you data in columns A and B, with header row in 1 (first numbers in row 2). Assuming you =max()/2 formula is in D2
Use aggregate to determine the first row where the Y value exceeds you mean. Then do it again and subtract 1 from the row.
=AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)
That will return the row number of 6. First occurrence exceeding the value in D2.
=AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1
That will give you row number of 5.
use the row numbers in conjunction with INDEX and you can pull the X value.
=INDEX(A:A,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1)
=INDEX(A:A,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1))
That will give you the X values. if you want the corresponding Y values, simply change the index look up range from A:A to B:B.
=INDEX(B:B,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1)-1)
=INDEX(B:B,AGGREGATE(15,6,ROW($B$2:$B$32)/(B2:B32>D2),1))

RODBC read error where excel column contains leading NAs

I have been reading Excel sheets into R using the RODBC package and have hit an issue with the Excel ODBC driver. Columns that contain (sufficient) leading NAs are coerced to logical.
In Excel the data appears as follows:
period n n.ft n.pt
1/02/1985 0.008 NA 0.025
1/03/1985 -0.003 NA -0.024
1/04/1985 0.002 NA 0.015
1/05/1985 0.006 NA 0.012
1/06/1985 0.001 NA 0.003
1/07/1985 0.005 NA 0.010
1/08/1985 0.006 NA 0.001
1/09/1985 0.007 NA 0.013
1/10/1985 -0.002 NA 0.009
1/11/1985 0.013 NA 0.019
1/12/1985 -0.004 NA -0.021
1/01/1986 0.008 NA 0.009
1/02/1986 0.002 NA 0.009
1/03/1986 0.002 -0.003 1.000
1/04/1986 0.010 -0.003 0.041
1/05/1986 0.000 -0.001 -0.004
1/06/1986 0.005 0.003 0.005
1/07/1986 -0.003 0.005 0.012
1/08/1986 -0.001 -0.003 -0.021
1/09/1986 0.003 -0.001 0.012
1/10/1986 0.003 0.003 0.010
1/11/1986 -0.003 0.003 -0.003
1/12/1986 0.003 -0.003 0.022
1/01/1987 0.001 0.013 -0.004
1/02/1987 0.004 -0.004 0.011
1/03/1987 0.004 0.008 0.005
1/04/1987 0.000 0.002 -0.002
1/05/1987 0.001 0.002 0.006
1/06/1987 0.004 0.010 0.00
I read in the data with:
require(RODBC)
conexcel <- odbcConnectExcel(xls.file="C:/data/example.xls")
s1 <- 'SOx'
dd <- sqlFetch(conexcel, s1)
odbcClose(conexcel)
This reads in the entire second column as NA. I think this is due to the fact it's guessed to be logical, and therefore the subsequent numbers are assessed as invalid and hence NA.
> str(dd)
'data.frame': 29 obs. of 4 variables:
$ period: POSIXct, format: "1985-02-01" "1985-03-01" ...
$ n : num 0.00833 -0.00338 0.00157 0.00562 0.00117 ...
$ n#ft : logi NA NA NA NA NA NA ...
$ n#pt : num 0.02515 -0.02394 0.0154 0.01224 0.00301 ...
I am trying to find a way to prevent this coercion to logical, which I think is causing the subsequent error.
I found this Q+A by searching SO, however I am at work and have no hope of being permitted to edit the registry to change the default for DWORD, as suggested (I understand that the value set here determines how many NAs are required before Microsoft guesses the data type and bombs my read).
Right now, I'm thinking that the best solution is to invert the data in Excel, and read it into R up-side-down.
I love a good hack but surely there's a better solution?
This is not a bug, but a feature of ODBC (note the lack of R) as documented here
http://support.microsoft.com/kb/257819/en-us
(long page, check for "mixed data type").
Since reading Excel files with ODBC is rather limited, I prefer one of the alternatives mentioned by Gabor, with preference for XLConnnect.

Resources