Pandas: start and stop parsing after a delimiter keyword - python-3.x

I'm a chemist dealing with Potential Energy Distributions and the output is kind of messy (some lines use more columns than others) and we have several analysis in one file so I'd like to start and stop parsing when I see some specific "keywords" or signs like "***".
Here is an example of my input:
Average max. Potential Energy <EPm> = 41.291
TED Above 100 Factor TAF=0.011
Average coordinate population 1.000
s 1 1.00 STRE 4 7 NH 1.015024 f3554 100
s 2 1.00 STRE 2 1 CH 1.096447 f3127 13 f3126 13 f3073 37 f3073 34
s 3 1.00 STRE 2 5 CH 1.094347 f3127 38 f3126 36 f3073 12 f3073 11
s 4 1.00 STRE 6 8 CH 1.094349 f3127 36 f3126 38 f3073 11 f3073 13
s 5 1.00 STRE 2 3 CH 1.106689 f2950 48 f2944 46
s 6 1.00 STRE 6 9 CH 1.106696 f2950 47 f2944 47
s 7 1.00 STRE 6 10 CH 1.096447 f3127 12 f3126 13 f3073 33 f3073 38
s 8 1.00 STRE 4 2 NC 1.450644 f1199 43 f965 39
s 9 1.00 STRE 4 6 NC 1.450631 f1199 43 f965 39
s 10 1.00 BEND 7 4 6 HNC 109.30 f1525 12 f1480 42 f781 18
s 11 1.00 BEND 1 2 3 HCH 107.21 f1528 33 f1525 21 f1447 12
s 12 1.00 BEND 5 2 1 HCH 107.42 f1493 17 f1478 36 f1447 20
s 13 1.00 BEND 8 6 10 HCH 107.42 f1493 17 f1478 36 f1447 20
s 14 1.00 BEND 3 2 5 HCH 108.14 f1525 10 f1506 30 f1480 14 f1447 13
s 15 1.00 BEND 9 6 8 HCH 108.13 f1525 10 f1506 30 f1480 14 f1447 13
s 16 1.00 BEND 10 6 9 HCH 107.20 f1528 33 f1525 21 f1447 12
s 17 1.00 BEND 6 4 2 CNC 112.81 f383 85
s 18 1.00 TORS 7 4 2 1 HNCH -172.65 f1480 10 f781 55
s 19 1.00 TORS 1 2 4 6 HCNC 65.52 f1192 27 f1107 14 f243 18
s 20 1.00 TORS 5 2 4 6 HCNC -176.80 f1107 17 f269 35 f243 11
s 21 1.00 TORS 8 6 4 2 HCNC -183.20 f1107 17 f269 35 f243 11
s 22 1.00 TORS 3 2 4 6 HCNC -54.88 f1273 26 f1037 22 f243 19
s 23 1.00 TORS 9 6 4 2 HCNC 54.88 f1273 26 f1037 22 f243 19
s 24 1.00 TORS 10 6 4 2 HCNC -65.52 f1192 30 f1107 18 f243 21
****
9 STRE modes:
1 2 3 4 5 6 7 8 9
8 BEND modes:
10 11 12 13 14 15 16 17
7 TORS modes:
18 19 20 21 22 23 24
19 CH modes:
2 3 4 5 6 7 11 12 13 14 15 16 18 19 20 21 22 23 24
0 USER modes:
alternative coordinates 25
k 10 1.00 BEND 7 4 2 HNC 109.30
k 11 1.00 BEND 1 2 4 HCN 109.41
k 12 1.00 BEND 5 2 4 HCN 109.82
k 13 1.00 BEND 8 6 4 HCN 109.82
k 14 1.00 BEND 3 2 1 HCH 107.21
k 15 1.00 BEND 9 6 4 HCN 114.58
k 16 1.00 BEND 10 6 8 HCH 107.42
k 18 1.00 TORS 7 4 2 5 HNCH -54.98
k 18 1.00 TORS 7 4 2 3 HNCH 66.94
k 18 1.00 OUT 4 2 6 7 NCCH 23.30
k 19 1.00 OUT 2 3 5 1 CHHH 21.35
k 19 1.00 OUT 2 1 5 3 CHHH 21.14
k 19 1.00 OUT 2 3 1 5 CHHH 21.39
k 20 1.00 OUT 2 1 4 5 CHNH 21.93
k 20 1.00 OUT 2 5 4 1 CHNH 21.88
k 20 1.00 OUT 2 1 5 4 CHHN 16.36
k 21 1.00 TORS 8 6 4 7 HCNH 54.98
k 21 1.00 OUT 6 10 9 8 CHHH 21.39
k 22 1.00 OUT 2 1 4 3 CHNH 20.12
k 22 1.00 OUT 2 5 4 3 CHNH 19.59
k 23 1.00 TORS 9 6 4 7 HCNH -66.94
k 23 1.00 OUT 6 8 4 9 CHNH 19.59
k 24 1.00 TORS 10 6 4 7 HCNH -187.34
k 24 1.00 OUT 6 9 4 10 CHNH 20.32
k 24 1.00 OUT 6 8 4 10 CHNH 21.88
I'd like to skip the first 3 lines (I know how to do that with skiprows=3) then I'd like to stop parsing at the "***" and accommodate my content into 11 columns with predefined names like "tVib1" "%PED1" "tVib2" "%PED2" etc.
After that, I'll have, in this same file to start parsing after the word "alternative coordinates" into 11 columns.
Looks very hard to achieve for me.
Any help is much appreciated.

For the .dd2 file provided, I used another strategy. The implicit assumptions are
1) a line is only converted, when it starts with either a lower case - space - digit or with at least five whitespaces, followed by at least one upper case word
2) if missing, the first, third and every f - column is reused from the last line
3) the third column contains the first upper case word
4) if the difference between the first upper case words is less than a given variable max_col, NaN is introduced for the missing values
5) f value columns start two columns after the second upper case column
import re
import pandas as pd
import numpy as np
def align_columns(file_name, col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6"], max_col = 4):
#max_col: number of columns between the two capitalised columns
#column names for the first values N = number, S = string, F = f number, adapt to your needs
#both optional parameters
#collect all data sets as a list of lists
all_lines = []
last_id, last_cat, last_fval = 0, 0, []
#opening file to read
for line_in in open(file_name, "r"):
#use only lines that start either
#with lower case - space - digit or at least five spaces
#and have an upper case word in the line
start_str = re.match("([a-z]\s\d|\s{5,}).*[A-Z]+", line_in)
if not start_str:
continue
#split data columns into chunks using 2 or more whitespaces as a delimiter
sep_items = re.split("\s{2,}", line_in.strip())
#if ID is missing use the information from last line
if not re.match("[a-z]\s\d", sep_items[0]):
sep_items.insert(0, last_id)
sep_items.insert(2, last_cat)
sep_items.extend(last_fval)
#otherwise keep the information in case it is missing from next line
else:
last_id = sep_items[0]
last_cat = sep_items[2]
#get index for the two columns with upper case words
index_upper = [i for i, item in enumerate(sep_items) if item.isupper()]
if len(index_upper) < 2 or index_upper[0] != 2 or index_upper[1] > index_upper[0] + max_col + 1:
print("Irregular format, skipped line:")
print(line_in)
continue
#get f values in case they are missing for next line
last_fval = sep_items[index_upper[1] + 2:]
#if not enough rows between the two capitalised columns, fill with NaN
if index_upper[1] < 3 + max_col:
fill_nan = [np.nan] * (3 + max_col - index_upper[1])
sep_items[index_upper[1]:index_upper[1]] = fill_nan
#append to list
all_lines.append(sep_items)
#create pandas dataframe from list
df = pd.DataFrame(all_lines)
#convert columns to float, if possible
df = df.apply(pd.to_numeric, errors='ignore', downcast='float')
#label columns according to col_names list and add f0, f1... at the end
df.columns = [col_names[i] if i < len(col_names) else "f" + str(i - len(col_names)) for i in df.columns]
return df
#-----------------main script--------------
#use standard parameters of function
conv_file = align_columns("a1-91a.dd2")
print(conv_file)
#use custom parameters for labels and number of fill columns
col_labels = ["X1", "Y1", "Z1", "A1", "A2", "A3", "A4", "A5", "A6", "Z2", "B1"]
conv_file2 = align_columns("a1-91a.dd2", col_labels, 6)
print(conv_file2)
This is more flexible than the first solution. The number of f value columns is not restricted to a specific number.
The example shows you, how to use it with standard parameters defined by the function and with custom parameters. It is surely not the most beautiful solution, and I am happy to upvote any more elegant solution. But it works, at least in my Python 3.5 environment. If there are any problems with a data file, please let me know.
P.S.: The solution to convert the appropriate columns into float was provided by jezrael

Seems not that hard, you already described all you want, all you need is to translate it to Python. First you can parse your whole file and store it in a list of lines:
with open(filename,'r') as file_in:
lines = file_in.readlines()
then you can begin reading from line 3 and parse until you find the "***":
ind = 3
while x[ind].find('***') != -1:
tmp = x[ind]
... do what you want with tmp ...
ind = ind + 1
and then you can keep on doing whatever you need, replacing find("...") by any keyword you need.
To manage each of your lines "tmp", you can use very useful Python functions like tmp.split(), tmp.strip(), convert any string to a number, etc.

I made a first script according to your example here on SO. It is not very flexible - it assumes that the first three columns are filled with values and aligns then the two columns with uppercase words by filling the four columns in between with NaN, if necessary. The reason to fill it with this value is that pandas function like .sum() or .mean() ignore this, when calculating the value for a column.
import re
import io
import pandas as pd
#adapt this part to your needs
#enforce to read 12 columns, N = number, S = string, F = f number
col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6", "F1", "F2", "F3"]
#only import lines that start with these patterns
startline = ("s ", "k ")
#number of columns between the two capitalised columns
max_col = 4
#create temporary file like object to feed later to the csv reader
pan_wr = io.StringIO()
#opening file to read
for line in open("test.txt", "r"):
#checking, if row should be ignored
if line.startswith(startline):
#find the text between the two capitalized columns
col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", line).group(1)
#determine, how many elements we have in this segment
nr_col_betw = len(re.split(r"\s{2,}", col_betw.strip()))
#test, if there are not enough numbers
if nr_col_betw <= max_col:
#fill with NA, which is interpreted by pandas csv reader as NaN
subst = col_betw + " NA" * (max_col - nr_col_betw + 1)
line = line.replace(col_betw, subst, 1)
#write into file like object the new line
pan_wr.writelines(line)
#reset pointer for csv reader
pan_wr.seek(0)
#csv reader creates data frame from file like object, splits at delimiter with more than one whitespace
#index_col: the first column is not treated as an index, names: name for columns
df = pd.read_csv(pan_wr, delimiter = r"\s{2,}", index_col = False, names = col_names, engine = "python")
print(df)
This works well, but can't deal with the .dd2 file you posted later. I am currently testing a different approach for this.
to be continued...
P.S.: I found conflicting information on the use of index_col = False by the csv reader. Some say, you should use now index_col = None, to suppress that the first column is converted into the index, but it didn't work in my tests.

Related

How to drop columns of csv data in J

I have a lot of csv files that I have to drop the date column.
I have a J line that reads in csv file into a numeric array, rdtabfile =: (0&".;.2#:(TAB&,)#:}:);._2) # ReadFile #<
If you know the column number of the date column, I would just use a mask across each line of the array and the copy # dyadic verb.
[ t =: i. 4 5
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
mask=: ~: [: i. # NB. x would be the column to be dropped, y is the numeric matrix
delcol=: (mask # ])"1
1 delcol t
0 2 3 4
5 7 8 9
10 12 13 14
15 17 18 19
delcola=: ((~: [: i. #) # ])"1 NB. can be done in one line
2 delcola t
0 1 3 4
5 6 8 9
10 11 13 14
15 16 18 19

Replacing the first column values according to the second column pattern

How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663
Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans

How to set a variable space with right alignment for a string in Python?

I'm trying to do this program where given a number N, one has to print out the decimal, octal, hexadecimal and binary for all the numbers in range 1 to N. The trouble is that the platform requires the solution in a particular format.
Suppose the number is 17, so the output should be like :
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 A 1010
11 13 B 1011
12 14 C 1100
13 15 D 1101
14 16 E 1110
15 17 F 1111
16 20 10 10000
17 21 11 10001
For 7 it would be like :
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
If you notice, the above is required to be printed in a way that the decimal, octal and hexadecimal numbers need a minimum of 2 spaces at their left whereas the binary numbers need at least one space at their left. Now, as the length of the numbers increase the space needs to be given accordingly such that the minimum space is there even for the max length number. So, how do I print them using a variable space? So far I have tried this :
Code
def print_formatted(number):
space=len(str(bin(number))[2:])
for i in range(1,number+1):
print('{:2d}'.format(i), end='')
print('{:>3s}'.format(str(oct(i))[2:]), end='')
print('{:>3s}'.format(str(hex(i))[2:]), end='')
print('{:>'+str(space)+'s}'.format(str(bin(i))[2:]))
print_formatted(17)
Here, I just tried doing the required with just the binary numbers but it's giving me an error
print('{:>'+str(space)+'s}'.format(str(bin(i))[2:]))
ValueError: Single '}' encountered in format string
Is there any fix/alternative for this?
Your problem is operator order - the + for string concattenation is weaker then the method call in
'{:>' + str(space) + 's}'.format(str(bin(i))[2:])
. Thats why you call the .format(...) only on "s}" - not the whole string. And thats where the
ValueError: Single '}' encountered in format string
comes from.
Putting the complete formatstring into parenthesis before applying .format to it fixes that.
You also need 1 more space for binary and can skip some str() that are not needed:
def print_formatted(number):
space=len(str(bin(number))[2:])+1 # fix here
for i in range(1,number+1):
print('{:2d}'.format(i), end='')
print('{:>3s}'.format(oct(i)[2:]), end='')
print('{:>3s}'.format(hex(i)[2:]), end='')
print(('{:>'+str(space)+'s}').format(bin(i)[2:])) # fix here
print_formatted(17)
Output:
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 a 1010
11 13 b 1011
12 14 c 1100
13 15 d 1101
14 16 e 1110
15 17 f 1111
16 20 10 10000
17 21 11 10001
From your given output above you might need to prepend this by 2 spaces - not sure if its a formatting error in your output above or part of the restrictions.
You could also shorten this by using f-strings (and removing superflous str() around bin, oct, hex: they all return a strings already).
Then you need to calculate the the numbers you use to your space out your input values:
def print_formatted(number):
de,bi,oc,he = len(str(number)), len(bin(number)), len(oct(number)), len(hex(number))
for i in range(1,number+1):
print(f' {i:{de}d}{oct(i)[2:]:>{oc}s}{hex(i)[2:]:>{he}s}{bin(i)[2:]:>{bi}s}')
print_formatted(26)
to accomodate other values then 17, f.e. 128:
1 1 1 1
2 2 2 10
3 3 3 11
...
8 10 8 1000
...
16 20 10 10000
...
32 40 20 100000
...
64 100 40 1000000
...
128 200 80 10000000

How can I add an X axis showing plot data seconds to a matplotlib pyplot price volume graph?

The code below plots a price volume chart using data from a tab separated csv file. Each row contains values for those columns: IDX, TRD, TIMESTAMPMS, VOLUME and PRICE. As is, the X axis shows the IDX value. I would like the X axis to display the seconds computed from the timestamp in milliseconds attached to each row. How can this be obtained ?
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
data = pd.read_csv('secondary-2018-08-12-21-32-56.csv', index_col=0, sep='\t')
print(data.head(50))
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(10,5))
ax[0].plot(data.index, data['PRICE'])
ax[1].bar(data.index, data['VOLUME'])
plt.show()
The drawn graph looks like this:
Here are the data as displayed by the
print(data.head(50))
instruction:
TRD TIMESTAMPMS VOLUME PRICE
IDX
1 4 1534102380000 0.363583 6330.41
2 20 1534102381000 5.509219 6329.13
3 3 1534102382000 0.199049 6328.69
4 5 1534102383000 1.055055 6327.36
5 2 1534102384000 0.006343 6328.26
6 4 1534102385000 0.167502 6330.38
7 1 1534102386000 0.002039 6326.69
8 0 1534102387000 0.000000 6326.69
9 4 1534102388000 0.163813 6327.62
10 2 1534102389000 0.007060 6326.66
11 4 1534102390000 0.015489 6327.64
12 5 1534102391000 0.035618 6328.35
13 2 1534102392000 0.006003 6330.12
14 5 1534102393000 0.172913 6328.77
15 1 1534102394000 0.019972 6328.03
16 3 1534102395000 0.007429 6328.03
17 1 1534102396000 0.000181 6328.03
18 3 1534102397000 1.041483 6328.03
19 2 1534102398000 0.992897 6328.74
20 3 1534102399000 0.061871 6328.11
21 2 1534102400000 0.000123 6328.77
22 4 1534102401000 0.028650 6330.25
23 2 1534102402000 0.035504 6330.01
24 3 1534102403000 0.982527 6330.11
25 5 1534102404000 0.298366 6329.11
26 2 1534102405000 0.071119 6330.06
27 3 1534102406000 0.025547 6330.02
28 2 1534102407000 0.003413 6330.11
29 4 1534102408000 0.431217 6330.05
30 3 1534102409000 0.021627 6330.23
31 1 1534102410000 0.009661 6330.28
32 1 1534102411000 0.004209 6330.27
33 1 1534102412000 0.000603 6328.07
34 6 1534102413000 0.655872 6330.31
35 1 1534102414000 0.000452 6328.09
36 7 1534102415000 0.277340 6328.07
37 8 1534102416000 0.768351 6328.04
38 1 1534102417000 0.078893 6328.20
39 2 1534102418000 0.000446 6326.24
40 2 1534102419000 0.317381 6326.83
41 2 1534102420000 0.100009 6326.24
42 2 1534102421000 0.000298 6326.25
43 6 1534102422000 0.566820 6330.00
44 1 1534102423000 0.000060 6326.30
45 2 1534102424000 0.047524 6326.30
46 4 1534102425000 0.748773 6326.61
47 3 1534102426000 0.007656 6330.23
48 1 1534102427000 0.000019 6326.32
49 1 1534102428000 0.000014 6326.34
50 0 1534102429000 0.000000 6326.34
I believe you need to data.setindex('TIMESTAMPMS') to get the axis to autoscale
I dont know if i understood you correctly, try with:
data['TIMESTAMPMS'] = data['TIMESTAMPMS']/1000
ax[0].plot(data['TIMESTAMPMS'], data['PRICE'])
ax[1].bar(data['TIMESTAMPMS'], data['VOLUME'])

Variable string formatting in python 3

Input is a number, e.g. 9 and I want to print decimal, octal, hex and binary value from 1 to 9 like:
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
How can I achieve this in python3 using syntax like
dm, oc, hx, bn = len(str(9)), len(bin(9)[2:]), ...
print("{:dm%d} {:oc%s}" % (i, oct(i[2:]))
I mean if number is 999 so I want decimal 10 to be printed like ' 10' and binary equivalent of 999 is 1111100111 so I want 10 like ' 1010'.
You can use str.format() and its mini-language to do the whole thing for you:
for i in range(1, 10):
print("{v} {v:>6o} {v:>6x} {v:>6b}".format(v=i))
Which will print:
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
UPDATE: To define field 'widths' in a variable you can use a format-within-format structure:
w = 5 # field width, i.e. offset to the right for all octal/hex/binary values
for i in range(1, 10):
print("{v} {v:>{w}o} {v:>{w}x} {v:>{w}b}".format(v=i, w=w))
Or define a different width variable for each field type if you want them non-uniformly spaced.
Btw. since you've tagged your question with python-3.x, if you're using Python 3.6 or newer, you can use Literal String Interpolation to simplify it even more:
w = 5 # field width, i.e. offset to the right for all octal/hex/binary values
for v in range(1, 10):
print(f"{v} {v:>{w}o} {v:>{w}x} {v:>{w}b}")

Resources