Building tripple from two different data frames - python-3.x

I want to to build the triples: source --> target --> edge and Store these triples in a new dataframe.
I have two data frames
Accident_ID Location CarID_1 CarID_2 DriverID_1 DriverID_2
0 1 Tartu 1000 1001 1 3
1 2 Tallin 1002 1003 2 5
2 3 Tartu 1004 1005 4 6
3 4 Tallin 1006 1007 7 8
User_ID First Name Last Name Age Address Accident_ID ROLE
0 1 Chester Murphy 25 Narva 108, Tartu 1 Driver
1 2 Walter Turner 26 Tilgi 49, Tartu 2 Driver
2 3 Daryl Fowler 25 Piik 67, Tartu 1 Driver
3 4 Ted Nelson 45 Herne 20, Tartu 3 Driver
4 5 Olivia Crawford 38 Kalevi 25, Tartu 2 Driver
5 1 Chester Murphy 25 Narva 108, Tartu 2 Witness
6 6 Amy Miller 27 Riia 408, Tartu 3 Driver
7 7 Tes Smith 25 Narva 108, Tartu 4 Driver
8 8 Josh Blake 36 Parnu 37, Tallin 4 Driver
9 3 Daryl Fowler 25 Piik 67, Tartu 4 Witness
The triples which I have to formed is in this pattern
[![enter image description here][2]][2]
what is the python code for this? I have written this one but I am getting error witness is not defined
df3 = df1.merge(df2,on='Accident_ID')
df3["train"] = df3.Accident_ID < 5
df3["train"] .value_counts()
triples = []
for _, row in df3[df3["train"]].iterrows():
if row["ROLE"] == "Driver":
if row["User_ID"] == row["DriverID_1"]:
Drives = (row["User_ID"],row["CarID_1"], "Drives")
elif row["User_ID"] == row["DriverID_2"]:
Drives = (row["User_ID"],row["CarID_2"], "Drives")
else:
Witness = (row["User_ID"],row["Accident_ID"], "Witness")
Involved_in_first = (row["CarID_1"],row["Accident_ID"], "Involved in")
Involved_in_second = (row["CarID_2"],row["Accident_ID"], "Involved in")
Happened_in = (row["Accident_ID"],row["Location"], "Happened in")
Lives_in = (row["User_ID"],row["Address"], "Lives in")
triples.extend((Drives , Witness , Involved_in_first,Involved_in_second, Happened_in , Lives_in ))
triples_df = pd.DataFrame(triples, columns=["Source", "Target", "Edge"])
triples_df.shape

You should something like this and follow the same process for the rest of the edges:
df = df2.merge(df1, on=['Accident_ID'], how='inner')
print(df)
columns = ['Source', 'Target', 'Edge']
rows = []
for i in range(0, df.shape[0]):
row1 = [
df.iloc[i]['First_Name'],
df.iloc[i]['CarID_1'],
'Drives'
]
row2 = [
df.iloc[i]['First_Name'],
df.iloc[i]['Accident_ID'],
'Witness'
]
rows.append(row1)
rows.append(row2)
df_g = pd.DataFrame(rows, columns=columns)
print(df_g)
Output:
Source Target Edge
0 Chester 1000 Drives
1 Chester 1 Witness
2 Daryl 1000 Drives
3 Daryl 1 Witness
4 Walter 1002 Drives
5 Walter 2 Witness
6 Olivia 1002 Drives
7 Olivia 2 Witness
8 Chester 1002 Drives
9 Chester 2 Witness
10 Ted 1004 Drives
11 Ted 3 Witness
12 Amy 1004 Drives
13 Amy 3 Witness
14 Tes 1006 Drives
15 Tes 4 Witness
16 Josh 1006 Drives
17 Josh 4 Witness
18 Daryl 1006 Drives
19 Daryl 4 Witness

Related

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d
Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

Create an aggregate column based on other columns in pandas dataframe

I have a dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'group' :["A","A","B","B","B"],
'A1_val' :[4,5,7,6,5],
'A1M_val' :[10,100,100,10,1],
'AB_val' :[4,5,7,6,5],
'ABM_val' :[10,100,100,10,1],
'AM_VAL' : [4,5,7,6,5]
}
# Create DataFrame
df1 = pd.DataFrame(data)
df1
group A1_val A1M_val AB_val ABM_val AM_VAL
0 A 4 10 4 10 4
1 A 5 100 5 100 5
2 B 7 100 7 100 7
3 B 6 10 6 10 6
4 B 5 1 5 1 5
Step 1: I want to create columns as below:
A1_agg_val = sum of A1_val + A1M_val (stripping M out of the column and if the name matches then sum it)
Similarly, AB_agg_val = AB_val + ABM_val
Since there is no matching columns for 'AM_VAL', AM_agg_val = AM_val
My expected output:
group A1_val A1M_val AB_val ABM_val AM_VAL A1_AGG_val AB_AGG_val A_AGG_val
0 A 4 10 4 10 4 14 14 4
1 A 5 100 5 100 5 105 105 5
2 B 7 100 7 100 7 107 107 7
3 B 6 10 6 10 6 16 16 6
4 B 5 1 5 1 5 6 6 5
you can use groupby on axis=1
out = (df1.assign(**df1.loc[:,df1.columns.str.lower().str.endswith('_val')]
.groupby(lambda x: x[:2],axis=1).sum().add_suffix('_agg_value')))
print(out)
group A1_val A1M_val AB_val ABM_val AM_VAL A1_agg_value AB_agg_value \
0 A 4 10 4 10 4 14 14
1 A 5 100 5 100 5 105 105
2 B 7 100 7 100 7 107 107
3 B 6 10 6 10 6 16 16
4 B 5 1 5 1 5 6 6
AM_agg_value
0 4
1 5
2 7
3 6
4 5

How to select rows in a DataFrame based on every transition for particular values in a particular column?

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.
I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.
df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()
Given DataFrame:
Index UniqID Value
1    a    1
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
7    b    0
8    b    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
16    c    2
Expected Output:
2    a    0
3    a    1
4    a    0
5    a    1
6    a    2
9    b    1
10    b    2
11    b    0
12    b    1
13    c    0
14    c    1
15    c    2
Only expecting those rows when there is a transition from either 0-1 or 1-2.
Thank you in advance.
Use this my solution working for groups with tuples of patterns:
np.random.seed(123)
N = 100
d = {
'UniqID': np.random.choice(list('abcde'), N),
'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)
pat = [(0, 1), (1, 2)]
a = np.array(pat)
s = (df.groupby('UniqID')['Value']
.rolling(2, min_periods=1)
.apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))
mask = (s.mask(s == 0)
.groupby(level=0)
.bfill(limit=1)
.fillna(0)
.astype(bool)
.reset_index(level=0, drop=True))
df = df[mask]
print (df)
UniqID Value
99 a 1
98 a 2
12 a 1
63 a 2
38 a 0
41 a 1
9 a 1
72 a 2
64 b 1
67 b 2
33 b 0
68 b 1
57 b 1
71 b 2
10 b 0
8 b 1
61 c 1
66 c 2
46 c 0
0 c 1
40 c 2
21 d 0
74 d 1
15 d 1
85 d 2
6 d 1
88 d 2
91 d 0
83 d 1
4 d 1
34 d 2
96 d 0
48 d 1
29 d 0
84 d 1
32 e 0
62 e 1
37 e 1
55 e 2
16 e 0
23 e 1
Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)
Similar Sample data:
index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2
Load this in pandas dataframe.
Then,
Using below code:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])
It gives desired dataframe based on assumption:
index id value
1 2 a 0
2 3 a 1
3 4 a 0
4 5 a 1
5 6 a 2
8 9 b 1
9 10 b 2
10 11 b 0
11 12 b 1
12 13 c 0
13 14 c 1
14 15 c 2
Edit: To include transition 1->0, below is updated function:
def grp_trns(x):
x['dif']=x.value.diff().fillna(0)
index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
return pd.DataFrame(index1+index2)
My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2
df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)
Actually waiting for updates about the 2-1 and 1-2 relationship

Pandas: start and stop parsing after a delimiter keyword

I'm a chemist dealing with Potential Energy Distributions and the output is kind of messy (some lines use more columns than others) and we have several analysis in one file so I'd like to start and stop parsing when I see some specific "keywords" or signs like "***".
Here is an example of my input:
Average max. Potential Energy <EPm> = 41.291
TED Above 100 Factor TAF=0.011
Average coordinate population 1.000
s 1 1.00 STRE 4 7 NH 1.015024 f3554 100
s 2 1.00 STRE 2 1 CH 1.096447 f3127 13 f3126 13 f3073 37 f3073 34
s 3 1.00 STRE 2 5 CH 1.094347 f3127 38 f3126 36 f3073 12 f3073 11
s 4 1.00 STRE 6 8 CH 1.094349 f3127 36 f3126 38 f3073 11 f3073 13
s 5 1.00 STRE 2 3 CH 1.106689 f2950 48 f2944 46
s 6 1.00 STRE 6 9 CH 1.106696 f2950 47 f2944 47
s 7 1.00 STRE 6 10 CH 1.096447 f3127 12 f3126 13 f3073 33 f3073 38
s 8 1.00 STRE 4 2 NC 1.450644 f1199 43 f965 39
s 9 1.00 STRE 4 6 NC 1.450631 f1199 43 f965 39
s 10 1.00 BEND 7 4 6 HNC 109.30 f1525 12 f1480 42 f781 18
s 11 1.00 BEND 1 2 3 HCH 107.21 f1528 33 f1525 21 f1447 12
s 12 1.00 BEND 5 2 1 HCH 107.42 f1493 17 f1478 36 f1447 20
s 13 1.00 BEND 8 6 10 HCH 107.42 f1493 17 f1478 36 f1447 20
s 14 1.00 BEND 3 2 5 HCH 108.14 f1525 10 f1506 30 f1480 14 f1447 13
s 15 1.00 BEND 9 6 8 HCH 108.13 f1525 10 f1506 30 f1480 14 f1447 13
s 16 1.00 BEND 10 6 9 HCH 107.20 f1528 33 f1525 21 f1447 12
s 17 1.00 BEND 6 4 2 CNC 112.81 f383 85
s 18 1.00 TORS 7 4 2 1 HNCH -172.65 f1480 10 f781 55
s 19 1.00 TORS 1 2 4 6 HCNC 65.52 f1192 27 f1107 14 f243 18
s 20 1.00 TORS 5 2 4 6 HCNC -176.80 f1107 17 f269 35 f243 11
s 21 1.00 TORS 8 6 4 2 HCNC -183.20 f1107 17 f269 35 f243 11
s 22 1.00 TORS 3 2 4 6 HCNC -54.88 f1273 26 f1037 22 f243 19
s 23 1.00 TORS 9 6 4 2 HCNC 54.88 f1273 26 f1037 22 f243 19
s 24 1.00 TORS 10 6 4 2 HCNC -65.52 f1192 30 f1107 18 f243 21
****
9 STRE modes:
1 2 3 4 5 6 7 8 9
8 BEND modes:
10 11 12 13 14 15 16 17
7 TORS modes:
18 19 20 21 22 23 24
19 CH modes:
2 3 4 5 6 7 11 12 13 14 15 16 18 19 20 21 22 23 24
0 USER modes:
alternative coordinates 25
k 10 1.00 BEND 7 4 2 HNC 109.30
k 11 1.00 BEND 1 2 4 HCN 109.41
k 12 1.00 BEND 5 2 4 HCN 109.82
k 13 1.00 BEND 8 6 4 HCN 109.82
k 14 1.00 BEND 3 2 1 HCH 107.21
k 15 1.00 BEND 9 6 4 HCN 114.58
k 16 1.00 BEND 10 6 8 HCH 107.42
k 18 1.00 TORS 7 4 2 5 HNCH -54.98
k 18 1.00 TORS 7 4 2 3 HNCH 66.94
k 18 1.00 OUT 4 2 6 7 NCCH 23.30
k 19 1.00 OUT 2 3 5 1 CHHH 21.35
k 19 1.00 OUT 2 1 5 3 CHHH 21.14
k 19 1.00 OUT 2 3 1 5 CHHH 21.39
k 20 1.00 OUT 2 1 4 5 CHNH 21.93
k 20 1.00 OUT 2 5 4 1 CHNH 21.88
k 20 1.00 OUT 2 1 5 4 CHHN 16.36
k 21 1.00 TORS 8 6 4 7 HCNH 54.98
k 21 1.00 OUT 6 10 9 8 CHHH 21.39
k 22 1.00 OUT 2 1 4 3 CHNH 20.12
k 22 1.00 OUT 2 5 4 3 CHNH 19.59
k 23 1.00 TORS 9 6 4 7 HCNH -66.94
k 23 1.00 OUT 6 8 4 9 CHNH 19.59
k 24 1.00 TORS 10 6 4 7 HCNH -187.34
k 24 1.00 OUT 6 9 4 10 CHNH 20.32
k 24 1.00 OUT 6 8 4 10 CHNH 21.88
I'd like to skip the first 3 lines (I know how to do that with skiprows=3) then I'd like to stop parsing at the "***" and accommodate my content into 11 columns with predefined names like "tVib1" "%PED1" "tVib2" "%PED2" etc.
After that, I'll have, in this same file to start parsing after the word "alternative coordinates" into 11 columns.
Looks very hard to achieve for me.
Any help is much appreciated.
For the .dd2 file provided, I used another strategy. The implicit assumptions are
1) a line is only converted, when it starts with either a lower case - space - digit or with at least five whitespaces, followed by at least one upper case word
2) if missing, the first, third and every f - column is reused from the last line
3) the third column contains the first upper case word
4) if the difference between the first upper case words is less than a given variable max_col, NaN is introduced for the missing values
5) f value columns start two columns after the second upper case column
import re
import pandas as pd
import numpy as np
def align_columns(file_name, col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6"], max_col = 4):
#max_col: number of columns between the two capitalised columns
#column names for the first values N = number, S = string, F = f number, adapt to your needs
#both optional parameters
#collect all data sets as a list of lists
all_lines = []
last_id, last_cat, last_fval = 0, 0, []
#opening file to read
for line_in in open(file_name, "r"):
#use only lines that start either
#with lower case - space - digit or at least five spaces
#and have an upper case word in the line
start_str = re.match("([a-z]\s\d|\s{5,}).*[A-Z]+", line_in)
if not start_str:
continue
#split data columns into chunks using 2 or more whitespaces as a delimiter
sep_items = re.split("\s{2,}", line_in.strip())
#if ID is missing use the information from last line
if not re.match("[a-z]\s\d", sep_items[0]):
sep_items.insert(0, last_id)
sep_items.insert(2, last_cat)
sep_items.extend(last_fval)
#otherwise keep the information in case it is missing from next line
else:
last_id = sep_items[0]
last_cat = sep_items[2]
#get index for the two columns with upper case words
index_upper = [i for i, item in enumerate(sep_items) if item.isupper()]
if len(index_upper) < 2 or index_upper[0] != 2 or index_upper[1] > index_upper[0] + max_col + 1:
print("Irregular format, skipped line:")
print(line_in)
continue
#get f values in case they are missing for next line
last_fval = sep_items[index_upper[1] + 2:]
#if not enough rows between the two capitalised columns, fill with NaN
if index_upper[1] < 3 + max_col:
fill_nan = [np.nan] * (3 + max_col - index_upper[1])
sep_items[index_upper[1]:index_upper[1]] = fill_nan
#append to list
all_lines.append(sep_items)
#create pandas dataframe from list
df = pd.DataFrame(all_lines)
#convert columns to float, if possible
df = df.apply(pd.to_numeric, errors='ignore', downcast='float')
#label columns according to col_names list and add f0, f1... at the end
df.columns = [col_names[i] if i < len(col_names) else "f" + str(i - len(col_names)) for i in df.columns]
return df
#-----------------main script--------------
#use standard parameters of function
conv_file = align_columns("a1-91a.dd2")
print(conv_file)
#use custom parameters for labels and number of fill columns
col_labels = ["X1", "Y1", "Z1", "A1", "A2", "A3", "A4", "A5", "A6", "Z2", "B1"]
conv_file2 = align_columns("a1-91a.dd2", col_labels, 6)
print(conv_file2)
This is more flexible than the first solution. The number of f value columns is not restricted to a specific number.
The example shows you, how to use it with standard parameters defined by the function and with custom parameters. It is surely not the most beautiful solution, and I am happy to upvote any more elegant solution. But it works, at least in my Python 3.5 environment. If there are any problems with a data file, please let me know.
P.S.: The solution to convert the appropriate columns into float was provided by jezrael
Seems not that hard, you already described all you want, all you need is to translate it to Python. First you can parse your whole file and store it in a list of lines:
with open(filename,'r') as file_in:
lines = file_in.readlines()
then you can begin reading from line 3 and parse until you find the "***":
ind = 3
while x[ind].find('***') != -1:
tmp = x[ind]
... do what you want with tmp ...
ind = ind + 1
and then you can keep on doing whatever you need, replacing find("...") by any keyword you need.
To manage each of your lines "tmp", you can use very useful Python functions like tmp.split(), tmp.strip(), convert any string to a number, etc.
I made a first script according to your example here on SO. It is not very flexible - it assumes that the first three columns are filled with values and aligns then the two columns with uppercase words by filling the four columns in between with NaN, if necessary. The reason to fill it with this value is that pandas function like .sum() or .mean() ignore this, when calculating the value for a column.
import re
import io
import pandas as pd
#adapt this part to your needs
#enforce to read 12 columns, N = number, S = string, F = f number
col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6", "F1", "F2", "F3"]
#only import lines that start with these patterns
startline = ("s ", "k ")
#number of columns between the two capitalised columns
max_col = 4
#create temporary file like object to feed later to the csv reader
pan_wr = io.StringIO()
#opening file to read
for line in open("test.txt", "r"):
#checking, if row should be ignored
if line.startswith(startline):
#find the text between the two capitalized columns
col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", line).group(1)
#determine, how many elements we have in this segment
nr_col_betw = len(re.split(r"\s{2,}", col_betw.strip()))
#test, if there are not enough numbers
if nr_col_betw <= max_col:
#fill with NA, which is interpreted by pandas csv reader as NaN
subst = col_betw + " NA" * (max_col - nr_col_betw + 1)
line = line.replace(col_betw, subst, 1)
#write into file like object the new line
pan_wr.writelines(line)
#reset pointer for csv reader
pan_wr.seek(0)
#csv reader creates data frame from file like object, splits at delimiter with more than one whitespace
#index_col: the first column is not treated as an index, names: name for columns
df = pd.read_csv(pan_wr, delimiter = r"\s{2,}", index_col = False, names = col_names, engine = "python")
print(df)
This works well, but can't deal with the .dd2 file you posted later. I am currently testing a different approach for this.
to be continued...
P.S.: I found conflicting information on the use of index_col = False by the csv reader. Some say, you should use now index_col = None, to suppress that the first column is converted into the index, but it didn't work in my tests.

Resources