Pandas appends duplicate rows even though if statement shouldn't be triggered - python-3.x

I have many csv files that only have one row of data. I need to take data from two of the cells and put them into a master csv file ('new_gal.csv'). Initially this will only contain the headings, but no data.
#The file I am pulling from:
file_name = "N4261_pacs160.csv"
#I have the code written to separate gal_name, cat_name, and cat_num (N4261, pacs, 160)
An example of the csv is given here. I am trying to pull "flux" and "rms" from this file. (Sorry it isn't aligned nicely; I can't figure out the formatting).
name band ra dec raerr decerr flux snr snrnoise stn rms strn fratio fwhmxfit fwhmyfit flag_elong edgeflag flag_blend warmat
obsid ssomapflag dist angle
HPPSC160A_J121923.1+054931 red 184.846389 5.8254 0.000151 0.00015
227.036 10.797 21.028 16.507 13.754 37.448 1.074 15.2 11 0.7237
f 0 f 1342199758 f 1.445729 296.577621
I read this csv and pull the data I need
with open(file_name, 'r') as table:
reader = csv.reader(table, delimiter=',')
read = iter(reader)
next(read)
for row in read:
fluxP = row[6]
errP = row[10]
#Open the master csv with pandas
df = pd.read_csv('new_gal.csv')
The master csv file has format:
Galaxy Cluster Mult. Detect. LumDist z W1 W1 err W2 W2 err W3 W3 err W4 W4 err 70 70 err 100 100 err 160 160 err 250 250 err 350 350 err 500 500 err
The main problem I have, is that I want to search the "Galaxy" column in the 'new_gal.csv' for the galaxy name. If it is not there, I need to add a new row with the galaxy name and the flux and error measurement. When I run this multiple times, I get duplicate rows even though I have the append command nested in the if statement. I only want it to append a new row if the galaxy name is not already there; otherwise, it should only change the values of the flux and error measurements for that galaxy.
if cat_name == 'pacs':
if gal_name not in df["Galaxy"]:
df = df.append({"Galaxy": gal_name}, ignore_index=True)
if cat_num == "70":
df.loc[df.Galaxy == gal_name, ["70"]] = fluxP
df.loc[df.Galaxy == gal_name, ["70 err"]] = errP
elif cat_num == "100":
df.loc[df.Galaxy == gal_name, ["100"]] = fluxP
df.loc[df.Galaxy == gal_name, ["100 err"]] = errP
elif cat_num == "160":
df.loc[df.Galaxy == gal_name, ["160"]] = fluxP
df.loc[df.Galaxy == gal_name, ["160 err"]] = errP
else:
if cat_num == "70":
df.loc[df.Galaxy == gal_name, ["70"]] = fluxP
df.loc[df.Galaxy == gal_name, ["70 err"]] = errP
elif cat_num == "100":
df.loc[df.Galaxy == gal_name, ["100"]] = fluxP
df.loc[df.Galaxy == gal_name, ["100 err"]] = errP
elif cat_num == "160":
df.loc[df.Galaxy == gal_name, ["160"]] = fluxP
df.loc[df.Galaxy == gal_name, ["160 err"]] = errP
After running the code 5 times with the same file, I have 5 identical lines in the table.

I think I've got something that'll work after tinkering with it this morning...
Couple points... You shouldn't incrementally build in pandas...get the data setup done externally then do 1 build. In what I have below, I'm building a big dictionary from the small csv files and then using merge to put that together with the master file.
If your .csv files aren't formatted properly, you can either try to replace the split character below or switch over to csv reader that is a bit more powerful.
You should put all of the smaller .csv files in a folder called 'orig_data' to make this work.
main prog
# galaxy compiler
import os, re
import pandas as pd
# folder location for the small .csvs, NOT the master
data_folder = 'orig_data' # this folder should be in same directory as program
result = {}
splitter = r'(.+)_([a-zA-Z]+)([0-9]+)\.' # regex to break up file name into 3 groups
for file in os.listdir(data_folder):
file_data = {}
# split up the filename and process
galaxy, cat_name, cat_num = re.match(splitter, file).groups()
#print(galaxy, cat_name, cat_num)
with open(os.path.join(data_folder, file), 'r') as src:
src.readline() # read the header and disregard it
data = src.readline().replace(' ','').strip().split(',') # you can change the split char
flux = float(data[2])
rms = float(data[3])
err_tag = cat_num + ' err'
file_data = { 'cat_name': cat_name,
cat_num: flux,
err_tag: rms}
result[galaxy] = file_data
df2 = pd.DataFrame.from_dict(result, orient='index')
df2.index.rename('galaxy', inplace=True)
# check the resulting build!
#print(df2)
# build master dataframe
master_df = pd.read_csv('master_data.csv')
#print(master_df.head())
# merge the 2 dataframes on galaxy name. See the dox on merge for other
# options and whether you want an "outer" join or other type of join...
master_df = master_df.merge(df2, how='outer', on='galaxy')
# convert boolean flags properly
conv = {'t': True, 'f': False}
master_df['flag_nova'] = master_df['flag_nova'].map(conv).astype('bool')
print(master_df)
print()
print(master_df.info())
print()
print(master_df.describe())
example data files in orig_data folder
filename: A99_dbc100.csv
band,weight,flux,rms
junk, 200.44,2e5,2e-8
filename: B250_pacs100.csv
band,weight,flux,rms
nada,2.44,19e-5, 74
...etc.
example master csv
galaxy,color,stars,flag_nova
A99,red,15,f
B250,blue,4e20,t
N1000,green,3e19,f
X99,white,12,t
Result:
galaxy color stars ... 200 err 100 100 err
0 A99 red 1.500000e+01 ... NaN 200000.00000 2.000000e-08
1 B250 blue 4.000000e+20 ... NaN 0.00019 7.400000e+01
2 N1000 green 3.000000e+19 ... 88.0 NaN NaN
3 X99 white 1.200000e+01 ... NaN NaN NaN
[4 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 9 columns):
galaxy 4 non-null object
color 4 non-null object
stars 4 non-null float64
flag_nova 4 non-null bool
cat_name 3 non-null object
200 1 non-null float64
200 err 1 non-null float64
100 2 non-null float64
100 err 2 non-null float64
dtypes: bool(1), float64(5), object(3)
memory usage: 292.0+ bytes
None
stars 200 200 err 100 100 err
count 4.000000e+00 1.0 1.0 2.000000 2.000000e+00
mean 1.075000e+20 1900000.0 88.0 100000.000095 3.700000e+01
std 1.955121e+20 NaN NaN 141421.356103 5.232590e+01
min 1.200000e+01 1900000.0 88.0 0.000190 2.000000e-08
25% 1.425000e+01 1900000.0 88.0 50000.000143 1.850000e+01
50% 1.500000e+19 1900000.0 88.0 100000.000095 3.700000e+01
75% 1.225000e+20 1900000.0 88.0 150000.000048 5.550000e+01
max 4.000000e+20 1900000.0 88.0 200000.000000 7.400000e+01

Related

Dataframe creation from text file and regex: python code optimisation

I want to extract patterns from a textfile and create pandas dataframe.
Each line inside the text file look like this:
2022-07-01,08:00:57.853, +12-34 = 1.11 (0. AA), a=0, b=1 cct= p=0 f=0 r=0 pb=0 pbb=0 prr=2569 du=89
I want to extract the following patterns:
+12-34, 1.11, a=0, b=1 cct= p=0 f=0 r=0 p=0 pbb=0 prr=2569 du=89 where cols={id,res,a,b,p,f,r,pb,pbb,prr,du}.
I have written the following the code to extract patterns and create dataframe. The file is around 500MB containing huge amount of rows.
files = glob.glob(path_torawfolder + "*.txt")
lines = []
for fle in files:
with open(fle) as f:
items = {}
lines += f.readlines()
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
dct = {k:[v] for k,v in feature_dict.items()}
series = pd.DataFrame(dct)
#print(series)
df = pd.concat([df,series], ignore_index=True)
Any suggestions to optimize the code and reduce the processing time, please?
Thanks!
A bit of improvement: in the previous code, there were few unnecessary conversions from dict to df.
dicts = []
def create_dataframe():
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d \.\d{2})',feature_interest)[0]
dicts.append(feature_dict)
df = pd.DataFrame(dicts)
return df
Line # Hits Time Per Hit % Time Line Contents
8 def create_dataframe():
9 1 551.0 551.0 0.0 df = pd.DataFrame()
10 1697339 727220.0 0.4 1.7 for l in lines:
11 1697338 1706328.0 1.0 4.0 feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
12 1697338 20857891.0 12.3 49.1 feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
13
14 1697338 1987874.0 1.2 4.7 feature_dict["ctry_provider"] = (l.split("+")[-1]).split(" =")[0]
15 1697338 9142820.0 5.4 21.5 feature_dict["acpa_codes"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
16 1697338 1039880.0 0.6 2.4 dicts.append(feature_dict)
17
18 1 7025303.0 7025303.0 16.5 df = pd.DataFrame(dicts)
19 1 2.0 2.0 0.0 return df
Improvement reduced the computation to few mins. Any more suggestions to optimize by using dask or parallel computing?

Python code to split my excel column value based on delimiter & write 1st split value to same column and 2nd to new column created next to that

In my excel sheet I have data of below kind...
sys_id Domain location tax_amount tp_category
8746 BLR 60000 link:IT,value:63746
2864 link:EFT,value:874887 HYD 50000
3624 link:Cred,value:897076 CHN 55000
7354 BLR 60000 link:sales,value:83746
I want output in my excel in below format...
sys_id Domain_link Domain_value location . tp_category_link tp_category_value
8746 BLR . IT 63746
2864 EFT 874887 HYD .
3624 Cred 897076 CHN .
7354 BLR . sales 83746
please help me with method or logic I should follow to have data in above format.
I have a huge amount of data and this I need to compare with other excel which is of Target data.
You can use pandas, assuming that your input file is called b1.xlsx:
import pandas as pd
import re
VALUE_LINK_REGEX = re.compile('^.*link:(.*),value:(.*)$')
df = pd.read_excel('b1.xlsx', engine='openpyxl')
cols_to_drop = []
for col in filter(lambda c: is_string_dtype(df[c]), df.columns):
m = df[col].map(lambda x: None if pd.isna(x) else VALUE_LINK_REGEX.match(x))
# skip columns with strings different from the expected format
if m.notna().sum() == 0:
continue
cols_to_drop.append(col)
df[f'{col}_link'] = m.map(lambda x: None if x is None else x.groups()[0])
df[f'{col}_value'] = m.map(lambda x: None if x is None else x.groups()[1])
df.drop(columns=cols_to_drop, inplace=True)
df.to_excel('b2.xlsx', index=False, engine='openpyxl')
This is the resulting df:
sys_id location tax_amount Domain_link Domain_value tp_category_link \
0 8746 BLR 60000 None None IT
1 2864 HYD 50000 EFT 874887 None
2 3624 CHN 55000 Cred 897076 None
3 7354 BLR 60000 None None sales
tp_category_value
0 63746
1 None
2 None
3 83746
And this is b2.xlsx:

How do I remove square brackets from my dataframe?

Im trying to create a comparison between my predicted and actual values.
Here is my try:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['Op1', 'Op2', 'S2', 'S3', 'S4', 'S7', 'S8', 'S9', 'S11', 'S12','S13', 'S14', 'S15', 'S17', 'S20', 'S21']], df.unit)
predicted = []
actual = []
for i in range(1,len(df.unit.unique())):
xp = df[(df.unit == i) & (df.cycles == len(df[df.unit == i].cycles))]
xa = xp.cycles.values
xp = xp.values[0,2:].reshape(1,-2)
predicted.append(reg.predict(xp))
actual.append(xa)
and to display the dataframe:
data = {'Actual cycles': actual, 'Predicted cycles': predicted }
df_2 = pd.DataFrame(data)
df_2.head()
I will get an output:
Actual cycles Predicted cycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
ignoring the values that are way off, how do I remove the square brackets in the dataframe? and is there a neater way to write my code? Thank you!
print(df_2)
Actualcycles Predictedcycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
df=df_2.apply(lambda x:x.str.strip('[]'))
print(df)
Actualcycles Predictedcycles
0 192 56.7530579842869
1 287 50.76877712361329
2 179 42.72575900074571
3 189 42.876506912637524
4 269 47.40087182743173
Below is a minimal example of your cycles column with brackets:
import pandas as pd
df = pd.DataFrame({
'cycles' : [[192], [287], [179], [189], [269]]
})
This code gets you the column without brackets:
df['cycles'] = df['cycles'].str[0]
The output looks like this:
print(df)
cycles
0 192
1 287
2 179
3 189
4 269

passing parameters in groupby aggregate function

I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9
It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664

how to replace a cell in a pandas dataframe

After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x
Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)

Resources