I have dataframe , I split data into 3 segments, applied for loop and calculate features of data, I want to join the result of 3 loops in one row and change column name by adding 1,2,3 at the end of the feature column name.
splitt=np.array_split(df,3)
for x in splitt:
x1=np.mean(A)
x1=pd.DataFrame([x1],columns=['mean'])
x1['std']=np.std
desired_result;
mean_A_1 std_B_1 mean_A_2 std_B_2 mean_A_3 std_B_3
You can use:
df = pd.DataFrame({'A':[0,1,5,5,8,9,6],'B':[5,8,0,9,5,6,3]})
splitt=np.array_split(df,3)
L = {k: v for i, x in enumerate(splitt, 1)
for k, v in {f'mean_A_{i}': x['A'].mean(), f'std_B_{i}': np.std(x['B'])}.items()}
df1 = pd.DataFrame([L])
print (df1)
mean_A_1 std_B_1 mean_A_2 std_B_2 mean_A_3 std_B_3
0 2.0 3.299832 6.5 2.0 7.5 1.5
Another solution with loop:
L = []
splitt=np.array_split(df,3)
for i, x in enumerate(splitt, 1):
d = {f'mean_A_{i}': x['A'].mean(),
f'std_B_{i}': np.std(x['B'])}
L.append(pd.Series(d))
df1 = pd.concat(L).to_frame().T
print (df1)
mean_A_1 std_B_1 mean_A_2 std_B_2 mean_A_3 std_B_3
0 2.0 3.299832 6.5 2.0 7.5 1.5
Related
I want to extract patterns from a textfile and create pandas dataframe.
Each line inside the text file look like this:
2022-07-01,08:00:57.853, +12-34 = 1.11 (0. AA), a=0, b=1 cct= p=0 f=0 r=0 pb=0 pbb=0 prr=2569 du=89
I want to extract the following patterns:
+12-34, 1.11, a=0, b=1 cct= p=0 f=0 r=0 p=0 pbb=0 prr=2569 du=89 where cols={id,res,a,b,p,f,r,pb,pbb,prr,du}.
I have written the following the code to extract patterns and create dataframe. The file is around 500MB containing huge amount of rows.
files = glob.glob(path_torawfolder + "*.txt")
lines = []
for fle in files:
with open(fle) as f:
items = {}
lines += f.readlines()
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
dct = {k:[v] for k,v in feature_dict.items()}
series = pd.DataFrame(dct)
#print(series)
df = pd.concat([df,series], ignore_index=True)
Any suggestions to optimize the code and reduce the processing time, please?
Thanks!
A bit of improvement: in the previous code, there were few unnecessary conversions from dict to df.
dicts = []
def create_dataframe():
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d \.\d{2})',feature_interest)[0]
dicts.append(feature_dict)
df = pd.DataFrame(dicts)
return df
Line # Hits Time Per Hit % Time Line Contents
8 def create_dataframe():
9 1 551.0 551.0 0.0 df = pd.DataFrame()
10 1697339 727220.0 0.4 1.7 for l in lines:
11 1697338 1706328.0 1.0 4.0 feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
12 1697338 20857891.0 12.3 49.1 feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
13
14 1697338 1987874.0 1.2 4.7 feature_dict["ctry_provider"] = (l.split("+")[-1]).split(" =")[0]
15 1697338 9142820.0 5.4 21.5 feature_dict["acpa_codes"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
16 1697338 1039880.0 0.6 2.4 dicts.append(feature_dict)
17
18 1 7025303.0 7025303.0 16.5 df = pd.DataFrame(dicts)
19 1 2.0 2.0 0.0 return df
Improvement reduced the computation to few mins. Any more suggestions to optimize by using dask or parallel computing?
The code block below produces the this table:
Trial Week Branch Num_Dep Tot_dep_amt
1 1 1 4 4200
1 1 2 7 9000
1 1 3 6 4800
1 1 4 6 5800
1 1 5 5 3800
1 1 6 4 3200
1 1 7 3 1600
. . . . .
. . . . .
1 1 8 5 6000
9 19 40 3 2800
Code:
trials=10
dep_amount=[]
branch=41
total=[]
week=1
week_num=[]
branch_num=[]
dep_num=[]
trial_num=[]
weeks=20
df=pd.DataFrame()
for a in range(1,trials):
print("Starting trial", a)
for b in range(1,weeks):
for c in range(1,branch):
depnum = int(np.round(np.random.normal(5,2,1)/1)*1)
acc_dep=0
for d in range(1,depnum):
dep_amt=int(np.round(np.random.normal(1200,400,1)/200)*200)
acc_dep=acc_dep+dep_amt
temp = pd.DataFrame.from_records([{'Trial': a, 'Week': b, 'branch': c,'Num_Dep': depnum, 'Tot_dep_amt':acc_dep }])
df = pd.concat([df, temp])
df = df[['Trial', 'Week', 'branch', 'Num_Dep','Tot_dep_amt']]
df=df.reset_index()
df=df.drop('index',axis=1)
I would like to be able to break branches apart in the for-loop and instead have the resultant df represented with headers:
Trial Week Branch_1_Num_Dep Branch_1_Tot_dep_amount Branch_2_Num_ Dep .....etc
I know this could be done by generating the DF and performing an encoding, but for this task I would like it to be generated in the for loop if possible?
In order to achieve this with minimal changes to your code, you can do something like the following:
df = pd.DataFrame()
for a in range(1, trials):
print("Starting trial", a)
for b in range(1, weeks):
records = {'Trial': a, 'Week': b}
for c in range(1, branch):
depnum = int(np.round(np.random.normal(5, 2, 1) / 1) * 1)
acc_dep = 0
for d in range(1, depnum):
dep_amt = int(np.round(np.random.normal(1200, 400, 1) / 200) * 200)
acc_dep = acc_dep + dep_amt
records['Branch_{}_Num_Dep'.format(c)] = depnum
records['Branch_{}_Tot_dep_amount'.format(c)] = acc_dep
temp = pd.DataFrame.from_records([records])
df = pd.concat([df, temp])
df = df.reset_index()
df = df.drop('index', axis=1)
Overall it seems that what you are doing can be done in more elegant and faster ways. I would recommend taking a look to vectorization as a concept (e.g. here).
I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9
It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664
I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.
For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.
How would I achieve that?
Use numpy.split:
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074
print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395
print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625
Creating a dataframe with 70% values of original dataframe
part_1 = df.sample(frac = 0.7)
Creating dataframe with rest of the 30% values
part_2 = df.drop(part_1.index)
I've written a simple function that does the job.
Maybe that might help you.
P.S:
Sum of fractions must be 1.
It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])
np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))
def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
remain = df.index.copy().to_frame()
res = []
for i in range(len(fracs)):
fractions_sum=sum(fracs[i:])
frac = fracs[i]/fractions_sum
idxs = remain.sample(frac=frac, random_state=random_state).index
remain=remain.drop(idxs)
res.append(idxs)
return [df.loc[idxs] for idxs in res]
train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
print(train.shape, test.shape, val.shape)
outputs:
(79, 4) (10, 4) (10, 4)
What I am trying to do is to get bootstrap confidence limits by row regardless of the number of rows and make a new dataframe from the output.I currently can do this for the entire dataframe, but not by row. The data I have in my actual program looks similar to what I have below:
0 1 2
0 1 2 3
1 4 1 4
2 1 2 3
3 4 1 4
I want the new dataframe to look something like this with the lower and upper confidence limits:
0 1
0 1 2
1 1 5.5
2 1 4.5
3 1 4.2
The current generated output looks like this:
0 1
0 2.0 2.75
The python 3 code below generates a mock dataframe and generates the bootstrap confidence limits for the entire dataframe. The result is a new dataframe with just 2 values, a upper and a lower confidence limit rather than 4 sets of 2(one for each row).
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a)
b = pd.DataFrame(b)
b = b.T
print(b)
Thank you for any help.
scikits.bootstrap operates by assuming that data samples are arranged by row, not by column. If you want the opposite behavior, just use the transpose, and a statfunction that doesn't combine columns.
import pandas as pd
import numpy as np
import scikits.bootstrap as sci
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
print(zz)
x= zz.dtypes
print(x)
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
b = sci.ci(a.T, statfunction=lambda x: np.average(x, axis=0))
print(b.T)
Below is the answer I ended up figuring out to create bootstrap ci by row.
import pandas as pd
import numpy as np
import numpy.random as npr
zz = pd.DataFrame([[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]],
[[1,2],[2,3],[3,6]],[[4,2],[1,4],[4,6]]])
x= zz.dtypes
a = pd.DataFrame(np.array(zz.values.tolist())[:, :, 0],zz.index, zz.columns)
print(a)
def bootstrap(data, num_samples, statistic, alpha):
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
cc = list(a.index.values) # informs generator of the number of rows
def bootbyrow(cc):
for xx in range(1):
xx = list(a.index.values)
for xx in range(len(cc)):
k = a.apply(lambda y: y[xx])
k = k.values
for xx in range(1):
kk = list(bootstrap(k,10000,np.mean,0.05))
yield list(kk)
abc = pd.DataFrame(list(bootbyrow(cc))) #bootstrap ci by row
# the next 4 just show that its working correctly
a0 = bootstrap((a.loc[0,].values),10000,np.mean,0.05)
a1 = bootstrap((a.loc[1,].values),10000,np.mean,0.05)
a2 = bootstrap((a.loc[2,].values),10000,np.mean,0.05)
a3 = bootstrap((a.loc[3,].values),10000,np.mean,0.05)
print(abc)
print(a0)
print(a1)
print(a2)
print(a3)