This is my data set
fake_abalone2
Sex Length Diameter Height Whole Shucked Viscera Shell Rings
Weight Weight Weight Weight
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 K 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
5 K 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
Getting syntax error while using the following method. Please help me out.
I want the value in "sex" table to change depending on "Rings" table.If "Rings" value is less than 10 the corresponding "sex" value should be changed to 'K'.Otherwise, no change should be made in "Sex" table.
fake_abalone2["sex"]=fake_abalone2["Rings"].apply(lambda x:"K" if x<10)
File "", line 1
fake_abalone2["sex"]=fake_abalone2["Rings"].apply(lambda x:"K" if x<10)
SyntaxError: invalid syntax
The Following method works perfectly.
df1["Sex"]=df1.apply(lambda x: "K"if x.Rings<10 else x["Sex"],axis=1)
df1 is the dataframe
Sex Length Diameter Height Whole Shucked Viscera Shell Rings
weight weight weight weight
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 K 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 K 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 K 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
5 K 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.1200 8
6 F 0.530 0.415 0.150 0.7775 0.2370 0.1415 0.3300 20
You can use Python numpy instead of lambda function.
Import python numpy using import numpy as np
then you can use the following method to replace the string.
fake_abalone2['Sex'] = np.where(fake_abalone2['Rings']<10, 'K', fake_abalone2['Sex'])
The main problem is the output of the lambda function:
.apply(lambda x:"K" if x<10)
The output is not certain for other conditions, so you can use else something ...
.apply(lambda x:"K" if x<10 else None)
Related
I want to write data with headers into a file. The first three lines are unique, and can be considered as a 'block' which are then repeated with increments in x and y (0.12, 1) respectively. The data in the file should look like:
#X #Y Xmin Ymin Z
1 1 0.0000 0.000 0.0062
1 2 0.0000 0.350 0.0156
1 3 0.0000 0.750 0.0191
1 4 0.0000 1.000 0.0062
1 5 0.0000 1.350 0.0156
1 6 0.0000 1.750 0.0191
1 7 0.0000 2.000 0.0062
1 8 0.0000 2.350 0.0156
1 9 0.0000 2.750 0.0191
2 1 0.1200 0.000 0.0062
2 2 0.1200 0.350 0.0156
2 3 0.1200 0.750 0.0191
2 4 0.1200 1.000 0.0062
2 5 0.1200 1.350 0.0156
2 6 0.1200 1.750 0.0191
2 7 0.1200 2.000 0.0062
2 8 0.1200 2.350 0.0156
2 9 0.1200 2.750 0.0191
3 1 0.2400 0.000 0.0062
3 2 0.2400 0.350 0.0156
3 3 0.2400 0.750 0.0191
3 4 0.2400 1.000 0.0062
3 5 0.2400 1.350 0.0156
3 6 0.2400 1.750 0.0191
3 7 0.2400 2.000 0.0062
3 8 0.2400 2.350 0.0156
3 9 0.2400 2.750 0.0191
I tried to make the first three lines as 3 lists and write the first two columns and headers by two nested for loops but failed to write the repeating 3 line block.
l1 = [0.0000, 0.000, 0.0062]
l2 = [0.0000, 0.350, 0.0156]
l3 = [0.0000, 0.750, 0.0191]
pitch_x = 0.12
pitch_y = 1
with open('dataprep_test.txt', 'w') as f:
f.write('#x #y Xmin Ymin Z \n')
for i in range(1,4,1):
k =1
for j in range (1,4,1):
d_x = pitch_x*(i-1)
d_y = pitch_y*(j-1)
f.write('%d %d %f %f %f \n'%(i,k,(l1[0]+d_x),(l1[1]+d_y), l1[2]))
f.write('%d %d %f %f %f \n'%(i,k+1,(l2[0]+d_x),(l2[1]+d_y), l2[2]))
f.write('%d %d %f %f %f \n'%(i,k+2,(l3[0]+d_x),(l3[1]+d_y), l3[2]))
k=k+3
Is there a smarter way to do it using the python built-in functions and structures and methods (lists, dictionaries etc.)?
I'd just refactor the data generation into a generator function. You can also easily accept an arbitrary number of vectors.
def generate_data(initial_vectors, pitch_x, pitch_y, i_count=4, j_count=4):
for i in range(i_count):
for j in range(j_count):
d_x = pitch_x * i
d_y = pitch_y * j
for k, (x, y, z) in enumerate(initial_vectors, 1):
yield (i + 1, k, (x + d_x), (y + d_y), z)
def main():
l1 = [0.0000, 0.000, 0.0062]
l2 = [0.0000, 0.350, 0.0156]
l3 = [0.0000, 0.750, 0.0191]
with open('dataprep_test.txt', 'w') as f:
f.write('#x #y Xmin Ymin Z \n')
for i, k, x, y, z in generate_data([l1, l2, l3], pitch_x=0.12, pitch_y=1):
f.write(f'{i:d} {k:d} {x:f} {y:f} {z:f}\n')
if __name__ == '__main__':
main()
Furthermore, if a future version of your project might want to use JSON files instead, you could just json.dumps(list(generate_data(...)), etc.
You could do this, which gives every part:
file = 'F:\code\some_file.csv'
some_headers = ['x#', 'y#', 'Xmin','Ymin','Z']
# new lists
list_x = [1,1,1]
list_y = [1,2,3]
list_xmin = [0,0,0]
list_ymin = [0,0.35,0.75]
list_z = [0.0062,0.0156,0.0191]
# build new lists with whatever rules you need
for i in range(10):
list_x.append(i)
list_y.append(i*2)
list_xmin.append(i)
list_ymin.append(i*3)
list_z.append(i)
# write to file
with open(file, 'w') as csvfile:
# write headers
for i in some_headers:
csvfile.write(i + ',')
csvfile.write('\n')
# write data
for i in range(len(list_x)):
line_to_write = str(list_x[i]) + ',' + str(list_y[i]) + ',' + str(list_xmin[i])
line_to_write = line_to_write + ',' + str(list_ymin[i]) + ',' + str(list_z[i])
line_to_write = line_to_write + '\n'
csvfile.writelines(line_to_write)
# finished
print('done')
The result would be a csv file like this:
I would like to create a new column where the values are the sum of the last 14 values of column atr1, How can I do it?
I tried
col = a.columns.get_loc('atr1')
a['atrsum'] = a.iloc[-14:,col].sum()
But I get only a fixed value in the new column. Dataframe below as reference.
time open high low close volume atr1
0 1620518400000 1.6206 1.8330 1.5726 1.7663 8.830913e+08 NaN
1 1620604800000 1.7662 1.8243 1.5170 1.6423 7.123049e+08 0.3073
2 1620691200000 1.6418 1.7791 1.5954 1.7632 5.243267e+08 0.1837
3 1620777600000 1.7633 1.8210 1.5462 1.5694 5.997101e+08 0.2748
4 1620864000000 1.5669 1.9719 1.5000 1.9296 1.567655e+09 0.4719
... ... ... ... ... ... ... ...
360 1651622400000 0.7712 0.8992 0.7677 0.8985 2.566498e+08 0.1315
361 1651708800000 0.8986 0.9058 0.7716 0.7884 3.649706e+08 0.1342
362 1651795200000 0.7884 0.7997 0.7625 0.7832 2.440587e+08 0.0372
363 1651881600000 0.7832 0.7858 0.7467 0.7604 1.268089e+08 0.0391
364 1651968000000 0.7605 0.7663 0.7254 0.7403 1.751395e+08 0.0409
use expanding function of pandas
14 is the number of days, followed by the column you like to sum
a.expanding(14)['atr1'].sum()
I must be missing something in the question, ,my apologies.
I just used the data you shared and used the 2 previous days and this is the result
df['atrsum'] = df['atr1'].expanding(2).sum()
id time open high low close volume atr1 atrsum
0 0 1620518400000 1.6206 1.8330 1.5726 1.7663 8.830913e+08 NaN NaN
1 1 1620604800000 1.7662 1.8243 1.5170 1.6423 7.123049e+08 0.3073 NaN
2 2 1620691200000 1.6418 1.7791 1.5954 1.7632 5.243267e+08 0.1837 0.4910
3 3 1620777600000 1.7633 1.8210 1.5462 1.5694 5.997101e+08 0.2748 0.7658
4 4 1620864000000 1.5669 1.9719 1.5000 1.9296 1.567655e+09 0.4719 1.2377
5 360 1651622400000 0.7712 0.8992 0.7677 0.8985 2.566498e+08 0.1315 1.3692
6 361 1651708800000 0.8986 0.9058 0.7716 0.7884 3.649706e+08 0.1342 1.5034
7 362 1651795200000 0.7884 0.7997 0.7625 0.7832 2.440587e+08 0.0372 1.5406
8 363 1651881600000 0.7832 0.7858 0.7467 0.7604 1.268089e+08 0.0391 1.5797
9 364 1651968000000 0.7605 0.7663 0.7254 0.7403 1.751395e+08 0.0409 1.6206
Result with Rolling sum
df['atrsum'] = df['atr1'].rolling(2).sum()
id time open high low close volume atr1 atrsum
0 0 1620518400000 1.6206 1.8330 1.5726 1.7663 8.830913e+08 NaN NaN
1 1 1620604800000 1.7662 1.8243 1.5170 1.6423 7.123049e+08 0.3073 NaN
2 2 1620691200000 1.6418 1.7791 1.5954 1.7632 5.243267e+08 0.1837 0.4910
3 3 1620777600000 1.7633 1.8210 1.5462 1.5694 5.997101e+08 0.2748 0.4585
4 4 1620864000000 1.5669 1.9719 1.5000 1.9296 1.567655e+09 0.4719 0.7467
5 360 1651622400000 0.7712 0.8992 0.7677 0.8985 2.566498e+08 0.1315 0.6034
6 361 1651708800000 0.8986 0.9058 0.7716 0.7884 3.649706e+08 0.1342 0.2657
7 362 1651795200000 0.7884 0.7997 0.7625 0.7832 2.440587e+08 0.0372 0.1714
8 363 1651881600000 0.7832 0.7858 0.7467 0.7604 1.268089e+08 0.0391 0.0763
9 364 1651968000000 0.7605 0.7663 0.7254 0.7403 1.751395e+08 0.0409 0.0800
At least for me the answer from Naveed did not work, but I found a different way:
a['atrsum']= a['atr1'].rolling(window=14).apply(sum).dropna()
This gives me the result.
I'm working on a Kaggle project. Below is my CSV file column:
total_sqft
1056
1112
34.46Sq. Meter
4125Perch
1015 - 1540
34.46
10Sq. Yards
10Acres
10Guntha
10Grounds
The column is of type object. First I want to convert all the values to float then update the string 1015 - 1540 with its average value and finally convert the units to square feet. I've tried different StackOverflow solutions but none of them seems to work. Any help would be appreciated.
Expected Output:
total_sqft
1056.00
1112.00
370.307
1123031.25
1277.5
34.46
90.00
435600
10890
24003.5
1 square meter = 10.764 * square foot
1 perch = 272.25 * square foot
1 square yards = 9 * square foot
1 acres = 43560 * square foot
1 guntha = 1089 * square foot
1 grounds = 2400.35 * square foot
First extract numeric values by Series.str.extractall, convert to floats and get averages:
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
print (df)
total_sqft avg
0 1056 1056.00
1 1112 1112.00
2 34.46Sq. Meter 34.46
3 4125Perch 4125.00
4 1015 - 1540 1277.50
5 34.46 34.46
then need more information for convert to square feet.
EDIT: Create dictionary for match units, extract them from column and use map, last multiple columns:
d = {'Sq. Meter': 10.764, 'Perch':272.25, 'Sq. Yards':9,
'Acres':43560,'Guntha':1089,'Grounds':2400.35}
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
df['unit'] = df['total_sqft'].str.extract(f'({"|".join(d)})', expand=False)
df['map'] = df['unit'].map(d).fillna(1)
df['total_sqft'] = df['avg'].mul(df['map'])
print (df)
total_sqft avg unit map
0 1.056000e+03 1056.00 NaN 1.000
1 1.112000e+03 1112.00 NaN 1.000
2 3.709274e+02 34.46 Sq. Meter 10.764
3 1.123031e+06 4125.00 Perch 272.250
4 1.277500e+03 1277.50 NaN 1.000
5 3.446000e+01 34.46 NaN 1.000
6 9.000000e+01 10.00 Sq. Yards 9.000
7 4.356000e+05 10.00 Acres 43560.000
8 1.089000e+04 10.00 Guntha 1089.000
9 2.400350e+04 10.00 Grounds 2400.350
I have a lot of parameters on which I have to calculate the year on year growth.
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 3.44 3.60 3.99 4.40 4.61 4.73 5.11 4.97 4.92 4.89 5.29 4.51
RtlVol 97.08 97.94 98.25 99.15 99.63 100.29 100.71 101.18 102.04 101.56 101.05 99.49
IntRt 4.44 5.60 6.99 7.40 8.61 9.73 9.11 9.97 9.92 9.89 7.29 9.51
GMR 9.08 9.94 9.25 9.15 9.63 10.29 10.71 10.18 10.04 10.56 10.05 9.49
I need to calculate the growth, i.e in column 2007-Q1 i need to find the growth from 2006-Q1. The formula is (2007-Q1/2006-Q1) - 1
I have gone through the link below and tried to code
Calculating year over year growth by group in Pandas
df = pd.read_csv('c:/Econometric/EconoModel.csv')
df.set_index('Type',inplace=True)
df.sort_index(axis=1, inplace=True)
df_t = df.T
df_output=(df_cd_americas_t/df_cd_americas_t.shift(4)) -1
The output is as below
Type 2006-Q1 2006-Q2 2006-Q3 2006-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
MonMkt_IntRt 0.3398 0.3159 0.2806 0.1285 0.0661 0.0340 0.0363 -0.0912
RtlVol 0.0261 0.0240 0.0249 0.0204 0.0242 0.0126 0.0033 -0.0166
IntRt 0.6666 0.5375 0.3919 0.2310 0.1579 0.0195 0.0856 -0.2688
GMR 0.0077 -0.031 0.1124 0.1704 0.0571 -0.024 -0.014 -0.0127
Use iloc to shift data slices. See an example on test df.
df= pd.DataFrame({i:[0+i,1+i,2+i] for i in range(0,12)})
print(df)
0 1 2 3 4 5 6 7 8 9 10 11
0 0 1 2 3 4 5 6 7 8 9 10 11
1 1 2 3 4 5 6 7 8 9 10 11 12
2 2 3 4 5 6 7 8 9 10 11 12 13
df.iloc[:,list(range(3,12))] = df.iloc[:,list(range(3,12))].values/ df.iloc[:,list(range(0,9))].values - 1
print(df)
0 1 2 3 4 5 6 7 8 9 10
0 0 1 2 inf 3.0 1.50 1.00 0.75 0.600000 0.500000 0.428571
1 1 2 3 3.0 1.5 1.00 0.75 0.60 0.500000 0.428571 0.375000
2 2 3 4 1.5 1.0 0.75 0.60 0.50 0.428571 0.375000 0.333333
11
0 0.375000
1 0.333333
2 0.300000
I could not find any issue with your code.
Simply added axis=1 to the dataframe.shift() method as you are trying to do the column comparison
I have executed the following code it is giving the result you expected.
def getSampleDataframe():
df_economy_model = pd.DataFrame(
{
'Type':['MonMkt_IntRt', 'RtlVol', 'IntRt', 'GMR'],
'2006-Q1':[3.44, 97.08, 4.44, 9.08],
'2006-Q2':[3.6, 97.94, 5.6, 9.94],
'2006-Q3':[3.99, 98.25, 6.99, 9.25],
'2006-Q4':[4.4, 99.15, 7.4, 9.15],
'2007-Q1':[4.61, 99.63, 8.61, 9.63],
'2007-Q2':[4.73, 100.29, 9.73, 10.29],
'2007-Q3':[5.11, 100.71, 9.11, 10.71],
'2007-Q4':[4.97, 101.18, 9.97, 10.18],
'2008-Q1':[4.92, 102.04, 9.92, 10.04],
'2008-Q2':[4.89, 101.56, 9.89, 10.56],
'2008-Q3':[5.29, 101.05, 7.29, 10.05],
'2008-Q4':[4.51, 99.49, 9.51, 9.49]
}) # Your data
return df_economy_model>
df_cd_americas = getSampleDataframe()
df_cd_americas.set_index('Type', inplace=True)
df_yearly_growth = (df/df.shift(4, axis=1))-1
print (df_cd_americas)
print (df_yearly_growth)
I have the following DataFrame
data inflation
0 2000.01 0.62
1 2000.02 0.13
2 2000.03 0.22
3 2000.04 0.42
4 2000.05 0.01
5 2000.06 0.23
6 2000.07 1.61
7 2000.08 1.31
8 2000.09 0.23
9 2000.10 0.14
Note that the format of the Year Month is with a dot
When I try to convert to DateTime as in:
inflation.data = pd.to_datetime(inflation.data, format='%Y.%m')
I get both line 0 and line 9 as 2000-01-01
That means pandas is automatically changing .10 into .01
Is that a bug? or just a format issue?
You're actually using the formatting codes in pandas slightly incorrectly.
Look at the Pandas helpfile
pandas.to_datetime(*args, **kwargs)[source]
Convert argument to datetime.
Parameters:
arg : string, datetime, list, tuple, 1-d array, Series
you appear to be feeding it float64s when it probably expects strings
Try the following code.
Or convert your inflation.data to string (use inflation.data.apply(str))
f0=['2000.01',
'2000.02',
'2000.03',
'2000.04',
'2000.05',
'2000.06',
'2000.07',
'2000.08',
'2000.09',
'2000.10']
inflation=pd.DataFrame(f0,columns={'data'})
inflation.data=pd.to_datetime(inflation.data,format='%Y.%m')
output
Out[3]:
0 2000-01-01
1 2000-02-01
2 2000-03-01
3 2000-04-01
4 2000-05-01
5 2000-06-01
6 2000-07-01
7 2000-08-01
8 2000-09-01
9 2000-10-01
Name: data, dtype: datetime64[ns]
This is an interesting problem. The astype() construct is converting .10 to .01 and you can't use any split methods on the current float type.
Here is my take on this:
Use python math module modf function which returns the fractional and integer parts of x.
Now round the year and month data and convert to string for to_datetime to interpret.
import math
df['Year']= df.data.apply(lambda x: round(math.modf(x)[1])).astype(str)
df['Month']= df.data.apply(lambda x: round((math.modf(x)[0])*100)).astype(str)
df = df.drop('data', axis = 1)
df['Date'] = pd.to_datetime(df.Year+':'+df.Month, format = '%Y:%m')
df = df.drop(['Year', 'Month'], axis = 1)
You get
inflation Date
0 0.62 2000-01-01
1 0.13 2000-02-01
2 0.22 2000-03-01
3 0.42 2000-04-01
4 0.01 2000-05-01
5 0.23 2000-06-01
6 1.61 2000-07-01
7 1.31 2000-08-01
8 0.23 2000-09-01
9 0.14 2000-10-01