I'm trying to apply a function to a specific column in this dataframe
datetime PM2.5 PM10 SO2 NO2
0 2013-03-01 7.125000 10.750000 11.708333 22.583333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667
2 2013-03-03 76.916667 120.541667 61.291667 81.000000
3 2013-03-04 22.708333 44.583333 22.854167 46.187500
4 2013-03-06 223.250000 265.166667 116.236700 142.059383
5 2013-03-07 263.375000 316.083333 97.541667 147.750000
6 2013-03-08 221.458333 297.958333 69.060400 120.092788
I'm trying to apply this function(below) to a specific column(PM10) of the above dataframe:
range1 = [list(range(0,50)),list(range(51,100)),list(range(101,200)),list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
Where "x" can be any column and "y" = Range1
Available Options
df.PM10.apply(c1_c2,args(df.PM10,range1),axis=1)
df.PM10.apply(c1_c2)
I've tried these couple of available options and none of them seems to be working. Any suggestions?
Not sure what the expected output is from the function. But to get the function getting called you can try the following
from functools import partial
df.PM10.apply(partial(c1_c2, y=range1))
Update:
Ok, I think I understand a little better. This should work, but 'range1' is a list of lists of integers. Your data doesn't have integers and the new column comes up empty. I created another list based on your initial data that works. See below:
df = pd.read_csv('pm_data.txt', header=0)
range1= [[7.125000,10.750000,11.708333,22.583333],list(range(0,50)),list(range(51,100)),list(range(101,200)),
list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
df['function']=df.PM10.apply(lambda x: c1_c2(x,range1))
print(df.head(10))
datetime PM2.5 PM10 SO2 NO2 new_column function
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000 16.458333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167 NaN
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083 NaN
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167 NaN
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333 NaN
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167 NaN
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917 NaN
Only the first item in 'function' had a match because it came from your initial data because of 'if x in a'.
Old Code:
I'm also not sure what you are doing. But you can use a lambda to modify columns or create new ones.
Like this,
import pandas as pd
I created a data file to import from the data you posted above:
datetime,PM2.5,PM10,SO2,NO2
2013-03-01,7.125000,10.750000,11.708333,22.583333
2013-03-02,30.750000,42.083333,36.625000,66.666667
2013-03-03,76.916667,120.541667,61.291667,81.000000
2013-03-04,22.708333,44.583333,22.854167,46.187500
2013-03-06,223.250000,265.166667,116.236700,142.059383
2013-03-07,263.375000,316.083333,97.541667,147.750000
2013-03-08,221.458333,297.958333,69.060400,120.092788
Here is how I import it,
df = pd.read_csv('pm_data.txt', header=0)
and create a new column and apply a function to the data in 'PM10'
df['new_column'] = df['PM10'].apply(lambda x: x+15 if x < 30 else x/20)
which yields,
datetime PM2.5 PM10 SO2 NO2 new_column
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917
Let me know if this helps.
"I've tried these couple of available options and none of them seems to be working..."
What do you mean by this? What's your output, are you getting errors or what?
I see a couple of problems:
range1 lists contain int while your column values are float, so c1_c2() will return None.
if the data types were the same within range1 and columns, c1_c2() will return None when value is not in range1.
Below is how I would do it, assuming the data-types match:
def c1_c2(x):
range1 = [list of lists]
for a in range1:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
return x # returns the original value if not in range1
df.PM10.apply(c1_c2)
I have stock data set like
**Date Open High ... Close Adj Close Volume**
0 2014-09-17 465.864014 468.174011 ... 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 ... 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 ... 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 ... 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 ... 398.821014 398.821014 26580100
I need to cumulative sum the columns Open,High,Close,Adj Close, Volume
I tried this df.cumsum(), its shows the the error time stamp error.
I think for processing trade data is best create DatetimeIndex:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
And then if necessary cumulative sum for all column:
df = df.cumsum()
If want cumulative sum only for some columns:
cols = ['Open','High','Close','Adj Close','Volume']
df[cols] = df.cumsum()
I am trying to calculating total age through minadmit and date of birth columns.
I tried this :
patient_admission['minadmit'] = pd.to_datetime(patient_admission['minadmit'], infer_datetime_format=True)
patient_admission['DOB'] = pd.to_datetime(patient_admission['DOB'], infer_datetime_format=True)
print("*******************")
print(patient_admission['minadmit'])
print("*******************")
print(patient_admission['DOB'])
And this is the result :
*******************
0 2149-12-17 20:41:00
1 2149-12-17 20:41:00
2 2149-12-17 20:41:00
3 2188-11-12 09:22:00
4 2110-07-27 06:46:00
...
58971 2111-09-30 12:04:00
58972 2161-07-15 12:00:00
58973 2135-01-06 07:15:00
58974 2129-01-03 07:15:00
58975 2149-06-08 15:21:00
Name: minadmit, Length: 58976, dtype: datetime64[ns]
*******************
0 2075-03-13
1 2075-03-13
2 2075-03-13
3 2164-12-27
4 2090-03-15
...
58971 2026-05-25
58972 2124-07-27
58973 2049-11-26
58974 2076-07-25
58975 2098-07-25
Name: DOB, Length: 58976, dtype: datetime64[ns]
After that, I just write this :
patient_admission['age'] = list(map(lambda x: x.days , (patient_admission['minadmit'] - patient_admission['DOB'])/365.242 ))
I have this error :
raise OverflowError("Overflow in int64 addition") OverflowError: Overflow in int64 addition
What is the cause of this error, and how can can I fix it.
Try this
patient_admission['date_of_admission'] = pd.to_datetime(patient_admission['date_of_admission']).dt.date
patient_admission['DOB'] = pd.to_datetime(patient_admission['DOB']).dt.date
Along with it not sure if this will fix but you may need to use x.dt.days in your lambda function
I am trying to convert these string objects to python datetime objects but the letter T,Z are creating a problem.
datetime.strptime(i,'%Y-%m-%d').date() for i in df['dateUpdated']
0 2014-02-01T04:41:06Z
1 2013-12-11T05:28:28Z
2 2015-11-19T22:20:29Z
3 2014-02-01T05:22:01Z
4 2016-05-19T14:31:26Z
ValueError: unconverted data remains: T04:41:06Z
Assuming you are working with Pandas, use pd.to_datetime:
import pandas as pd
df = pd.DataFrame({'dateUpdated': ['2014-02-01T04:41:06Z', '2013-12-11T05:28:28Z']})
df['dateUpdated'] = pd.to_datetime(df['dateUpdated'])
Results:
0 2014-02-01 04:41:06+00:00
1 2013-12-11 05:28:28+00:00
Name: dateUpdated, dtype: datetime64[ns, UTC]
If you then want to access only the date part of your new column, you can use:
df['dateUpdated'].dt.date
I have read this Pandas: convert type of column and this How to convert datatype:object to float64 in python?
I have current output of df:
Day object
Time object
Open float64
Close float64
High float64
Low float64
Day Time Open Close High Low
0 ['2019-03-25'] ['02:00:00'] 882.2 882.6 884.0 882.1
1 ['2019-03-25'] ['02:01:00'] 882.9 882.9 883.4 882.9
2 ['2019-03-25'] ['02:02:00'] 882.8 882.8 883.0 882.7
So I can not use this:
day_=df.loc[df['Day'] == '2019-06-25']
My final purpose is to extract df by filtering the value of column "Day" by specific condition.
I think the reason of df.loc above failed to excecute is that dtype of Day is object so I can not execute df.loc
so I try to convert the above df to something like this:
Day Time Open Close High Low
0 2019-03-25 ['02:00:00'] 882.2 882.6 884.0 882.1
1 2019-03-25 ['02:01:00'] 882.9 882.9 883.4 882.9
2 2019-03-25 ['02:02:00'] 882.8 882.8 883.0 882.7
I have tried:
df=pd.read_csv('output.csv')
df = df.convert_objects(convert_numeric=True)
#df['Day'] = df['CTR'].str.replace('[','').astype(np.float64)
df['Day'] = pd.to_numeric(df['Day'].str.replace(r'[,.%]',''))
But it does not work with error like this:
ValueError: Unable to parse string "['2019-03-25']" at position 0
I am novice at pandas and this may be duplicated!
Pls, help me to find solution. Thanks alot.
Try this I hope it would work
first remove list brackets by from day then do filter using .loc
df = pd.DataFrame(data={'Day':[['2016-05-12']],
'day2':[['2016-01-01']]})
df['Day'] = df['Day'].apply(''.join)
df['Day'] = pd.to_datetime(df['Day']).dt.date.astype(str)
days_df=df.loc[df['Day'] == '2016-05-12']
Second Solution
If the list is stored as string
from ast import literal_eval
df2 = pd.DataFrame(data={'Day':["['2016-05-12']"],
'day2':["['2016-01-01']"]})
df2['Day'] = df2['Day'].apply(literal_eval)
df2['Day'] = df2['Day'].apply(''.join)
df2['Day'] = pd.to_datetime(df2['Day']).dt.date.astype(str)
days_df=df2.loc[df2['Day'] == '2016-05-12']