Pandas is converting month 10 into month 1. Is there a format issue here? - python-3.x

I have the following DataFrame
data inflation
0 2000.01 0.62
1 2000.02 0.13
2 2000.03 0.22
3 2000.04 0.42
4 2000.05 0.01
5 2000.06 0.23
6 2000.07 1.61
7 2000.08 1.31
8 2000.09 0.23
9 2000.10 0.14
Note that the format of the Year Month is with a dot
When I try to convert to DateTime as in:
inflation.data = pd.to_datetime(inflation.data, format='%Y.%m')
I get both line 0 and line 9 as 2000-01-01
That means pandas is automatically changing .10 into .01
Is that a bug? or just a format issue?

You're actually using the formatting codes in pandas slightly incorrectly.
Look at the Pandas helpfile
pandas.to_datetime(*args, **kwargs)[source]
Convert argument to datetime.
Parameters:
arg : string, datetime, list, tuple, 1-d array, Series
you appear to be feeding it float64s when it probably expects strings
Try the following code.
Or convert your inflation.data to string (use inflation.data.apply(str))
f0=['2000.01',
'2000.02',
'2000.03',
'2000.04',
'2000.05',
'2000.06',
'2000.07',
'2000.08',
'2000.09',
'2000.10']
inflation=pd.DataFrame(f0,columns={'data'})
inflation.data=pd.to_datetime(inflation.data,format='%Y.%m')
output
Out[3]:
0 2000-01-01
1 2000-02-01
2 2000-03-01
3 2000-04-01
4 2000-05-01
5 2000-06-01
6 2000-07-01
7 2000-08-01
8 2000-09-01
9 2000-10-01
Name: data, dtype: datetime64[ns]

This is an interesting problem. The astype() construct is converting .10 to .01 and you can't use any split methods on the current float type.
Here is my take on this:
Use python math module modf function which returns the fractional and integer parts of x.
Now round the year and month data and convert to string for to_datetime to interpret.
import math
df['Year']= df.data.apply(lambda x: round(math.modf(x)[1])).astype(str)
df['Month']= df.data.apply(lambda x: round((math.modf(x)[0])*100)).astype(str)
df = df.drop('data', axis = 1)
df['Date'] = pd.to_datetime(df.Year+':'+df.Month, format = '%Y:%m')
df = df.drop(['Year', 'Month'], axis = 1)
You get
inflation Date
0 0.62 2000-01-01
1 0.13 2000-02-01
2 0.22 2000-03-01
3 0.42 2000-04-01
4 0.01 2000-05-01
5 0.23 2000-06-01
6 1.61 2000-07-01
7 1.31 2000-08-01
8 0.23 2000-09-01
9 0.14 2000-10-01

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Pandas Dataframe: Dropping Selected rows with 0.0 float type values

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?
You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed
Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Python LIfe Expectancy

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin
IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

Resampling with a Multiindex and several columns

I have a pandas dataframe with the following structure:
ID date m_1 m_2
1 2016-01-03 10 3.4
2016-02-07 11 3.3
2016-02-07 10.4 2.8
2 2016-01-01 10.9 2.5
2016-02-04 12 2.3
2016-02-04 11 2.7
2016-02-04 12.1 2.1
Both ID and date are a MultiIndex. The data represent some measurements made by some sensors (in the example two sensors). Those sensors sometimes create several measurements per day (as shown in the example).
My questions are:
How can I resample this so I have one row per day per sensor, but one column with the mean, another with the max another with min, etc?
How can I "align" (maybe this is no the correct word) the two time series, so both begin and end at the same time (from 2016-01-01 to 2016-02-07) adding the missing days with NAs?
You can use groupby with DataFrameGroupBy.resample and aggregate by functions in dict first and then reindex by MultiIndex.from_product:
df = df.reset_index(level=0).groupby('ID').resample('D').agg({'m_1':'mean', 'm_2':'max'})
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names = df.index.names))
#alternative for adding missing start and end datetimes
#df = df.unstack().stack(dropna=False)
print (df.head())
m_2 m_1
ID date
1 2016-01-01 NaN NaN
2016-01-02 NaN NaN
2016-01-03 3.4 10.0
2016-01-04 NaN NaN
2016-01-05 NaN NaN
For PeriodIndex in second level use set_levels with to_period:
df.index = df.index.set_levels(df.index.get_level_values('date').to_period('d'), level=1)
print (df.index.get_level_values('date'))
PeriodIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01',
'2016-02-02', '2016-02-03', '2016-02-04', '2016-02-05',
'2016-02-06', '2016-02-07', '2016-01-01', '2016-01-02',
'2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06',
'2016-01-07', '2016-01-08', '2016-01-09', '2016-01-10',
'2016-01-11', '2016-01-12', '2016-01-13', '2016-01-14',
'2016-01-15', '2016-01-16', '2016-01-17', '2016-01-18',
'2016-01-19', '2016-01-20', '2016-01-21', '2016-01-22',
'2016-01-23', '2016-01-24', '2016-01-25', '2016-01-26',
'2016-01-27', '2016-01-28', '2016-01-29', '2016-01-30',
'2016-01-31', '2016-02-01', '2016-02-02', '2016-02-03',
'2016-02-04', '2016-02-05', '2016-02-06', '2016-02-07'],
dtype='period[D]', name='date', freq='D')

Resources