Resample-interpolate Pandas Dataframe, interpolate to NaN - python-3.x

I have a Pandas dataframe with datetime indices and a single column. The datetime indices are not regularly spaced (but ordered), and the corresponding values in the column are either numbers or NaN.
My goal is to interpolate these entries to hourly datetimes. I can do so with the following snippet:
df = dataframe.resample('H')
interpolated = df.interpolate(method='linear')
where dataframe is my raw dataframe, and interpolated is the interpolated dataframe.
This works perfectly well, but the problem is that this function seems to interpolate over the NaN values, which is not what I want. Ideally, I'd like the function to return NaN if it tries interpolate between two entries at least one of which is NaN (so if I interpolate between NaN and NaN: return NaN; if I interpolate between NaN and 5: return NaN; if I interpolate halfway between 3 and 5: return 4).
I could code my own brute-force interpolator which does this, but I'd much rather get my grip on Pandas. This seems like a pretty straightforward problem - does anyone know a method to achieve this?

Related

Trying to divide two columns of a dataframe but get Nan

Background:
I deal with a dataframe and want to divide the two columns of this dataframe to get a new column. The code is shown below:
import pandas as pd
df = {'drive_mile': [15.1, 2.1, 7.12], 'price': [40, 9, 31]}
df = pd.DataFrame(df)
df['price/km'] = df[['drive_mile', 'price']].apply(lambda x: x[1]/x[0])
print(df)
And I get the below result:
drive_mile price price/km
0 15.10 40 NaN
1 2.10 9 NaN
2 7.12 31 NaN
Why would this happen? And how can I fix it?
As pointed out in the comments, you missed the axis=1 parameter to perform the division on the right dimension using apply. This is because you end up with different indices when joining back in the DataFrame.
However, more importantly, do not use apply to perform a division!. Apply is often much less efficient compared to vectorial operations.
Use div:
df['price/km'] = df['drive_mile'].div(df['price'])
Or /:
df['price/km'] = df['drive_mile']/df['price']

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

Column of NaN created when concatenating series into dataframe

I've created an output variable 'a = pd.Series()', then run a number of simulations using a for loop that append the results of the simulation, temporarily stored in 'x', to 'a' in successive columns, each renamed to coincide with the simulation number, starting at the zero-th position, using the following code:
a = pandas.concat([a, x.rename(sim_count)], axis=1)
For some reason, the resulting dataframe includes a column of "NaN" values to the left of my first column of simulated results that I can't get rid of, as follows (example shows the results of three simulations):
0 0 1 2
0 NaN 0.136799 0.135325 -0.174987
1 NaN -0.010517 0.108798 0.003726
2 NaN 0.116757 0.030352 0.077443
3 NaN 0.148347 0.045051 0.211610
4 NaN 0.014309 0.074419 0.109129
Any idea how to prevent this column of NaN values from being generated?
Basically, by creating your output variable via pd.Series() you are creating an empty dataset. This is carried over in the concatenation, with the empty dataset's size being defined as the same size (well, same number of rows) as x[sim_count]. The only way Python/Pandas knows to represent this "empty" series is by using a series of NaN values. When you concatenate you are effectively saying: I want to add my new dataframe/series onto the "empty" series...and the empty series just gets NaN.
A more effective way of doing this is to assign "a" to a dataframe then concatenate.
a = pd.DataFrame()
a = pandas.concat([a, x.rename(sim_count)], axis=1)
You might be asking yourself why this works and using pd.Series() forces a column of NaNs. My understanding is the dataframe creates an empty place in memory for the data to be added (i.e. you are putting your new data INTO an empty dataframe), whereas when you do pd.concat([pd.Series(), x.rename(sim_count)], axis1) you are telling pandas that the empty series (pd.Series()) is important and should be retained, and that the new data should be added ONTO "a". Hence the column of NaNs.

How to obtain the difference of values of two specific dates with Pandas [duplicate]

This questions is similar to Python: Pandas Divide DataFrame by first row
I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-04-02 98.91 NaN 5626.79
2013-04-03 99.29 NaN 5727.53
2013-04-04 99.79 NaN 5643.75
2013-04-07 100.55 NaN 5630.78
2013-04-08 100.65 NaN 5633.77
I would like to divide the values of the last row by the values of the first row in order to obtain the percentage difference over time.
A clearer way is to use iloc:
df.iloc[0] / df.iloc[-1]
Just take the first row and the last row values, then divide them, like this: df.T[df.index[0]] / df.T[df.index[-1]]

pandas apply over a single column

I have a few columns in a dataframe read from a CSV file that appears to have a mix of nan and strings (that dataframe also has a few other columns are float with a few nan values as well), for example:
[nan '12/31/1990 12:00:00 AM' '06/03/1991 12:00:00 AM'
'09/15/1991 12:00:00 AM' '11/11/1991 12:00:00 AM']
I am interested to convert this to
[nan '12/31/1990' '06/03/1991'
'09/15/1991' '11/11/1991']
This question is in four parts:
suppose I want to convert the strings in the example above to remove the time, such as by using the function
def rem_t_from_d(x): return x.split(sep = ' ')[0]
I was thinking to convert the function above to handle the nan type (float) ? Would any missing value in a string column (when read from the CSV) be converted to nan (float) even if rest of the column in string?
how can I apply this function over a single column of pandas? In pandas documentation, the structure of the function is given as
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
but I don't see any ability to apply it to a column of the dataframe.
How can I check if an element in the dataframe is nan? The documentation enter link description here gave examples of how to check if there are nan in the whole column (e.g. using .notnull()) and how to assign an element to nan (e.g. using = np.nan), but not how to check. I tried using np.isnan to check for nan, but that seems to give me a type error.
also, in pandas, is there the equivalent of NA_integer_ , NA_real_, NA_character_, NA_complex_ like in R? such that an entire column can be designated as a string type if needed, instead of a mix of string and float?
I am using Python 3.4

Resources