How to obtain the difference of values of two specific dates with Pandas [duplicate] - python-3.x

This questions is similar to Python: Pandas Divide DataFrame by first row
I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-04-02 98.91 NaN 5626.79
2013-04-03 99.29 NaN 5727.53
2013-04-04 99.79 NaN 5643.75
2013-04-07 100.55 NaN 5630.78
2013-04-08 100.65 NaN 5633.77
I would like to divide the values of the last row by the values of the first row in order to obtain the percentage difference over time.

A clearer way is to use iloc:
df.iloc[0] / df.iloc[-1]

Just take the first row and the last row values, then divide them, like this: df.T[df.index[0]] / df.T[df.index[-1]]

Related

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Resample-interpolate Pandas Dataframe, interpolate to NaN

I have a Pandas dataframe with datetime indices and a single column. The datetime indices are not regularly spaced (but ordered), and the corresponding values in the column are either numbers or NaN.
My goal is to interpolate these entries to hourly datetimes. I can do so with the following snippet:
df = dataframe.resample('H')
interpolated = df.interpolate(method='linear')
where dataframe is my raw dataframe, and interpolated is the interpolated dataframe.
This works perfectly well, but the problem is that this function seems to interpolate over the NaN values, which is not what I want. Ideally, I'd like the function to return NaN if it tries interpolate between two entries at least one of which is NaN (so if I interpolate between NaN and NaN: return NaN; if I interpolate between NaN and 5: return NaN; if I interpolate halfway between 3 and 5: return 4).
I could code my own brute-force interpolator which does this, but I'd much rather get my grip on Pandas. This seems like a pretty straightforward problem - does anyone know a method to achieve this?

How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

I have two dataframes as shown below:
time browncarbon blackcarbon
181.7335 0.105270 NaN
181.3809 0.166545 0.001217
181.6197 0.071581 NaN
422 rows x 3 columns
start end toc
179.9989 180.0002 155.0
180.0002 180.0016 152.0
180.0016 180.0030 151.0
1364 rows x 3 columns
The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:
time browncarbon blackcarbon toc
422 rows X 4 columns
There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?
Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.
Addendum 2: Merging two pandas dataframes with complex conditions
We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.
note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)
df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
.query('(time >= start) & (time <= end)')\
.drop(['key', 'start', 'end'], axis=1)
output
time browncarbon blackcarbon toc
1 180.0008 0.10527 NaN 152.0
Example dataframes used:
df1:
time browncarbon blackcarbon
0 180.0008 0.105270 NaN
1 181.3809 0.166545 0.001217
2 181.6197 0.071581 NaN
df2:
start end toc
0 179.9989 180.0002 155.0
1 180.0002 180.0016 152.0
2 180.0016 180.0030 151.0
Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.
Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked
from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd
def to_timestamp(x):
return Timestamp(2000, 1, 1) + rd(days=x)
df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)
Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:
df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})
df1['Rank'] = np.arange(len(df1))
new_df = pd.merge_asof(df1.sort_values('time'), df2,
left_on='time',
right_on='start')
gives you:
time browncarbon blackcarbon Rank start end toc
0 181.3809 0.166545 0.001217 1 181.3 181.5 150.0
1 181.6197 0.071581 NaN 2 181.5 185.7 149.0
2 181.7335 0.105270 NaN 0 181.7 181.8 148.0
which you can drop extra column and sort_values on Rank. For example:
new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)
gives:
time browncarbon blackcarbon toc
2 181.7335 0.105270 NaN 148.0
0 181.3809 0.166545 0.001217 150.0
1 181.6197 0.071581 NaN 149.0

Fill the Null values of the first row of dataframe with 100 [duplicate]

This question already has answers here:
pandas fillna not working
(5 answers)
Closed 3 years ago.
I have a dataframe which looks like this:
51183 53423 51989 52483 51342
100 NaN NaN 83.33 NaN
NaN NaN 50 25 12.5
Here , '51183' , '53423'....are column names. I want to fill the null value present in the first row with 100.
I tried doing this:
df[:1].fillna(100)
It just changes the null values in the first row to 100 but it doesn't update it in the dataframe.
I want the result to look like this:
51183 53423 51989 52483 51342
100 100 100 83.33 100
NaN NaN 50 25 12.5
If you could help me achieve that , I'll greatly appreciate it.
To update the row, try this:
df[:1] = df[:1].fillna(100)
Your try was almost OK.
df[:1] gets the initial row, but treats it as a copy of this row.
Then .fillna(100) changes all NaN values to 100, but in this copy,
not in the table.
An attempt to add inplace=True:
df[:1].fillna(100, inplace=True)
does the job, but issues also a SettingWithCopyWarning warning.
A method to do the job without this warning is e.g. to use .iloc and then .fillna:
df.iloc[0].fillna(100, inplace=True)

Resources