Fixing Pandas NaN when making a new column? - python-3.x

I have two Panda's Dataframes
id volume
1 100
2 200
3 300
and
id 2020-07-01 2020-07-02 ...
1 12 14
2 5 1
3 7 8
I am trying to make a new column in the first table based on the values in the second table.
df['Total_Change'] = df2.iloc[:, 0] - df2.iloc[:, -1]
df['Change_MoM'] = df2.iloc[:, -2] - df2.iloc[:, -1]
This works, but the values are all shifted down in the table by one so that the first value is NaN and the last value is lost, so that my result is
id volume Total_Change Change_MoM
1 100 NaN NaN
2 200 -2 -2
3 300 4 4
Why is this happening? I've already double checked that the df2.iloc statements are grabbing the correct values, but I don't understand why my first table is shifting the values down a row. I've also tried shifting the table up one, but that left an NaN at the bottom.
The two tables are the same size. To be clear, I want to know how to prevent the NaN from occurring in the first place, not to replace it with some other value.

Both dfs have different index a quick fix is add reset_index()
df=df.reset_index(drop=True)
df2=df2.reset_index(drop=True)

Related

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

How to join two dataframes for which column time values are within a certain range and are not datetime or timestamp objects?

I have two dataframes as shown below:
time browncarbon blackcarbon
181.7335 0.105270 NaN
181.3809 0.166545 0.001217
181.6197 0.071581 NaN
422 rows x 3 columns
start end toc
179.9989 180.0002 155.0
180.0002 180.0016 152.0
180.0016 180.0030 151.0
1364 rows x 3 columns
The first dataframe has a time column that has instants every four minutes. The second dataframe has a two time columns spaced every two minutes. Both these time columns do not start and end at the same time. However, they contain data collected over the same day. How could I make another dataframe containing:
time browncarbon blackcarbon toc
422 rows X 4 columns
There is a related answer on Stack Overflow, however, that is applicable only when the time columns are datetime or timestamp objects. The link is: How to join two dataframes for which column values are within a certain range?
Addendum 1: The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, as it does right now, however, it should be the average of the multiple toc rows, which is not the case presently.
Addendum 2: Merging two pandas dataframes with complex conditions
We create a artificial key column to do an outer merge to get the cartesian product back (all matches between the rows). Then we filter all the rows where time falls in between the range with .query.
note: I edited the value of one row so we can get a match (see row 0 in example dataframes on the bottom)
df1.assign(key=1).merge(df2.assign(key=1), on='key', how='outer')\
.query('(time >= start) & (time <= end)')\
.drop(['key', 'start', 'end'], axis=1)
output
time browncarbon blackcarbon toc
1 180.0008 0.10527 NaN 152.0
Example dataframes used:
df1:
time browncarbon blackcarbon
0 180.0008 0.105270 NaN
1 181.3809 0.166545 0.001217
2 181.6197 0.071581 NaN
df2:
start end toc
0 179.9989 180.0002 155.0
1 180.0002 180.0016 152.0
2 180.0016 180.0030 151.0
Since the start and end intervals are mutually exclusive, we may be able to create new columns in df2 such that it would contain all the integer values in the range of floor(start) and floor(end). Later, add another column in df1 as floor(time) and then take left outer join on df1 and df2. I think that should do except that you may have to remove nan values and extra columns if required. If you send me the csv files, I may be able to send you the script. I hope I answered your question.
Perhaps you could just convert your columns to Timestamps and then use the answer in the other question you linked
from pandas import Timestamp
from dateutil.relativedelta import relativedelta as rd
def to_timestamp(x):
return Timestamp(2000, 1, 1) + rd(days=x)
df['start_time'] = df.start.apply(to_timestamp)
df['end_time'] = df.end.apply(to_timestamp)
Your 2nd data frame is too short, so it wouldn't reflect a meaningful merge. So I modified it a little:
df2 = pd.DataFrame({'start': [179.9989, 180.0002, 180.0016, 181.3, 181.5, 181.7],
'end': [180.0002, 180.0016, 180.003, 181.5, 185.7, 181.8],
'toc': [155.0, 152.0, 151.0, 150.0, 149.0, 148.0]})
df1['Rank'] = np.arange(len(df1))
new_df = pd.merge_asof(df1.sort_values('time'), df2,
left_on='time',
right_on='start')
gives you:
time browncarbon blackcarbon Rank start end toc
0 181.3809 0.166545 0.001217 1 181.3 181.5 150.0
1 181.6197 0.071581 NaN 2 181.5 185.7 149.0
2 181.7335 0.105270 NaN 0 181.7 181.8 148.0
which you can drop extra column and sort_values on Rank. For example:
new_df.sort_values('Rank').drop(['Rank','start','end'], axis=1)
gives:
time browncarbon blackcarbon toc
2 181.7335 0.105270 NaN 148.0
0 181.3809 0.166545 0.001217 150.0
1 181.6197 0.071581 NaN 149.0

Fill the Null values of the first row of dataframe with 100 [duplicate]

This question already has answers here:
pandas fillna not working
(5 answers)
Closed 3 years ago.
I have a dataframe which looks like this:
51183 53423 51989 52483 51342
100 NaN NaN 83.33 NaN
NaN NaN 50 25 12.5
Here , '51183' , '53423'....are column names. I want to fill the null value present in the first row with 100.
I tried doing this:
df[:1].fillna(100)
It just changes the null values in the first row to 100 but it doesn't update it in the dataframe.
I want the result to look like this:
51183 53423 51989 52483 51342
100 100 100 83.33 100
NaN NaN 50 25 12.5
If you could help me achieve that , I'll greatly appreciate it.
To update the row, try this:
df[:1] = df[:1].fillna(100)
Your try was almost OK.
df[:1] gets the initial row, but treats it as a copy of this row.
Then .fillna(100) changes all NaN values to 100, but in this copy,
not in the table.
An attempt to add inplace=True:
df[:1].fillna(100, inplace=True)
does the job, but issues also a SettingWithCopyWarning warning.
A method to do the job without this warning is e.g. to use .iloc and then .fillna:
df.iloc[0].fillna(100, inplace=True)

Percentage of values when one column has values and other column is null

May be this is the duplicate of other question but I am not able to solve the problem.
I have transaction data having 100 features and 2.3 million rows. I want to find percentage of values present in one column and Null in other column for every combination of columns.
Example:
A B C D
1 NA 2 3
2 4 5 6
NA 5 6 7
8 2 NA NA
9 8 7 6
So output should be:
When A has values B has Null 1/4=0.25 times
When A has values C has Null 1/4=0.25 times
Similarly for every other combination of columns and create a dataframe for it.
I tried combination of columns function in Python but it's not giving the desired result.
itertools.combinations(daf.columns, n)
You can write 2 for loops to iterate for individual columns and then compare.

How to use extractall in Pandas and get a new column with the extracted strings?

I have a data frame of 15 columns from a csv file. I am trying to remove one part of the text of a column and create a new column containing that information on each row. Each row of 'phospho' should have only one match to my demands on extractall. Now, I am trying to add the result to my data frame but I get the error:
TypeError: incompatible index of inserted column with frame index
The dataset has two column with names, and 6 columns with values (like 65.98, for ex).
Ex:
accession sequence modification phospho CON_1 CON_2 CON_3 LIF1
LIF2 LIF3 P18767 [R].GAAQNIIPASTGAAK.[A]
1xTMT6plex[K15];1xTMT6plex[N-Term] 1xPhospho [S3(98.3)]
Here is the freaking code:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
pmap1['phosphosites'] = a
Thanks!
I created pmap1 using the following sample data:
pmap1 = pd.DataFrame(data=[[ 'S34T44X', 1 ], [ 'E23H78Y', 2 ],
[ 'R49Y81Z', 3 ], [ 'D20U23X', 4 ]], columns=['phospho', 'nn'])
When you extract all matches:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
the result is:
0
match
0 0 S34
1 T44
1 0 E23
1 H78
2 Y
2 0 R49
1 Y81
3 0 D20
Note that:
The result is of DataFrame type (with a single column named 0).
It contains eight rows. So it is not clear to which row insert
particular matches.
The index is actually a MultiIndex with 2 levels:
The first (unnamed) level is the index of the source row,
The second level (named match) contains the number of
match within the current row.
E.g. in row with index 0 there were founde 2 matches:
S34 - No 0,
T44 - No 1.
So you can not directly save a as a new column of pmap1,
e.g. because pmap1 contains "ordinary" index and
a is a MultiIndex, incompatible with the index of pmap1.
And just this is written in the error message.
If you want somehow "add" a to pmap1, you can e.g. "break" each match
as a separate column the following way:
a2 = a.unstack()
Gives the result:
0
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
where columns are MultiIndex, so to drop the first
level if it, run:
a2.columns = a2.columns.droplevel()
The result is:
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
Then you can perform the actual join, executing:
pmap1.join(a2)
The result is:
phospho nn 0 1 2
0 S34T44X 1 S34 T44 NaN
1 E23H78Y 2 E23 H78 Y
2 R49Y81Z 3 R49 Y81 NaN
3 D20U23X 4 D20 NaN NaN
If you are unhappy about numbers as column names, you can change them as
you wish.
If you are unhappy about NaN values for "missing" matches
(for rows where less matches have been found compared to other rows),
add .fillna('') to the last instruction.
Edit
There is a shorter solution:
After you created a, you can do the whole rest of processing
with a single instruction:
pmap1.join(a[0].unstack()).fillna('')

Resources