Divide all columns of a dataframe by a smaller one by key without merging - python-3.x

I know this is a well-asked question, but I am searching from quite some time and cannot find the answer.
I have a dataset like this:
ID. denominator
A 2
B 4
C 5
and another one like this:
ID. Value1. Value2. Value3. Value4 ...
A 2. 1. 4 8. ...
B 4 2 6 0 ...
C 5. 5 7 7 ...
And I want to divide all columns from the first dataset by the second dataset based on the ID and replace the value on the second dataset with this answer of the division.
Also, the datasets are big, so I don't want to do it by merging and then dividing, as in some answers.
Datasets:
df1 = pd.DataFrame({
'ID.':list('abc'),
'denominator':[2, 4, 5]
})
df2 = pd.DataFrame({
'ID.':list('abc'),
'var2':[1,0.5,7],
'var3':[7,8,9],
'var1':[1,3,1]
})

You can use set_index:
df2.set_index('ID.').div(df1.set_index('ID.')['denominator'], axis=0)

Related

How do I validate data mapping between 2 data frames in pandas

I am trying to validate a data mapping between two data frames for specific columns. I need to validate the following:
if values in a specific column in df1 matches the mapping in a specific column in df2.
if values in a specific column in df1 does not match the specified mapping in a specific column in df2 - a different value in df2.
if values in a specific column in df1 does not have a match in in df2.
df1 looks like this:
cp_id
cp_code
2A23
A
2A24
D
3A45
G
7A96
B
2A30
R
6A18
K
df2 looks like like:
cp_type_id
cp_type_code
2A23
8
2A24
7
3A45
3
2A44
1
6A18
8
4A08
2
The data mapping constitutes of sets of values where the combination could match any values within the set, as following:
('A','C','F','K','M') in df1 should map to (2, 8) in df2 - either 2 or 8
('B') in df1 should map to 4 in df2
('D','G','I') in df1 should map to 7 in df2
('T','U') in df1 should map to (3,5) in df2 - either 3 or 5
Note that df1 has a cp_code as R which is not mapped and that 3A45 is a mismatch. The good news is there is a unique identifier key to use.
First, I created a list for each mapping set and created a statement using merge to check for each mapping. I ended up with 3 lists and 3 statements per set, which I am not sure if this is the right way to do it.
At the end I want to combine the matches into one df that I call match, all no_matches into another df that I call no_match, and all no_mappings into another df that I call no_mapping, like the following:
Match
cp_id
cp_code
cp_type_id
cp_type_code
2A23
A
2A23
8
2A24
D
2A24
7
6A18
K
6A18
8
Mismatch
cp_id
cp_code
cp_type_id
cp_type_code
3A45
G
3A45
3
No Mapping
cp_id
cp_code
cp_type_id
cp_type_code
7A96
B
NaN
NaN
NaN
NaN
2A44
1
2A30
R
NaN
NaN
NaN
NaN
4A08
2
I am having a hard time to make the no_match to work.
This is what I tried for no match:
filtered df1 based on the set 2 codes
filtered df2 based on not in map 2 codes
for the no mapping, I did a df merge with on='cp_id'
no_mapping_set2 = df1_filtered.merge(df2_filtered, on='cp_id', indicator = True)
With the code above, for cp_id = 'B', for example, instead of getting only 1 row back, I get a lot of duplicate rows with cp_id = 'B'.
Just to state my level, I am a beginner in Python. Any help would be appreciated.
Thank you so much for your time.
Rob

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

How to add series to a Dataframe? [duplicate]

I have the following indexed DataFrame with named columns and rows not- continuous numbers:
a b c d
2 0.671399 0.101208 -0.181532 0.241273
3 0.446172 -0.243316 0.051767 1.577318
5 0.614758 0.075793 -0.451460 -0.012493
I would like to add a new column, 'e', to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).
0 -0.335485
1 -1.166658
2 -0.385571
dtype: float64
How can I add column e to the above example?
Edit 2017
As indicated in the comments and by #Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:
df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
Edit 2015
Some reported getting the SettingWithCopyWarning with this code.
However, the code still runs perfectly with the current pandas version 0.16.1.
>>> sLength = len(df1['a'])
>>> df1
a b c d
6 -0.269221 -0.026476 0.997517 1.294385
8 0.917438 0.847941 0.034235 -0.448948
>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e
6 -0.269221 -0.026476 0.997517 1.294385 1.757167
8 0.917438 0.847941 0.034235 -0.448948 2.228131
>>> pd.version.short_version
'0.16.1'
The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead
>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e f
6 -0.269221 -0.026476 0.997517 1.294385 1.757167 -0.050927
8 0.917438 0.847941 0.034235 -0.448948 2.228131 0.006109
>>>
In fact, this is currently the more efficient method as described in pandas docs
Original answer:
Use the original df1 indexes to create the series:
df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
This is the simple way of adding a new column: df['e'] = e
I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)
I assume that the index values in e match those in df1.
The easiest way to initiate a new column named e, and assign it the values from your series e:
df['e'] = e.values
assign (Pandas 0.16.0+)
As of Pandas 0.16.0, you can also use assign, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.
df1 = df1.assign(e=e.values)
As per this example (which also includes the source code of the assign function), you can also include more than one column:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
a b mean_a mean_b
0 1 3 1.5 3.5
1 2 4 1.5 3.5
In context with your example:
np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))
>>> df1
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
7 1.532779 1.469359 0.154947 0.378163
9 1.230291 1.202380 -0.387327 -0.302303
>>> e
0 -1.048553
1 -1.420018
2 -1.706270
3 1.950775
4 -0.509652
dtype: float64
df1 = df1.assign(e=e.values)
>>> df1
a b c d e
0 1.764052 0.400157 0.978738 2.240893 -1.048553
2 -0.103219 0.410599 0.144044 1.454274 -1.420018
3 0.761038 0.121675 0.443863 0.333674 -1.706270
7 1.532779 1.469359 0.154947 0.378163 1.950775
9 1.230291 1.202380 -0.387327 -0.302303 -0.509652
The description of this new feature when it was first introduced can be found here.
Super simple column assignment
A pandas dataframe is implemented as an ordered dict of columns.
This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.
For example, this dataframe can have a column added to it by simply using the [] accessor
size name color
0 big rose red
1 small violet blue
2 small tulip red
3 small harebell blue
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note that this works even if the index of the dataframe is off.
df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
[]= is the way to go, but watch out!
However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"
What actually is going on.
When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series
Side note
This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.
Going around the problem
If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.
You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
or
df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
But this is not very explicit.
Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".
Explicit way
Setting the index of the pd.Series to be the index of the df is explicit.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
Or more realistically, you probably have a pd.Series already available.
protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index
3 no
2 no
1 no
0 yes
Can now be assigned
df['protected'] = protected_series
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
Alternative way with df.reset_index()
Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.
df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note on df.assign
While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=
df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.
df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'
You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.
It seems that in recent Pandas versions the way to go is to use df.assign:
df1 = df1.assign(e=np.random.randn(sLength))
It doesn't produce SettingWithCopyWarning.
Doing this directly via NumPy will be the most efficient:
df1['e'] = np.random.randn(sLength)
Note my original (very old) suggestion was to use map (which is much slower):
df1['e'] = df1['a'].map(lambda x: np.random.random())
Easiest ways:-
data['new_col'] = list_of_values
data.loc[ : , 'new_col'] = list_of_values
This way you avoid what is called chained indexing when setting new values in a pandas object. Click here to read further.
If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None
This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.
I got the dreaded SettingWithCopyWarning, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:
df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength), index=df.index))
This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign
Works well for data flow type where you don't overwrite your intermediate values.
First create a python's list_of_e that has relevant data.
Use this:
df['e'] = list_of_e
To create an empty column
df['i'] = None
If the column you are trying to add is a series variable then just :
df["new_columns_name"]=series_variable_name #this will do it for you
This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.
If the data frame and Series object have the same index, pandas.concat also works here:
import pandas as pd
df
# a b c d
#0 0.671399 0.101208 -0.181532 0.241273
#1 0.446172 -0.243316 0.051767 1.577318
#2 0.614758 0.075793 -0.451460 -0.012493
e = pd.Series([-0.335485, -1.166658, -0.385571])
e
#0 -0.335485
#1 -1.166658
#2 -0.385571
#dtype: float64
# here we need to give the series object a name which converts to the new column name
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df
# a b c d e
#0 0.671399 0.101208 -0.181532 0.241273 -0.335485
#1 0.446172 -0.243316 0.051767 1.577318 -1.166658
#2 0.614758 0.075793 -0.451460 -0.012493 -0.385571
In case they don't have the same index:
e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)
Foolproof:
df.loc[:, 'NewCol'] = 'New_Val'
Example:
df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
3 -0.147354 0.778707 0.479145 2.284143
4 -0.529529 0.000571 0.913779 1.395894
5 2.592400 0.637253 1.441096 -0.631468
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
8 0.606985 -2.232903 -1.358107 -2.855494
9 -0.692013 0.671866 1.179466 -1.180351
10 -1.093707 -0.530600 0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
18 0.693458 0.144327 0.329500 -0.655045
19 0.104425 0.037412 0.450598 -0.923387
df.drop([3, 5, 8, 10, 18], inplace=True)
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
4 -0.529529 0.000571 0.913779 1.395894
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
9 -0.692013 0.671866 1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
19 0.104425 0.037412 0.450598 -0.923387
df.loc[:, 'NewCol'] = 0
df
A B C D NewCol
0 -0.761269 0.477348 1.170614 0.752714 0
1 1.217250 -0.930860 -0.769324 -0.408642 0
2 -0.619679 -1.227659 -0.259135 1.700294 0
4 -0.529529 0.000571 0.913779 1.395894 0
6 0.757178 0.240012 -0.553820 1.177202 0
7 -0.986128 -1.313843 0.788589 -0.707836 0
9 -0.692013 0.671866 1.179466 -1.180351 0
11 -0.143273 -0.503199 -1.328728 0.610552 0
12 -0.923110 -1.365890 -1.366202 -1.185999 0
13 -2.026832 0.273593 -0.440426 -0.627423 0
14 -0.054503 -0.788866 -0.228088 -0.404783 0
15 0.955298 -1.430019 1.434071 -0.088215 0
16 -0.227946 0.047462 0.373573 -0.111675 0
17 1.627912 0.043611 1.743403 -0.012714 0
19 0.104425 0.037412 0.450598 -0.923387 0
One thing to note, though, is that if you do
df1['e'] = Series(np.random.randn(sLength), index=df1.index)
this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,
data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)
to insert a new column at a given location (0 <= loc <= amount of columns) in a data frame, just use Dataframe.insert:
DataFrame.insert(loc, column, value)
Therefore, if you want to add the column e at the end of a data frame called df, you can use:
e = [-0.335485, -1.166658, -0.385571]
DataFrame.insert(loc=len(df.columns), column='e', value=e)
value can be a Series, an integer (in which case all cells get filled with this one value), or an array-like structure
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html
Let me just add that, just like for hum3, .loc didn't solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by "fake" chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.
Also note that if you know what you are doing, you can switch of the warning using
pd.options.mode.chained_assignment = None
and than use one of the other solutions given here.
Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:
data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])
To add a new column, 'e', to the existing data frame
df1.loc[:,'e'] = Series(np.random.randn(sLength))
I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.
From the following:
the answers here
this question about passing a variable as a keyword argument
this method for generating a numpy array of NaNs in-line
I came up with this:
col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})
For the sake of completeness - yet another solution using DataFrame.eval() method:
Data:
In [44]: e
Out[44]:
0 1.225506
1 -1.033944
2 -0.498953
3 -0.373332
4 0.615030
5 -0.622436
dtype: float64
In [45]: df1
Out[45]:
a b c d
0 -0.634222 -0.103264 0.745069 0.801288
4 0.782387 -0.090279 0.757662 -0.602408
5 -0.117456 2.124496 1.057301 0.765466
7 0.767532 0.104304 -0.586850 1.051297
8 -0.103272 0.958334 1.163092 1.182315
9 -0.616254 0.296678 -0.112027 0.679112
Solution:
In [46]: df1.eval("e = #e.values", inplace=True)
In [47]: df1
Out[47]:
a b c d e
0 -0.634222 -0.103264 0.745069 0.801288 1.225506
4 0.782387 -0.090279 0.757662 -0.602408 -1.033944
5 -0.117456 2.124496 1.057301 0.765466 -0.498953
7 0.767532 0.104304 -0.586850 1.051297 -0.373332
8 -0.103272 0.958334 1.163092 1.182315 0.615030
9 -0.616254 0.296678 -0.112027 0.679112 -0.622436
If you just need to create a new empty column then the shortest solution is:
df.loc[:, 'e'] = pd.Series()
The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.
df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))
newCol = [3,5,7]
newName = 'C'
values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)
df = pd.DataFrame(values,columns=header)
If we want to assign a scaler value eg: 10 to all rows of a new column in a df:
df = df.assign(new_col=lambda x:10) # x is each row passed in to the lambda func
df will now have new column 'new_col' with value=10 in all rows.
If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.
df = df.copy()
df['col_name'] = values
x=pd.DataFrame([1,2,3,4,5])
y=pd.DataFrame([5,4,3,2,1])
z=pd.concat([x,y],axis=1)
4 ways you can insert a new column to a pandas DataFrame
using simple assignment, insert(), assign() and Concat() methods.
import pandas as pd
df = pd.DataFrame({
'col_a':[True, False, False],
'col_b': [1, 2, 3],
})
print(df)
col_a col_b
0 True 1
1 False 2
2 False 3
Using simple assignment
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
print(ser)
0 a
1 b
2 c
dtype: object
df['col_c'] = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
print(df)
col_a col_b col_c
0 True 1 NaN
1 False 2 a
2 False 3 b
Using assign()
e = pd.Series([1.0, 3.0, 2.0], index=[0, 2, 1])
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
df.assign(colC=s.values, colB=e.values)
col_a col_b col_c
0 True 1.0 a
1 False 3.0 b
2 False 2.0 c
Using insert()
df.insert(len(df.columns), 'col_c', ser.values)
print(df)
col_a col_b col_c
0 True 1 a
1 False 2 b
2 False 3 c
Using concat()
ser = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
df = pd.concat([df, ser.rename('colC')], axis=1)
print(df)
col_a col_b col_c
0 True 1.0 NaN
1 False 2.0 NaN
2 False 3.0 NaN
10 NaN NaN a
20 NaN NaN b
30 NaN NaN c
this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.
so, let our dataFrame has columns 'feature_1', 'feature_2', 'probability_score' and we have to add a new_column 'predicted_class' based on data in column 'probability_score'.
I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.
data = pd.read_csv('data.csv')
def myFunction(x):
//implement your logic here
if so and so:
return a
return b
variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)
data['predicted_class'] = predicted_class
// check dataFrame, new column is included based on an existing column data for each row
data.head()
Whenever you add a Series object as new column to an existing DF, you need to make sure that they both have the same index.
Then add it to the DF
e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f
import pandas as pd
# Define a dictionary containing data
data = {'a': [0,0,0.671399,0.446172,0,0.614758],
'b': [0,0,0.101208,-0.243316,0,0.075793],
'c': [0,0,-0.181532,0.051767,0,-0.451460],
'd': [0,0,0.241273,1.577318,0,-0.012493]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
col_e = [-0.335485,-1.166658,-0.385571,0,0,0]
df['e'] = col_e
# add column 'e'
df['e'] = col_e
# Observe the result
df

How to remove duplicates rows by same values in different order in dataframe by pandas

How to remove the duplicates in the df? df only has 1 column. In this case "60,25" and "25,60" is a pair of duplicated rows. The output should be the new df. For each pair of duplicated row, the kept row in format "A,B" where A < B, the removed row should be the one A>B. In this case, "25,60" and "80,123" should be kept. For unique row, it should stay whatever it is.
IIUC, using get_dummies with duplicated
df[~df.A.str.get_dummies(sep=',').duplicated()]
Out[956]:
A
0 A,C
1 A,B
4 X,Y,Z
Data input
df
Out[957]:
A
0 A,C
1 A,B
2 C,A
3 B,A
4 X,Y,Z
5 Z,Y,X
Update op change the question totally to different question
newdf=df.A.str.get_dummies(sep=',')
newdf[~newdf.duplicated()].dot(newdf.columns+',').str[:-1]
Out[976]:
0 25,60
1 123,37
dtype: object
I'd do a combination of things.
Use pandas.Series.str.split to split by commas
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='last'
df[~df.A.str.split(',').apply(frozenset).duplicated(keep='last')]
A
1 123,17
3 80,123
4 25,60
5 25,42
Addressing comments
df.A.apply(
lambda x: tuple(sorted(map(int, x.split(','))))
).drop_duplicates().apply(
lambda x: ','.join(map(str, x))
)
0 25,60
1 17,123
2 80,123
5 25,42
Name: A, dtype: object
Setup
df = pd.DataFrame(dict(
A='60,25 123,17 123,80 80,123 25,60 25,42'.split()
))

Pandas Calculate CAGR with Slicing (missing values)

As a follow-up to this question,
I'd like to calculate the CAGR from a pandas data frame such as this, where there are some missing data values:
df = pd.DataFrame({'A' : ['1','2','3','7'],
'B' : [7,6,np.nan,4],
'C' : [5,6,7,1],
'D' : [np.nan,9,9,8]})
df=df.set_index('A')
df
B C D
A
1 7 5 NaN
2 6 6 9
3 NaN 7 9
7 4 1 8
Thanks in advance!
When calculating returns from a level, it's ok to use most recent available. For example, when calculating CAGR for row 1, we want to use (5/7) ^ (1/3) - 1. Also, for row 3 (9/7) ^ (1/3). There is an assumption made that we annualize across all years looked at.
With these assumptions:
df = df.bfill(axis=1).ffill(axis=1)
Then apply solution from linked question.
df['CAGR'] = df.T.pct_change().add(1).prod().pow(1./(len(df.columns) - 1)).sub(1)
With out this assumption. The only other reasonable choice would be to annualize by the number of non-NaN observations. So I need to track that with:
notnull = df.notnull().sum(axis=1)
df = df.bfill(axis=1).ffill(axis=1)
df['CAGR'] = df.T.pct_change().add(1).prod().pow(1./(notnull.sub(1))).sub(1)
In fact, this becomes the more general solution as it will work with the case with out nulls as well.

Resources