How to add series to a Dataframe? [duplicate] - python-3.x

I have the following indexed DataFrame with named columns and rows not- continuous numbers:
a b c d
2 0.671399 0.101208 -0.181532 0.241273
3 0.446172 -0.243316 0.051767 1.577318
5 0.614758 0.075793 -0.451460 -0.012493
I would like to add a new column, 'e', to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).
0 -0.335485
1 -1.166658
2 -0.385571
dtype: float64
How can I add column e to the above example?

Edit 2017
As indicated in the comments and by #Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:
df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
Edit 2015
Some reported getting the SettingWithCopyWarning with this code.
However, the code still runs perfectly with the current pandas version 0.16.1.
>>> sLength = len(df1['a'])
>>> df1
a b c d
6 -0.269221 -0.026476 0.997517 1.294385
8 0.917438 0.847941 0.034235 -0.448948
>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e
6 -0.269221 -0.026476 0.997517 1.294385 1.757167
8 0.917438 0.847941 0.034235 -0.448948 2.228131
>>> pd.version.short_version
'0.16.1'
The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead
>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e f
6 -0.269221 -0.026476 0.997517 1.294385 1.757167 -0.050927
8 0.917438 0.847941 0.034235 -0.448948 2.228131 0.006109
>>>
In fact, this is currently the more efficient method as described in pandas docs
Original answer:
Use the original df1 indexes to create the series:
df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)

This is the simple way of adding a new column: df['e'] = e

I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)
I assume that the index values in e match those in df1.
The easiest way to initiate a new column named e, and assign it the values from your series e:
df['e'] = e.values
assign (Pandas 0.16.0+)
As of Pandas 0.16.0, you can also use assign, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.
df1 = df1.assign(e=e.values)
As per this example (which also includes the source code of the assign function), you can also include more than one column:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
a b mean_a mean_b
0 1 3 1.5 3.5
1 2 4 1.5 3.5
In context with your example:
np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))
>>> df1
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
7 1.532779 1.469359 0.154947 0.378163
9 1.230291 1.202380 -0.387327 -0.302303
>>> e
0 -1.048553
1 -1.420018
2 -1.706270
3 1.950775
4 -0.509652
dtype: float64
df1 = df1.assign(e=e.values)
>>> df1
a b c d e
0 1.764052 0.400157 0.978738 2.240893 -1.048553
2 -0.103219 0.410599 0.144044 1.454274 -1.420018
3 0.761038 0.121675 0.443863 0.333674 -1.706270
7 1.532779 1.469359 0.154947 0.378163 1.950775
9 1.230291 1.202380 -0.387327 -0.302303 -0.509652
The description of this new feature when it was first introduced can be found here.

Super simple column assignment
A pandas dataframe is implemented as an ordered dict of columns.
This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.
For example, this dataframe can have a column added to it by simply using the [] accessor
size name color
0 big rose red
1 small violet blue
2 small tulip red
3 small harebell blue
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note that this works even if the index of the dataframe is off.
df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
[]= is the way to go, but watch out!
However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"
What actually is going on.
When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series
Side note
This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.
Going around the problem
If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.
You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
or
df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
But this is not very explicit.
Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".
Explicit way
Setting the index of the pd.Series to be the index of the df is explicit.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
Or more realistically, you probably have a pd.Series already available.
protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index
3 no
2 no
1 no
0 yes
Can now be assigned
df['protected'] = protected_series
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
Alternative way with df.reset_index()
Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.
df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note on df.assign
While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=
df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.
df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'
You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.

It seems that in recent Pandas versions the way to go is to use df.assign:
df1 = df1.assign(e=np.random.randn(sLength))
It doesn't produce SettingWithCopyWarning.

Doing this directly via NumPy will be the most efficient:
df1['e'] = np.random.randn(sLength)
Note my original (very old) suggestion was to use map (which is much slower):
df1['e'] = df1['a'].map(lambda x: np.random.random())

Easiest ways:-
data['new_col'] = list_of_values
data.loc[ : , 'new_col'] = list_of_values
This way you avoid what is called chained indexing when setting new values in a pandas object. Click here to read further.

If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None
This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.

I got the dreaded SettingWithCopyWarning, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:
df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength), index=df.index))
This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign
Works well for data flow type where you don't overwrite your intermediate values.

First create a python's list_of_e that has relevant data.
Use this:
df['e'] = list_of_e

To create an empty column
df['i'] = None

If the column you are trying to add is a series variable then just :
df["new_columns_name"]=series_variable_name #this will do it for you
This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.

If the data frame and Series object have the same index, pandas.concat also works here:
import pandas as pd
df
# a b c d
#0 0.671399 0.101208 -0.181532 0.241273
#1 0.446172 -0.243316 0.051767 1.577318
#2 0.614758 0.075793 -0.451460 -0.012493
e = pd.Series([-0.335485, -1.166658, -0.385571])
e
#0 -0.335485
#1 -1.166658
#2 -0.385571
#dtype: float64
# here we need to give the series object a name which converts to the new column name
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df
# a b c d e
#0 0.671399 0.101208 -0.181532 0.241273 -0.335485
#1 0.446172 -0.243316 0.051767 1.577318 -1.166658
#2 0.614758 0.075793 -0.451460 -0.012493 -0.385571
In case they don't have the same index:
e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)

Foolproof:
df.loc[:, 'NewCol'] = 'New_Val'
Example:
df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
3 -0.147354 0.778707 0.479145 2.284143
4 -0.529529 0.000571 0.913779 1.395894
5 2.592400 0.637253 1.441096 -0.631468
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
8 0.606985 -2.232903 -1.358107 -2.855494
9 -0.692013 0.671866 1.179466 -1.180351
10 -1.093707 -0.530600 0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
18 0.693458 0.144327 0.329500 -0.655045
19 0.104425 0.037412 0.450598 -0.923387
df.drop([3, 5, 8, 10, 18], inplace=True)
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
4 -0.529529 0.000571 0.913779 1.395894
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
9 -0.692013 0.671866 1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
19 0.104425 0.037412 0.450598 -0.923387
df.loc[:, 'NewCol'] = 0
df
A B C D NewCol
0 -0.761269 0.477348 1.170614 0.752714 0
1 1.217250 -0.930860 -0.769324 -0.408642 0
2 -0.619679 -1.227659 -0.259135 1.700294 0
4 -0.529529 0.000571 0.913779 1.395894 0
6 0.757178 0.240012 -0.553820 1.177202 0
7 -0.986128 -1.313843 0.788589 -0.707836 0
9 -0.692013 0.671866 1.179466 -1.180351 0
11 -0.143273 -0.503199 -1.328728 0.610552 0
12 -0.923110 -1.365890 -1.366202 -1.185999 0
13 -2.026832 0.273593 -0.440426 -0.627423 0
14 -0.054503 -0.788866 -0.228088 -0.404783 0
15 0.955298 -1.430019 1.434071 -0.088215 0
16 -0.227946 0.047462 0.373573 -0.111675 0
17 1.627912 0.043611 1.743403 -0.012714 0
19 0.104425 0.037412 0.450598 -0.923387 0

One thing to note, though, is that if you do
df1['e'] = Series(np.random.randn(sLength), index=df1.index)
this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,
data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)

to insert a new column at a given location (0 <= loc <= amount of columns) in a data frame, just use Dataframe.insert:
DataFrame.insert(loc, column, value)
Therefore, if you want to add the column e at the end of a data frame called df, you can use:
e = [-0.335485, -1.166658, -0.385571]
DataFrame.insert(loc=len(df.columns), column='e', value=e)
value can be a Series, an integer (in which case all cells get filled with this one value), or an array-like structure
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html

Let me just add that, just like for hum3, .loc didn't solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by "fake" chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.
Also note that if you know what you are doing, you can switch of the warning using
pd.options.mode.chained_assignment = None
and than use one of the other solutions given here.

Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:
data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])

To add a new column, 'e', to the existing data frame
df1.loc[:,'e'] = Series(np.random.randn(sLength))

I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.
From the following:
the answers here
this question about passing a variable as a keyword argument
this method for generating a numpy array of NaNs in-line
I came up with this:
col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})

For the sake of completeness - yet another solution using DataFrame.eval() method:
Data:
In [44]: e
Out[44]:
0 1.225506
1 -1.033944
2 -0.498953
3 -0.373332
4 0.615030
5 -0.622436
dtype: float64
In [45]: df1
Out[45]:
a b c d
0 -0.634222 -0.103264 0.745069 0.801288
4 0.782387 -0.090279 0.757662 -0.602408
5 -0.117456 2.124496 1.057301 0.765466
7 0.767532 0.104304 -0.586850 1.051297
8 -0.103272 0.958334 1.163092 1.182315
9 -0.616254 0.296678 -0.112027 0.679112
Solution:
In [46]: df1.eval("e = #e.values", inplace=True)
In [47]: df1
Out[47]:
a b c d e
0 -0.634222 -0.103264 0.745069 0.801288 1.225506
4 0.782387 -0.090279 0.757662 -0.602408 -1.033944
5 -0.117456 2.124496 1.057301 0.765466 -0.498953
7 0.767532 0.104304 -0.586850 1.051297 -0.373332
8 -0.103272 0.958334 1.163092 1.182315 0.615030
9 -0.616254 0.296678 -0.112027 0.679112 -0.622436

If you just need to create a new empty column then the shortest solution is:
df.loc[:, 'e'] = pd.Series()

The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.
df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))
newCol = [3,5,7]
newName = 'C'
values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)
df = pd.DataFrame(values,columns=header)

If we want to assign a scaler value eg: 10 to all rows of a new column in a df:
df = df.assign(new_col=lambda x:10) # x is each row passed in to the lambda func
df will now have new column 'new_col' with value=10 in all rows.

If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.
df = df.copy()
df['col_name'] = values

x=pd.DataFrame([1,2,3,4,5])
y=pd.DataFrame([5,4,3,2,1])
z=pd.concat([x,y],axis=1)

4 ways you can insert a new column to a pandas DataFrame
using simple assignment, insert(), assign() and Concat() methods.
import pandas as pd
df = pd.DataFrame({
'col_a':[True, False, False],
'col_b': [1, 2, 3],
})
print(df)
col_a col_b
0 True 1
1 False 2
2 False 3
Using simple assignment
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
print(ser)
0 a
1 b
2 c
dtype: object
df['col_c'] = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
print(df)
col_a col_b col_c
0 True 1 NaN
1 False 2 a
2 False 3 b
Using assign()
e = pd.Series([1.0, 3.0, 2.0], index=[0, 2, 1])
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
df.assign(colC=s.values, colB=e.values)
col_a col_b col_c
0 True 1.0 a
1 False 3.0 b
2 False 2.0 c
Using insert()
df.insert(len(df.columns), 'col_c', ser.values)
print(df)
col_a col_b col_c
0 True 1 a
1 False 2 b
2 False 3 c
Using concat()
ser = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
df = pd.concat([df, ser.rename('colC')], axis=1)
print(df)
col_a col_b col_c
0 True 1.0 NaN
1 False 2.0 NaN
2 False 3.0 NaN
10 NaN NaN a
20 NaN NaN b
30 NaN NaN c

this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.
so, let our dataFrame has columns 'feature_1', 'feature_2', 'probability_score' and we have to add a new_column 'predicted_class' based on data in column 'probability_score'.
I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.
data = pd.read_csv('data.csv')
def myFunction(x):
//implement your logic here
if so and so:
return a
return b
variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)
data['predicted_class'] = predicted_class
// check dataFrame, new column is included based on an existing column data for each row
data.head()

Whenever you add a Series object as new column to an existing DF, you need to make sure that they both have the same index.
Then add it to the DF
e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f

import pandas as pd
# Define a dictionary containing data
data = {'a': [0,0,0.671399,0.446172,0,0.614758],
'b': [0,0,0.101208,-0.243316,0,0.075793],
'c': [0,0,-0.181532,0.051767,0,-0.451460],
'd': [0,0,0.241273,1.577318,0,-0.012493]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
col_e = [-0.335485,-1.166658,-0.385571,0,0,0]
df['e'] = col_e
# add column 'e'
df['e'] = col_e
# Observe the result
df

Related

Pandas Dataframe of Unique Triples

I'm currently working on some python dataframes over on pandas. And I'm not sure how this operation can be done. For example, I have an empty dataframe df and list of the following triples:
L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]
I wish to add all these triples into the dataframe df with columns ['id', 'a', 'b', 'c'] according to some constraint. The id is simply a counter that determines how many items have been added so far and a, b, and c are columns for the triples (but they would be commutative with each other). So the idea is to linearly traverse all items in L and then add each one to the df according to the restriction:
It is ok to add (1,2,3) since df is still empty. (id=0)
It is ok to add (2,5,4) since it or any of its permutation has not appeared yet in df. (id=1)
We then see (2,5,4) but this already exists in df, hence we cannot add it.
Next is (3,2,0) and we can clearly add this for the same reason as #2. (id=2)
Finally, it's (2,1,3). While this triple has not existed yet in df but since it's a permutation to an existing triplet in df (which is the (1,2,3)), then we cannot add it to df.
In the end, the final df should look something like this.
id a b c
0 1 2 3
1 2 5 4
2 3 2 0
Anyone knows how this can be done? My idea is to first curate an auxiliary list LL that would contain these "unique" triples and then just transform it into a pandas df. But I'm not sure if it's a fast and elegant efficient approach.
Fast solution
Create a numpy array from the list, then sort the array along axis=1 and use duplicated to create a boolean mask to identify dupes, then remove the duplicate rows from the array and create a new dataframe
a = np.array(L)
m = pd.DataFrame(np.sort(a, axis=1)).duplicated()
pd.DataFrame(a[~m], columns=['a', 'b', 'c'])
Result
a b c
0 1 2 3
1 2 5 4
2 3 2 0
You can use a dictionary comprehension with a frozenset of the tuple as key to eliminate the duplicated permutations, then feed the values to the DataFrame constructor:
L = [(1,2,3), (2,5,4), (2,5,4), (3,2,0), (2,1,3)]
df = pd.DataFrame({frozenset(t): t for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 3 2 0
2 2 5 4
If order is important, you can use a set to collect the seen values instead:
seen = set()
df = pd.DataFrame([t for t in L if (f:=frozenset(t)) not in seen
and not seen.add(f)],
columns=['a', 'b', 'c'])
output:
a b c
0 1 2 3
1 2 5 4
2 3 2 0
handling duplicates values in the tuple
df = pd.DataFrame({tuple(sorted(t)): t
for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
If there are many columns, sorting becomes inefficient, then you can use a Counter:
from collections import Counter
df = pd.DataFrame({frozenset(Counter(t).items()): t
for t in L[::-1]}.values(),
columns=['a', 'b', 'c'])
pure pandas alternative:
You can do the same with pandas using loc and aggregation to set:
df = pd.DataFrame(L).loc[lambda d: ~d.agg(set, axis=1).duplicated()]

How to split a pandas column into multiple columns [duplicate]

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.
Currently, I do the following:
data = pandas.read_csv('mydata.csv')
which gives something like:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.
It is not possible to write something like
observations = data[:'c']
features = data['c':]
I'm not sure what the best method is. Do I need a pd.Panel?
By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is.
Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]
2017 Answer - pandas 0.20: .ix is deprecated. Use .loc
See the deprecation in the docs
.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.
Let's assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.
# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat
.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat
# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar
# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat
# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned
# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar
# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat
# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat
You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
# foo ant
# w
# x
# y
Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.
The DataFrame.ix index is what you want to be accessing. It's a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
b c d e
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced
Lets use the titanic dataset from the seaborn package as an example
# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')
using the column names
>> titanic.loc[:,['sex','age','fare']]
using the column indices
>> titanic.iloc[:,[2,3,6]]
using ix (Older than Pandas <.20 version)
>> titanic.ix[:,[‘sex’,’age’,’fare’]]
or
>> titanic.ix[:,[2,3,6]]
using the reindex method
>> titanic.reindex(columns=['sex','age','fare'])
Also, Given a DataFrame
data
as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:
>>> data.iloc[:,[0,3]]
will give you
a d
0 0.883283 0.100975
1 0.614313 0.221731
2 0.438963 0.224361
3 0.466078 0.703347
4 0.955285 0.114033
5 0.268443 0.416996
6 0.613241 0.327548
7 0.370784 0.359159
8 0.692708 0.659410
9 0.806624 0.875476
You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]
And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like
op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op
This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).
Here's how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.
In [37]: import pandas as pd
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))
In [44]: df
Out[44]:
a b c d e f g
0 0.409038 0.745497 0.890767 0.945890 0.014655 0.458070 0.786633
1 0.570642 0.181552 0.794599 0.036340 0.907011 0.655237 0.735268
2 0.568440 0.501638 0.186635 0.441445 0.703312 0.187447 0.604305
3 0.679125 0.642817 0.697628 0.391686 0.698381 0.936899 0.101806
In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing
Out[45]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing
Out[46]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [47]: df.iloc[:, 0:3] ## index based column ranges slicing
Out[47]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
### with 2 different column ranges, index based slicing:
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]
Its equivalent
>>> print(df2.loc[140:160,['Relevance','Title']])
>>> print(df2.ix[140:160,[3,7]])
if Data frame look like that:
group name count
fruit apple 90
fruit banana 150
fruit orange 130
vegetable broccoli 80
vegetable kale 70
vegetable lettuce 125
and OUTPUT could be like
group name count
0 fruit apple 90
1 fruit banana 150
2 fruit orange 130
if you use logical operator np.logical_not
df[np.logical_not(df['group'] == 'vegetable')]
more about
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html
other logical operators
logical_and(x1, x2, /[, out, where, ...]) Compute the truth value of
x1 AND x2 element-wise.
logical_or(x1, x2, /[, out, where, casting,
...]) Compute the truth value of x1 OR x2 element-wise.
logical_not(x, /[, out, where, casting, ...]) Compute the truth
value of NOT x element-wise.
logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.
You can use the method truncate
df = pd.DataFrame(np.random.rand(10, 5), columns = list('abcde'))
df_ab = df.truncate(before='a', after='b', axis=1)
df_cde = df.truncate(before='c', axis=1)

How to reindex a data frame by a custom dict?

Good morning!
In my script I have a DataFrame that could have between 1 and 3 rows and always 2 columns. Here I show two different examples of my possible df:
df = pd.DataFrame({'name_col': ['L', 'V'], 'counter': [30, 4]})
or
df = pd.DataFrame({'name_col': ['VE'], 'counter': [10]})
My objective is to get always a df like the following (in brackets for the second example above):
df =
name_col counter
0 VE 0 (10)
1 V 4 (0)
2 L 30 (0)
I mean, I want to have always this 3 values, VE, V, L, in that order, in my final df. I've already tried different combinations of reindex and map, but nothing works a bit...
Thanks you very much in advance guys!
Maybe an answer could be:
new_df = pd.DataFrame(columns = {'name_col': ['VE', 'V', 'L']})
df_final = new_df.merge(df, on = 'name_col', how = 'outer').fillna(0)
Do you think this is a good approach?

Adding a row to existing dataframe [duplicate]

How do I create an empty DataFrame, then add rows, one by one?
I created an empty DataFrame:
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row at the end and fill a single field with:
df = df._set_value(index=len(df), col='qty1', value=10.0)
It works for only one field at a time. What is a better way to add new row to df?
You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
Create a list of dictionaries in which each dictionary corresponds to an input data row.
Create a data frame from this list.
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)
In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.
Performance
Using .append (NPE's answer)
Using .loc (fred's answer)
Using .loc with preallocating (FooBar's answer)
Using dict and create DataFrame in the end (ShikharDua's answer)
Runtime results (in seconds):
Approach
1000 rows
5000 rows
10 000 rows
.append
0.69
3.39
6.78
.loc without prealloc
0.74
3.90
8.35
.loc with prealloc
0.24
2.58
8.70
dict
0.012
0.046
0.084
So I use addition through the dictionary for myself.
Code:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.
You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.
For example:
def append_row(df, row):
return pd.concat([
df,
pd.DataFrame([row], columns=row.index)]
).reset_index(drop=True)
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})
df = append_row(df, new_row)
NEVER grow a DataFrame!
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
This is The Right Way™ to accumulate your data
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
These options are horrible
append or concat inside a loop
append and concat aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object (slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.
If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, #fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row
You can append a single row as a dictionary using the ignore_index option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix on non existing key index data. For example:
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
Or:
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
.....:
In [2]: dfi
Out[2]:
A B
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
Out[4]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
Out[6]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
For the sake of a Pythonic way:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
giving
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
If you always want to add a new row at the end, use this:
df.loc[len(df)] = ['name5', 9, 0]
I figured out a simple and nice way:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments.
This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
You can use a generator object to create a Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists, versus one list of n dictionaries. The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists, which overall is the same time complexity (adding entries to list and dictionary are both amortized constant time) but may have less overhead due to being a simple operation.
# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# At the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
Create a new record (data frame) and add to old_data_frame.
Pass a list of values and the corresponding column names to create a new_record (data_frame):
new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])
old_data_frame = pd.concat([old_data_frame, new_record])
Here is the way to add/append a row in a Pandas DataFrame:
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in an empty or populated Pandas DataFrame.
If you want to add a row at the end, append it as a list:
valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)
Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
All you need is loc[df.shape[0]] or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
or
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).
So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use a for loop to iterate through values or can add arrays of values.
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
Make it simple. By taking a list as input which will be appended as a row in the data-frame:
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)
pandas.DataFrame.append
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'
Code
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
With ignore_index set to True:
df.append(df2, ignore_index=True)
If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df under data frame df, then you can use:
df.append(new_df)
We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.
But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.
Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.
That idea makes me to write the below code.
df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns: # Here df.columns gives us the main dictionary key
df2[x][101] = values[i] # Here the 101 is our index number. It is also the key of the sub dictionary
i += 1
If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end.
It seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)
This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.
import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
for y in fruits:
# print(x, y)
dict1 = dict([('x',x),('y',y)])
# print(f'dict 1 {dict1}')
feat_list.append(dict1)
# print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

Group by and Count Function returns NaNs [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Resources