Sum dictionary values stored in Data frame Columns - python-3.x

I have a data frame having dictionary like structure. I want to only sum the values and store into new column.
Column 1 Desired Output
[{'Apple':3}, 9
{'Mango:2},
{'Peach:4}]
[{'Blue':2}, 3
{'Black':1}]
df['Desired Output'] = [sum(x) for x in df['Column 1']]
df

Assuming your Column 1 column does indeed have dictionaries (and not strings that look like dictionaries), this should do the trick:
df['Desired Output'] = df["Column 1"].apply(lambda lst: sum(sum(d.values()) for d in lst))

Related

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

How to split a pandas column into multiple columns [duplicate]

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.
Currently, I do the following:
data = pandas.read_csv('mydata.csv')
which gives something like:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.
It is not possible to write something like
observations = data[:'c']
features = data['c':]
I'm not sure what the best method is. Do I need a pd.Panel?
By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is.
Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]
2017 Answer - pandas 0.20: .ix is deprecated. Use .loc
See the deprecation in the docs
.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.
Let's assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.
# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat
.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat
# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar
# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat
# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned
# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar
# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat
# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat
You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
# foo ant
# w
# x
# y
Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.
The DataFrame.ix index is what you want to be accessing. It's a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
b c d e
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced
Lets use the titanic dataset from the seaborn package as an example
# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')
using the column names
>> titanic.loc[:,['sex','age','fare']]
using the column indices
>> titanic.iloc[:,[2,3,6]]
using ix (Older than Pandas <.20 version)
>> titanic.ix[:,[‘sex’,’age’,’fare’]]
or
>> titanic.ix[:,[2,3,6]]
using the reindex method
>> titanic.reindex(columns=['sex','age','fare'])
Also, Given a DataFrame
data
as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:
>>> data.iloc[:,[0,3]]
will give you
a d
0 0.883283 0.100975
1 0.614313 0.221731
2 0.438963 0.224361
3 0.466078 0.703347
4 0.955285 0.114033
5 0.268443 0.416996
6 0.613241 0.327548
7 0.370784 0.359159
8 0.692708 0.659410
9 0.806624 0.875476
You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]
And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like
op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op
This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).
Here's how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.
In [37]: import pandas as pd
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))
In [44]: df
Out[44]:
a b c d e f g
0 0.409038 0.745497 0.890767 0.945890 0.014655 0.458070 0.786633
1 0.570642 0.181552 0.794599 0.036340 0.907011 0.655237 0.735268
2 0.568440 0.501638 0.186635 0.441445 0.703312 0.187447 0.604305
3 0.679125 0.642817 0.697628 0.391686 0.698381 0.936899 0.101806
In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing
Out[45]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing
Out[46]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [47]: df.iloc[:, 0:3] ## index based column ranges slicing
Out[47]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
### with 2 different column ranges, index based slicing:
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]
Its equivalent
>>> print(df2.loc[140:160,['Relevance','Title']])
>>> print(df2.ix[140:160,[3,7]])
if Data frame look like that:
group name count
fruit apple 90
fruit banana 150
fruit orange 130
vegetable broccoli 80
vegetable kale 70
vegetable lettuce 125
and OUTPUT could be like
group name count
0 fruit apple 90
1 fruit banana 150
2 fruit orange 130
if you use logical operator np.logical_not
df[np.logical_not(df['group'] == 'vegetable')]
more about
https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html
other logical operators
logical_and(x1, x2, /[, out, where, ...]) Compute the truth value of
x1 AND x2 element-wise.
logical_or(x1, x2, /[, out, where, casting,
...]) Compute the truth value of x1 OR x2 element-wise.
logical_not(x, /[, out, where, casting, ...]) Compute the truth
value of NOT x element-wise.
logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.
You can use the method truncate
df = pd.DataFrame(np.random.rand(10, 5), columns = list('abcde'))
df_ab = df.truncate(before='a', after='b', axis=1)
df_cde = df.truncate(before='c', axis=1)

concatenating lists in a pandas dataframe and getting the unique tokens in another column

I have a data frame consisting of 4 columns the first one is ID, and the other columns have lists as their values in each row. i need to concatenate all these three columns and take the unique tokens and create another column. this is one row and i have a bit more than 1 million records.
original_df = pd.DataFrame({'ID':1,
'Name_List1':[[ 'aa','bb']],
'Name_List2':[['Mutiso','Julia','Linger']],
'Name_List3':[['Mutiso','Julia','Linger','bb','cc']]})
and the desired df is something output of this script
desired_df = pd.DataFrame({'ID':1,
'Name_List1':[[ 'aa','bb']],
'Name_List2':[['Mutiso','Julia','Linger']],
'Name_List3':[['Mutiso','Julia','Linger','bb','cc']],
'Unique_name_list':[['aa','bb','cc','Mutiso','Julia','Linger']]})
How can i get the 5th column that is "Unique_name_list" column
You can try with stack() , explode(Note this is new in pandas 0.25+) and groupby+agg , then map
m = original_df.set_index('ID').stack().explode()
.drop_duplicates().groupby(level=0).agg(list)
original_df['Unique_name_list'] = original_df['ID'].map(m)
print(original_df)
ID Name_List1 Name_List2 Name_List3 \
0 1 [aa, bb] [Mutiso, Julia, Linger] [Mutiso, Julia, Linger, bb, cc]
Unique_name_list
0 [aa, bb, Mutiso, Julia, Linger, cc]
Or (slower version)
You can try apply with np.concatenate and set:
original_df = original_df.set_index('ID')
final = original_df.assign(Unique_name_list=original_df.apply(lambda x :
[*set(np.concatenate(x))],axis=1)).reset_index()
ID Name_List1 Name_List2 Name_List3 \
0 1 [aa, bb] [Mutiso, Julia, Linger] [Mutiso, Julia, Linger, bb, cc]
Unique_name_list
0 [bb, Mutiso, cc, aa, Julia, Linger]

How can I apply a function that I created to each successive row of a column in python?

I have a 4 rows in a column that I want to iterate over and apply a function to.
The df is a column and the values for instance are:
a
b
c
d
I want to apply the function in this way:
function(a,b)
function(a,c)
function(a,d)
function(b,c)
function(c,d)
How can I do this in python?
I've tried using df.apply(lambda column: compare_score, axis=0)
1

pandas generates a new column based on values from another column considering duplicates

I am working on a dataframe which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[3,2,1] 1
[4,5,6,7] 2
[8] 0
[9,10] 3
[10,9] 3
column cluster_id only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3], [3,2,1] and [9,10], [10,9] should be assigned the same cluster_id.
I was asking a similar question without considering duplicates list values, at
pandas how to derived values for a new column base on another column
I am wondering how to do that in pandas.
First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:
df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)
Then you need to assign the cluster_id for each set sorted list:
ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)
Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:
df.merge(ids, how = 'left').fillna({'cluster_id':0})

Resources