Pandas: Update values of a column - python-3.x

I have a large dataframe with multiple columns (sample shown below). I want to update the values of one particular (population column) column by dividing the values of it by 1000.
City Population
Paris 23456
Lisbon 123466
Madrid 1254
Pekin 86648
I have tried
df['Population'].apply(lambda x: int(str(x))/1000)
and
df['Population'].apply(lambda x: int(x)/1000)
Both give me the error
ValueError: invalid literal for int() with base 10: '...'

If your DataFrame really does look as presented, then the second example should work just fine (with the int not even being necessary):
In [16]: df
Out[16]:
City Population
0 Paris 23456
1 Lisbon 123466
2 Madrid 1254
3 Pekin 86648
In [17]: df['Population'].apply(lambda x: x/1000)
Out[17]:
0 23.456
1 123.466
2 1.254
3 86.648
Name: Population, dtype: float64
In [18]: df['Population']/1000
Out[18]:
0 23.456
1 123.466
2 1.254
3 86.648
However, from the error, it seems like you have the unparsable string '...' somewhere in your Series, and that the data needs to be cleaned further.

Related

Python pandas move cell value to another cell in same row

I have a dataFrame like this:
id Description Price Unit
1 Test Only 1254 12
2 Data test Fresher 4
3 Sample 3569 1
4 Sample Onces Code test
5 Sample 245 2
I want to move to the left Description column from Price column if not integer then become NaN. I have no specific word to call in or match, the only thing is if Price column have Non-integer value, that string value move to Description column.
I already tried pandas replace and concat but it doesn't work.
Desired output is like this:
id Description Price Unit
1 Test Only 1254 12
2 Fresher 4
3 Sample 3569 1
4 Code test
5 Sample 245 2
This should work
# data
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Description': ['Test Only', 'Data test', 'Sample', 'Sample Onces', 'Sample'],
'Price': ['1254', 'Fresher', '3569', 'Code test', '245'],
'Unit': [12, 4, 1, np.nan, 2]})
# convert price column to numeric and coerce errors
price = pd.to_numeric(df.Price, errors='coerce')
# for rows where price is not numeric, replace description with these values
df.Description = df.Description.mask(price.isna(), df.Price)
# assign numeric price to price column
df.Price = price
df
Use:
#convert valeus to numeric
price = pd.to_numeric(df['Price'], errors='coerce')
#test missing values
m = price.isna()
#shifted only matched rows
df.loc[m, ['Description','Price']] = df.loc[m, ['Description','Price']].shift(-1, axis=1)
print (df)
id Description Price
0 1 Test Only 1254
1 2 Fresher NaN
2 3 Sample 3569
3 4 Code test NaN
4 5 Sample 245
If need numeric values in ouput Price column:
df = df.assign(Price=price)
print (df)
id Description Price
0 1 Test Only 1254.0
1 2 Fresher NaN
2 3 Sample 3569.0
3 4 Code test NaN
4 5 Sample 245.0

Is KeyError in dataframe caused incorrect groupby application?

I have made a summarized dataframe from another dataframe using .groupby and .agg.
sum_df = cnms_df.groupby(['Tiermetric', 'Mod_unMod', 'Val_Combined', 'Det_Approx', 'State', 'Region', 'CO_FIPS']).agg({'MILES': 'sum'})
However, something doesn't look quite right; there seems to be missing values.
Tiermetric Mod_unMod Val_Combined Det_Approx State Region CO_FIPS MILES
Other 1 UnMapped ASSESSED Approx IN 5 18001 8.397255
18003 3.284817
18011 64.019156
18017 9.068318
TIER 4 Modernized VALID Detailed NC 4 37119 2.046716
NC 4 37120 59.890107
NC 4 37025 3.773599
When I try to do something like this:
sum_df['CO_FIPS'][0]
I get an error that seems related to indexing:
KeyError: 'CO_FIPS'
What I want is for my final dataframe to look like this:
Tiermetric Mod_unMod Val_Combined Det_Approx State Region CO_FIPS MILES
Other 1 UnMapped ASSESSED Approx IN 5 18001 8.397255
Other 1 UnMapped ASSESSED Approx IN 5 18003 3.284817
Other 1 UnMapped ASSESSED Approx IN 5 18011 64.019156
Other 1 UnMapped ASSESSED Approx IN 5 18017 9.068318
TIER 4 Modernized VALID Detailed NC 4 37119 2.046716
TIER 4 Modernized VALID Detailed NC 4 37120 59.890107
TIER 4 Modernized VALID Detailed NC 4 37025 3.773599
How can I fix this?
The groupby and sum cause all of these columns to become a multi-index. You can use reset_index() or pass in as_index=False to turn the index into columns.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a':[1, 1, 2], 'b':[10, 20, 30]})
In [3]: df
Out[3]:
a b
0 1 10
1 1 20
2 2 30
In [4]: grouped = df.groupby('a').agg({'b': 'sum'})
In [5]: grouped # a is an index now
Out[5]:
b
a
1 30
2 30
In [6]: grouped = df.groupby('a', as_index=False).agg({'b': 'sum'})
In [7]: grouped # now a is a column
Out[7]:
a b
0 1 30
1 2 30
This will work with multi-indexes as well.
Set as_index to False, by default this will be True:
sum_df = cnms_df.groupby(as_index=False,by=['Tiermetric', 'Mod_unMod', 'Val_Combined', 'Det_Approx', 'State', 'Region', 'CO_FIPS']).agg({'MILES': 'sum'})

Calculate the sum of the numbers separated by a comma in a dataframe column

I am trying to calculate the sum of all the numbers separated by a comma in a dataframe column however I keep getting error. This is what the dataframe looks like:
Description scores
logo
graphics
eyewear 0.360740,-0.000758
glasses 0.360740,-0.000758
picture -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889
computer 0.852264 0.001007,0.000968,0.000929,0.000889
This is what the code looks like
test['Sum'] = test['scores'].apply(lambda x: sum(map(float, x.split(','))))
However I keep getting the following error
ValueError: could not convert string to float:
I though it could it be because of missing values at the start of the dataframe. But I subset the dataframe to exclude the missing the values, still I get the same error.
Output
Description scores SUM
logo
graphics
eyewear 0.360740,-0.000758 0.359982
glasses 0.360740,-0.000758 0.359982
picture -0.000646 -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889 0.003793
computer 0.852264 0.001007,0.000968,0.000929,0.000889 0.856057
How can I resolve it?
There are times when using Python seems to be very effective, this might be one of those.
df['scores'].apply(lambda x: sum(float(i) if len(x) > 0 else np.nan for i in x.split(',')))
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
You can just do str.split
df.scores.str.split(',',expand=True).astype(float).sum(1).mask(df.scores.isnull())
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
dtype: float64
Another solution using explode, groupby and sum functions:
df.scores.str.split(',').explode().astype(float).groupby(level=0).sum(min_count=1)
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
Name: scores, dtype: float64
Or to make #WeNYoBen's answer slightly shorter":
df.scores.str.split(',',expand=True).astype(float).sum(1, min_count=1)

Create of multiple subsets from existing pandas dataframe [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

Resources