Calculate the sum of the numbers separated by a comma in a dataframe column - python-3.x

I am trying to calculate the sum of all the numbers separated by a comma in a dataframe column however I keep getting error. This is what the dataframe looks like:
Description scores
logo
graphics
eyewear 0.360740,-0.000758
glasses 0.360740,-0.000758
picture -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889
computer 0.852264 0.001007,0.000968,0.000929,0.000889
This is what the code looks like
test['Sum'] = test['scores'].apply(lambda x: sum(map(float, x.split(','))))
However I keep getting the following error
ValueError: could not convert string to float:
I though it could it be because of missing values at the start of the dataframe. But I subset the dataframe to exclude the missing the values, still I get the same error.
Output
Description scores SUM
logo
graphics
eyewear 0.360740,-0.000758 0.359982
glasses 0.360740,-0.000758 0.359982
picture -0.000646 -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889 0.003793
computer 0.852264 0.001007,0.000968,0.000929,0.000889 0.856057
How can I resolve it?

There are times when using Python seems to be very effective, this might be one of those.
df['scores'].apply(lambda x: sum(float(i) if len(x) > 0 else np.nan for i in x.split(',')))
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057

You can just do str.split
df.scores.str.split(',',expand=True).astype(float).sum(1).mask(df.scores.isnull())
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
dtype: float64

Another solution using explode, groupby and sum functions:
df.scores.str.split(',').explode().astype(float).groupby(level=0).sum(min_count=1)
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
Name: scores, dtype: float64
Or to make #WeNYoBen's answer slightly shorter":
df.scores.str.split(',',expand=True).astype(float).sum(1, min_count=1)

Related

Keeping columns of pandas dataframe whose substring is in the list

I have a dataframe with many columns. I only want to retain those columns whose substring is in the list. For example the lst and dataframe is:
lst = ['col93','col71']
sample_id. col9381.3 col8371.8 col71937.9 col19993.1
1
2
3
4
Based on the substrings, the resulting dataframe will look like:
sample_id. col9381.3 col71937.9
1
2
3
4
I have a code that go through the list and filter out the columns for whom I have a substring in a list but I don't know how to create a dataframe for it. The code so far:
for i in lst:
df2 = df1.filter(regex=i)
if df2.shape[1] > 0:
print(df2)
The above code is able to filter out the columns but I don't know how combine all of these into one dataframe. Insights will be appreciated.
Try with startswith which accepts a tuple of options:
df.loc[:, df.columns.str.startswith(('sample_id.',)+tuple(lst))]
Or filter which accepts a regex as you were trying:
df.filter(regex='|'.join(['sample_id']+lst))
Output:
sample_id. col9381.3 col71937.9
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN

Unable to make dtype converson from object to str

loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 36 months
1 2500.0 2500.0 2500.0 60 months
My dataframe frame goes like above .
I need to change '36 months' under term to '3'(months to years) but I'm unable to do so as the dtype of 'term' is Object.
loan.term.replace('36months','3',inplace=True) -->No change in dataframe
I tried the below code for type conversion but it still returns the dtype as Object
loan['term']=loan.term.astype(str)
Expected output:
loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 3
1 2500.0 2500.0 2500.0 5
Any help would be dearly appreciated .Thank you for your time.
Put your data in a variable like data, use this:
for i in data:
i['term'] = int(i['term'].split(' ')[0])//12
print(data)
you'll get your output!

Remove "x" number of characters from a string in a pandas dataframe?

I have a pandas dataframe df looking like this:
a b
thisisastring 5
anotherstring 6
thirdstring 7
I want to remove characters from the left of the strings in column a based on the number in column b. So I tried:
df["a"] = d["a"].str[df["b"]:]
But this will result in:
a b
NaN 5
NaN 6
NaN 7
Instead of:
a b
sastring 5
rstring 6
ring 7
Any help? Thanks in advance!
Using zip with string slice
df.a=[x[y:] for x,y in zip(df.a,df.b)]
df
Out[584]:
a b
0 sastring 5
1 rstring 6
2 ring 7
You can do it with apply, to apply this row-wise:
df.apply(lambda x: x.a[x.b:],axis=1)
0 sastring
1 rstring
2 ring
dtype: object

Pandas: Update values of a column

I have a large dataframe with multiple columns (sample shown below). I want to update the values of one particular (population column) column by dividing the values of it by 1000.
City Population
Paris 23456
Lisbon 123466
Madrid 1254
Pekin 86648
I have tried
df['Population'].apply(lambda x: int(str(x))/1000)
and
df['Population'].apply(lambda x: int(x)/1000)
Both give me the error
ValueError: invalid literal for int() with base 10: '...'
If your DataFrame really does look as presented, then the second example should work just fine (with the int not even being necessary):
In [16]: df
Out[16]:
City Population
0 Paris 23456
1 Lisbon 123466
2 Madrid 1254
3 Pekin 86648
In [17]: df['Population'].apply(lambda x: x/1000)
Out[17]:
0 23.456
1 123.466
2 1.254
3 86.648
Name: Population, dtype: float64
In [18]: df['Population']/1000
Out[18]:
0 23.456
1 123.466
2 1.254
3 86.648
However, from the error, it seems like you have the unparsable string '...' somewhere in your Series, and that the data needs to be cleaned further.

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Resources