Unable to make dtype converson from object to str - python-3.x

loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 36 months
1 2500.0 2500.0 2500.0 60 months
My dataframe frame goes like above .
I need to change '36 months' under term to '3'(months to years) but I'm unable to do so as the dtype of 'term' is Object.
loan.term.replace('36months','3',inplace=True) -->No change in dataframe
I tried the below code for type conversion but it still returns the dtype as Object
loan['term']=loan.term.astype(str)
Expected output:
loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 3
1 2500.0 2500.0 2500.0 5
Any help would be dearly appreciated .Thank you for your time.

Put your data in a variable like data, use this:
for i in data:
i['term'] = int(i['term'].split(' ')[0])//12
print(data)
you'll get your output!

Related

How to unmerge cells and create a standard dataframe when reading excel file?

I would like to convert this dataframe
into this dataframe
So far reading excel the standard way gives me the following result.
df = pd.read_excel(folder + 'abcd.xlsx', sheet_name="Sheet1")
Unnamed: 0 Unnamed: 1 T12006 T22006 T32006 \
0 Casablanca Global 100 97.27252 93.464538
1 NaN RĂ©sidentiel 100 95.883979 92.414063
2 NaN Appartement 100 95.425152 91.674379
3 NaN Maison 100 101.463607 104.039383
4 NaN Villa 100 102.45132 101.996932
Thank you
You can try method .fillna() with parameter method='ffill'. According to the pandas documentation for the ffill method: ffill: propagate last valid observation forward to next valid backfill.
So, your code would be like:
df.fillna(method='ffill', inplace=True)
And change name of 0 and 1 columns with this lines:
df.columns.values[0] = "City"
df.columns.values[1] = "Type"

Pandas Dataframe: Dropping Selected rows with 0.0 float type values

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?
You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed
Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Calculate the sum of the numbers separated by a comma in a dataframe column

I am trying to calculate the sum of all the numbers separated by a comma in a dataframe column however I keep getting error. This is what the dataframe looks like:
Description scores
logo
graphics
eyewear 0.360740,-0.000758
glasses 0.360740,-0.000758
picture -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889
computer 0.852264 0.001007,0.000968,0.000929,0.000889
This is what the code looks like
test['Sum'] = test['scores'].apply(lambda x: sum(map(float, x.split(','))))
However I keep getting the following error
ValueError: could not convert string to float:
I though it could it be because of missing values at the start of the dataframe. But I subset the dataframe to exclude the missing the values, still I get the same error.
Output
Description scores SUM
logo
graphics
eyewear 0.360740,-0.000758 0.359982
glasses 0.360740,-0.000758 0.359982
picture -0.000646 -0.000646
tutorial 0.001007,0.000968,0.000929,0.000889 0.003793
computer 0.852264 0.001007,0.000968,0.000929,0.000889 0.856057
How can I resolve it?
There are times when using Python seems to be very effective, this might be one of those.
df['scores'].apply(lambda x: sum(float(i) if len(x) > 0 else np.nan for i in x.split(',')))
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
You can just do str.split
df.scores.str.split(',',expand=True).astype(float).sum(1).mask(df.scores.isnull())
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
dtype: float64
Another solution using explode, groupby and sum functions:
df.scores.str.split(',').explode().astype(float).groupby(level=0).sum(min_count=1)
0 NaN
1 NaN
2 0.359982
3 0.359982
4 -0.000646
5 0.003793
6 0.856057
Name: scores, dtype: float64
Or to make #WeNYoBen's answer slightly shorter":
df.scores.str.split(',',expand=True).astype(float).sum(1, min_count=1)

ValueError: arrays must all be same length in python using pandas DataFrame

I'm a newbie in python and using Dataframe from pandas package (python3.6).
I set it up like below code,
df = DataFrame({'list1': list1, 'list2': list2, 'list3': list3, 'list4': list4, 'list5': list5, 'list6': list6})
and it gives an error like ValueError: arrays must all be same length
So I checked all the length of arrays, and list1 & list2 have 1 more data than other lists. If I want to add 1 data to those other 4 lists(list3, list4, list5, list6) by using pd.resample, then how should I write code...?
Also, those lists are time series list with 1 minute.
Does anybody have an idea or help me out here?
Thanks in advance.
EDIT
So I changed as what EdChum said.
and added time list at the front. it is like below.
2017-04-01 0:00 895.87 730 12.8 4 19.1 380
2017-04-01 0:01 894.4 730 12.8 4 19.1 380
2017-04-01 0:02 893.08 730 12.8 4 19.3 380
2017-04-01 0:03 890.41 730 12.8 4 19.7 380
2017-04-01 0:04 889.28 730 12.8 4 19.93 380
and I typed code like
df.resample('1min', how='mean', fill_method='pad')
And it gives me this error: TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I'd just construct a Series for each list and then concat them all:
In [38]:
l1 = list('abc')
l2 = [1,2,3,4]
s1 = pd.Series(l1, name='list1')
s2 = pd.Series(l2, name='list2')
df = pd.concat([s1,s2], axis=1)
df
Out[38]:
list1 list2
0 a 1
1 b 2
2 c 3
3 NaN 4
As you can pass a name arg for the Series ctor it will name each column in the df, plus it will place NaN where the column lengths don't match
resample refers to when you have a DatetimeIndex for which you want to rebase or adjust the length based on some time period which is not what you want here. You want to reindex which I think is unnecessary and messy:
In [40]:
l1 = list('abc')
l2 = [1,2,3,4]
s1 = pd.Series(l1)
s2 = pd.Series(l2)
df = pd.DataFrame({'list1':s1.reindex(s2.index), 'list2':s2})
df
Out[40]:
list1 list2
0 a 1
1 b 2
2 c 3
3 NaN 4
Here you'd need to know the longest length and then reindex all Series using that index, if you just concat it will automatically adjust the lengths and fill missing elements with NaN
According to this documentation, it looks quite difficult to do this with pd.resample() : You should calculate a frequence which add only one value to your df, and the function seems really not made for this ^^ (seems to permit easy reshaping, ex : 1 min to 30sec or 1h) ! You'd better try what EdChum did :P

Resources