How to change the format for values in a dataframe? - python-3.x

I need to change the format for values in a column in a dataframe. If I have a dataframe in that format:
df =
sector funding_total_usd
1 NaN
2 10,00,000
3 3,90,000
4 34,06,159
5 2,17,50,000
6 20,00,000
How to change it to that format:
df =
sector funding_total_usd
1 NaN
2 10000.00
3 3900.00
4 34061.59
5 217500.00
6 20000.00
This is my code:
for row in df['funding_total_usd']:
dt1 = row.replace (',','')
print (dt1)
This is the error that I got "AttributeError: 'float' object has no attribute 'replace'"
I need really to your help in how to do that?

Here's the way to get the decimal places:
import pandas as pd
import numpy as np
df= pd.DataFrame({'funding_total_usd': [np.nan, 1000000, 390000, 3406159,21750000,2000000]})
print(df)
df['funding_total_usd'] /= 100
print(df)
funding_total_usd
0 NaN
1 1000000.0
2 390000.0
3 3406159.0
4 21750000.0
funding_total_usd
0 NaN
1 10000.00
2 3900.00
3 34061.59
4 217500.00
To solve your comma problem, please run this as your first command before you print. It will remove all your commas for the float values.
pd.options.display.float_format = '{:.2f}'.format

Related

Multiplying 2 pandas dataframes generates nan

I have 2 dataframes as below
import pandas as pd
dat = pd.DataFrame({'val1' : [1,2,1,2,4], 'val2' : [1,2,1,2,4]})
dat1 = pd.DataFrame({'val3' : [1,2,1,2,4]})
Now with each column of dat and want to multiply dat1. So I did below
dat * dat1
However this generates nan value for all elements.
Could you please help on what is the correct approach? I could run a for loop with each column of dat, but I wonder if there are any better method available to perform the same.
Thanks for your pointer.
When doing multiplication (or any arithmetic operation), pandas does index alignment. This goes for both the index and columns in case of dataframes. If matches, it multiplies; otherwise puts NaN and the result has the union of the indices and columns of the operands.
So, to "avoid" this alignment, make dat1 a label-unaware data structure, e.g., a NumPy array:
In [116]: dat * dat1.to_numpy()
Out[116]:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16
To see what's "really" being multiplied, you can align yourself:
In [117]: dat.align(dat1)
Out[117]:
( val1 val2 val3
0 1 1 NaN
1 2 2 NaN
2 1 1 NaN
3 2 2 NaN
4 4 4 NaN,
val1 val2 val3
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 1
3 NaN NaN 2
4 NaN NaN 4)
(extra: you have the indices same for dat & dat1; please change one of them's index, and then align again to see the union-behaviour.)
You need to change two things:
use mul with axis=0
use a Series instead of dat1 (else multiplication will try to align the indices, there is no common ones between your two dataframes
out = dat.mul(dat1['val3'], axis=0)
output:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16

Pandas: Get substring between start and end of the characters

I am attempting to get substring between start and end of different characters. I tried several different regex notations, I am coming close to the output I need, but it is not fully correct. What can I do to fix this?
Data csv
ID,TEST
abc,1#London4#Harry Potter#5Rowling##
cde,6#Harry Potter1#England#5Rowling
efg,4#Harry Potter#5Rowling##1#USA
ghi,
jkm,4#Harry Potter5#Rowling
xyz,4#Harry Potter1#China#5Rowling
Code:
import pandas as pd
df = pd.read_csv('sample2.csv')
print(df)
Try:
df['TEST'].astype(str).str.extract('(1#.*(?=#))')
Got output from above code: It doesn't pick up end line '1#USA'
1#London4#Harry Potter#5Rowling#
1#England
NaN
NaN
NaN
1#China
Output needed:
1#London
1#England
1#USA
NaN
NaN
1#China
You can do this:
>>> df.TEST.str.extract("(1#[a-zA-Z]*)")
0
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
You can try:
# capture all characters that are neither `#` nor digits
# following 1#
df['TEST'].str.extract('(1#[^#\d]+)', expand=False)
Output:
0 1#London
1 1#England
2 1#USA
3 NaN
4 NaN
5 1#China
Name: TEST, dtype: object

Converting timedeltas to integers for consecutive time points in pandas

Suppose I have the dataframe
import pandas as pd
df = pd.DataFrame({"Time": ['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04']})
print(df)
Time
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
If I want to calculate the time from the lowest time point for each time in the dataframe, I can use the apply function like
df['Time'] = pd.to_datetime(df['Time'])
df.sort_values(inplace = True)
df['Time'] = df['Time'].apply(lambda x: (x - df['Time'].iloc[0]).days)
print(df)
Time
0 0
1 1
2 2
3 3
Is there a function in Pandas that does this already?
I will recommend not use apply
(df.Time-df.Time.iloc[0]).dt.days
0 0
1 1
2 2
3 3
Name: Time, dtype: int64

Remove "x" number of characters from a string in a pandas dataframe?

I have a pandas dataframe df looking like this:
a b
thisisastring 5
anotherstring 6
thirdstring 7
I want to remove characters from the left of the strings in column a based on the number in column b. So I tried:
df["a"] = d["a"].str[df["b"]:]
But this will result in:
a b
NaN 5
NaN 6
NaN 7
Instead of:
a b
sastring 5
rstring 6
ring 7
Any help? Thanks in advance!
Using zip with string slice
df.a=[x[y:] for x,y in zip(df.a,df.b)]
df
Out[584]:
a b
0 sastring 5
1 rstring 6
2 ring 7
You can do it with apply, to apply this row-wise:
df.apply(lambda x: x.a[x.b:],axis=1)
0 sastring
1 rstring
2 ring
dtype: object

Using relative positioning with Python 3.5 and pandas

I am formatting some csv files, and I need to add columns that use other columns for arithmetic. Like in Excel, B3 = sum(A1:A3)/3, then B4 = sum(A2:A4)/3. I've looked up relative indexes and haven't found what I'm Trying to do.
def formula_columns(csv_list, dir_env):
for file in csv_list:
df = pd.read_csv(dir_env + file)
avg_12(df)
print(df[10:20])
# Create AVG(12) Column
def avg_12 ( df ):
df[ 'AVG(12)' ] = df[ 'Price' ]
# Right Here I want to set each value of 'AVG(12)' to equal
# the sum of the value of price from its own index plus the
# previous 11 indexes
df.loc[:10, 'AVG(12)'] = 0
I would imagine this to be a common task, I would assume I'm looking in the wrong places. If anyone has some advice I would appreciate it, Thank.
That can be done with the rolling method:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1, 5, 10), columns = ['A'])
df
Out[151]:
A
0 2
1 4
2 1
3 1
4 4
5 2
6 4
7 2
8 4
9 1
Take the averages of A1:A3, A2:A4 etc:
df.rolling(3).mean()
Out[152]:
A
0 NaN
1 NaN
2 2.333333
3 2.000000
4 2.000000
5 2.333333
6 3.333333
7 2.666667
8 3.333333
9 2.333333
It requires pandas 18. For earlier versions, use pd.rolling_mean():
pd.rolling_mean(df['A'], 3)

Resources