How to check numbers after decimal point? - python-3.x

How do I check the numbers after a decimal point?
import pandas as pd
df = pd.DataFrame({'num':[1,2,3.5,4,5.8]})
df:
num
0 1.0
1 2.0
2 3.5
3 4.0
4 5.8
After check:
num check_point
0 1.0 0
1 2.0 0
2 3.5 1
3 4.0 0
4 5.8 1

Use numpy.modf for get values after decimal, then compare for not equal with ne and cast to integer:
df['check_point'] = np.modf(df['num'])[0].ne(0).astype(int)
Or use numpy.where:
df['check_point'] = np.where(np.modf(df['num'])[0] == 0, 0, 1)
Another idea is test if floats without .0 are integers:
df['check_point'] = np.where(df['num'].apply(lambda x: x.is_integer()), 0, 1)
Or:
df['check_point'] = np.where(df['num'].sub(df['num'].astype(int)).astype(bool), 1, 0)
print (df)
num check_point
0 1.0 0
1 2.0 0
2 3.5 1
3 4.0 0
4 5.8 1
Detail:
print (np.modf(df['num']))
(0 0.0
1 0.0
2 0.5
3 0.0
4 0.8
Name: num, dtype: float64, 0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Name: num, dtype: float64)

Check the diff between the number and its rounded version and determine if it's a int.
df['check_point'] = df.num.sub(df.num.round()).ne(0).astype(int)

Related

Creating python data frame from list of dictionary

I have the following data:
sentences = [{'mary':'N', 'jane':'N', 'can':'M', 'see':'V','will':'N'},
{'spot':'N','will':'M','see':'V','mary':'N'},
{'will':'M','jane':'N','spot':'V','mary':'N'},
{'mary':'N','will':'M','pat':'V','spot':'N'}]
I want to create a data frame where each key (from the pairs above) will be the column name and each value (from above) will be the index of the row. The values in the data frame will be counting of each matching point between the key and the value.
The expected result should be:
df = pd.DataFrame([(4,0,0),
(2,0,0),
(0,1,0),
(0,0,2),
(1,3,0),
(2,0,1),
(0,0,1)],
index=['mary', 'jane', 'can', 'see', 'will', 'spot', 'pat'],
columns=('N','M','V'))
Use value_counts per columns in DataFrame.apply, replace missing values, convert to integers and last transpose by DataFrame.T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
M N V
mary 0 3 1
jane 0 2 0
can 1 0 0
see 0 0 2
will 3 1 0
spot 0 2 1
pat 0 0 1
Or use DataFrame.stack with SeriesGroupBy.value_counts and Series.unstack:
df = df.stack().groupby(level=1).value_counts().unstack(fill_value=0)
print (df)
M N V
can 1 0 0
jane 0 2 0
mary 0 3 1
pat 0 0 1
see 0 0 2
spot 0 2 1
will 3 1 0
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0)
M N V
can 1.0 0.0 0.0
jane 0.0 2.0 0.0
mary 0.0 3.0 1.0
pat 0.0 0.0 1.0
see 0.0 0.0 2.0
spot 0.0 2.0 1.0
will 3.0 1.0 0.0
Cast as int if needed to.
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0).cast("int")

Replace a column value with number using pandas

For the following dataset, I can replace column 1 with the numeric value easily.
df['1'].replace(['A', 'B', 'C', 'D'], [0, 1, 2, 3], inplace=True)
But if I have 3600 or more than that different values in a column, how can I replace it with the numeric values without writing the value of the column.
Please let me know. I don't understand how to do that. If anybody has any solution please share with me.
Thanks in advance.
import pandas as pd
df = pd.DataFrame({1:['A','B','C','C','D','A'],
2:[0.6,0.9,5,4,7,1,],
3:[0.3,1,0.7,8,2,4]})
print(df)
1 2 3
0 A 0.6 0.3
1 B 0.9 1.0
2 C 5.0 0.7
3 C 4.0 8.0
4 D 7.0 2.0
5 A 1.0 4.0
np.where makes it easy.
import numpy as np
df[1] = np.where(df[1]=="A", "0",
np.where(df[1]=="B", "1",
np.where(df[1]=="C","2",
np.where(df[1]=="D","3",np.nan))))
print(df)
1 2 3
0 0 0.6 0.3
1 1 0.9 1.0
2 2 5.0 0.7
3 2 4.0 8.0
4 3 7.0 2.0
5 0 1.0 4.0
But if you have a lot of categories, you might want to think about other ways.
import string
upper=list(string.ascii_uppercase)
a=pd.DataFrame({'Alp':upper})
print(a)
Alp
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
.
.
19 T
20 U
21 V
22 W
23 X
24 Y
25 Z
for k in np.arange(0,26):
a=a.replace(to_replace =upper[k],value =k)
print(a)
Alp
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
.
.
.
21 21
22 22
23 23
24 24
25 25
If there is many values for replace you can use factorize:
df[1] = pd.factorize(df[1])[0] + 1
print (df)
1 2 3
0 1 0.6 0.3
1 2 0.9 1.0
2 3 5.0 0.7
3 3 4.0 8.0
4 4 7.0 2.0
5 1 1.0 4.0
You could do something like
df.loc[df['1'] == 'A','1'] = 0
df.loc[df['1'] == 'B','1'] = 1
### Or
keys = df['1'].unique().tolist()
i = 0
for key in keys
df.loc[df['1'] == key,'1'] = i
i = i+1

Locate rows with 0 value columns and set them to none pandas

Data:
f a b
5 0 1
5 1 3
5 1 3
5 6 3
5 0 0
5 1 5
5 0 0
I know how to locate the rows with both columns being 0, setting them to None on the other hand is a mystery.
df_o[(df_o['a'] == 0) & (df_o['d'] == 0)]
# set a and b to None
Expected result:
f a b
5 0 1
5 1 3
5 1 3
5 6 3
5 None None
5 1 5
5 None None
If working with numeric values None is converted to NaN and integers to float by design:
df_o.loc[(df_o['a'] == 0) & (df_o['b'] == 0), ['a','b']] = None
print (df_o)
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
Another solution with DataFrame.all for check if all Trues per rows with axis=1:
df_o.loc[(df_o[['a', 'b']] == 0).all(axis=1), ['a','b']] = None
print (df_o)
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
Details:
print ((df_o[['a', 'b']] == 0))
a b
0 True False
1 False False
2 False False
3 False False
4 True True
5 False False
6 True True
print ((df_o[['a', 'b']] == 0).all(axis=1))
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
One way I could think of is like this. Create an extra copy of the dataframe and check both individually while setting the value to None on the main dataframe. Not the cleanest solutions but:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['f'] = [5,5,5,5,5,5,5]
df['a'] = [0,1,1,6,0,1,0]
df['b'] = [1,3,3,3,0,5,0]
df1 = df.copy()
df['a'] = np.where((df.a == 0) & (df.b == 0), None, df.a)
df['b'] = np.where((df1.a == 0) & (df1.b == 0), None, df.b)
print(df)
Output:
f a b
0 5 0 1
1 5 1 3
2 5 1 3
3 5 6 3
4 5 None None
5 5 1 5
6 5 None None
df.replace(0, np.nan) -- to get NaNs (possibly more useful)
df.replace(0, 'None') -- what you actually want
It is surely not the most elegant way to do this, but maybe this helps.
import pandas as pd
data = {'a': [0,1,1,6,0,1,0],
'b':[1,3,3,3,0,5,0]}
df_o = pd.DataFrame.from_dict(data)
df_None = df_o[(df_o['a'] == 0) & (df_o['b'] == 0)]
df_o.loc[df_None.index,:] = None
print(df_o)
Out:
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
This is how I would do it:
import pandas as pd
a = pd.Series([0, 1, 1, 6, 0, 1, 0])
b = pd.Series([1, 3, 3, 3, 0, 5 ,0])
data = pd.DataFrame({'a': a, 'b': b})
v = [[data[i][j] for i in data] == [0, 0] for j in range(len(data['a']))] # spot null rows
a = [None if v[i] else a[i] for i in range(len(a))]
b = [None if v[i] else b[i] for i in range(len(b))]
data = pd.DataFrame({'a': a, 'b': b})
print(data)
Output:
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Conditional cumulative sum in Python/Pandas

Consider my dataframe, df:
data data_binary sum_data
2 1 1
5 0 0
1 1 1
4 1 2
3 1 3
10 0 0
7 0 0
3 1 1
How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?
The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].
I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns
array([1, 0, 2, 3, 4, 0, 0, 5])
Which is not what I want.
You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.
b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
Output
0 1
1 0
2 1
3 2
4 3
5 0
6 0
7 1
Name: data_binary, dtype: int64
Explanation
Let's start by looking at each step side by side
cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
b, c,
c.mask(b != 0),
c.mask(b != 0).ffill(),
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
], axis=1, keys=cols))
Output
data_binary cumulative_sum nan_non_zero forward_fill final_result
0 1 1 NaN NaN 1
1 0 1 1.0 1.0 0
2 1 2 NaN 1.0 1
3 1 3 NaN 1.0 2
4 1 4 NaN 1.0 3
5 0 4 4.0 4.0 0
6 0 4 4.0 4.0 0
7 1 5 NaN 4.0 1
The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.
I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:
print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
Name: data_binary, dtype: int32
df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
.cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
data data_binary sum_data sum_data1
0 2 1 1 1
1 5 0 0 0
2 1 1 1 1
3 4 1 2 2
4 3 1 3 3
5 10 0 0 0
6 7 0 0 0
7 3 1 1 1
If you want the excellent piRSquared's answer in just one single command:
df['sum_data'] = df[['data_binary']].apply(
lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int),
axis=0)
Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).

Resources