pd.Series(pred).value_counts() how to get the first column in dataframe? - python-3.x

I apply pd.Series(pred).value_counts() and get this output:
0 2084
-1 15
1 13
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
dtype: int64
When I create a list I get only the second column:
c_list=list(pd.Series(pred).value_counts()), Out:
[2084, 15, 13, 10, 7, 4, 3, 3, 3, 2, 2, 2, 2]
How do I get ultimately a dataframe that looks like this including a new column for size% of total size?
df=
[class , size ,relative_size]
0 2084 , x%
-1 15 , y%
1 13 , etc.
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2

You are very nearly there. Typing this in the blind as you didn't provide a sample input:
df = pd.Series(pred).value_counts().to_frame().reset_index()
df.columns = ['class', 'size']
df['relative_size'] = df['size'] / df['size'].sum()

Related

New DataFrame column that contains IDs where value is outside bounds?

I have the following DataFrame :
data: Dict[str, list[int]] = {
"x1": [5 , 6, 7, 8, 9],
"min1": [3 , 3, 3, 3, 3],
"max1": [8, 8, 8, 8, 8],
"x2": [0 , 1, 2, 3, 4],
"min2": [2 , 2, 2, 2, 2],
"max2": [7, 7, 7, 7, 7],
"x3": [7 , 6, 7, 6, 7],
"min3": [1 , 1, 1, 1, 1],
"max3": [6, 6, 6, 6, 6],
}
n: int = 3 # number of xi
df: pd.DataFrame = pd.DataFrame(data=data)
print(df)
Output
x1 min1 max1 x2 min2 max2 x3 min3 max3
0 5 3 8 0 2 7 7 1 6
1 6 3 8 1 2 7 6 1 6
2 7 3 8 2 2 7 7 1 6
3 8 3 8 3 2 7 6 1 6
4 9 3 8 4 2 7 7 1 6
I would like to add a new column alert to df that contains the IDs i where xi < mini or xi > maxi.
Expected result
x1 min1 max1 x2 min2 max2 x3 min3 max3 alert
0 5 3 8 0 2 7 7 1 6 "2,3"
1 6 3 8 1 2 7 6 1 6 "2"
2 7 3 8 2 2 7 7 1 6 "3"
3 8 3 8 3 2 7 6 1 6 ""
4 9 3 8 4 2 7 7 1 6 "1,3"
I looked at this answer but could not understand how to apply it to my problem.
Below is my working implementation that I wish to improve.
def f(row: pd.Series) -> str:
alert: str = ""
for k in range(1, n+1):
if row[f"x{k}"] < row[f"min{k}"] or row[f"x{k}"] > row[f"max{k}"]:
alert += f"{k}"
return ",".join(list(alert))
df["alert"] = df.apply(f, axis=1)
Actually given your output as strings, your approach isn't too bad. I would just suggest making alert a list, not a string:
def f(row: pd.Series) -> str:
alert: list = []
for k in range(1, n+1):
if row[f"x{k}"] < row[f"min{k}"] or row[f"x{k}"] > row[f"max{k}"]:
alert.append(f"{k}")
return ",".join(alert)
In a bit fancy way, you can do:
xs = df.filter(regex='^x')
mins = df.filter(like='min').to_numpy()
maxes = df.filter(like='max').to_numpy()
mask = (xs < mins) | (xs > maxes)
df['alert'] = ( mask # xs.columns.str.replace('x',',')).str.replace('^,','')
We can groupby to dataframe along columns according to integer it contains
df['alert'] = (df.groupby(df.columns.str.extract('(\d+)$')[0].tolist(), axis=1)
.apply(lambda g: g[f'x{g.name}'].le(g[f'min{g.name}']) | g[f'x{g.name}'].gt(g[f'max{g.name}']))
.apply(lambda row: ','.join(row.index[row]), axis=1))
print(df)
x1 min1 max1 x2 min2 max2 x3 min3 max3 alert
0 5 3 8 0 2 7 7 1 6 2,3
1 6 3 8 1 2 7 6 1 6 2
2 7 3 8 2 2 7 7 1 6 2,3
3 8 3 8 3 2 7 6 1 6
4 9 3 8 4 2 7 7 1 6 1,3
Intermediate result
(df.groupby(df.columns.str.extract('(\d+)$')[0].tolist(), axis=1)
.apply(lambda g: g[f'x{g.name}'].le(g[f'min{g.name}']) | g[f'x{g.name}'].gt(g[f'max{g.name}'])))
1 2 3
0 False True True
1 False True False
2 False True True
3 False False False
4 True False True
Using pandas:
a = (pd.wide_to_long(df.reset_index(), ['x', 'min', 'max'],'index', 'alert')
.loc[lambda x: x['x'].lt(x['min']) | x['x'].gt(x['max'])]
.reset_index()
.groupby('index')['alert'].agg(lambda x: ','.join(x.astype(str))))
df.join(a)
x1 min1 max1 x2 min2 max2 x3 min3 max3 alert
0 5 3 8 0 2 7 7 1 6 2,3
1 6 3 8 1 2 7 6 1 6 2
2 7 3 8 2 2 7 7 1 6 3
3 8 3 8 3 2 7 6 1 6 NaN
4 9 3 8 4 2 7 7 1 6 1,3

how to change rows to column in python

I want to convert my dataframe rows to column and take last value of last column.
here is my dataframe
df=pd.DataFrame({'flag_1':[1,2,3,1,2,500],'dd':[1,1,1,7,7,8],'x':[1,1,1,7,7,8]})
print(df)
flag_1 dd x
0 1 1 1
1 2 1 1
2 3 1 1
3 1 7 7
4 2 7 7
5 500 8 8
df_out:
1 2 3 1 2 500 1 1 1 7 7 8 8
Assuming you want a list as output, you can mask the initial values of the list column and stack:
import numpy as np
out = (df
.assign(**{df.columns[-1]: np.r_[[pd.NA]*(len(df)-1),[df.iloc[-1,-1]]]})
.T.stack().to_list()
)
Output:
[1, 2, 3, 1, 2, 500, 1, 1, 1, 7, 7, 8, 8]
For a wide dataframe with a single row, use .to_frame().T in place of to_list() (here with a MultiIndex):
flag_1 dd x
0 1 2 3 4 5 0 1 2 3 4 5 5
0 1 2 3 1 2 500 1 1 1 7 7 8 8

Pandas grouping with loops

is there a way to group a dataframe (*csv file) in these ways?
For example, I want to select blocks of ten rows for the first column to average and then I would like to do the same for the second column but not keeping blocks, but grouping every 10th row.
For ex. I want the average of:
1 1 3rd 4th
1 2 .. ..
1 3 .. ..
..
1 9 .. ..
1 10 .. ..
2 1 .. ..
2 2 .. ..
2 3 .. ..
So selecting the first chunk of the 1st column to calculate an average, and then every x rows for the second column.
For example, from a df like this one...
241888 1 1
241888 2 1
241888 3 2
241888 4 2
241888 5 3
241888 6 3
241888 7 4
241888 8 4
241888 9 5
241888 10 5
665309 1 3
665309 2 3
665309 3 4
665309 4 4
665309 5 5
665309 6 5
665309 7 6
665309 8 6
665309 9 7
665309 10 7
and then
df.groupby('24188').mean()[3]
df.groupby('665309').mean()[3]
df.groupby('1' of the 2nd column).mean()[3]
df.groupby('10' of the 2nd column).mean()[3]
giving 3, 5 2 and 6
Sorry, if I did not get you properly. do you want this?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a' : [1, 2, 3 , 4, 5 ,6 , 7],
'b': [11, 12, 13 , 14, 15 ,16 , 17],
'c': [21, 22, 23 , 24, 25 ,26 , 27] } )
print(df)
print("sum of 1 st column with 2nd rows(i.e. 2,4,6):")
print(df.iloc[[x for x in df.index if (x+1) % 2 == 0],0].sum())
print("sum of 3rd column with 3rd rows(i.e. 22,36):")
print(df.iloc[[x for x in df.index if (x+1) % 3 == 0],2].sum())
output:
a b c
0 1 11 21
1 2 12 22
2 3 13 23
3 4 14 24
4 5 15 25
5 6 16 26
6 7 17 27
sum of 1 st column with 2nd rows(i.e. 2,4,6):
12
sum of 3rd column with 3rd rows(i.e. 22,36):
49

Need to create incremental series number using python

I need to create a incremental series for a given value of dataframe in python.
Any help much appreciated
Suppose I have dataframe column
df['quadrant']
Out[6]:
0 4
1 4
2 4
3 3
4 3
5 3
6 2
7 2
8 2
9 1
10 1
11 1
I want to create a new column such that
index quadrant new value
0 4 1
1 4 5
2 4 9
3 3 2
4 3 6
5 3 10
6 2 3
7 2 7
8 2 11
9 1 4
10 1 8
11 1 12
Using Numpy, you can create the array as:
import numpy as np
def value(q, k=1):
diff_quadrant = np.diff(q)
j = 0
ramp = []
for i in np.where(diff_quadrant != 0)[0]:
ramp.extend(list(range(i-j+1)))
j = i+1
ramp.extend(list(range(len(quadrant)-j)))
ramp = np.array(ramp) * k # sawtooth-shaped array
a = np.ones([len(quadrant)], dtype = np.int)*5
return a - q + ramp
quadrant = np.array([3, 3, 3, 3, 4, 4, 4, 2, 2, 1, 1, 1])
b = value(quadrant, 4)
# [ 2 6 10 14 1 5 9 3 7 4 8 12]

Expanding/Duplicating dataframe rows based on condition

I am an R user who has recently started using Python 3 for data management. I am struggling with a way to expand/duplicate data frame rows based on a condition. I also need to be able to expand rows in a variable way. I'll illustrate with this example.
I have this data:
df = pd.DataFrame([[1, 10], [1,15], [2,10], [2, 15], [2, 20], [3, 10], [3, 15]], columns = ['id', 'var'])
df
Out[6]:
id var
0 1 10
1 1 15
2 2 10
3 2 15
4 2 20
5 3 10
6 3 15
I would like to expand rows for both ID == 1 and ID == 3. I would also like to expand each ID == 1 row by 1 duplicate each, and I would like to expand each ID == 3 row by 2 duplicates each. The result would look like this:
df2
Out[8]:
id var
0 1 10
1 1 10
2 1 15
3 1 15
4 2 10
5 2 15
6 2 20
7 3 10
8 3 10
9 3 10
10 3 15
11 3 15
12 3 15
13 3 15
I've been trying to use np.repeat, but I am failing to think of a way that I can use both ID condition and variable duplication numbers at the same time. Index ordering does not matter here, only that the rows are duplicated appropriately. I apologize in advance if this is an easy question. Thanks in advance for any help and feel free to ask clarifying questions.
This should do it:
dup = {1: 1, 3:2} #what value and how much to add
res = df.copy()
for k, v in dup.items():
for i in range(v):
res = res.append(df.loc[df['id']==k], ignore_index=True)
res.sort_values(['id', 'var'], inplace=True)
res.reset_index(inplace=True, drop=True)
res
# id var
#0 1 10
#1 1 10
#2 1 15
#3 1 15
#4 2 10
#5 2 15
#6 2 20
#7 3 10
#8 3 10
#9 3 10
#10 3 15
#11 3 15
#12 3 15
P.S. your desired solution had 7 values for id 3 while your description implies 6 values.
I think below code gets your job done:
df_1=df.loc[df.id==1]
df_3=df.loc[df.id==3]
df1=df.append([df_1]*1,ignore_index=True)
df1.append([df_3]*2,ignore_index=True).sort_values(by='id')
id var
0 1 10
1 1 15
7 1 10
8 1 15
2 2 10
3 2 15
4 2 20
5 3 10
6 3 15
9 3 10
10 3 15
11 3 10
12 3 15

Resources