I have following data in my column of data frame. How can I convert each domain name by digital number? I try to use replace in a for loop. However, since I have more than 1200 unqie domain name. I do not want to It seems like it is not a idea way to do it
for i, v in np.ndenumerate(np.unique(df['domain'])):
df['domain'] = df['domain'].replace(to_replace=[v], value=i[0]+1, inplace=True)
but it does not work
data frame:
type domain
0 1 yahoo.com
1 1 google.com
2 0 google.com
3 0 aa.com
4 0 google.com
5 0 aa.com
6 1 abc.com
7 1 msn.com
8 1 abc.com
9 1 abc.com
....
I want to convert to
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
....
Let's use pd.factorize:
df.assign(domain=pd.factorize(df.domain)[0]+1)
Output:
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
If it does really matter for the digital number assignment, you can try this
import pandas as pd
df.domain.astype('category').cat.codes
Out[154]:
0 4
1 2
2 2
3 0
4 2
5 0
6 1
7 3
8 1
9 1
dtype: int8
If that is matter, you can try
maplist=df[['domain']].drop_duplicates(keep='first').reset_index(drop=True).reset_index().set_index('domain')
maplist['index']=maplist['index']+1
df.domain=df.domain.map(maplist['index'])
Out[177]:
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
Related
I'm trying to come up with a LAMBDA formula that captures the following recursive calculation:
Column A has 40 rows with integers between 1 and 40. Column B divides each integer in column A by 6 and rounds it up. Column C divides each integer in column B by 6 and rounds it up. This continues until the integer is 1 or less, and then I want the sum of the full row for a given integer. So, for example, for the number 25 in column A, I get 6 (5 from column B and 1 from column C). For the number 40 in column A, I get 10 (7 from column B, 2 from column C, 1 from column D).
Is it possible to come up with a LAMBDA function that would get me the correct output for a given number in column A? I don't want to use VBA - just want to use the LAMBDA function for this.
Image of the XL
Data
Column 1
Column 2
Column 3
Column 4
Sum
1
0
0
0
0
1
2
1
0
0
0
1
3
1
0
0
0
1
4
1
0
0
0
1
5
1
0
0
0
1
6
1
0
0
0
1
7
2
1
0
0
3
8
2
1
0
0
3
9
2
1
0
0
3
10
2
1
0
0
3
11
2
1
0
0
3
12
2
1
0
0
3
13
3
1
0
0
4
14
3
1
0
0
4
15
3
1
0
0
4
16
3
1
0
0
4
17
3
1
0
0
4
18
3
1
0
0
4
19
4
1
0
0
5
20
4
1
0
0
5
21
4
1
0
0
5
22
4
1
0
0
5
23
4
1
0
0
5
24
4
1
0
0
5
25
5
1
0
0
6
26
5
1
0
0
6
27
5
1
0
0
6
28
5
1
0
0
6
29
5
1
0
0
6
30
5
1
0
0
6
31
6
1
0
0
7
32
6
1
0
0
7
33
6
1
0
0
7
34
6
1
0
0
7
35
6
1
0
0
7
36
6
1
0
0
7
37
7
2
1
0
10
Use BYROW and SCAN:
=BYROW(A1:A40,LAMBDA(c,SUM(SCAN(c,SEQUENCE(,4,6,0),LAMBDA(a,b,IF(a=1,0,ROUNDUP(a/b,0)))))))
I would like to subtract a value example value 2 on a specific column of a data frame
csv1=
X Y Subdie 1v 2v 5v 10v
0 1 0 4 2 4 2 2
1 2 0 2 3 4 4 6
2 3 0 3 5 4 6 8
3 4 0 4 2 5 4 4
4 5 0 4 2 5 8 4
I want to subtract 2 on 1v and 2v columns, I tried with this code
Cv=(csv1.loc[:,' 1v':' 5v'])-2
I got an output like
1v 2v 5v
0 0 2 0
1 1 2 2
2 3 2 4
3 0 3 2
4 0 3 6
Expected output: include other columns also
x y 1v 2v 5v 10v
0 1 0 0 2 0 2
1 2 0 1 2 2 6
2 3 0 3 2 4 8
3 4 0 0 3 2 4
4 5 0 0 3 6 4
Don't create a copy, perform an in place modification:
csv1.loc[:, ' 1v':' 5v'] -= 2
modifiers csv1:
X Y Subdie 1v 2v 5v 10v
0 1 0 4 0 2 0 2
1 2 0 2 1 2 2 6
2 3 0 3 3 2 4 8
3 4 0 4 0 3 2 4
4 5 0 4 0 3 6 4
NB. I kept your slice as in the question, but you should avoid having leading spaces in the column names. Also, ' 1v':' 5v' selects 1v, 2v, and 5v (included).
I have dataframe like below.
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4], 'match': [1,1,1,1,1,1,1,1,1,1]})
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
I want to get top n group like below (n=3).
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
My data, in actually, each row have another information to use, so only sort to num of match, and extract top n.
How to do this?
I believe you need if need top3 groups per column match - use SeriesGroupBy.value_counts with GroupBy.head for top3 per groups and then convert index to DataFrame by Index.to_frame and DataFrame.merge:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df = s.index.to_frame().reset_index(drop=True).merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Or if need filter only values if match is 1 use Series.value_counts with filtering by boolean indexing:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df = s.index.to_frame(name='group').merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Solution with isin and ordered categoricals:
#if need filter match == 1
idx = df.loc[df['match'] == 1, 'group'].value_counts().head(3).index
#if dont need filter
#idx = df.group.value_counts().head(3).index
df = df[df.group.isin(idx)]
df['group'] = pd.CategoricalIndex(df['group'], ordered=True, categories=idx)
df = df.sort_values('group')
print (df)
group match
0 1 1
2 1 1
5 1 1
8 1 1
6 4 1
7 4 1
9 4 1
3 3 1
4 3 1
Difference in solutions is best seen in changed data of match column:
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4,10,20,10,20,10,30,40],
'match': [1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]})
print (df)
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
10 10 0
11 20 0
12 10 0
13 20 0
14 10 0
15 30 0
16 40 0
Top3 values per groups by match:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df1 = s.index.to_frame().reset_index(drop=True).merge(df)
print (df1)
group match
0 10 0
1 10 0
2 10 0
3 20 0
4 20 0
5 30 0
6 1 1
7 1 1
8 1 1
9 1 1
10 4 1
11 4 1
12 4 1
13 3 1
14 3 1
Top3 values by match == 1:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df2 = s.index.to_frame(name='group').merge(df)
print (df2)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Top3 values, match column is not important:
s = df['group'].value_counts().head(3)
df3 = s.index.to_frame(name='group').merge(df)
print (df3)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 10 0
5 10 0
6 10 0
7 4 1
8 4 1
9 4 1
I have a table like this, I want to draw a histogram for number of 0, 1, 2, 3 across all table, is there a way to do it?
you can apply melt and hist
for example:
df
A B C D
0 3 1 1 1
1 3 3 2 2
2 1 0 1 1
3 3 2 3 0
4 3 1 1 3
5 3 0 3 1
6 3 1 1 0
7 1 3 3 0
8 3 1 3 3
9 3 3 1 3
df.melt()['value'].value_counts()
3 18
1 14
0 5
2 3
I have a dataframe with a multiindex (ID, Date, LID) and columns from 0 to N that looks something like this:
0 1 2 3 4
ID Date LID
00112 11-02-2014 I 0 1 5 6 7
00112 11-02-2014 II 2 4 5 3 4
00112 30-07-2015 I 5 7 1 1 2
00112 30-07-2015 II 3 2 8 7 1
I would like to group the dataframe by ID and Date and concatenate the columns to the same row such that it looks like this:
0 1 2 3 4 5 6 7 8 9
ID Date
00112 11-02-2014 0 1 5 6 7 2 4 5 3 4
00112 30-07-2015 5 7 1 1 2 3 2 8 7 1
Using pd.concat and pd.DataFrame.xs
pd.concat(
[df.xs(x, level=2) for x in df.index.levels[2]],
axis=1, ignore_index=True
)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1
Use unstack + sort_index:
df = df.unstack().sort_index(axis=1, level=1)
#for new columns names
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1