pivot all the columns of a dataframe - python-3.x

I have this dataframe:
df= pd.DataFrame({'Jan' : ['US', 'GB', 'NL', 'CH', 'GB', 'US'],
'Feb': ['US', 'AU', 'RU', 'NO', 'AU', 0],
'Mar' : ['PL', 'AU', 'FI', 'US', 'CH', 'CH']})
I would like to create stacked barchart to show the count of countries per month. So I need first to transform this dataframe to this form :
Jan Feb Mar
US 2 1 1
GB 2 0 0
NL 1 0 0
CH 1 0 2
AU 0 2 1
RU 0 1 0
NO 0 1 0
PL 0 0 1
FI 0 0 1
0 0 1 0
My dataframe is large but I want to display the most 10 common countries for each month on the stacked barplot. I noticed that pandas pivot isn't doing the Job.

You could
In [46]: s = df.stack().reset_index()
In [47]: pd.crosstab(s[0], s['level_1']).rename_axis(None, 1).rename_axis(None, 0)
Out[47]:
Feb Jan Mar
0 1 0 0
AU 2 0 1
CH 0 1 2
FI 0 0 1
GB 0 2 0
NL 0 1 0
NO 1 0 0
PL 0 0 1
RU 1 0 0
US 1 2 1

Related

Vectorized way of using the previous row value based on the condition

I have a pandas dataframe as below. I want to perform the below condition:
if Column 'A' is 1 then update the value of column 'F' with the previous value of 'F'. This can be done row by row iteration but it is not efficient way of doing that. I want a vectorized method of doing that.
df = pd.DataFrame({'A':[1,1,1, 0, 0, 0, 1, 0, 0], 'C':[1,1,1, 0, 0, 0, 1, 1, 1], 'D':[1,1,1, 0, 0, 0, 1, 1, 1],
'F':[2,0,0, 0, 0, 1, 1, 1, 1]})
df
A C D F
0 1 1 1 2
1 1 1 1 0
2 1 1 1 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
My desired output:
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
I tried the below code, but it doesnot work because when I use shift, it doesnot take the updated previous row.
df['F'] = df.groupby(['A'])['F'].shift(1)
df
A C D F
0 1 1 1 NaN
1 1 1 1 2.0
2 1 1 1 0.0
3 0 0 0 NaN
4 0 0 0 0.0
5 0 0 0 0.0
6 1 1 1 0.0
7 0 1 1 1.0
8 0 1 1 1.0
transform('first')
df.F.groupby(df.A.rsub(1).cumsum()).transform('first')
0 2
1 2
2 2
3 0
4 0
5 1
6 1
7 1
8 1
Name: F, dtype: int64
Assign to column 'F'
df.assign(F=df.F.groupby(df.A.rsub(1).cumsum()).transform('first'))
A C D F
0 1 1 1 2
1 1 1 1 2
2 1 1 1 2
3 0 0 0 0
4 0 0 0 0
5 0 0 0 1
6 1 1 1 1
7 0 1 1 1
8 0 1 1 1
we also know how to do it without groupby:
where=df['A'].eq(1)&df['A'].ne(df['A'].shift())
df['F']=df['F'].where(where).ffill().mask(df['A'].ne(1),df['F'])
print(df)
A C D F
0 1 1 1 2.0
1 1 1 1 2.0
2 1 1 1 2.0
3 0 0 0 0.0
4 0 0 0 0.0
5 0 0 0 1.0
6 1 1 1 1.0
7 0 1 1 1.0
8 0 1 1 1.0

How to switch 1 (ON) flags occurring together in batch of size more than a specified threshold to 0 in pandas dataframe?

A flag column in a pandas dataframe is populated by 1 or 0
The problem is to identify continuous 1s.
Let t be the number of days thresholds
There are two types of transformations required:
i) If there are more than t 1s together, turn the (t+1)th onwards 1 to 0
ii) If there are more than t 1s together, turn all the 1s to 0s
My approach is to create 2 columns called result1 and result2, and filter using these columns:
Please see image here
I have not been able to think of anything as such, so not posting any code.
A nudge or hint in the right direction would be appreciated.
Use:
#compare 0 values
m = df['Value'].eq(0)
#get cumulative sum and filter only 1 rows
g = m.cumsum()[~m]
#set by condition - 0 or ccounter per groups
df['Result1'] = np.where(m, 0, df.groupby(g).cumcount().add(1))
#get maximum per groups with transform for new Series
df['Result2'] = np.where(m, 0, df.groupby(g)['Result1'].transform('max')).astype(int)
print (df)
Value Result1 Result2
0 1 1 1
1 0 0 0
2 0 0 0
3 1 1 2
4 1 2 2
5 0 0 0
6 1 1 4
7 1 2 4
8 1 3 4
9 1 4 4
10 0 0 0
11 0 0 0
12 1 1 1
13 0 0 0
14 1 1 1
15 0 0 0
16 0 0 0
17 1 1 6
18 1 2 6
19 1 3 6
20 1 4 6
21 1 5 6
22 1 6 6
23 0 0 0
24 1 1 1
25 0 0 0
26 0 0 0
27 1 1 1
28 0 0 0

Python 3.x: Pandas Dataframe How do we change the column names for a specific range?

I have a large csv file that I am reading using pandas. The following is a fraction of how my data looks like. The column names are 0,4,6,8,10,12,14,16,18.
0 4 6 8 10 12 14 16 18
-2 4500 4500 4500 4500 4500 4500 4500 4500
-1 4650 4650 4650 4650 4650 4650 4650 4650
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0
If I use Data.columns, I can change the column names. However, I only want to change a part of the column name. For instance I want to change the column 6,8,10 to bird,dog,strawberry,kiwi,tree,chocolate, and snow respectively.
0 4 bird dog strawberry kiwi tree chocolate snow
-2 4500 4500 4500 4500 4500 4500 4500 4500
-1 4650 4650 4650 4650 4650 4650 4650 4650
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0
How would you write a code? Remember that I have a massive file and want to do a large scale change for many numbers of columns. So I will need efficient line of code for this...
Thanks!
Edit: I meant to express that I desire to change column names starting from the third column.
Since you want to rename all the columns from 3 onwards you can use zip to create a dict then rename:
# sample data
df = pd.DataFrame(np.random.randn(5,9), columns=[0,4,6,8,10,12,14,16,18])
# create a dict using zip from df.columns[2:]
d = dict(zip(df.columns[2:].values, ['bird','dog','strawberry','kiwi','tree','chocolate','snow']))
# rename you columns
df = df.rename(columns=d)
0 4 bird dog strawberry kiwi tree \
0 -0.121085 1.263364 -0.008604 -0.240872 1.433633 0.092023 -0.903776
1 0.570377 0.565611 -1.107842 1.498852 -0.655996 -1.215298 0.639862
2 0.367796 -1.357311 -0.106241 -0.824072 1.055168 0.862952 0.475000
3 0.945560 0.359249 -0.282965 0.230909 -2.278477 1.656094 -0.031756
4 -0.611121 -0.159064 -0.711482 2.342169 0.044782 -0.955120 1.481766
chocolate snow
0 0.607185 0.694980
1 -0.666239 0.208806
2 0.018151 -0.656670
3 -0.438527 0.678592
4 1.035624 0.537486
import pandas as pd
example_list = [
{'name' : 'a',
'age' : 2,
'gender' : 'm'},
{'name' : 'b',
'age' : 5,
'gender' : 'm'
}]
df = pd.DataFrame(example_list)
print(df)
df.rename(columns = {'name':'First Name'}, inplace = True)
print(df)
Output
age gender name
0 2 m a
1 5 m b
age gender First Name
0 2 m a
1 5 m b
EDIT:
import pandas as pd
example_list = [
{
0 : 'a',
4 : 2,
6 : 'm'},
{
0 : 'b',
4 : 5,
6 : 'm'
}]
df = pd.DataFrame(example_list)
print(df)
df.rename(columns = {0:'apple', 4 : 'bannana', 6 : 'pear'}, inplace = True)
print(df)
OUTPUT:
0 4 6
0 a 2 m
1 b 5 m
apple bannana pear
0 a 2 m
1 b 5 m

Sorting dataframe and creating new columns based on the rank of element

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
'id': [1, 1, 1, 1, 2, 2,2, 2, 3, 3, 3, 3],
'name': ['A', 'B', 'C', 'D','A', 'B','C', 'D', 'A', 'B','C', 'D'],
'Value': [1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 3, 5]
},
columns=['name','id','Value'])`
I can sort the data using id and value as shown below:
df.sort_values(['id','Value'],ascending = [True,False])
The table that I print will be appearing as follow:
name id Value
D 1 4
C 1 3
B 1 2
A 1 1
B 2 6
A 2 5
D 2 2
C 2 0
B 3 6
D 3 5
A 3 4
C 3 3
I would like to create 4 new columns (Rank1, Rank2, Rank3, Rank4) if element in the column name is highest value, the column Rank1 will be assign as 1 else 0. if element in the column name is second highest value, he column Rank2 will be assign as 1 else 0.
Same for Rank3 and Rank4.
How could I do that?
Thanks.
Zep
Use:
df = df.join(pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
print (df)
name id Value Rank1 Rank2 Rank3 Rank4
3 D 1 4 1 0 0 0
2 C 1 3 0 1 0 0
1 B 1 2 0 0 1 0
0 A 1 1 0 0 0 1
5 B 2 6 1 0 0 0
4 A 2 5 0 1 0 0
7 D 2 2 0 0 1 0
6 C 2 0 0 0 0 1
9 B 3 6 1 0 0 0
11 D 3 5 0 1 0 0
8 A 3 4 0 0 1 0
10 C 3 3 0 0 0 1
Details:
For count per groups use GroupBy.cumcount, then add 1:
print (df.groupby('id').cumcount().add(1))
3 1
2 2
1 3
0 4
5 1
4 2
7 3
6 4
9 1
11 2
8 3
10 4
dtype: int64
For indicator columns use get_dumes with add_prefix:
print (pd.get_dummies(df.groupby('id').cumcount().add(1)).add_prefix('Rank'))
Rank1 Rank2 Rank3 Rank4
3 1 0 0 0
2 0 1 0 0
1 0 0 1 0
0 0 0 0 1
5 1 0 0 0
4 0 1 0 0
7 0 0 1 0
6 0 0 0 1
9 1 0 0 0
11 0 1 0 0
8 0 0 1 0
10 0 0 0 1
This does not require a prior sort
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(np.argsort).rsub(4)
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1
More dynamic
df.join(
pd.get_dummies(
df.groupby('id').Value.apply(lambda x: len(x) - np.argsort(x))
).add_prefix('Rank')
)
name id Value Rank1 Rank2 Rank3 Rank4
0 D 1 4 1 0 0 0
1 C 1 3 0 1 0 0
2 B 1 2 0 0 1 0
3 A 1 1 0 0 0 1
4 B 2 6 1 0 0 0
5 A 2 5 0 1 0 0
6 D 2 2 0 0 1 0
7 C 2 0 0 0 0 1
8 B 3 6 1 0 0 0
9 D 3 5 0 1 0 0
10 A 3 4 0 0 1 0
11 C 3 3 0 0 0 1

Create dummy variables for interdependent categories in pandas

I'm trying to set up a linear regression model in order to predict traffic counts based on the day, and the time of day. Since both are categorical variables, I have to create dummy variables. The get_dummies function makes this very easy, when doing this for both variables individually. However, in the case of predicting traffic volumes, the interdependence between the day and the time of day are important. Therefore, I'll need dummies for all days * all time intervals.
I made a small example, avoiding to trouble you with big datasets:
import pandas as pd
df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Time': [11,15,9,15,17,10,20],
'Count': [100,150,150,150,180,60,50]})
df_dummies = pd.get_dummies(df.Day)
print(df_dummies)
Results in a nice dataframe with dummies:
Fri Mon Sat Sun Thu Tue Wed
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
So what I'm after is something like this:
import pandas as pd
df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Time': [11,15,9,15,17,10,20],
'Count': [100,150,150,150,180,60,50]})
df_dummies = pd.get_dummies(df.Day * df.Time)
print(df_dummies)
With a result like this:
Fri_9 Fri_15 Mon_9 Mon_15 Sat_9 Sat_15 Sun_9 ...
0 0 1 0 0 0 0 0 ...
1 0 0 0 0 0 1 0 ...
2 0 0 0 0 0 0 1 ...
3 0 0 0 0 1 0 0 ...
4 1 0 0 0 0 0 0 ...
5 0 0 1 0 0 0 0 ...
6 0 0 0 1 0 0 0 ...
7 0 0 0 0 0 0 0 ...
[...]
Is there any way in which this can be done elegantly?
I belive need join columns together with cast to strings:
df_dummies = pd.get_dummies(df.Day + '_' + df.Time.astype(str))
#df_dummies = pd.get_dummies(df.Day.str.cat(df.Time.astype(str), sep='_'))
print(df_dummies)
Fri_17 Mon_11 Sat_10 Sun_20 Thu_15 Tue_15 Wed_9
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
I'm trying to set up a linear regression model in order to predict
Technically, you can make a dummy of the tuples:
>>> pd.get_dummies(df[['Day', 'Time']].apply(tuple, axis=1))
(Fri, 17) (Mon, 11) (Sat, 10) (Sun, 20) (Thu, 15) (Tue, 15) (Wed, 9)
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
...
however, I think this approach is not the best at the ML level. This will probably fragment the data very much, making things hard for your regressor. You might consider using a gradient-boosted decision tree, if you're after the interactions.

Resources