Pandas: Wide to long transformation: how to get the row and col numbers - python-3.x

Beginner question :
I have a matrix of lets say 3x3 and I want to convert it to the long format as follows :
Wide :
A B C
A 0.1 0.2 0.3
B 0.1 0.2 0.3
C 0.1 0.2 0.3
Long :
Col1 Col2 Row_num Col_num Value
0 A A 1 1 0.1
1 A B 1 2 0.2
2 A C 1 3 0.3
.
.
8 C C 3 3 0.3
I have tried various functions like melt, unstack(),wide_to_long but can't get the col number. What is the best way to do this ?
Thanks

Create data and unstack values
df = pd.DataFrame({'A': [0.1, 0.1, 0.1],
'B': [0.2, 0.2, 0.2],
'C': [0.3, 0.3, 0.3]},
index=['A', 'B', 'C'])
mapping = {col: idx for idx, col in enumerate(df.columns, 1)}
df = df.unstack().to_frame().reset_index()
df.columns = ['Col1', 'Col2', 'Value']
DataFrame
>>> df
Col1 Col2 Value
0 A A 0.1
1 A B 0.1
2 A C 0.1
3 B A 0.2
4 B B 0.2
5 B C 0.2
6 C A 0.3
7 C B 0.3
8 C C 0.3
Map remaining values
>>> df.assign(
Row_num=df['Col1'].map(mapping),
Col_num=df['Col2'].map(mapping)
)
Output
Col1 Col2 Value Row_num Col_num
0 A A 0.1 1 1
1 A B 0.1 1 2
2 A C 0.1 1 3
3 B A 0.2 2 1
4 B B 0.2 2 2
5 B C 0.2 2 3
6 C A 0.3 3 1
7 C B 0.3 3 2
8 C C 0.3 3 3

I'm sure there is a more efficient way to do this since my method involves two for loops but this is a quick and dirty way to transform the data like you're looking for:
# df is your initial dataframe
df = pd.DataFrame({"A": [1,1,1],
"B": [2,2,2],
"C": [3,3,3]},
index=["A","B","C"])
#long_rows will store the data we need for the new df
long_rows = []
# loop through each row
for i in range(len(df)):
#loop through each column
for j in range(len(df.columns)):
ind = list(df.index.values)[i]
col = list(df.columns.values)[j]
val = df.iloc[i,j]
row = [ind, col, i+1, j+1, val]
long_rows.append(row)
new_df = pd.DataFrame(long_rows, columns=["Col1", "Col2", "Row1", "Row2", "Value"])
and the result:
new_df
Col1 Col2 Row1 Row2 Value
0 A A 1 1 1
1 A B 1 2 2
2 A C 1 3 3
3 B A 2 1 1
4 B B 2 2 2
5 B C 2 3 3
6 C A 3 1 1
7 C B 3 2 2
8 C C 3 3 3

Related

Replace values on dataset and apply quartile rule by row on pandas

I have a dataset with lots of variables. So I've extracted the numeric ones:
numeric_columns = transposed_df.select_dtypes(np.number)
Then I want to replace all 0 values for 0.0001
transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)
And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.
Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.
To answer your first question
In [36]: from pprint import pprint
In [37]: pprint( numeric_columns.where.__doc__)
('\n'
'Replace values where the condition is False.\n'
'\n'
'Parameters\n'
'----------\n'
because of that your all the values except 0 are getting replaced
Use DataFrame.mask and for second condition compare by DataFrame.quantile:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)
print (transposed_df)
A B C D E F
0 a -1.0 7 1.0 5 a
1 b -1.0 8 3.0 3 a
2 c 4.0 9 -1.0 6 a
3 d 5.0 -1 7.0 9 b
4 e 5.0 2 -1.0 2 b
5 f 4.0 3 -1.0 4 b
EDIT:
from scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
B C D E
0 -2.236068 0.570352 -0.408248 0.073521
1 0.447214 0.950586 0.408248 -0.808736
2 0.447214 1.330821 -0.816497 0.514650
3 0.447214 -0.570352 2.041241 1.838037
4 0.447214 -1.330821 -0.408248 -1.249865
5 0.447214 -0.950586 -0.816497 -0.367607
EDIT1:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,1,1,1,1,1],
'C':[1,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[1,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
from scipy.stats import zscore
df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
A B C D E F
0 a -1.732051 0.577350 0.577350 0.577350 a
1 b -1.063410 1.643452 -0.290021 -0.290021 a
2 c -0.816497 1.360828 -1.088662 0.544331 a
3 d -1.402136 -0.412393 0.577350 1.237179 b
4 e -1.000000 1.000000 -1.000000 1.000000 b
5 f -0.632456 0.632456 -1.264911 1.264911 b

Is there anyway to make more than one dummies variable at a time? [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Delete row from dataframe having "None" value in all the columns - Python

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?
There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b
You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

Change/swap values one after another in pandas dataframe for selected rows

Dataframe:
col1 col2
A 0
A 1
A nan
B 0
B 1
C and so on...
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2 wherever col1=='A'.
Code so far:
df.loc[(df.col1=='A') & (df.col2==0),'col2'] = 2
df.loc[(df.col1=='A') & (df.col2==1),'col2'] = 0
df.loc[(df.col1=='A') & (df.col2==2),'col2'] = 1
# Hope you understand why I am converting 0 to 2 first then to 1.
# Because if I convert all zeroes to 1 then all 1's will be converted to
# 0 in subsequent conversion.
Unique values in col2 are 0,1 and nan.
Is there a correct/better way of doing this?
Also, is there a way to directly swap these numbers instead of assignment operators?
One solution using Series.where and astype(bool) with ~ (NOT operator) and then back to astype(int). Then use loc with boolean indexing to assign back to DataFrame:
df.loc[df.col1.eq('A'), 'col2'] = df.col2.where(df.col2.isna(),
(~df.col2.astype(bool)).astype(int))
[out]
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C NaN
You can also try with df.mask():
m=df.col1.eq('A')&df.col2.isna() #condition
df.col2=1-df.col2.mask(m)
print(df)
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 1.0
4 B 0.0
I am trying to change 1 to 0, 0 to 1 and nan stays as such in col2
wherever col1=='A'.
use np.where
df['col2] = np.where(df['col1'] == 'A', np.where(df['col2'] == 1, 0 , np.where(df['col2'].isnull() == True, df['col2'],1)),df['col2'])
Output
col1 col2
0 A 1.0
1 A 0.0
2 A NaN
3 B 0.0
4 B 1.0
5 C 0.0
In this case, you can also use your own function in combination with apply().
# import pandas
import pandas as pd
# make a sample data
list_of_rows = [
{'col1': A, 'col2': 1},
{'col1': A, 'col2': 0},
{'col1': A, 'col2': None},
{'col1': B, 'col2': 0},
{'col1': B, 'col2': 1},
{'col1': B, 'col2': None},
]
# make a pandas data frame
df = pd.DataFrame(list_of_rows)
# define a function
def change_values(row):
if row['col2'] == 0:
return 1
if row['col2'] == 1:
return 0
return row['col2']
# apply function to dataframe
df['col2'] = df.apply(lambda row: change_values(row), axis=1)

How do I get nlargest rows without the sorting?

I need to extract the n-smallest rows of a pandas df, but it is very important to me to maintain the original order of rows.
code example:
import pandas as pd
df = pd.DataFrame({
'a': [1, 10, 8, 11, -1],
'b': list('abdce'),
'c': [1.0, 2.0, 1.5, 3.0, 4.0]})
df.nsmallest(3, 'a')
Gives:
a b c
4 -1 e 4.0
0 1 a 1.0
2 8 d 1.5
I need:
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0
Any ideas how to do that?
PS! In my real example, the index is not sorted/sortable as they are strings (names).
Simplest approach assuming index was sorted in the beginning
df.nsmallest(3, 'a').sort_index()
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0
Alternatively with np.argpartition and iloc
This doesn't depend on sorting the index.emphasized text
df.iloc[np.sort(df.a.values.argpartition(3)[:3])]
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0

Resources