How to change one column of data to multiple column based on row using Python Pandas? - python-3.x

I dont know if I put the question correctly..
For example, I want
1
0
1
0
1
0
1
0
change into
1 0 1
0 1 0
1 0 x
The first list should not be changed..
and change the type to DataFrame..
I try use numpy.array, flatten the array. and reshape to columns using reshape(-1,3).T ..
but since there are some missing value to it.. I cannot reshape the array properly..

A possible solution would be to add the missing values to the array before resizing.
Starting point:
import numpy as np
import pandas as pd
# I assume you flattened the array.
data = np.array([1, 0, 1, 0, 1, 0, 1, 0, ])
Adding the new data based on the required shape and fill value:
new_shape = (3, 3)
fill_value = np.NaN
missing_length = np.product(new_shape) - data.size
missing_array = np.full(missing_length, fill_value)
data = np.hstack([data, missing_array])
Then apply the reshape and convert it to a dataframe:
data = data.reshape(new_shape)
df = pd.DataFrame(data)
print(df)
output:
0 1 2
0 1.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 NaN

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

What's the best way of converting a numeric array in a text file to a numpy array?

So I'm trying to create an array from a text file, the text file is laid out as follows. The numbers in the first two columns both go to 165:
0 0 1.0 0.0
1 0 0.0 0.0
1 1 0.0 0.0
2 0 -9.0933087157900000E-5 0.0000000000000000E+00
2 1 -2.7220323615900000E-09 -7.5751829208300000E-10
2 2 3.4709851601400000E-5 1.6729490538300000E-08
3 0 -3.2035914003000000E-06 0.0000000000000000E+00
3 1 2.6327440121800000E-05 5.4643630898200000E-06
3 2 1.4188179329400000E-05 4.8920365004800000E-06
3 3 1.2286058944700000E-05 -1.7854480816400000E-06
4 0 3.1973095717200000E-06 0.0000000000000000E+00
4 1 -5.9966018301500000E-06 1.6619345194700000E-06
4 2 -7.0818069269700000E-06 -6.7836271726900000E-06
4 3 -1.3622983381300000E-06 -1.3443472287100000E-05
4 4 -6.0257787358300000E-06 3.9396371953800000E-06
I'm trying to write a function where an array is made using the numbers in the 3rd columns, taking their positions in the array from the first two columns, and the empty cells are 0s. For example:
1 0 0 0
0 0 0 0
-9.09330871579000e-05 -2.72203236159000e-09 3.47098516014000e-05 0
-3.20359140030000e-06 2.63274401218000e-05 1.41881793294000e-05 1.22860589447000e-05
At the same time, I'm also trying to make a second array but using the numbers from the 4th column not the 3rd. The code that I've written so far is as follows and this is the array produced, I'm not even sure where the 4.41278e-08 comes from:
import numpy as np
def createarray(filepath,maxdegree):
Cnm = np.zeros((maxdegree+1,maxdegree+1))
Snm = np.zeros((maxdegree+1,maxdegree+1))
fid = np.genfromtxt(filepath)
for row in fid:
for n in range(0,maxdegree):
for m in range(0,maxdegree):
Cnm[n+1,m+1]=row[2]
Snm[n+1,m+1]=row[3]
return [Cnm, Snm]
0 0 0 0
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
I'm not getting any errors but I'm also not getting the right array. Can anyone shed some light on what I'm doing wrong?
Your data appear to be in a COO sparse matrix format already. This means, that you could use your own function, but you could also capitalize on the work done in the scipy.sparse package.
For example this code creates a function that would generate one of your matrices at a time. You could modify it to return both matrices.
import numpy as np
from scipy import sparse
def createarray(filepath, maxdegree, value_column):
"""Create single array from file"""
# load sparse data into numpy array
data = np.loadtxt(filepath)
# use coo_matrix to create the sparse matrix where the
# values are found in the value_column column of data
M = sparse.coo_matrix((data[:,value_column], (data[:,0], data[:,1])), shape=(maxdegree+1, maxdegree+1))
# if you need a numpy array call toarray() otherwise you
# can return M which is sparse and more memory efficient
return M.toarray()
Then for the first matrix you wanted to create you would set value_column to 2, and for the second you would set value_column to 3.
# first matrix
Cnm = createarray(filepath, maxdegree, 2)
# second matrix
Snm = createarray(filepath, maxdegree, 3)

How to check if column is binary? (Pandas)

How to (efficiently!) check if a column is binary ?
"col" "col2"
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
also there might be a problem with columns that arent meant to be binary,
but only include zeros.
(I thought of using a list with their names which is filled after the column is added to the DF,
but is there a way to directly sign a column as "binary" during creation?)
the purpose is featurescaling for machine learning. (binarys shouldnt be scaled)
If want filter columns names with 0 or 1 values:
c = df.columns[df.isin([0,1]).all()]
print (c)
Index(['col', 'col2'], dtype='object')
If need filter columns:
df1 = df.loc[:, df.isin([0,1]).all()]
print (df1)
col col2
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
you can use this:
pd.unique(df[['col', 'col2']].values.ravel('K'))
and it returns:
array([0, 1], dtype=int64)
or you can use also pd.unique for each column
That's what I use to also cover all corner cases with mixed string/numeric types
import numpy as np
import pandas as pd
def checkBinary(ser, dropna = False):
try:
if dropna:
ser = pd.to_numeric(ser.dropna(), errors="raise") #With a safety reminder that errors must be raised
else:
ser = pd.to_numeric(ser, errors="raise")
except:
return False
return {0,1} == set(pd.unique(ser))
ser = pd.Series(["0",1,"1.000", np.nan])
checkBinary(ser, dropna = True)
>> True
ser = pd.Series(["0",0,"0.000"])
checkBinary(ser)
>> False

How to replace selected rows of pandas dataframe with a np array, sequentially?

I have a pandas dataframe
A B C
0 NaN 2 6
1 3.0 4 0
2 NaN 0 4
3 NaN 1 2
where I have a column A that has NaN values in some rows (not necessarily consecutive).
I want to replace these values not with a constant value (which pd.fillna does), but rather with the values from a numpy array.
So the desired outcome is:
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
I'm not sure the .replace method will help here as well, since that seems to replace value <-> value via dictionary. Whereas here I want to sequentially change NaN to its corresponding value (by index) in the np array.
I tried:
MWE:
huh = pd.DataFrame([[np.nan, 2, 6],
[3, 4, 0],
[np.nan, 0, 4],
[np.nan, 1, 2]],
columns=list('ABC'))
huh.A[huh.A.isnull()] = np.array([1,5,7]) # what i want to do, but this gives error
gives the error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
'''
I read the docs but I can't understand how to do this with .loc.
How do I do this properly, preferably without a for loop?
Other info:
The number of elements in the np array will always match the number of NaN in the dataframe, so your answer does not need to check for this.
You are really close, need DataFrame.loc for avoid chained assignments:
huh.loc[huh.A.isnull(), 'A'] = np.array([1,5,7])
print (huh)
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
zip
This should account for uneven lengths
m = huh.A.isna()
a = np.array([1, 5, 7])
s = pd.Series(dict(zip(huh.index[m], a)))
huh.fillna({'A': s})
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2

create dummies from a column for a subset of data, which does't contains all the category value in that column

I am handling a subset of the a large data set.
There is a column named "type" in the dataframe. The "type" are expected to have values like [1,2,3,4].
In a certain subset, I find the "type" column only contains certain values like [1,4],like
In [1]: df
Out[2]:
type
0 1
1 4
When I create dummies from column "type" on that subset, it turns out like this:
In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]: type_1 type_4
0 1 0
1 0 1
It does't have the columns named "type_2", "type_3".What i want is like:
Out[6]: type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Is there a solution for this?
What you need to do is make the column 'type' into a pd.Categorical and specify the categories
pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Another solution with reindex_axis and add_prefix:
df1 = pd.get_dummies(df["type"])
.reindex_axis([1,2,3,4], axis=1, fill_value=0)
.add_prefix('type')
print (df1)
type1 type2 type3 type4
0 1 0 0 0
1 0 0 0 1
Or categorical solution:
df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Since you tagged your post as one-hot-encoding, you may find sklearn module's OneHotEncoder useful, in addition to pure Pandas solutions:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5
# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))
# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])
print(newdf)
type_0 type_1 type_2 type_3 type_4
0 0 1 0 0 0
1 0 0 0 0 1
One advantage of using this approach is that OneHotEncoder easily produces sparse vectors, for very large class sets. (Just change to sparse=True in the OneHotEncoder() declaration.)

Resources