Panda Dataframe Copy to another dataframe actually copy the data type? - python-3.x

I try to copy a dataframe to another dataframe using default 'deep' copy method. But when i try to do some calculation in the second dataframe it is showing me result in datatype 'int64'. Is there any way to show this in real format(float64)?
dilution_category_info
Output
IS_HIGH_VALUE 0 1
DIALUTION_CATEGORY
0 93117 107300
1 374679 628604
2 64642 192098
3 404921 823262
4 145663 322063
dilution_category_info_2 = dilution_category_info.copy()
dilution_category_info_2[0][0] = (dilution_category_info[0][0]/(dilution_category_info[0][0]\
+ dilution_category_info[1][0]))
Output
IS_HIGH_VALUE 0 1
DIALUTION_CATEGORY
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

At first type cast the datatype of your new deep copied dataframe explicitely as (assuming you have numpy available as np, if not import it as import numpy as np for my code below to work):
dilution_category_info_2 = dilution_category_info.copy()
dilution_category_info_2[0][0] = dilution_category_info_2.astype(np.float64)
then assign the values as :
dilution_category_info_2[0][0] = (dilution_category_info[0][0]/(dilution_category_info[0][0]+ dilution_category_info[1][0]))

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

How to change one column of data to multiple column based on row using Python Pandas?

I dont know if I put the question correctly..
For example, I want
1
0
1
0
1
0
1
0
change into
1 0 1
0 1 0
1 0 x
The first list should not be changed..
and change the type to DataFrame..
I try use numpy.array, flatten the array. and reshape to columns using reshape(-1,3).T ..
but since there are some missing value to it.. I cannot reshape the array properly..
A possible solution would be to add the missing values to the array before resizing.
Starting point:
import numpy as np
import pandas as pd
# I assume you flattened the array.
data = np.array([1, 0, 1, 0, 1, 0, 1, 0, ])
Adding the new data based on the required shape and fill value:
new_shape = (3, 3)
fill_value = np.NaN
missing_length = np.product(new_shape) - data.size
missing_array = np.full(missing_length, fill_value)
data = np.hstack([data, missing_array])
Then apply the reshape and convert it to a dataframe:
data = data.reshape(new_shape)
df = pd.DataFrame(data)
print(df)
output:
0 1 2
0 1.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 NaN

What's the best way of converting a numeric array in a text file to a numpy array?

So I'm trying to create an array from a text file, the text file is laid out as follows. The numbers in the first two columns both go to 165:
0 0 1.0 0.0
1 0 0.0 0.0
1 1 0.0 0.0
2 0 -9.0933087157900000E-5 0.0000000000000000E+00
2 1 -2.7220323615900000E-09 -7.5751829208300000E-10
2 2 3.4709851601400000E-5 1.6729490538300000E-08
3 0 -3.2035914003000000E-06 0.0000000000000000E+00
3 1 2.6327440121800000E-05 5.4643630898200000E-06
3 2 1.4188179329400000E-05 4.8920365004800000E-06
3 3 1.2286058944700000E-05 -1.7854480816400000E-06
4 0 3.1973095717200000E-06 0.0000000000000000E+00
4 1 -5.9966018301500000E-06 1.6619345194700000E-06
4 2 -7.0818069269700000E-06 -6.7836271726900000E-06
4 3 -1.3622983381300000E-06 -1.3443472287100000E-05
4 4 -6.0257787358300000E-06 3.9396371953800000E-06
I'm trying to write a function where an array is made using the numbers in the 3rd columns, taking their positions in the array from the first two columns, and the empty cells are 0s. For example:
1 0 0 0
0 0 0 0
-9.09330871579000e-05 -2.72203236159000e-09 3.47098516014000e-05 0
-3.20359140030000e-06 2.63274401218000e-05 1.41881793294000e-05 1.22860589447000e-05
At the same time, I'm also trying to make a second array but using the numbers from the 4th column not the 3rd. The code that I've written so far is as follows and this is the array produced, I'm not even sure where the 4.41278e-08 comes from:
import numpy as np
def createarray(filepath,maxdegree):
Cnm = np.zeros((maxdegree+1,maxdegree+1))
Snm = np.zeros((maxdegree+1,maxdegree+1))
fid = np.genfromtxt(filepath)
for row in fid:
for n in range(0,maxdegree):
for m in range(0,maxdegree):
Cnm[n+1,m+1]=row[2]
Snm[n+1,m+1]=row[3]
return [Cnm, Snm]
0 0 0 0
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
I'm not getting any errors but I'm also not getting the right array. Can anyone shed some light on what I'm doing wrong?
Your data appear to be in a COO sparse matrix format already. This means, that you could use your own function, but you could also capitalize on the work done in the scipy.sparse package.
For example this code creates a function that would generate one of your matrices at a time. You could modify it to return both matrices.
import numpy as np
from scipy import sparse
def createarray(filepath, maxdegree, value_column):
"""Create single array from file"""
# load sparse data into numpy array
data = np.loadtxt(filepath)
# use coo_matrix to create the sparse matrix where the
# values are found in the value_column column of data
M = sparse.coo_matrix((data[:,value_column], (data[:,0], data[:,1])), shape=(maxdegree+1, maxdegree+1))
# if you need a numpy array call toarray() otherwise you
# can return M which is sparse and more memory efficient
return M.toarray()
Then for the first matrix you wanted to create you would set value_column to 2, and for the second you would set value_column to 3.
# first matrix
Cnm = createarray(filepath, maxdegree, 2)
# second matrix
Snm = createarray(filepath, maxdegree, 3)

pandas - show column name + sum in which the sum is higher than zero

I read my dataframe in with:
dataframe = pd.read_csv("testFile.txt", sep = "\t", index_col= 0)
I got a dataframe like this:
cell 17472131 17472132 17472133 17472134 17472135 17472136
cell_0 1 0 1 0 1 0
cell_1 0 0 0 0 1 0
cell_2 0 1 1 1 0 0
cell_3 1 0 0 0 1 0
with pandas I would like to get all the column names in which the sum of the column is > 1 and the total sum.
So I would like:
17472131 2
17472133 2
17472135 3
I figured out how to get the sums of each column with
dataframe.sum(axis=0)
but this also returns the columns with a sum lower than 2.. is there a way to only show the columns with a higher value than i.e. 1?
One pretty neat way is to use lambda function in loc:
df.set_index('cell').sum().loc[lambda x: x>1]
Output:
17472131 2
17472133 2
17472135 3
dtype: int64
Details: df.sum returns a pd.Series and we can use lambda x: x>1 to produce as boolean series which loc use boolean indexing to select only True parts of the pd.Series.

How to check if column is binary? (Pandas)

How to (efficiently!) check if a column is binary ?
"col" "col2"
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
also there might be a problem with columns that arent meant to be binary,
but only include zeros.
(I thought of using a list with their names which is filled after the column is added to the DF,
but is there a way to directly sign a column as "binary" during creation?)
the purpose is featurescaling for machine learning. (binarys shouldnt be scaled)
If want filter columns names with 0 or 1 values:
c = df.columns[df.isin([0,1]).all()]
print (c)
Index(['col', 'col2'], dtype='object')
If need filter columns:
df1 = df.loc[:, df.isin([0,1]).all()]
print (df1)
col col2
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
you can use this:
pd.unique(df[['col', 'col2']].values.ravel('K'))
and it returns:
array([0, 1], dtype=int64)
or you can use also pd.unique for each column
That's what I use to also cover all corner cases with mixed string/numeric types
import numpy as np
import pandas as pd
def checkBinary(ser, dropna = False):
try:
if dropna:
ser = pd.to_numeric(ser.dropna(), errors="raise") #With a safety reminder that errors must be raised
else:
ser = pd.to_numeric(ser, errors="raise")
except:
return False
return {0,1} == set(pd.unique(ser))
ser = pd.Series(["0",1,"1.000", np.nan])
checkBinary(ser, dropna = True)
>> True
ser = pd.Series(["0",0,"0.000"])
checkBinary(ser)
>> False

Resources