How to draw venn diagram from a dummy variable in Python Matplotlib_venn? - python-3.x

I have the following code to draw the venn diagram.
import numpy as np
import pandas as pd
import matplotlib_venn as vplt
x = np.random.randint(2, size=(10,3))
df = pd.DataFrame(x, columns=['A', 'B','C'])
print(df)
v = vplt.venn3(subsets=(1,1,1,1,1,1,1))
and the output looks like this:
I actually want to find the numbers in subsets() using the data set. How to do that? or is there any other easy way to make these venn diagram directly from the dataset.
I also want to make a box around it and annotate the remaining area as people with all the A,B,C are 0. Then calculate the percentage of the people in each circle and keep it as label. Not sure how to achieve this.
Background of the Problem:
I have a dataset of more than 500 observations and these three columns are recorded from one variable where multiple choices can be chosen as answers.
I want to visualize the data in a graph which shows that how many people have chosen 1st, 2nd, etc., as well as how many people have chosen 1st and 2nd, 1st and 3rd, etc.,

Use numpy.argwhere to get the indices of the 1s for each column and plot them the resultant
In [85]: df
Out[85]:
A B C
0 0 1 1
1 1 1 0
2 1 1 0
3 0 0 1
4 1 1 0
5 1 1 0
6 0 0 0
7 0 0 0
8 1 1 0
9 1 0 0
In [86]: sets = [set(np.argwhere(v).ravel()) for k,v in df.items()]
...: venn3(sets, df.columns)
...: plt.show()
Note: if you want to draw an additional box with the number of items not in either of the categories, add those lines:
In [87]: ax = plt.gca()
In [88]: xmin, _, ymin, _ = ax.axes.axis('on')
In [89]: ax.text(xmin, ymin, (df == 0).all(1).sum(), ha='left', va='bottom')

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

How to change one column of data to multiple column based on row using Python Pandas?

I dont know if I put the question correctly..
For example, I want
1
0
1
0
1
0
1
0
change into
1 0 1
0 1 0
1 0 x
The first list should not be changed..
and change the type to DataFrame..
I try use numpy.array, flatten the array. and reshape to columns using reshape(-1,3).T ..
but since there are some missing value to it.. I cannot reshape the array properly..
A possible solution would be to add the missing values to the array before resizing.
Starting point:
import numpy as np
import pandas as pd
# I assume you flattened the array.
data = np.array([1, 0, 1, 0, 1, 0, 1, 0, ])
Adding the new data based on the required shape and fill value:
new_shape = (3, 3)
fill_value = np.NaN
missing_length = np.product(new_shape) - data.size
missing_array = np.full(missing_length, fill_value)
data = np.hstack([data, missing_array])
Then apply the reshape and convert it to a dataframe:
data = data.reshape(new_shape)
df = pd.DataFrame(data)
print(df)
output:
0 1 2
0 1.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 NaN

What's the best way of converting a numeric array in a text file to a numpy array?

So I'm trying to create an array from a text file, the text file is laid out as follows. The numbers in the first two columns both go to 165:
0 0 1.0 0.0
1 0 0.0 0.0
1 1 0.0 0.0
2 0 -9.0933087157900000E-5 0.0000000000000000E+00
2 1 -2.7220323615900000E-09 -7.5751829208300000E-10
2 2 3.4709851601400000E-5 1.6729490538300000E-08
3 0 -3.2035914003000000E-06 0.0000000000000000E+00
3 1 2.6327440121800000E-05 5.4643630898200000E-06
3 2 1.4188179329400000E-05 4.8920365004800000E-06
3 3 1.2286058944700000E-05 -1.7854480816400000E-06
4 0 3.1973095717200000E-06 0.0000000000000000E+00
4 1 -5.9966018301500000E-06 1.6619345194700000E-06
4 2 -7.0818069269700000E-06 -6.7836271726900000E-06
4 3 -1.3622983381300000E-06 -1.3443472287100000E-05
4 4 -6.0257787358300000E-06 3.9396371953800000E-06
I'm trying to write a function where an array is made using the numbers in the 3rd columns, taking their positions in the array from the first two columns, and the empty cells are 0s. For example:
1 0 0 0
0 0 0 0
-9.09330871579000e-05 -2.72203236159000e-09 3.47098516014000e-05 0
-3.20359140030000e-06 2.63274401218000e-05 1.41881793294000e-05 1.22860589447000e-05
At the same time, I'm also trying to make a second array but using the numbers from the 4th column not the 3rd. The code that I've written so far is as follows and this is the array produced, I'm not even sure where the 4.41278e-08 comes from:
import numpy as np
def createarray(filepath,maxdegree):
Cnm = np.zeros((maxdegree+1,maxdegree+1))
Snm = np.zeros((maxdegree+1,maxdegree+1))
fid = np.genfromtxt(filepath)
for row in fid:
for n in range(0,maxdegree):
for m in range(0,maxdegree):
Cnm[n+1,m+1]=row[2]
Snm[n+1,m+1]=row[3]
return [Cnm, Snm]
0 0 0 0
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
0 4.41278e-08 4.41278e-08 4.41278e-08
I'm not getting any errors but I'm also not getting the right array. Can anyone shed some light on what I'm doing wrong?
Your data appear to be in a COO sparse matrix format already. This means, that you could use your own function, but you could also capitalize on the work done in the scipy.sparse package.
For example this code creates a function that would generate one of your matrices at a time. You could modify it to return both matrices.
import numpy as np
from scipy import sparse
def createarray(filepath, maxdegree, value_column):
"""Create single array from file"""
# load sparse data into numpy array
data = np.loadtxt(filepath)
# use coo_matrix to create the sparse matrix where the
# values are found in the value_column column of data
M = sparse.coo_matrix((data[:,value_column], (data[:,0], data[:,1])), shape=(maxdegree+1, maxdegree+1))
# if you need a numpy array call toarray() otherwise you
# can return M which is sparse and more memory efficient
return M.toarray()
Then for the first matrix you wanted to create you would set value_column to 2, and for the second you would set value_column to 3.
# first matrix
Cnm = createarray(filepath, maxdegree, 2)
# second matrix
Snm = createarray(filepath, maxdegree, 3)

How to check if column is binary? (Pandas)

How to (efficiently!) check if a column is binary ?
"col" "col2"
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
also there might be a problem with columns that arent meant to be binary,
but only include zeros.
(I thought of using a list with their names which is filled after the column is added to the DF,
but is there a way to directly sign a column as "binary" during creation?)
the purpose is featurescaling for machine learning. (binarys shouldnt be scaled)
If want filter columns names with 0 or 1 values:
c = df.columns[df.isin([0,1]).all()]
print (c)
Index(['col', 'col2'], dtype='object')
If need filter columns:
df1 = df.loc[:, df.isin([0,1]).all()]
print (df1)
col col2
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
you can use this:
pd.unique(df[['col', 'col2']].values.ravel('K'))
and it returns:
array([0, 1], dtype=int64)
or you can use also pd.unique for each column
That's what I use to also cover all corner cases with mixed string/numeric types
import numpy as np
import pandas as pd
def checkBinary(ser, dropna = False):
try:
if dropna:
ser = pd.to_numeric(ser.dropna(), errors="raise") #With a safety reminder that errors must be raised
else:
ser = pd.to_numeric(ser, errors="raise")
except:
return False
return {0,1} == set(pd.unique(ser))
ser = pd.Series(["0",1,"1.000", np.nan])
checkBinary(ser, dropna = True)
>> True
ser = pd.Series(["0",0,"0.000"])
checkBinary(ser)
>> False

Search for value in a panel

I'm using the pandas library, and have an instance of the panel object. I want to find the number of elements that are equal to 0. I tried using the count command thusly:
panel.count(0)
However this returns the number of df within the axis 0, and I want to find the number of elements within each df of the panel that are equal to zero. Is there any built-in command to do that? Can anyone help me?
You can use .sum() (and the axis argument controls which DataFrame slices you're summing over):
In [11]: p = pd.Panel([[[1, 1]], [[1, 2]], [[1, 2]]])
In [12]: (p == 1).sum(axis=0)
Out[12]:
0 1
0 3 1
In [13]: (p == 1).sum(axis=1)  # this is the default: .sum()
Out[13]:
0 1 2
0 1 1 1
1 1 0 0
In [14]: (p == 1).sum(axis=2)
Out[14]:
0 1 2
0 2 1 1
It might be you want to sum of this, the Series (I don't think you can do this part in one??):
In [15]: (p == 1).sum(axis=0).sum(axis=0)
Out[15]:
0 3
1 1
dtype: int64
To find the total number of items equal to 0, I'd use np.sum (though you could do .sum().sum().sum()):
In [21]: np.sum((p == 1).values)
Out[21]: 4
Note: surprisingly the .values is required here.

Resources