How can I create Frequency Matrix using all columns - python-3.x

Let's say that I have a dataset that contains 4 binary columns for 2 rows.
It looks like this:
c1 c2 c3 c4 c5
r1 0 1 0 1 0
r2 1 1 1 1 0
I want to create a matrix that gives the number of occurrences of a column, given that it also occurred in another column. Kinda like a confusion matrix
My desired output is:
c1 c2 c3 c4 c5
c1 - 1 1 1 0
c2 1 - 1 2 0
c3 1 1 - 1 0
c4 1 2 1 - 0
I have used pandas crosstab but it only gives the desired output when using 2 columns. I want to use all of the columns

dot
df.T.dot(df)
# same as
# df.T # df
c1 c2 c3 c4 c5
c1 1 1 1 1 0
c2 1 2 1 2 0
c3 1 1 1 1 0
c4 1 2 1 2 0
c5 0 0 0 0 0
You can use np.fill_diagonal to make the diagonal zero
d = df.T.dot(df)
np.fill_diagonal(d.to_numpy(), 0)
d
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0
And as long as we're using Numpy, you could go all the way...
a = df.to_numpy()
b = a.T # a
np.fill_diagonal(b, 0)
pd.DataFrame(b, df.columns, df.columns)
c1 c2 c3 c4 c5
c1 0 1 1 1 0
c2 1 0 1 2 0
c3 1 1 0 1 0
c4 1 2 1 0 0
c5 0 0 0 0 0

A way of using melt and merge with groupby
s=df.reset_index().melt('index').loc[lambda x : x.value==1]
s.merge(s,on='index').query('variable_x!=variable_y').groupby(['variable_x','variable_y'])['value_x'].sum().unstack(fill_value=0)
Out[32]:
variable_y c1 c2 c3 c4
variable_x
c1 0 1 1 1
c2 1 0 1 2
c3 1 1 0 1
c4 1 2 1 0

Related

Ungroup pandas dataframe column values separated by comma

Hello I Have pandas dataframe which is grouped wanted to ungroup the dataframe the column values are separated comma the dataframe which is looking as below
col1 col2 name exams
0,0,0 0,0,0, A1 exm1,exm2, exm3
0,1,0,20 0,0,2,20 A2 exm1,exm2, exm4, exm5
0,0,0,30 0,0,20,20 A3 exm1,exm2, exm3, exm5
output how I wanted
col1 col2 name exam
0 0 A1 exm1
0 0 A1 exm2
0 0 A1 exm3
0 0 A2 exm1
1 0 A2 exm2
0 2 A2 exm4
20 20 A2 exm5
..............
30 20 A3 exm5
I am tried with Split (explode) pandas dataframe string entry to separate rows
but not able get proper approach any one please give me suggestion how can I get my output
Try with explode, notice , explode is the new function after pandas 0.25.0
df[['col1','col2','exams']]=df[['col1','col2','exams']].apply(lambda x : x.str.split(','))
df = df.join(pd.concat([df.pop(x).explode() for x in ['col1','col2','exams']],axis=1))
Out[62]:
name col1 col2 exams
0 A1 0 0 exm1
0 A1 0 0 exm2
0 A1 0 0 exm3
1 A2 0 0 exm1
1 A2 1 0 exm2
1 A2 0 2 exm4
1 A2 20 20 exm5
2 A3 0 0 exm1
2 A3 0 0 exm2
2 A3 0 20 exm3
2 A3 30 20 exm5

Adding new Column in Dataframe and updating row values as other columns name based on condition

I have a dataframe with columns as a, c1, c2, c3 c4.
df =
a. c1. c2. c3. c4.
P1 1 0 0 0
P2 0 0 0 1
P3 1 0 0 0
P4 0 1 0 0
On above df, I want to do following operations:
Add a new column main, whose value will be the name of column which contain value 1 for a particular row.
For eg: 1st row will have value 'c1' in its main column, similarly second row will have c4.
The resulting df will look like below:
df =
a. c1. c2. c3. c4. main
P1 1 0 0 0 c1
P2 0 0 0 1 c4
P3 1 0 0 0 c1
P4 0 1 0 0 c2
I am new to python and dataframes. Please help.
Use DataFrame.dot for matrix multiplication:
If a is first colum omit it by indexing:
df['main'] = df.iloc[:, 1:].dot(df.columns[1:])
#if possible multiple 1 per row
#df['main'] = df.iloc[:, 1:].dot(df.columns[1:] + ',').str.rstrip(',')
print (df)
a c1 c2 c3 c4 main
0 P1 1 0 0 0 c1
1 P2 0 0 0 1 c4
2 P3 1 0 0 0 c1
3 P4 0 1 0 0 c2
If a is index:
df['main'] = df.dot(df.columns)
#if possible multiple 1 per row
#df['main'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
c1 c2 c3 c4 main
a
P1 1 0 0 0 c1
P2 0 0 0 1 c4
P3 1 0 0 0 c1
P4 0 1 0 0 c2

DolphinDB pivot table

I have a table something looking like this:
id CompanyName ProductID productName
-- ----------- --------- -----------
1 c1 1 p1
2 c1 2 p2
3 c2 2 p2
4 c2 3 p3
5 c3 3 p3
6 c4 3 p3
7 c5 4 p4
8 c6 4 p4
9 c6 5 p5
Is it possible to run a DolphinDB query to get output like this:
companyName p1 p2 p3 p4 p5
------------------------------
c1 1 1 0 0 0
c2 0 1 1 0 0
c3 0 0 1 0 0
c4 0 0 1 0 0
c5 0 0 0 1 0
c6 0 0 0 1 1
The value in the above table is the number of each product in each company.I get it by the query:
select count(*) from t group by companyName,productName
t1=select count(ProductID) from t pivot by CompanyName, productName
nullFill!(t1,0)

Index Value of Last Matching Row Python Panda DataFrame

I have a dataframe which has a value of either 0 or 1 in a "column 2", and either a 0 or 1 in "column 1", I would somehow like to find and append as a column the index value for the last row where Column1 = 1 but only for rows where column 2 = 1. This might be easier to see than read:
d = {'C1' : pd.Series([1, 0, 1,0,0], index=[1,2,3,4,5]),'C2' : pd.Series([0, 0,0,1,1], index=[1,2,3,4,5])}
df = pd.DataFrame(d)
print(df)
C1 C2
1 1 0
2 0 0
3 1 0
4 0 1
5 0 1
#I've left out my attempts as they don't even get close
df['C3'] = IF C2 = 1: Call Function that gives Index Value of last place where C1 = 1 Else 0 End
This would result in this result set:
C1 C2 C3
1 1 0 0
2 0 0 0
3 1 0 0
4 0 1 3
5 0 1 3
I was trying to get a function to do this as there are roughly 2million rows in my data set but only ~10k where C2 =1.
Thank you in advance for any help, I really appreciate it - I only started
programming with python a few weeks ago.
It is not so straight forward, you have to do a few loops to get this result. The key here is the fillna method which can do forwards and backwards filling.
It is often the case that pandas methods does more than one thing, this makes it very hard to figure out what methods to use for what.
So let me talk you through this code.
First we need to set C3 to nan, otherwise we cannot use fillna later.
Then we set C3 to be the index but only where C1 == 1 (the mask does this)
After this we can use fillna with method='ffill' to propagate the last observation forwards.
Then we have to mask away all the values where C2 == 0, same way we set the index earlier, with a mask.
df['C3'] = pd.np.nan
mask = df['C1'] == 1
df['C3'].loc[mask] = df.index[mask].copy()
df['C3'] = df['C3'].fillna(method='ffill')
mask = df['C2'] == 0
df['C3'].loc[mask] = 0
df
C1 C2 C3
1 1 0 0
2 0 0 0
3 1 0 0
4 0 1 3
5 0 1 3
EDIT:
Added a .copy() to the index, otherwise we overwrite it and the index gets all full of zeroes.

Reading crosstab numbers from excel to access

I have to load a bunch of numbers from excel to access. I used Import Excel data in Access to load the data earlier.
Earlier :-
Field1 Field2 Field3 QTY
A1 B1 C1 1
A1 B2 C2 2
A1 B3 C3 3
A1 B4 C4 4
A1 B5 C5 5
My data is in a crosstab format now.
For example :-
A1 B1 B2 B3 B4 B5
C1 1 0 0 0 0
C2 0 2 0 0 0
C3 0 0 3 0 0
C4 0 0 0 4 0
C5 0 0 0 0 5
Is there a direct way in which I can import this to access? Or do I have to use a macro to convert into the linear form used earlier.
Thanks,
Arka.

Resources