Count categorical values in DataFrame - python-3.x

I have DataFrame only with Categorical Values
Col1 | Col2| ... | ColM
Row
1 X | Y | ... | X
2 Z | X | ... | Y
3 Y | Z | ... | X
.
.
.
N X | Z | ... | Z
I would like to count how many times each category appeared in database
So example result:
X - 100 times
Y - 30 times
Z = 210 times
Thank You for help

The most performant option is to use np.unique with the return_counts flag set:
u, c = np.unique(df, return_counts=True)
pd.Series(c, index=u)
There's also stack and value_counts, which is much slower, but simple and intuitive:
df.stack().value_counts()

Related

Identify the parent and children value in the dataframe

I spend almost half of my day trying to solve this...
I want to identify the value in parent and child columns and change it to rows.
The value has a tree structure in that the parent node becomes the child node, and the child node becomes the parent node at the next step.
My sample data looks like.
| Parent | Child |
--------------------------
0 | a b
1 | b c
2 | b d
3 | c e
4 | c f
5 | f g
6 | d h
and I want to change this like,
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
----------------------------------------------------------
0 | a | b | c | f | g | nan |
1 | a | b | c | e | nan | nan |
2 | a | b | d | h | nan | nan |
I have tried doing the loop for searching the next items, but it does not work.
Any help would be appreciated.
You can approach this using a graph and networkx.
You graph is:
Create all edges, find the roots and leafs and compute the paths with all_simple_paths:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Parent', target='Child',
create_using=nx.DiGraph)
roots = [n for n,d in G.in_degree() if d==0]
leafs = [n for n,d in G.out_degree() if d==0]
df2 = pd.DataFrame([l for r in roots for l in nx.all_simple_paths(G, r, leafs)])
output:
0 1 2 3 4
0 a b c e None
1 a b c f g
2 a b d h None

Splitting a column into multiple columns

I have a pandas dataframe as below :
| A | Value |
+----------+--------+
|ABC001035 | 34 |
|USN001185 | 45 |
|UCT010.75 | 23 |
|ATC001070 | 21 |
+----------+--------+
I want to split the column in A (based on last three digits in A) into columns X and Y, and it should look like below
| A | Value | X | Y |
+----------+--------+---------+-----+
|ABC001035 | 34 | ABC001 | 035 |
|USN001185 | 45 | USN001 | 185 |
|UCT010.75 | 23 | UCT01 | 0.75|
|ATC001070 | 21 | ATC001 | 070 |
+----------+--------+---------+-----+
So how to split the column A ?
You can index all strings in a series with the .str accessor:
>>> df['X'] = df['A'].str[:-3]
>>> df['Y'] = df['A'].str[-3:]
>>> df
A Value X Y
0 ABC001035 34.0 ABC001 035
1 USN001185 45.0 USN001 185
2 UCT010.75 23.0 UCT010 .75
3 ATC001070 21.0 ATC001 070
Split your problem into smaller ones, easier to solve! :)
How to split a string (take the last 3 characters):
'Hello world!'[-3:0]
# Returns: ld!
How to apply a function over a DataFrame value?
df.A.apply(lambda x: x[-3:])
# Returns pandas.Series: [035, 185, 0.75, 070]
How to save a Series to a new DataFrame column?
# Create Y column.
df['Y'] = df.A.apply(lambda x: x[-3:])

LibreOffice/Excel Table Calculation Formula

I have three Columns in one sheet. Col1 Have Combination Of Col2 Values, I need to replace Col1 Value as equal Of COl2 = col3 Value.
Is there Any Formula to do this in LibreOffice Calculation.
Actual Table:
Col1 | col2 | Col3
A | A | X
C | B | Y
A | C | Z
B | |
A | |
B | |
C | |
A | |
C |
B |
Expected Output:
Col1 | col2 | Col3
X | A | X
Z | B | Y
X | C | Z
Y | |
X | |
Y | |
Z | |
X | |
Z |
Y |
Thanks In Advance, I am struggling For long days in this.
Basically it's a work around. You would like to change A->X, B->Y and C->Z in col1. Create a col4 with the formula
=CHAR(CODE(A1)+23)
This offsets the A by 23 characters that will be X and therefore, B and Cs with Y and Z.

Automatically calculate (or delete) rows in Excel when first column is changing

I have a big table, where first columns X is "input column" and range it's changing.
Y - There are more formulas and functions (Vlookup) and 1st column X is a lookup value, and then other columns are calculated from other sheets.
| A | B | C | D | E
1 | X | Y | Y | Y | Y
2 | X | Y | Y | Y | Y
3 | X | Y | Y | Y | Y
4 | X | Y | Y | Y | Y
I am inserting (and deleting) more X values (actual data) and then I use "double click" for all other Y columns to be calculated, BUT it's not good because the X range is not the same. I tried to convert it to table "Ctrl-T", but it's not working very good for me. Maybe I don't use it properly.
Problem:
If I paste a new X column, I need other Y columns to be automatically calculated OR if I delete few X rows, other Y should be also deleted. Now I get something like this:
| A | B | C | D | E
1 | X | Y | Y | Y | Y
2 | X | Y | Y | Y | Y
3 | | N/A | N/A | N/A | N/A
4 | | N/A | N/A | N/A | N/A
or:
| A | B | C | D | E
1 | X | Y | Y | Y | Y
2 | X | Y | Y | Y | Y
3 | X | | | |
What I need:
If I remove X value I need automatically disappear Y values:
| A | B | C | D | E
1 | X | Y | Y | Y | Y
2 | X | Y | Y | Y | Y
If I add X value I need automatically calculate Y values:
| A | B | C | D | E
1 | X | Y | Y | Y | Y
2 | X | Y | Y | Y | Y
3 | X | Y | Y | Y | Y
Hope it's clear, thank you!
For Y Columns, you can add "IF" FORMULA
=if(A1>0,*Y COLUMN FORMULA*,"")
try changing formula to
=iferror(*Y formula,"")
or if it's still slow and if you are changing only X Columns
you can use below code
Private Sub Worksheet_Change(ByVal Target As Range)
If Target.Column = 1 And Target.Count = 1 Then 'CHECK IF THERE IS ANY CHANGE ON X COLUMN
If Target.Value = Empty Then 'CHECK IF X COLUMN HAS BEEN DELETED
Rows(Target.Row).Delete 'IF X COLUMN IS DELETED, DELETS WHOLE ROW
Else
Cells(Target.Row - 1, 2).Resize(1, 4).Copy Cells(Target.Row, 2).Resize(1, 4) 'IF X COLUMN IS ENTERED OR MODIFIED COPIES ABOVE Y COLUMN FORMULAS
End If
End If
End Sub

Search a string in multiple columns

I am using the following formula:
=IF(ISERROR(LOOKUP(2^15;SEARCH(MID(A1;1;9);$D$1:$D$100)));"No";"Yes")
this is working perfectly!
Question: I want to search within columns $D$1:$E$100 and not only one column D. How can I modify this to search in two columns?
The easiest way is probably to AND the results of a search in each column. This translates to "if not found in D and not found in E then output No". The logic is as follows:
In column D | ISERROR(lookup in D) | In Column E | ISERROR(lookup in E) | result
N | Y | N | Y | No
N | Y | Y | N | Yes
Y | N | N | Y | Yes
Y | N | Y | N | Yes
=IF(AND(ISERROR(LOOKUP(2^15,SEARCH(MID(A1,1,9),$D$1:$D$100))),
ISERROR(LOOKUP(2^15,SEARCH(MID(A1,1,9),$E$1:$E$100)))),"No","Yes")

Resources