Columns total for categorical data - calculated-columns

How do obtain the totals for each of the column (in total 100) in the data frame. My data is a qualitative type.
For example
ID1 ID2 ID3 ID4 ID5 ID100
Y N
Y Y
N N
N Y
And I want to find the total of columns (how many Ys and Ns) in ID1, ID2 and etc....
i have tried typing the following code
colSums(mydata[,[1:ncol(mydata)]
thanks in advance

If you had information about which package you're using for the array I could be more specific. Here's a general solution that assumes your array is a 'list of lists' where each column is a list item within a larger list.
def sum_array(array):
""" Returns the number of Y, N for each column """
for column in array:
y_count = 0
n_count = 0
for cell in column:
if cell == "Y":
y_count += 1
elif cell == "N":
n_count += 1
else:
raise TypeError("bad entry")
# print out count for each column
print "column ", column, " has Y: ", y_count, " and N: ", n_count, " entries."
If you can provide more information, I'll try to help you with a more specific solution.

Related

I am finding it hard to understand Boolean Logic in code

The code below has the Python code with Boolean expressions. Please help me understand the logic in them and how they are bringing out the different results.
#code 1
for row in range(7):#Code to print rows
for col in range(5):#Code to print columns
if ((col == 0 or col == 4) or row!=0):
print("*", end = "")
else:
print(end = " ")
print()
#Code 2
for row in range(7):#Code to print rows
for col in range(5):#Code to print columns
if ((col ==0 and col == 4) and row!=0):
print("*", end = "")
else:
print(end = " ")
print()
#Code 3
for row in range(7):#Code to print rows
for col in range(5):#Code to print columns
if ((col ==0 or col == 4) and row!=0):
print("*", end = "")
else:
print(end = " ")
print()
#code 4
for row in range(7):#Code to print rows
for col in range(5):#Code to print columns
if ((col ==0 and col == 4) or row!=0):
print("*", end = "")
else:
print(end = " ")
print()
The four codes have in common:
Two nested loops go through 7x5 = 35 different cases.
For each case, a Boolean expression is evaluated. Depending on the result, a '*' is printed for true and a gap/space for false.
In English, the four Boolean expressions can be described as follows:
1: ((col == 0 or col == 4) or row!=0)
This is true when either col is 0 or 4, or row is unequal to 0.
In other words: For the first row, there are two '*' columns.
For the remaining six rows, outcome is '*' for all columns
2: ((col == 0 and col == 4) and row!=0)
This can only be true for the last six rows.
But col cannot have two different values at the same time.
Therefore, the expression is always false. It is a contradiction.
3: ((col == 0 or col == 4) and row!=0)
This can only be true for two of the five columns.
It is false for the first row.
4: ((col == 0 and col == 4) or row!=0)
The col part is a contradiction and thus always false.
But the row part is true for the last six rows.
Therefore, one blank rows followed by six rows of '*' are printed

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe
I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

python matrix multiplication check if number of rows of 1st matrix is equal to number of columns of 2nd matrix

I need to perform a matrix multiplication between 2 matrices by taking user input. The below code works fine for the multiplication part but if the no. of rows of 1st matrix are not equal to the no. of columns of the 2nd matrix then it should print NOT POSSIBLE and exit. But it still goes on to add the elements of the matrices. What could possibly be wrong in this code & what could be the solution for the same. Any help would be greatly appreciated
def p_mat(M,row_n,col_n):
for i in range(row_n):
for j in range(col_n):
print(M[i][j],end=" ")
print()
def mat_mul(A_rows,A_cols,A,B_rows,B_cols,B):
if A_cols != B_rows:
print("NOT POSSIBLE")
else:
C = [[0 for i in range(B_cols)] for j in range(A_rows)]
for i in range(A_rows) :
for j in range(B_cols) :
C[i][j] = 0
for k in range(B_rows) :
C[i][j] += A[i][k] * B[k][j]
p_mat(C, A_rows, B_cols)
if __name__== "__main__":
A_rows = int(input("Enter number of rows of 1st matrix: "))
A_cols = int(input("Enter number of columns of 1st matrix: "))
B_rows = int(input("Enter number of rows of 2nd matrix: "))
B_cols = int(input("Enter number of columns of 2nd matrix: "))
##### Initialization of matrix A and B #####
A = [[0 for i in range(B_cols)] for j in range(A_rows)]
B = [[0 for i in range(B_cols)] for j in range(A_rows)]
print("Enter the elements of the 1st matrix: ")
for i in range(A_rows):
for j in range(A_cols):
A[i][j] = int(input("A[" + str(i) + "][" + str(j) + "]: "))
print("Enter the elements of the 2nd matrix: ")
for i in range(B_rows):
for j in range(B_cols):
B[i][j] = int(input("B[" + str(i) + "][" + str(j) + "]:"))
##### Print the 1st & 2nd matrices #####
print("First Matrix : ")
p_mat(A,A_rows,A_cols)
print("Second Matrix : ")
p_mat(B,B_rows,B_cols)
### Function call to multiply the matrices ###
mat_mul(A_rows,A_cols,A,B_rows,B_cols,B)
For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix.
If you want to check the no of rows of 1st matrix and the no. of columns of the 2nd matrix then change the if A_cols != B_rows to if A_rows != B_cols
With your current code, it will print NOT POSSIBLE when A_cols != B_rows which is right.
Ex.
Enter number of rows of 1st matrix: 2
Enter number of columns of 1st matrix: 3
Enter number of rows of 2nd matrix: 2
Enter number of columns of 2nd matrix: 3
Enter the elements of the 1st matrix:
A[0][0]: 1
A[0][1]: 2
A[0][2]: 3
A[1][0]: 4
A[1][1]: 5
A[1][2]: 6
Enter the elements of the 2nd matrix:
B[0][0]:1
B[0][1]:2
B[0][2]:3
B[1][0]:4
B[1][1]:5
B[1][2]:6
First Matrix :
1 2 3
4 5 6
Second Matrix :
1 2 3
4 5 6
NOT POSSIBLE
Another mistake in the code is when you are initialize the Matrices.You are doing
A = [[0 for i in range(B_cols)] for j in range(A_rows)]
B = [[0 for i in range(B_cols)] for j in range(A_rows)]
If the B_cols are smaller than the A_cols when you adding elements in A it will raise IndexError
The same if the B_cols are greater than A_cols when you are adding elements to B will raise IndexError.
Change it to
A = [[0 for i in range(A_cols)] for j in range(A_rows)]
B = [[0 for i in range(B_cols)] for j in range(B_rows)]

Inserting Elements into a 2 Dimensional List

elements = []
i,j = 0,0
while(i<3):
while(j<3):
elements[i][j] = int(input())
j+=1
i+=1
j=0
print(elements)
I'm trying to insert elements into 2 dimensional list by getting the input from the user. I'm unable to do so, its giving me a IndexError.
IndexError: list assignment index out of range
I'm expecting a 3x3 list.
Something like :
elements = [
[0,1,2],
[3,4,5],
[6,7,8]
]
What am I doing wrong here? [I do not wish to use Numpy or other libraries atm]
Problem with your case is that the list is initialized with size 0 and as a empty list. So, when you have to set value at some index it throws up error saying that the specified index is out of range because the index doesn't exist.
My approach mutates the existing list in-place or in other words appends a value.
Get size as input first
>>> rows = int(input("Enter no. of rows: "))
Enter no. of rows: 2
>>> cols = int(input("Enter no. of Columns: "))
Enter no. of Columns: 2
Create a list and loop through ranges
>>> l = []
>>> for i in range(rows):
... row_vals = []
... for j in range(cols):
... row_vals.append(int(input(f"Enter value at {i}th row and {j}th column: ")))
... l.append(row_vals)
...
Enter value at 0th row and 0th column: 0
Enter value at 0th row and 1th column: 1
Enter value at 1th row and 0th column: 1
Enter value at 1th row and 1th column: 0
>>> l
[[0, 1], [1, 0]]
This will sove your problem:
elements = []
i, j = 0,0
while(i<3):
elements.append([])
while(j<3):
elements[i].append(int(input()))
j+=1
i+=1
j = 0
print(elements)
The points:
Lists in python are not automatically appended when you access an index, you have to build the list.
You forgot to zero the "j" counter, so that it starts correctly in each row.
Cheers!

Replace values in multiple untitled columns to 0, 1, 2 depending on column

EDITED AS PER COMMENTS
Background: Here is what the current dataframe looks like. The row labels are information texts in original excel file. But I hope this small reproduction of data will be enough for a solution? Actual file has about 100 columns and 200 rows.
Column headers and Row #0 values are repeated with pattern shown below -- except the Sales or Validation text changes at every occurrence of column with an existing title.
One more column before sales with text in each row. Mapping of Xs done for this test. Unfortunately, found no elegant way of displaying text as part of output below.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
Expected Output: Replacing the X with 0s, 1s and 2s depending on which column they are in (Commented / No Comment)
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 0 1
2 2 0
3 1 2
Possible Code: I assume the loop would look something like this:
while in row 9:
if column value = "commented":
replace all "x" with 1
elif row 9 when column valkue = "no comment":
replace all "x" with 2
else:
replace all "x" with 0
But being a python novice, I am not sure how to convert this to a working code. I'd appreciate all support and help.
Here is one way to do it:
Define a function to replace the x:
import re
def replaceX(col):
cond = ~((col == "x") | (col == "X"))
# Check if the name of the column is undefined
if not re.match(r'Unnamed: \d+', col.name):
return col.where(cond, 0)
else:
# Check what is the value of the first row
if col.iloc[0] == "Commented":
return col.where(cond, 1)
elif col.iloc[0] == "No comment":
return col.where(cond, 2)
return col
Or if your first row don't contain "Commented" or "No comment" for titled columns you can have a solution without regex:
def replaceX(col):
cond = ~((col == "x") | (col == "X"))
# Check what is the value of the first row
if col.iloc[0] == "Commented":
return col.where(cond, 1)
elif col.iloc[0] == "No comment":
return col.where(cond, 2)
return col.where(cond, 0)
Apply this function on the DataFrame:
# Apply the function on every column (axis not specified so equal 0)
df.apply(lambda col: replaceX(col))
Output:
title Unnamed: 2 Unnamed: 3
0 Commented No comment
1
2 0 2
3 1
Documentation:
Apply: apply a function on every columns/rows depending on the axis
Where: check where a condition is met on a series, if it is not met, replace with value specified.

Resources