Percentage of values when one column has values and other column is null - python-3.x

May be this is the duplicate of other question but I am not able to solve the problem.
I have transaction data having 100 features and 2.3 million rows. I want to find percentage of values present in one column and Null in other column for every combination of columns.
Example:
A B C D
1 NA 2 3
2 4 5 6
NA 5 6 7
8 2 NA NA
9 8 7 6
So output should be:
When A has values B has Null 1/4=0.25 times
When A has values C has Null 1/4=0.25 times
Similarly for every other combination of columns and create a dataframe for it.
I tried combination of columns function in Python but it's not giving the desired result.
itertools.combinations(daf.columns, n)

You can write 2 for loops to iterate for individual columns and then compare.

Related

How to write function to extract n+1 column in pandas

I have a excel file with 200 columns. The first column is no. of visits, and other columns are the data with number of people for that number of visits
Visits A B C D
2 10 0 30 40
3 5 6 0 1
4 2 3 1 0
I want to write a function so that I have multiple dataframes with Visit column and A; visit column and B, and so on (I want to write a function, as the number of columns will increase in the future and I want to automatize the process). Also, I want to remove the data with 0.
Desired output:
dataframe 1:
visits A
dataframe 2:
Visits B
3 6
4 3
This is my first question. So sorry, if it is not properly framed. Thank you for your help.
Use DataFrame.items:
for i,col in df.set_index('visits').items():
print(col[col.ne(0)].to_frame(i).reset_index())
You can create a dict to save by the name of columns
dfs={i:col[col.ne(0)].to_frame(i).reset_index() for i,col in df.set_index('visits').items()}

Subtract a subset of columns from a key column in Pandas Pivot

I have a pivot table with multiple columns of data in a time series:
A B C D
11/1/2018 1 5 5 7
11/2/2018 2 6 6 8
11/3/2018 3 7 7 9
The values in the data columns are not important for this example. I would like to subtract the value in the "key" column (column A in this case) from a subset of columns: B & C in this case. I would then like to drop any columns not in the subset or the key column. Result would be:
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
I have subtracted columns in the past via code like this:
df['dif'] = df['B'] -df['A']
But this will add the "dif" column. I would like to replace column B with B-A values. Also, instead of passing the instructions one at a time (B-A, C-A), would like to pass the list something like "if column in list, subtract key column, else drop column."
Thanks
pandas.DataFrame.sub with axis=0
When subtracting a Series from a DataFrame Pandas will align the columns of the DataFrame with the index of the Series by default. This is what happens when you use the - operator. However, when you use the pandas.DataFrame.sub method, you can override that default and specify that the DataFrame should align its index with the index of the Series.
def f(d, key, subset):
return d[[key]].join(d[subset].sub(d[key], axis=0))
f(df, 'A', ['B', 'C'])
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
You can use apply to substract A from the subset columns that you choose and finally join again with A.
df['A'].to_frame().join(df[['B','C']].apply(lambda x: x - df['A']))
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4

Excel using SUMIF to calculate totals of multiple columns

I'm trying to use Excle's SUMIF to calculate totals of Col1 to Col5 for dates that are similar.
My formula is as follows =SUMIF($A2:$A7,A10,$B2:$F7), but this only gives me the total of a single column.
How can I get the Totals of all the columns based on the date like I've shown in my results.
Date Col 1 Col 2 Col 3 Col 4 Col 5
1/5/2017 1 2 2
1/5/2017 5 3 1
1/5/2017 9 5 5
2/5/2017 10 5 3
2/5/2017 20 10 3
2/5/2017 6 8 1 5
Desired Results
1/5/2017 15 7 7 3 1
2/5/2017 30 11 11 11 8
use below formula in cell B11
=SUMIF($A$2:$A$7,$A11,B$2:B$7)
Per the example you provided, One solution is to use SUMPRODUCT
Multiplies corresponding components in the given arrays, and returns the sum of those products
Microsoft Docs give a thorough example, but per SO etiquette, here is an example in case of link-rot: [FYI, I used absolute reference for easier filling across, arbitrary how you get it done though]
Forumlas shown:
Formula is kind of hard to see without clicking on image:
=SUMPRODUCT(($B$3:$B$8=$B$11)*C3:C8)
This basically breaks down like this, it searches the B:B column for a match, and it will naturally return a true or false for the match, or 0/1 counterparts, and multiplys that by the number found in the column to the right (C3:C8), so it will either be 1 * # = # or 0 * # = 0

vlookup with multiple resulting values

I have a table array that looks like this:
A B
1 2
1 3
1 9
2 3
2 4
2 11
2 23
2 56
3 7
4 13
My VLOOKUP formula is to check for 1 in column A and then return the corresponding B value. Is there anyway I can get all the values for 1? Currently it just returns back the last corresponding number for 1 i.e. 9 in column B.
You can Pivot that data, or you can try INDEX and MATCH together, perhaps even an IF command with it.

Find summation and count only if they are EQUAL in Excel

In EXCEL sheet I have 1728 rows and 2 columns (L and O). I am doing addition of these 2 columns in column P. Further I want to count the occurrence in this column if addition is EQUAL to 2 or 4 or 6 or 8 BUT condition here is that The COUNT should be such that BOTH the columns L and O are EQUAL and Their addition is either 2 or 4 or 6 or 8.
This means that only the columns in L and O with values "1+1" , "2+2", "3+3", "4+4" should be counted. The addition of "1+3", "4+2" should not be counted.
=COUNTIF(P:P,4)
does not work.
L O P M
===========================
1 1 2 1 (NO OF 2'S)
2 2 4 1 (NO OF 4'S)
3 3 6 1 (NO OF 6'S)
1 3 4* NO TO BE COUNTED
4 4 8 1 (NO OF 8'S)
2 4 6* NOT TO BE COUNTED
4 2 6*
AS SEEN ABOVE RESULT OF COUNTING IS STORED IN M. Let me know the formula
=IF(L29=M29,SUMPRODUCT(--($L$29:$L$35=$M$29:$M$35)*(L29=$L$29:$L$35)),"Not Counted")
My data started in row 29 so you will need to adjust the references. It counts the entire table in 1 shot. So if you added a row to the bottom that had 1 and 1 and 2, the results in column M in your first row would become 2 and the same for the row you just added.
Will this formula help...?
=IF(AND(A1=B1,OR(SUM(A1,B1)=2,SUM(A1,B1)=4,SUM(A1,B1)=6,SUM(A1,B1)=8)),SUM(A1,B1),"NOT TO BE COUNTED")
Just drag the formula till you have data. You will need to adjust the references.
Here is the reference data.

Resources