Trying to ignore zero values in Excel column - excel

Expected OutputIn an Excel spreadsheet I am working on there are two columns of interest, column B and column E. In column B there are some 0 values and these are getting carried over to the column E based on the loop that I am running with respect to column D. I want to write a Python script to ignore these 0's and pick the next highest value based on their frequencies into column E.
12NC ModifiedSOCwrt12NC SOC
0 232270463903 0 0
1 232270463903 0 0
2 232270463903 0 0
3 232270463903 0 0
4 232270463903 0 RC0603FR-0738KL
5 232270463903 0 RC0603FR-0738KL
6 232270463903 0 RC0603FR-0738KL
I want to run a loop which picks non-zero values from SOC (column B) and carries it over to ModifiedSOCwrt12NC (column E) based on unique values in Column D.
For example, Column B has values = [0, RCK2] in multiple rows which are based on unique values in column D. So the current loop picks the maximum occurrences of values in column B and fills it into column E. If there is a tie between occurrences of 0 and RCK2, it picks 0 as per the ASCII standard (which I don't want to happen). I want the code to pick RCK2 and fill those in column E.

Since your Data is not accessible, I have created a test data similar to one below -
We can read data in pandas -
import pandas as pd
df = pd.read_excel("ExcelTemplate.xlsx")
df
Index SOC Index2 12NC
0 YXGMY 0 ZJIZX 23445
1 NQHQC 0 JKJKT 23445
2 MWTLY 0 EFCYD 23445
3 RPQFE AC VLOJZ 23445
4 GPLUQ AC AKKKG 23445
5 WGYYM AC DSMLO 23445
6 XGTAQ 0 ZHGWS 45667
7 AMWDT 0 YROLO 45667
following code will do the summarization -
First summarize data on 12NC and SOC and take count
Sort by 12NC, count and SOC, with highest count first
Take the first value of SOC for each 12NC
Merge with original Data to create column E
Export back to Excel
df1 = df.groupby(['12NC', 'SOC'])['Index'].count().reset_index()
df = df.merge(df1[df1['SOC']!=0].sort_values(by=['12NC', 'Index', 'SOC'], ascending=[True, False, True])\
.drop_duplicates(subset=['12NC'], keep='first')[['12NC', 'SOC']].\
rename(index=str, columns={'SOC': 'ModifiedSOCwrt12NC'}),\
on = ['12NC'], how='left')
df.to_excel("ExcelTemplate_modifies.xlsx", index=False)

Related

Randomly select and assign values to given number of rows in python dataframe

How can I randomly select and assign values to given number of rows in python dataframe.
Col B contains only 1's and 0's.
Suppose I have a dataframe as below
Col A Col B
A 0
B 0
A 0
B 0
C 0
A 0
B 0
C 0
D 0
A 0
I aim to randomly chose 5% of the rows and change the value of Col B to 1. I saw df.sample() but that wont allow me to do inplace changes to the column data
You can try Random library. Random has it's own sample function.
import Random
randindx = Random.sample(arr.between(0, dataframe['Col B'].size), dataframe['Col B'].size//20)
Considering 5%, you need to divide by 20.
You can first use the sample method to get the random 5% of examples and get hold of their indices like so:
samples_indices = df.sample(frac=0.05, replace=False).index
With the knowledge of the indices, loc method can be used to update the values corresponding to the examples.
df.loc[samples_indices, 'Col B'] = 1

Excel How do I fill a matrix by MAXIFS comparison

I have this dataset:
Groups A A B B
location a b c d
3 4 0 5
I also have a transformed version for better clarification:
Groups location
A a 3
A b 4
B c 0
B d 5
What I want is a simple matrix that fills binary.
The function should check each row and column wether it has the MAXIF from it's respective group and then compare it to the second value, which is a second MAXIF from it's group. Therefor the combination b and d has to resolve to 1.
The intended output is as following:
I have a dataset with a-n locations that are grouped in a-n groups. So group A has the locations a,b; group B the locations c,d. The columns represent different features at each location.
I want to build a matrix out of it, but not the "usual" distance matrix but one that incorporates the following questions:
-When building the matrix, the maximum values of each group get compared
-I want to find out, if the value I am looking at is the maximum value in this group and if so, compare it to the second groups maximum -> if this number is larger -> set it to 1
-this should automatically fill all fields in the matrix
I need this for a network analysis of my data, to wipe out not needed connections
My current input is somewhat like this:
=IF(AND(>0(MAXIF()=value)>(AND(>0(MAXIF()=value);1;0)
How it looks like in excel:
=IF(AND(A$1<>$A7; A$3>0;(MAXIFS($A$3:$D$3;$A$1:$D$1;A$1)=A$3))<(AND(A$1<>$A7; $C7>0;MAXIFS($C$7:$C$10;$A$7:$A$10;$A7)=$C7));1;0)
However I think internally it does not actually compare values but TRUES and FALSE. Therefore connections that are smaller than MAX are getting 1s. My output currently:
A A B B
a b c d
A a 0 0 0 0
A b 0 0 1 0
B c 0 0 0 0
B d 1 0 0 0
As you can see, the value a and d resolve to 1.
The output should look like this:
(the matrix is generally speaking 0, but when beacons like d (5) and b (4) meet, it gets "1" since both are the highest within their group. Only here's a connection between the two groups.
A A B B
a b c d
A a 0 0 0 0
A b 1 0 0 1
B c 0 0 0 0
B d 0 1 0 0
I understand the problem but don't know how to fix that.
I'm fairly sure this doesn't work properly, but it may help. I've restructured your data slightly to make it easier to write the formula.
The formula in C3 etc is
=IF(AND(MAXIFS($G$3:$G$6,$A$3:$A$6,$A3)=$G3,MAXIFS($C$7:$F$7,$C$1:$F$1,C$1)=C$7,$G3>MAXIFS($G$3:$G$6,$A$3:$A$6,"<>" & $A3)),1,0)
It's just checking if the value in G is the max for the group in column A, and if the value in row 7 is the max for group in row 1, and if the value in G is greater than the max of the other group. If they're all satisfied it inserts a '1'.

is it possible to manually assign the value of the dummy variable?

I have a data set for automotive sales and i want to change the feature 'aspiration' which contains two unique values 'std' & 'turbo' to categorical values using pd.get_dummies. using the code below;
dummy_variable_2 = pd.get_dummies(df['aspiration'])
It is automatically assigning 0 to 'std' & 1 to 'turbo'.
I would like to change to 'std' to 1 & 'turbo' to 0.
The return of pd.get_dummmies is a dataframe, which contains one column for each unique value in the dataframe. Whereby, in each column only the values of the corresponding unique value are set to one.
In your case, the dataframe contains two columns. One column is named turbo and one column std. If you want the column where the values of std are set to one, you have to do following:
df = pd.DataFrame({"aspiration":["std", "turbo", "std", "std", "std", "turbo"]})
dummies = pd.get_dummies(df)
std= dummies["aspiration_std"]
In this example, the variable dummy looks like:
std turbo
0 1 0
1 0 1
2 1 0
3 1 0
4 1 0
5 0 1
and std looks like:
0 1
1 0
2 1
3 1
4 1
5 0

Pandas groupby value and return observation count to dataset

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.
Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.
group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

How to use extractall in Pandas and get a new column with the extracted strings?

I have a data frame of 15 columns from a csv file. I am trying to remove one part of the text of a column and create a new column containing that information on each row. Each row of 'phospho' should have only one match to my demands on extractall. Now, I am trying to add the result to my data frame but I get the error:
TypeError: incompatible index of inserted column with frame index
The dataset has two column with names, and 6 columns with values (like 65.98, for ex).
Ex:
accession sequence modification phospho CON_1 CON_2 CON_3 LIF1
LIF2 LIF3 P18767 [R].GAAQNIIPASTGAAK.[A]
1xTMT6plex[K15];1xTMT6plex[N-Term] 1xPhospho [S3(98.3)]
Here is the freaking code:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
pmap1['phosphosites'] = a
Thanks!
I created pmap1 using the following sample data:
pmap1 = pd.DataFrame(data=[[ 'S34T44X', 1 ], [ 'E23H78Y', 2 ],
[ 'R49Y81Z', 3 ], [ 'D20U23X', 4 ]], columns=['phospho', 'nn'])
When you extract all matches:
a = pmap1['phospho'].str.extractall(r'([STEHRYD]\d*)')
the result is:
0
match
0 0 S34
1 T44
1 0 E23
1 H78
2 Y
2 0 R49
1 Y81
3 0 D20
Note that:
The result is of DataFrame type (with a single column named 0).
It contains eight rows. So it is not clear to which row insert
particular matches.
The index is actually a MultiIndex with 2 levels:
The first (unnamed) level is the index of the source row,
The second level (named match) contains the number of
match within the current row.
E.g. in row with index 0 there were founde 2 matches:
S34 - No 0,
T44 - No 1.
So you can not directly save a as a new column of pmap1,
e.g. because pmap1 contains "ordinary" index and
a is a MultiIndex, incompatible with the index of pmap1.
And just this is written in the error message.
If you want somehow "add" a to pmap1, you can e.g. "break" each match
as a separate column the following way:
a2 = a.unstack()
Gives the result:
0
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
where columns are MultiIndex, so to drop the first
level if it, run:
a2.columns = a2.columns.droplevel()
The result is:
match 0 1 2
0 S34 T44 NaN
1 E23 H78 Y
2 R49 Y81 NaN
3 D20 NaN NaN
Then you can perform the actual join, executing:
pmap1.join(a2)
The result is:
phospho nn 0 1 2
0 S34T44X 1 S34 T44 NaN
1 E23H78Y 2 E23 H78 Y
2 R49Y81Z 3 R49 Y81 NaN
3 D20U23X 4 D20 NaN NaN
If you are unhappy about numbers as column names, you can change them as
you wish.
If you are unhappy about NaN values for "missing" matches
(for rows where less matches have been found compared to other rows),
add .fillna('') to the last instruction.
Edit
There is a shorter solution:
After you created a, you can do the whole rest of processing
with a single instruction:
pmap1.join(a[0].unstack()).fillna('')

Resources