Randomly select and assign values to given number of rows in python dataframe - python-3.x

How can I randomly select and assign values to given number of rows in python dataframe.
Col B contains only 1's and 0's.
Suppose I have a dataframe as below
Col A Col B
A 0
B 0
A 0
B 0
C 0
A 0
B 0
C 0
D 0
A 0
I aim to randomly chose 5% of the rows and change the value of Col B to 1. I saw df.sample() but that wont allow me to do inplace changes to the column data

You can try Random library. Random has it's own sample function.
import Random
randindx = Random.sample(arr.between(0, dataframe['Col B'].size), dataframe['Col B'].size//20)
Considering 5%, you need to divide by 20.

You can first use the sample method to get the random 5% of examples and get hold of their indices like so:
samples_indices = df.sample(frac=0.05, replace=False).index
With the knowledge of the indices, loc method can be used to update the values corresponding to the examples.
df.loc[samples_indices, 'Col B'] = 1

Related

Excel - assign values based on the first unique item

I have got an excel question that I can not answer. Here is my table:
ID Key Count Unique Available Text Results
1 0 Text-1 Dupe-Y
2 1 Y Text-1 Y
3 0 Text-1 Dupe-Y
4 0 Text-1 Dupe-Y
5 1 N Text-2 N
6 1 Y Text-3 Y
7 0 Text-2 Dupe-N
8 0 Duplicate Text-2 Dupe-N
9 0 Duplicate Text-2 Dupe-N
10 0 Y Text-2 Dupe-N
Id Key is just unique key.
Count unique picks up the first time each value in column Text appears. Available can have Y, N, Duplicate and Text is the main column I need to analyze my table. The Results are for the first time each value in Text appears (Count unique = 1), if there is a value in Available then that is the value I need, if Count Unique is 0 then is either Dupe-Y or Dupe-N depending on the value in Available.
I tried with a formula like this one but got stuck after initial progress. =IF(B2=0,"",IFERROR(IF(COUNTIF(D:D,D2)>1,IF(COUNTIF($D:$D,D2)=1,"",C2),1),1))
Note that the column Results is the one I need to populate with a formula that is not affected by sorting or lack of it.
I guess you got all those values and you just need a formula for column Results.
My formul will work only if the data is sorted like in your example. If sorting changes, formula will fail:
My formula is:
=IF(B2=1;D2;"Dupe-"&RIGHT(G1;1))

how to get value of column2 when column 1 is greater 3 and check this value belong to which Bin

I have one dataframe with two columns , A and B . first i need to make empty bins with step 1 from 1 to 11 , (1,2),(2,3)....(10,11). then check from original dataframe if column B value greater than 3 then get value of column 'A' 2 rows before when column B is greater than 3.
Here is example dataframe :
df=pd.DataFrame({'A':[1,8.5,5.2,7,8,9,0,4,5,6],'B':[1,2,2,2,3.1,3.2,3,2,1,2]})
Required output 1:
df_out1=pd.DataFrame({'Value_A':[8.5,5.2]})
Required_output_2:
df_output2:
Bins count
(1 2) 0
(2,3) 0
(3,4) 0
(4,5) 0
(5,6) 1
(6,7) 0
(7,8) 0
(8,9) 1
(9,10) 0
(10,11) 0
You can index on a shifted series to get the two rows before 'A' satisfies some condition like
out1 = df['A'].shift(3)[df['B'] > 3]
The thing you want to do with the bins is known as a histogram. You can easily do this with numpy like
count, bin_edges = np.histogram(out1, bins=[i for i in range(1, 12)])
out2 = pd.DataFrame({'bin_lo': bin_edges[:-1], 'bin_hi': bin_edges[1:], 'count': count})
Here 'bin_lo' and 'bin_hi' are the lower and upper bounds of the bins.

Is there a way to convert my column of incrementing integers separated by zero to the number of intervals encountered so far in a pandas datafram?

I'm working in pandas and I have a column in my dataframe filled by 0s and incrementing integers starting at one. I would like to add another column of integers but that column would be a counter of how many intervals separated by zero we have encountered to this point. For example my data would like like
Index
1
2
3
0
1
2
0
1
and I would like it to look like
Index IntervalCount
1 1
2 1
3 1
0 1
1 2
2 2
0 2
1 2
Is it possible to do this with vectorized operation or do I have to do this iteratively? Note, it's not important that it be a new column could also overwrite the old one.
You can use cumsum function.
df["IntervalCount"] = (df["Index"] == 1).cumsum()

Excel How do I fill a matrix by MAXIFS comparison

I have this dataset:
Groups A A B B
location a b c d
3 4 0 5
I also have a transformed version for better clarification:
Groups location
A a 3
A b 4
B c 0
B d 5
What I want is a simple matrix that fills binary.
The function should check each row and column wether it has the MAXIF from it's respective group and then compare it to the second value, which is a second MAXIF from it's group. Therefor the combination b and d has to resolve to 1.
The intended output is as following:
I have a dataset with a-n locations that are grouped in a-n groups. So group A has the locations a,b; group B the locations c,d. The columns represent different features at each location.
I want to build a matrix out of it, but not the "usual" distance matrix but one that incorporates the following questions:
-When building the matrix, the maximum values of each group get compared
-I want to find out, if the value I am looking at is the maximum value in this group and if so, compare it to the second groups maximum -> if this number is larger -> set it to 1
-this should automatically fill all fields in the matrix
I need this for a network analysis of my data, to wipe out not needed connections
My current input is somewhat like this:
=IF(AND(>0(MAXIF()=value)>(AND(>0(MAXIF()=value);1;0)
How it looks like in excel:
=IF(AND(A$1<>$A7; A$3>0;(MAXIFS($A$3:$D$3;$A$1:$D$1;A$1)=A$3))<(AND(A$1<>$A7; $C7>0;MAXIFS($C$7:$C$10;$A$7:$A$10;$A7)=$C7));1;0)
However I think internally it does not actually compare values but TRUES and FALSE. Therefore connections that are smaller than MAX are getting 1s. My output currently:
A A B B
a b c d
A a 0 0 0 0
A b 0 0 1 0
B c 0 0 0 0
B d 1 0 0 0
As you can see, the value a and d resolve to 1.
The output should look like this:
(the matrix is generally speaking 0, but when beacons like d (5) and b (4) meet, it gets "1" since both are the highest within their group. Only here's a connection between the two groups.
A A B B
a b c d
A a 0 0 0 0
A b 1 0 0 1
B c 0 0 0 0
B d 0 1 0 0
I understand the problem but don't know how to fix that.
I'm fairly sure this doesn't work properly, but it may help. I've restructured your data slightly to make it easier to write the formula.
The formula in C3 etc is
=IF(AND(MAXIFS($G$3:$G$6,$A$3:$A$6,$A3)=$G3,MAXIFS($C$7:$F$7,$C$1:$F$1,C$1)=C$7,$G3>MAXIFS($G$3:$G$6,$A$3:$A$6,"<>" & $A3)),1,0)
It's just checking if the value in G is the max for the group in column A, and if the value in row 7 is the max for group in row 1, and if the value in G is greater than the max of the other group. If they're all satisfied it inserts a '1'.

Trying to ignore zero values in Excel column

Expected OutputIn an Excel spreadsheet I am working on there are two columns of interest, column B and column E. In column B there are some 0 values and these are getting carried over to the column E based on the loop that I am running with respect to column D. I want to write a Python script to ignore these 0's and pick the next highest value based on their frequencies into column E.
12NC ModifiedSOCwrt12NC SOC
0 232270463903 0 0
1 232270463903 0 0
2 232270463903 0 0
3 232270463903 0 0
4 232270463903 0 RC0603FR-0738KL
5 232270463903 0 RC0603FR-0738KL
6 232270463903 0 RC0603FR-0738KL
I want to run a loop which picks non-zero values from SOC (column B) and carries it over to ModifiedSOCwrt12NC (column E) based on unique values in Column D.
For example, Column B has values = [0, RCK2] in multiple rows which are based on unique values in column D. So the current loop picks the maximum occurrences of values in column B and fills it into column E. If there is a tie between occurrences of 0 and RCK2, it picks 0 as per the ASCII standard (which I don't want to happen). I want the code to pick RCK2 and fill those in column E.
Since your Data is not accessible, I have created a test data similar to one below -
We can read data in pandas -
import pandas as pd
df = pd.read_excel("ExcelTemplate.xlsx")
df
Index SOC Index2 12NC
0 YXGMY 0 ZJIZX 23445
1 NQHQC 0 JKJKT 23445
2 MWTLY 0 EFCYD 23445
3 RPQFE AC VLOJZ 23445
4 GPLUQ AC AKKKG 23445
5 WGYYM AC DSMLO 23445
6 XGTAQ 0 ZHGWS 45667
7 AMWDT 0 YROLO 45667
following code will do the summarization -
First summarize data on 12NC and SOC and take count
Sort by 12NC, count and SOC, with highest count first
Take the first value of SOC for each 12NC
Merge with original Data to create column E
Export back to Excel
df1 = df.groupby(['12NC', 'SOC'])['Index'].count().reset_index()
df = df.merge(df1[df1['SOC']!=0].sort_values(by=['12NC', 'Index', 'SOC'], ascending=[True, False, True])\
.drop_duplicates(subset=['12NC'], keep='first')[['12NC', 'SOC']].\
rename(index=str, columns={'SOC': 'ModifiedSOCwrt12NC'}),\
on = ['12NC'], how='left')
df.to_excel("ExcelTemplate_modifies.xlsx", index=False)

Resources