Randomly select a percentage of columns and rows in a dataframe (Pandas, Python 3) - python-3.x

I am trying to randomly select a certain percentage of rows and columns in my dataframe and fit these features into a logistic regression over 10 iterations. My dependent variable is whether a team won (1) or lost (0).
If I have a df that looks something like this (data is made up):
Won Field Injuries Weather Fouls Players
1 2 3 1 2 8
0 3 2 0 1 5
1 4 5 3 2 6
1 3 2 1 4 5
0 2 3 0 1 6
1 4 2 0 2 8
...
For example, let's say I want to select 50% (but this could change). I want to randomly select 50% (or the closest amount to 50% if its an odd number) of the columns (field,injuries,weather,fouls,players) and 50% of the rows in those columns to place in my model.
Here is my current code which right now runs by selecting all of the columns and rows and fitting it into my model but I would like to dictate a random percentage:
z = []
For i in range(10):
train_cols = df.columns[1:]
logit = sm.Logit(df['Won'], df[train_cols])
result = logit.fit()
exp = np.exp(result.params)
z.append([i, exp])

Related

Is there a way to sort a list so that rows with the same value in one column are evenly distributed?

Hoping to sort (below left) by sector but distribute evenly (below right):
Name
Sector.
Name.
Sector
A
1
A
1
B
1
E
2
C
1
H
3
D
4
D
4
E
2
B
1
F
2
F
2
G
2
J
3
H
3
I
4
I
4
C
1
J
3
G
2
Real data is 70+ rows with 4 sectors.
I've worked around it manually but would love to figure out how to do it with a formula in excel.
Here's a more complete (and hopefully more accurate) idea - the carouselOrder is the column I'd like to generate via a formula.
guestID
guestSector
carouselOrder
1
1
1
2
1
5
3
1
9
4
1
13
5
2
2
6
2
6
7
2
10
8
2
14
9
3
3
10
3
7
11
3
11
12
2
18
13
1
17
14
1
20
15
1
23
16
2
21
17
2
24
18
2
27
19
1
26
20
1
29
21
1
30
22
1
31
23
3
15
24
3
19
25
3
22
26
3
25
27
3
28
28
1
32
29
4
4
30
4
8
31
4
12
32
4
16
When using Office 365 you can use the following in D2: =MOD(SEQUENCE(COUNTA(A2:A11),,0),4)+1
This create the repetitive counter of the sectors 1 to 4 to the total count of rows in your data.
In C2 use the following:
=BYROW(D2#,LAMBDA(x,
INDEX(
FILTER($A$2:$A$11,$B$2:$B$11=x),
SUM(--(D$2:x=x)))))
This filters the Names that equal the sector of mentioned row and indexes it to show only the result where the row in the filter result equals the count of the same sector (D2#) up to current row.
Let's try the following approach that doesn't require to create a helper column. I would like to explain first the logic to build the recurrence, then the excel formula that builds such recurrence.
If we sort the input data Name and Sector. by Sector. in ascending order, the new positions of the Name values (letters) can be calculated as follow (Table 1):
Name
Sector.Sorted
Position
A
1
1+4*0=1
B
1
1+4*1=5
C
1
1+4*2=9
E
2
2+4*0=2
F
2
2+4*1=6
G
2
2*4*2=10
H
3
3+4*0=3
J
3
3+4*1=7
D
4
4+4*0=4
I
4
4+4*1=8
The new positions of Name (letters) follows this pattern (Formula 1):
position = Sector.Sorted + groupSize * factor
where groupSize is 4 in our case and factor counts how many times the same Sector.Sorted value is repeated, starting from 0. Think about Sector.Sorted as groups, where each set of repeated values represents a group: 1,2,3 and 4.
If we are able to build the Position values we can sort Name, based on the new positions via SORTBY(array, by_array1) function. Check SORTBY documentation for more information how this function works.
Here is the formula to get the Name sorted in cell E2:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
seq0, SEQUENCE(ROWS(sSector),,0), mapResult,
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW")))), factor,
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
Here is the output:
Explanation
The name sorted represents the input data sorted by Sector. in ascending order, i.e.: SORT(A2:B11,2). The names sName and sSector represent each column of sorted.
To identify each group we need the following sequence (seq0) starting from 0, i.e. SEQUENCE(ROWS(sSector),,0).
Now we need to identify when a new group starts. We use MAP function for that and the result is represented by the name mapResult:
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW"))))
The logic is the following: If we are at the beginning of the sequence (first value of seq0), then returns SAME otherwise we check current value of sSector (a) against the previous one represented by INDEX(sSector,b) if they are the same, then we are in the same group, otherwise a new group started.
The intermediate result of mapResult is:
Name
Sector Sorted
mapResult
A
1
SAME
B
1
SAME
C
1
SAME
E
2
NEW
F
2
SAME
G
2
SAME
H
3
NEW
J
3
SAME
D
4
NEW
I
4
SAME
The first two columns are shown just for illustrative purpose, but mapResult only returns the last column.
Now we just need to create the counter based on every time we find NEW. In order to do that we use SCAN function and the result is stored under the name factor. This value represents the factor we use to multiply by 4 within each group (see Table 1):
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0)))
The accumulator starts in -1, because the counter starts with 0. Every time we find SAME, it increments by 1 the previous value. When it finds NEW (not equal to SAME), the accumulator is reset to 0.
Here is the intermediate result of factor:
Name
Sector Sorted
mapResult
factor
A
1
SAME
0
B
1
SAME
1
C
1
SAME
2
E
2
NEW
0
F
2
SAME
1
G
2
SAME
2
H
3
NEW
0
J
3
SAME
1
D
4
NEW
0
I
4
SAME
1
The first three columns are shown for illustrative purpose.
Now we have all the elements to build our pattern for the new positions represented with the name pos:
MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n))
where m represents each element of Sector.Sorted and factor the previous calculated values. As you can see the formula in Excel represents the generic formula (Formula 1 see above). The intermediate result will be:
Name
Sector Sorted
mapResult
factor
pos
A
1
SAME
0
1
B
1
SAME
1
5
C
1
SAME
2
9
E
2
NEW
0
2
F
2
SAME
1
6
G
2
SAME
2
10
H
3
NEW
0
3
J
3
SAME
1
7
D
4
NEW
0
4
I
4
SAME
1
8
The previous columns are shown just for illustrative purpose. Now we have the new positions, so we are ready to sort based on the new positions for Name via:
SORTBY(sName,pos)
Update
The first MAP can be removed creating an array as input for SCAN that has the information of sSector and the index position to be used for finding the previous element. SCAN only allows a single array as input argument, so we can combine both information in a new array. This is the formula can be used instead:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
factor, SCAN(-1,sSector&"-"&SEQUENCE(ROWS(sSector),,0),
LAMBDA(aa,b, LET(s, TEXTSPLIT(b,"-"),item, INDEX(s,,1),
idx, INDEX(s,,2), IF(aa=-1, 0, IF(1*item=INDEX(sSector, idx), aa+1,0))))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
We use inside of SCAN a LET function to calculate all required elements for doing the comparison as part of the calculation of the corresponding LAMBDA function. We extract the item and the idx position used to find previous element of sSector via:
1*item=INDEX(sSector, idx)
we are able to compare each element of sSector with previous one, starting from the second element of sSector. We multiply item by 1, because TEXTSPLIT converts the result to text, otherwise the comparison will fail.

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Calculating Cumulative Average every x successive rows in Excel( not to be confused with Average every x rows gap interval)

I want to calculate cumulative average every 3 rows from the value field. Above figure shows the Column cumulative average which is expected output. Tried offset method but it gives the average after every 3 rows gap interval and not the cumulative average every 3 continuous rows.
Use Series.rolling with mean and then Series.shift:
N = 3
df = pd.DataFrame({'Value': [6,9,15,3,27,33]})
df['Cum_sum'] = df['Value'].rolling(N).mean().shift(-N+1)
print (df)
Value Cum_sum
0 6 10.0
1 9 9.0
2 15 15.0
3 3 21.0
4 27 NaN
5 33 NaN

Find a subset of rows (N rows) in a Pandas data frame having the same values at a subset of columns

I have a df which contains customer data without a primary key. The same customer might show up multiple times.
I have a field (df2['campaign']) that is an int and reflects how many times the customer shows up in the df. There are also many customer attributes.
In my example, going from top to bottom, for each row (i.e. customer), I would like to find all n rows (i.e. all n customers) whose values of the education and default columns are the same. Remember n is the int contained in df2['campaign']
So as shown below, for row 0 and 1 I should search 1 row but find nothing because there are no matching values for education-default combinations.
For row 2 I should search 1 row (because campaign == 1) where education-default values match, and find 1 row in index 4.
df2.head()
job marital education default campaign housing loan contact
0 3 1 0 0 1 0 0 1
1 7 1 3 1 1 0 0 1
2 7 1 3 0 1 2 0 1
3 0 1 1 0 1 0 0 1
4 7 1 3 0 1 0 2 1
Use df2_sorted = df2.sort(['education', 'default'], ascending=[1, 1]).
Then if your data is not noisy, the rows should become neighbors.

How to group by two Columns using Pandas?

I am working on an algorithm, which requires grouping by two columns. Pandas supports grouping by two columns by using:
df.groupby([col1, col2])
But the resulting dataframe is not the required dataframe
Work Setup:
Python : v3.5
Pandas : v0.18.1
Pandas Dataframe - Input Data:
Type Segment
id
1 Domestic 1
2 Salary 3
3 NRI 1
4 Salary 4
5 Salary 3
6 NRI 4
7 Salary 4
8 Salary 3
9 Salary 4
10 NRI 4
Required Dataframe:
Count of [Domestic, Salary, NRI] in each Segment
Domestic Salary NRI
Segment
1 1 3 1
3 0 0 0
4 0 3 2
Experiments:
group = df.groupby(['Segment', 'Type'])
group.size()
Segment Type Count
1 Domestic 1
NRI 1
3 Salary 3
4 Salary 3
NRI 2
I am able to achieve the required dataframe using MS Excel Pivot Table feature. Is there any way, where I can achieve similar results using pandas?
After the Groupby.size operation, a multi-index(2 level index) series object gets created that needs to be converted into a dataframe, which could be done by unstacking the 2nd level index and optionally filling NaNs obtained with 0.
df.groupby(['Segment', 'Type']).size().unstack(level=1, fill_value=0)

Resources