Hoping to sort (below left) by sector but distribute evenly (below right):
Name
Sector.
Name.
Sector
A
1
A
1
B
1
E
2
C
1
H
3
D
4
D
4
E
2
B
1
F
2
F
2
G
2
J
3
H
3
I
4
I
4
C
1
J
3
G
2
Real data is 70+ rows with 4 sectors.
I've worked around it manually but would love to figure out how to do it with a formula in excel.
Here's a more complete (and hopefully more accurate) idea - the carouselOrder is the column I'd like to generate via a formula.
guestID
guestSector
carouselOrder
1
1
1
2
1
5
3
1
9
4
1
13
5
2
2
6
2
6
7
2
10
8
2
14
9
3
3
10
3
7
11
3
11
12
2
18
13
1
17
14
1
20
15
1
23
16
2
21
17
2
24
18
2
27
19
1
26
20
1
29
21
1
30
22
1
31
23
3
15
24
3
19
25
3
22
26
3
25
27
3
28
28
1
32
29
4
4
30
4
8
31
4
12
32
4
16
When using Office 365 you can use the following in D2: =MOD(SEQUENCE(COUNTA(A2:A11),,0),4)+1
This create the repetitive counter of the sectors 1 to 4 to the total count of rows in your data.
In C2 use the following:
=BYROW(D2#,LAMBDA(x,
INDEX(
FILTER($A$2:$A$11,$B$2:$B$11=x),
SUM(--(D$2:x=x)))))
This filters the Names that equal the sector of mentioned row and indexes it to show only the result where the row in the filter result equals the count of the same sector (D2#) up to current row.
Let's try the following approach that doesn't require to create a helper column. I would like to explain first the logic to build the recurrence, then the excel formula that builds such recurrence.
If we sort the input data Name and Sector. by Sector. in ascending order, the new positions of the Name values (letters) can be calculated as follow (Table 1):
Name
Sector.Sorted
Position
A
1
1+4*0=1
B
1
1+4*1=5
C
1
1+4*2=9
E
2
2+4*0=2
F
2
2+4*1=6
G
2
2*4*2=10
H
3
3+4*0=3
J
3
3+4*1=7
D
4
4+4*0=4
I
4
4+4*1=8
The new positions of Name (letters) follows this pattern (Formula 1):
position = Sector.Sorted + groupSize * factor
where groupSize is 4 in our case and factor counts how many times the same Sector.Sorted value is repeated, starting from 0. Think about Sector.Sorted as groups, where each set of repeated values represents a group: 1,2,3 and 4.
If we are able to build the Position values we can sort Name, based on the new positions via SORTBY(array, by_array1) function. Check SORTBY documentation for more information how this function works.
Here is the formula to get the Name sorted in cell E2:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
seq0, SEQUENCE(ROWS(sSector),,0), mapResult,
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW")))), factor,
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
Here is the output:
Explanation
The name sorted represents the input data sorted by Sector. in ascending order, i.e.: SORT(A2:B11,2). The names sName and sSector represent each column of sorted.
To identify each group we need the following sequence (seq0) starting from 0, i.e. SEQUENCE(ROWS(sSector),,0).
Now we need to identify when a new group starts. We use MAP function for that and the result is represented by the name mapResult:
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW"))))
The logic is the following: If we are at the beginning of the sequence (first value of seq0), then returns SAME otherwise we check current value of sSector (a) against the previous one represented by INDEX(sSector,b) if they are the same, then we are in the same group, otherwise a new group started.
The intermediate result of mapResult is:
Name
Sector Sorted
mapResult
A
1
SAME
B
1
SAME
C
1
SAME
E
2
NEW
F
2
SAME
G
2
SAME
H
3
NEW
J
3
SAME
D
4
NEW
I
4
SAME
The first two columns are shown just for illustrative purpose, but mapResult only returns the last column.
Now we just need to create the counter based on every time we find NEW. In order to do that we use SCAN function and the result is stored under the name factor. This value represents the factor we use to multiply by 4 within each group (see Table 1):
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0)))
The accumulator starts in -1, because the counter starts with 0. Every time we find SAME, it increments by 1 the previous value. When it finds NEW (not equal to SAME), the accumulator is reset to 0.
Here is the intermediate result of factor:
Name
Sector Sorted
mapResult
factor
A
1
SAME
0
B
1
SAME
1
C
1
SAME
2
E
2
NEW
0
F
2
SAME
1
G
2
SAME
2
H
3
NEW
0
J
3
SAME
1
D
4
NEW
0
I
4
SAME
1
The first three columns are shown for illustrative purpose.
Now we have all the elements to build our pattern for the new positions represented with the name pos:
MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n))
where m represents each element of Sector.Sorted and factor the previous calculated values. As you can see the formula in Excel represents the generic formula (Formula 1 see above). The intermediate result will be:
Name
Sector Sorted
mapResult
factor
pos
A
1
SAME
0
1
B
1
SAME
1
5
C
1
SAME
2
9
E
2
NEW
0
2
F
2
SAME
1
6
G
2
SAME
2
10
H
3
NEW
0
3
J
3
SAME
1
7
D
4
NEW
0
4
I
4
SAME
1
8
The previous columns are shown just for illustrative purpose. Now we have the new positions, so we are ready to sort based on the new positions for Name via:
SORTBY(sName,pos)
Update
The first MAP can be removed creating an array as input for SCAN that has the information of sSector and the index position to be used for finding the previous element. SCAN only allows a single array as input argument, so we can combine both information in a new array. This is the formula can be used instead:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
factor, SCAN(-1,sSector&"-"&SEQUENCE(ROWS(sSector),,0),
LAMBDA(aa,b, LET(s, TEXTSPLIT(b,"-"),item, INDEX(s,,1),
idx, INDEX(s,,2), IF(aa=-1, 0, IF(1*item=INDEX(sSector, idx), aa+1,0))))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
We use inside of SCAN a LET function to calculate all required elements for doing the comparison as part of the calculation of the corresponding LAMBDA function. We extract the item and the idx position used to find previous element of sSector via:
1*item=INDEX(sSector, idx)
we are able to compare each element of sSector with previous one, starting from the second element of sSector. We multiply item by 1, because TEXTSPLIT converts the result to text, otherwise the comparison will fail.
I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10
I have a df which contains customer data without a primary key. The same customer might show up multiple times.
I have a field (df2['campaign']) that is an int and reflects how many times the customer shows up in the df. There are also many customer attributes.
In my example, going from top to bottom, for each row (i.e. customer), I would like to find all n rows (i.e. all n customers) whose values of the education and default columns are the same. Remember n is the int contained in df2['campaign']
So as shown below, for row 0 and 1 I should search 1 row but find nothing because there are no matching values for education-default combinations.
For row 2 I should search 1 row (because campaign == 1) where education-default values match, and find 1 row in index 4.
df2.head()
job marital education default campaign housing loan contact
0 3 1 0 0 1 0 0 1
1 7 1 3 1 1 0 0 1
2 7 1 3 0 1 2 0 1
3 0 1 1 0 1 0 0 1
4 7 1 3 0 1 0 2 1
Use df2_sorted = df2.sort(['education', 'default'], ascending=[1, 1]).
Then if your data is not noisy, the rows should become neighbors.
I am working on an algorithm, which requires grouping by two columns. Pandas supports grouping by two columns by using:
df.groupby([col1, col2])
But the resulting dataframe is not the required dataframe
Work Setup:
Python : v3.5
Pandas : v0.18.1
Pandas Dataframe - Input Data:
Type Segment
id
1 Domestic 1
2 Salary 3
3 NRI 1
4 Salary 4
5 Salary 3
6 NRI 4
7 Salary 4
8 Salary 3
9 Salary 4
10 NRI 4
Required Dataframe:
Count of [Domestic, Salary, NRI] in each Segment
Domestic Salary NRI
Segment
1 1 3 1
3 0 0 0
4 0 3 2
Experiments:
group = df.groupby(['Segment', 'Type'])
group.size()
Segment Type Count
1 Domestic 1
NRI 1
3 Salary 3
4 Salary 3
NRI 2
I am able to achieve the required dataframe using MS Excel Pivot Table feature. Is there any way, where I can achieve similar results using pandas?
After the Groupby.size operation, a multi-index(2 level index) series object gets created that needs to be converted into a dataframe, which could be done by unstacking the 2nd level index and optionally filling NaNs obtained with 0.
df.groupby(['Segment', 'Type']).size().unstack(level=1, fill_value=0)