I am looking for an equivalent of the convenient python panda syntax:
#df is a pandas dataframe
for fruit, sub_df in df.groupby('fruits'):
# Do some stuff with sub_df and fruit
It is basically a groupby, where each group can be accessed as a single dataframe alongside its label (the common value in the grouping column).
I had a look to data fusion but I can't reproduce this behavior without having to first select all the unique values and second execute one select par value which result to re-parsing the whole file multiple times.
I had a look to the Polars crate which seamed promising but wan't able to reach my goal either.
How would you do this in similar/better performance as the python code? I am open to any syntax / library / approche that would allow me efficiently to partition the parquet file by values of a fixed column.
Here is a rust sample code using polar as an example of what kind of input I am dealing with:
let s0 = Series::new("fruits", ["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"].as_ref());
let s1 = Series::new("maturity", ["A", "B", "A", "C", "A", "D"].as_ref());
let s1 = Series::new("N", [1, 2, 2, 4, 2, 8].as_ref());
// create a new DataFrame
let df = DataFrame::new(vec![s0, s1, s2]).unwrap();
// I would like to loop on all fruits values, each time with a dataframe containing only the records with this fruit.
If you activate the partition_by feature, polars exposes
DataFrame::partition_by and DataFrame::partition_by_stable.
use polars::prelude::*;
fn main() -> Result<()> {
let partitioned = df! {
"fruits" => &["Apple", "Apple", "Pear", "Pear", "Pear", "Pear"],
"maturity" => &["A", "B", "A", "C", "A", "D"],
"N" => &[1, 2, 2, 4, 2, 8]
}?.partition_by(["fruits"])?;
dbg!(partitioned);
Ok(())
}
Running this prints:
[src/main.rs:17] partitioned = [
shape: (4, 3)
┌────────┬──────────┬─────┐
│ fruits ┆ maturity ┆ N │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞════════╪══════════╪═════╡
│ Pear ┆ A ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Pear ┆ C ┆ 4 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Pear ┆ A ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Pear ┆ D ┆ 8 │
└────────┴──────────┴─────┘,
shape: (2, 3)
┌────────┬──────────┬─────┐
│ fruits ┆ maturity ┆ N │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞════════╪══════════╪═════╡
│ Apple ┆ A ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Apple ┆ B ┆ 2 │
└────────┴──────────┴─────┘,
]
You can do something like:
let columns = df.columns();
if let Ok(grouped) = df.groupby("fruits") {
let sub_df = grouped.select(columns).agg_list()?;
dbg!(sub_df);
}
And that'll technically leave you with a dataframe. The problem then is that the dataframe's columns are arrays with all the values for each fruit, which might not be what you want.
+---------+--------------------------------------+-----------------------------+-----------------+
| fruits | fruits_agg_list | maturity_agg_list | N_agg_list |
| --- | --- | --- | --- |
| str | list [str] | list [str] | list [i32] |
+=========+======================================+=============================+=================+
| "Apple" | "[\"Apple\", \"Apple\"]" | "[\"A\", \"B\"]" | "[1, 2]" |
+---------+--------------------------------------+-----------------------------+-----------------+
| "Pear" | "[\"Pear\", \"Pear\", ... \"Pear\"]" | "[\"A\", \"C\", ... \"D\"]" | "[2, 4, ... 8]" |
+---------+--------------------------------------+-----------------------------+-----------------+
I have the table below:
╭───╥────────────┬────────────────╮
│ ║ A │ B │
╞═══╬════════════╪═════════════════╡
│ 1 ║ Jack │ 1 year 6 months │
│ 2 ║ Emily │ 6 months │
│ 3 ║ Carl │ 2 years 3 months│
│ 4 ║ │ │
│ 5 ║ Team avg: │ 1 years 5 months│
└───╨────────────┴─────────────────┘
I would like to get the time span averge from column B. Something like 1.42 years or 1 year 5 months.
Is there a way to input time periods in Excel in terms of years and months and days? I could not figure out how to use dates format for this case.
I would prefer a non-macro solution if possible.
Ant ideas? Thanks in advance.
Since you wrote you could enter the numbers differently, and want to do it in a single column, you could for example, enter your data as y.mm, and then use the array formula below to present a human readable output.
Be sure to retain a leading zero for single digit months, as shown in the screen shot below.
=TEXT(INT(DOLLARFR(AVERAGE(INT(myRange)*12+MOD(myRange,1)*100)/12,12)),"[=1]0 ""year "";0"" years """) & TEXT(INT(MOD(DOLLARFR(AVERAGE(INT(myRange)*12+MOD(myRange,1)*100)/12,12),1)*100), "[=1]0"" month"";0"" months""")
To enter/confirm an array formula, hold down ctrl + shift while hitting enter. If you do this correctly, Excel will place braces {...} around the formula seen in the formula bar.
If you want the output in the same format as the entries, you can use the much simpler formula (also entered as an array formula)
=DOLLARFR(AVERAGE(INT(myRange)*12+MOD(myRange,1)*100)/12,12
And, if you don't mind having a dot in the displayed data, and plural months/years even if months/years are one (1), you can custom format it as suggested by #RaulDurand:
0" years". 0 "months ";;
If you are going to add days into the mix, you can use a similar algorithm. I did not provide one because, although years contain a fixed number of months, both contain a variable number of days and you will need to decide how you are going to treat that situation. A VBA UDF would be much simpler to construct.
I wouldn't bother looking for an intensely programmed solution: you only seem to have 4 data rows? I'd just type new data in instead
We don't to time spans in excel, but we can denote a time span by a start date and an end date. Some time spans may be dynamic, dependent on the current time
If you have 2 dates you can do simple math: later_date - earlier_date
Excel will return a result in days and fractions of a day, so for example 2000-02-01 18:00 minus 2000-01-01 00:00 will give an answer of 31.75 - it's a time span of 31.75 days
You can format 31.75 as days and hours and excel will represent it as eg 31 days 18 hours 0 minutes, but you should appreciate that that is a formatting of a numeric value of 31.75 - get used to working with and thinking of spans of time as a decimal number of days
You can add these up, average them etc
You can add a number amount to a date to advance the date by that number of days
Just use three columns. Column B is Years. Column C is Months. Column D is the sum of B and a twelfth of C. So, D2 would be =B2 + C2/12.
╭───╥────────────┬───┬────┬──────────────╮
│ ║ A | B | C | D │
╞═══╬════════════╪═══╪════╪══════════════╡
│ 1 ║ Jack │ 1 | 6 | = B1 + C1/12 │
│ 2 ║ Emily │ 0 | 6 | = B2 + C2/12 │
│ 3 ║ Carl │ 2 | 3 | = B3 + C3/12 │
│ 4 ║ │ | | │
│ 5 ║ Team avg: │ * | ** | = AVG(D1:D3) │
└───╨────────────┴───┴────┴──────────────┘
* B5 can be INT(D5) to give years
** C5 can be (D5 - INT(D5)) * 12 to give months
Based on #RonRosenfeld comments I managed to use numbers. In this case the time spans on column B are written as:
╭───╥────────────┬─────────────────╮
│ ║ A │ B │
╞═══╬════════════╪═════════════════╡
│ 1 ║ Jack │ 1.06 │
│ 2 ║ Emily │ 0.06 │
│ 3 ║ Carl │ 2.03 │
│ 4 ║ │ │
│ 5 ║ Team avg: │ │
└───╨────────────┴─────────────────┘
and using the following custom format
0" years". 00 "months ";;
it becomes:
╭───╥────────────┬───────────────────╮
│ ║ A │ B │
╞═══╬════════════╪═══════════════════╡
│ 1 ║ Jack │ 1 years. 06 months│
│ 2 ║ Emily │ 0 years. 06 months│
│ 3 ║ Carl │ 2 years. 03 months│
│ 4 ║ │ │
│ 5 ║ Team avg: │ 1.42 years │
└───╨────────────┴───────────────────┘
Then, the average time span is calculated as:
=(SUMPRODUCT(INT(B1:B3))+SUMPRODUCT(MOD(B1:B3,1))/0.12)/COUNT(B1:B3)
Also, custom formatting is used in the result.
This comes up fairly often, and I always find myself fighting against it. So I'd like the definitive solution, if there is one.
The essential problem boils down to this conflict:
J likes to work with homogeneous lists / tables etc. (ie, no lists with items of different lengths, eg)
Sometimes you want to apply a verb to each item of list, where the result of the verb is itself a list whose number of items varies.
This is typically solved using chain (or flat_map) in functional languages.
example to demonstrate the general problem
To take a concrete example, say you want to list all possible pairs from the list 0 1 2 3, where the first is strictly less than the second:
0 1
0 2
0 3
1 2
1 3
2 3
Of course, you could table ,/ or catalog { to get the full cross product, and then filter so you're left with just the upper triangle:
That is take the result of ,"0/~ i.4:
0 0
0 1
0 2
0 3
1 0
1 1
1 2
1 3
2 0
2 1
2 2
2 3
3 0
3 1
3 2
3 3
Actually, to make the geometry more clear, let's display it as ;/"2 ,"0/~ i.4:
┌───┬───┬───┬───┐
│0 0│0 1│0 2│0 3│
├───┼───┼───┼───┤
│1 0│1 1│1 2│1 3│
├───┼───┼───┼───┤
│2 0│2 1│2 2│2 3│
├───┼───┼───┼───┤
│3 0│3 1│3 2│3 3│
└───┴───┴───┴───┘
And now the result we seek is the upper half the triangle. But this approach has drawbacks:
We have to do double the work we need.
We have to introduce a separate filter step to remove the extra results we don't work.
The above two things obscure the intent of our code.
Solutions using { have similar issues.
it would be nice if...
The chain approach would look something like this:
g=. ,"0 i.
(g 0);(g 1);(g 2);(g"0 i.3);<(<#g"0 (1+i.3))
which produces:
┌──┬───┬───┬───┬─────────────┐
│ │1 0│2 0│0 0│┌───┬───┬───┐│
│ │ │2 1│0 0││1 0│2 0│3 0││
│ │ │ │ ││ │2 1│3 1││
│ │ │ │1 0││ │ │3 2││
│ │ │ │0 0│└───┴───┴───┘│
│ │ │ │ │ │
│ │ │ │2 0│ │
│ │ │ │2 1│ │
└──┴───┴───┴───┴─────────────┘
The last two columns are close to what I want, but in the penultimate column we have the nuisance of automatic fill obscuring our result, and in the final column our correct results are boxed, but unboxing them returns the fill.
What is a good (and idiomatically J) way to solve issues like this?
NOTE: I'm not looking for an ad hoc solution to the problem in the example, but a solution to the general problem that is solved by chain in other languages.
I think that this answer is too specific for what you want, but it does suggest that the solution to part 2 of the challenge (variable length results) is to use each=:&.> so that the padding can be avoided.
(< 0 1 2 3) ;"1#:((],.>#[) each) 0 1 2 3
0 1
0 2
0 3
1 2
1 3
2 3
(< 0 1 2 3) ((],.>#[) each) 0 1 2 3
+---+---+---+--+
|0 1|1 2|2 3| |
|0 2|1 3| | |
|0 3| | | |
+---+---+---+--+
Adding another answer to my question for posterity. A phrase I've been using recently when golfing is:
;#(,.&.> <\)#i. 3
0 0
1 0
1 1
2 0
2 1
2 2
Another, slightly longer alternative:
>#(>:/ #&, {#;)~#i.
For example:
>#(>:/ #&, {#;)~#i. 3
0 0
1 0
1 1
2 0
2 1
2 2
I want to create a number of random columns in Excel with those characteristics:
Each column has 9 cells
Each cell is either 0, 1, or 2
Each column has SUM = 10
I tried creating 9 random numbers in column A and then use ROUND(B1/SUM(B$1:B$9);1)*10 for the columns but due to ROUND (I think) it is not completely correct as not all columns have sum=10 (some have 8 others 10 e.t.c.)
For example:
Column B: 0,1,1,1,1,1,1,2,2
Column C: 0,0,1,1,1,1,2,2,2
Column D: 0,0,0,1,1,2,2,2,2
Column E: 0,0,0,0,2,2,2,2,2
and so on, numbers in any order like
Column Z: 1,1,2,0,1,1,1,1,2
The closest I can get is with this:
=IF(SUM(A$1:A1)>=10,0,IF(SUM(A$1:A1)=9,1,IF(SUM(A$1:A1)=8,2,RANDBETWEEN(1,2))))
Put it in A2 and copy down and over. It must go in row 2 or it will cause a circular reference.
It fills the column with 1 or 2 till it sums to 10, then the rest are zeros.
Edit
This is about as random as I can get, this will allow 0s randomly:
=IF(SUM(A$1:A1)>=10,0,IF(SUM(A$1:A1)=9,1,IF(SUM(A$1:A1)=8,2,IF(AND(SUM(A$1:A1)<=ROW()-2,ROW()>5),2,RANDBETWEEN(0,2)))))
there are only 5 possible combinations of 9 numbers 0,1 & 2 (disregarding order) where the total = 10.
2,2,2,2,2,0,0,0,0
2,2,2,2,1,1,0,0,0
2,2,2,1,1,1,1,0,0
2,2,1,1,1,1,1,1,0
2,1,1,1,1,1,1,1,1
put those combinations in a spreadsheet:
╔════╦══════════════════════╤═════════╤═════════╤═════════╤═════════╕
║ ║ A │ B │ C │ D │ E │
╠════╬══════════════════════╪═════════╪═════════╪═════════╪═════════╡
║ 1 ║ CORRECT COMBINATIONS │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 2 ║ Group 1 │ Group 2 │ Group 3 │ Group 4 │ Group 5 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 3 ║ 2 │ 2 │ 2 │ 2 │ 2 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 4 ║ 2 │ 2 │ 2 │ 2 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 5 ║ 2 │ 2 │ 2 │ 1 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 6 ║ 2 │ 2 │ 1 │ 1 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 7 ║ 2 │ 1 │ 1 │ 1 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 8 ║ 0 │ 1 │ 1 │ 1 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 9 ║ 0 │ 0 │ 1 │ 1 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 10 ║ 0 │ 0 │ 0 │ 1 │ 1 │
╟────╫──────────────────────┼─────────┼─────────┼─────────┼─────────┤
║ 11 ║ 0 │ 0 │ 0 │ 0 │ 1 │
╙────╨──────────────────────┴─────────┴─────────┴─────────┴─────────┘
use RAND() to produce 9 random numbers in a column (say cells G3:G11)
use RANK(G3,$G$3:$G$11) to get randomly-ordered list of numbers 1-9 in the neighbouring column.
use RANDBETWEEN(1,5) to randomly choose one of the 5 allowed number combinations (say in cell I2)
use INDEX to reference the cell in the randomly selected column (1-5) and the randomly ordered row (1-9), from within the 9x5 region of allowed values. eg: in cell I3:=INDEX($A$3:$E$11,H3,$I$2)
you can combine the RANK() into the index function also.
╔════╦═══════════════════════╤══════╤════════╤═══╤═══════════════════════╤════════╕
║ ║ G │ H │ I │ J │ K │ L │
╠════╬═══════════════════════╪══════╪════════╪═══╪═══════════════════════╪════════╡
║ 1 ║ │ │ group: │ │ │ group: │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 2 ║ RANDOM number (order) │ rank │ 3 │ │ RANDOM number (order) │ 4 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 3 ║ 0.04 │ 8 │ 0 │ │ 0.92 │ 2 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 4 ║ 0.13 │ 7 │ 1 │ │ 0.79 │ 1 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 5 ║ 0.9 │ 1 │ 2 │ │ 0.2 │ 0 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 6 ║ 0.36 │ 6 │ 1 │ │ 0.31 │ 1 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 7 ║ 0.49 │ 5 │ 1 │ │ 0.98 │ 2 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 8 ║ 0.89 │ 2 │ 2 │ │ 0.65 │ 1 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 9 ║ 0 │ 9 │ 0 │ │ 0.68 │ 1 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 10 ║ 0.84 │ 3 │ 2 │ │ 0.57 │ 1 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 11 ║ 0.65 │ 4 │ 1 │ │ 0.28 │ 1 │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 12 ║ │ │ │ │ │ │
╟────╫───────────────────────┼──────┼────────┼───┼───────────────────────┼────────┤
║ 13 ║ │ │ 10 │ │ │ 10 │
╙────╨───────────────────────┴──────┴────────┴───┴───────────────────────┴────────┘
Here's a randomized solution for you. First, create a table of possible sets. Given your constraints, there are only 5 possible sets of solutions. I put this table in cells B2:F10, with the headers in row 1. Note that this table can go anywhere, even on a different sheet if preferred. In a final product, I would probably hide these rows. Anyway, it looks like this:
Next, because you want a random number of columns, in cell A12 I put in a header called # of Columns and in cell B12 is this formula (feel free to adjust the upper and lower bounds to what you're looking for, this is just a random number between 3 and 10): =RANDBETWEEN(3,10)
Now we can setup our randomized columns and what sets they use:
In cell B14 and copied right (to the maximum number of columns
defined in the previous formula, so in this example it goes to K
because B:K is 10 columns), use this formula:
=IF(COLUMN(A14)>$B$12,"","Column "&COLUMN(A14))
In cell B15 and copied right is this formula:
=IF(B14="","",INDEX($B$1:$F$1,,RANDBETWEEN(1,5)))
In cell B16 and copied right and down for 9 rows (so in this example it is
copied to K24) is this formula:
=IF(B$14="","",INDEX($B$2:$F$10,MATCH(LARGE(B$26:B$34,ROW(B1)),B$26:B$34,0),MATCH(B$15,$B$1:$F$1,0)))
Finished, it will look like this (note that before completing the next step of this answer, it will show #NUM! errors, explained below):
You'll notice that third formula references a range we haven't built yet, in rows 26:34. In that range, there is another table full of randomized numbers so that the Sets can get scrambled to give us randomized results. Building that table is very easy. In cell B26 and copied over and down to K34 (again, over to the maximum number of columns and down for 9 rows), is this formula:
=IF(B$14="","",RAND())
Now with the randomizers, you'll get results as shown in the second image, with randomized sets of 9 numbers that sum to 10, consisting of 0s, 1s, and 2s. At this point you can cut/paste the Sets and Randomizers tables to a different sheet if preferred, or simply hide those rows.
Because of the constraint, there are only 5 unique combinations of values to get to 10:
5 two's; 0 one's ; 4 zero's
4 two's; 2 one's ; 3 zero's
3 two's; 4 one's ; 2 zero's
2 two's; 6 one's ; 1 zero
1 two's; 8 one's ; 0 zero's
We pick one of the five possibilities at random, scramble the elements and stuff the results into a column.
Store the templates in Sheet1 and the output in columns A through Z in sheet Sheet2.
In Sheet1:
The code:
Sub croupier()
Dim Itms(1 To 9) As Variant
Dim i As Long, J As Long, s1 As Worksheet, s2 As Worksheet
Set s1 = Sheets("Sheet1")
Set s2 = Sheets("Sheet2")
For i = 1 To 26
J = Application.WorksheetFunction.RandBetween(1, 5)
For k = 1 To 9
Itms(k) = s1.Cells(k, J).Value
Next k
Call Shuffle(Itms)
For k = 1 To 9
s2.Cells(k, i).Value = Itms(k)
Next k
Next i
End Sub
Sub Shuffle(InOut() As Variant)
Dim HowMany As Long, i As Long, J As Long
Dim tempF As Double, temp As Variant
Hi = UBound(InOut)
Low = LBound(InOut)
ReDim Helper(Low To Hi) As Double
Randomize
For i = Low To Hi
Helper(i) = Rnd
Next i
J = (Hi - Low + 1) \ 2
Do While J > 0
For i = Low To Hi - J
If Helper(i) > Helper(i + J) Then
tempF = Helper(i)
Helper(i) = Helper(i + J)
Helper(i + J) = tempF
temp = InOut(i)
InOut(i) = InOut(i + J)
InOut(i + J) = temp
End If
Next i
For i = Hi - J To Low Step -1
If Helper(i) > Helper(i + J) Then
tempF = Helper(i)
Helper(i) = Helper(i + J)
Helper(i + J) = tempF
temp = InOut(i)
InOut(i) = InOut(i + J)
InOut(i + J) = temp
End If
Next i
J = J \ 2
Loop
End Sub
Sample Sheet2: