Haskell multivariable Lambda function for lists - haskell

I am confused on how this is computed.
Input: groupBy (\x y -> (x*y `mod` 3) == 0) [1,2,3,4,5,6,7,8,9]
Output: [[1],[2,3],[4],[5,6],[7],[8,9]]
First, does x and y refer to the current and the next element?
Second, is this saying that it will group the elements that equal 0 when it is modded by 3? If so, how come there are elements that are not equal to 0 when modded by 3 in the output?
Found here: http://zvon.org/other/haskell/Outputlist/groupBy_f.html

To answer your second question: We compare two elements by multiplying them and seeing if the result is divisible by 3. "So why are there elements in the output not divisible by 3?" If they aren't divisible, that doesn't filter them out (that's what filter does); rather, when the predicate fails, the element goes into a separate group. When it succeeds, the element goes into the current group.
As to your first question, this took me a little while to figure out... x and y aren't two consecutive elements; rather, y is the current element and x is the first element in the current group. (!)
1 * 2 = 2; 2 `mod` 3 = 2; 1 and 2 go in separate groups.
2 * 3 = 6; 6 `mod` 3 = 0; 2 and 3 go in the same group.
2 * 4 = 8; 8 `mod` 3 = 2; 4 gets put in a different group.
...
Notice, on that last line, we're looking at 2 and 4 — not 3 and 4, as you might reasonably expect.

First, does x and y refer to the current and the next element?
Roughly, yes.
Second, is this saying that it will group the elements that equal 0 when it is modded by 3? If so, how come there are elements that are not equal to 0 when modded by 3 in the output?
The lambda defines a relation between two integers x and y, which holds whenever the product x*y is a multiple of 3. Since 3 is prime, x must be a multiple of 3 or y must be such.
For the input [1,2,3,4,5,6,7,8,9], it is first checked whether 1 is in relation with 2. This is false, so 1 gets its own singleton group [1]. Then, we proceed we 2 and 3: now the relation holds, so 2,3 will share their group. Next, we check whether 2 and 4 are in relation: this is false. So, the group is [2,3] and not any larger. Then we proceed with 4 and 5 ...
I must confess that I do not like this example very much, since the relation is not an equivalence relation (because it is not transitive). Because of this, the exact result of groupBy is not guaranteed: the implementation might test the relation on 3,4 (true) instead of 2,4 (false), and build a group [2,3,4] instead.
Quoting from the docs:
The predicate is assumed to define an equivalence.
So, once this contract is violated, there are no guarantees on what the output of groupBy might be.

The groupBy function takes a list and returns a list of lists such that each sublist in the result contains only equal elements, based on the equality function you provide.
In this case, you are trying to find all subsets where for all sublist elements x and y, mod (x*y) 3 == 0 (and the ones where it doesn't == 0). Slightly weird, but there you go. groupBy only looks at adjacent elements. sort the list to reduce the number of duplicate sets.

Related

Sum of arrays with repeated indices

How can I add an array of numbers to another array by indices? Especially with repeated indices. Like that
x
1 2 3 4
idx
0 1 0
y
5 6 7
] x add idx;y NB. (1 + 5 + 7) , (2 + 6) , 3 , 4
13 8 3 4
All nouns (x, idx, y) can be millions of items and I need to fast 'add' verb.
UPDATE
Solution (thanks to Dan Bron):
cumIdx =: 1 : 0
:
'i z' =. y
n =. ~. i
x n}~ (n{x) + i u//. z
)
(1 2 3 4) + cumIdx (0 1 0);(5 6 7)
13 8 3 4
For now, a short answer in the "get it done" mode:
data =. 1 2 3 4
idx =. 0 1 0
updat =. 5 6 7
cumIdx =: adverb define
:
n =. ~. m
y n}~ (n{y) + m +//. x
)
updat idx cumIdx data NB. 13 8 3 4
In brief:
Start by grouping the update array (in your post, y¹) where your index array has the same value, and taking the sum of each group
Accomplish this using the adverb key (/.) with sum (+/) as its verbal argument, deriving a dyadic verb whose arguments are idx on the left and the update array (your y, my updat) on the right.
Get the nub (~.) of your index array
Select these (unique) indices from your value array (your x, my data)
This will, by definition, have the same length as the cumulative sums we calculated in (1.)
Add these to the cumulative sum
Now you have your final updates to the data; updat and idx have the same length, so you just merge them into your value array using }, as you did in your code
Since we kept the update array small (never greater than its original length), this should have decent performance on larger inputs, though I haven't run any tests. The only performance drawback is the double computation of the nub of idx (once explicitly with ~. and once implicitly with /.), though since your values are integers, this should be relatively cheap; it's one of J's stronger areas, performance-wise.
¹ I realize renaming your arrays makes this answer more verbose than it needs to be. However, since you named your primary data x rather than y (which is the convention), if I had just kept your naming convention, then when I invoked cumIdx, the names of the nouns inside the definition would have the opposite meanings to the ones outside the definition, which I thought would cause greater confusion. For this reason, it's best to keep "primary data" on the right (y), and "control data" on the left (x).You might also consider constraining your use of the special names x,y,u,v,m and n to where they're already implicitly defined by invoking an explicit definition; definitely never change their nameclasses.
This approach also uses key (/.) but is a bit more simplistic in its approach.
It is likely to use more space especially for big updates than Dan Bron's.
addByIdx=: {{ (m , i.## y) +//. x,y }}
updat idx addByIdx data
13 8 3 4

pandas how to flatten a list in a column while keeping list ids for each element

I have the following df,
A id
[ObjectId('5abb6fab81c0')] 0
[ObjectId('5abb6fab81c3'),ObjectId('5abb6fab81c4')] 1
[ObjectId('5abb6fab81c2'),ObjectId('5abb6fab81c1')] 2
I like to flatten each list in A, and assign its corresponding id to each element in the list like,
A id
ObjectId('5abb6fab81c0') 0
ObjectId('5abb6fab81c3') 1
ObjectId('5abb6fab81c4') 1
ObjectId('5abb6fab81c2') 2
ObjectId('5abb6fab81c1') 2
I think the comment is coming from this question ? you can using my original post or this one
df.set_index('id').A.apply(pd.Series).stack().reset_index().drop('level_1',1)
Out[497]:
id 0
0 0 1.0
1 1 2.0
2 1 3.0
3 1 4.0
4 2 5.0
5 2 6.0
Or
pd.DataFrame({'id':df.id.repeat(df.A.str.len()),'A':df.A.sum()})
Out[498]:
A id
0 1 0
1 2 1
1 3 1
1 4 1
2 5 2
2 6 2
This probably isn't the most elegant solution, but it works. The idea here is to loop through df (which is why this is likely an inefficient solution), and then loop through each list in column A, appending each item and the id to new lists. Those two new lists are then turned into a new DataFrame.
a_list = []
id_list = []
for index, a, i in df.itertuples():
for item in a:
a_list.append(item)
id_list.append(i)
df1 = pd.DataFrame(list(zip(alist, idlist)), columns=['A', 'id'])
As I said, inelegant, but it gets the job done. There's probably at least one better way to optimize this, but hopefully it gets you moving forward.
EDIT (April 2, 2018)
I had the thought to run a timing comparison between mine and Wen's code, simply out of curiosity. The two variables are the length of column A, and the length of the list entries in column A. I ran a bunch of test cases, iterating by orders of magnitude each time. For example, I started with A length = 10 and ran through to 1,000,000, at each step iterating through randomized A entry list lengths of 1-10, 1-100 ... 1-1,000,000. I found the following:
Overall, my code is noticeably faster (especially at increasing A lengths) as long as the list lengths are less than ~1,000. As soon as the randomized list length hits the ~1,000 barrier, Wen's code takes over in speed. This was a huge surprise to me! I fully expected my code to lose every time.
Length of column A generally doesn't matter - it simply increases the overall execution time linearly. The only case in which it changed the results was for A length = 10. In that case, no matter the list length, my code ran faster (also strange to me).
Conclusion: If the list entries in A are on the order of a few hundred elements (or less) long, my code is the way to go. But if you're working with huge data sets, use Wen's! Also worth noting that as you hit the 1,000,000 barrier, both methods slow down drastically. I'm using a fairly powerful computer, and each were taking minutes by the end (it actually crashed on the A length = 1,000,000 and list length = 1,000,000 case).
Flattening and unflattening can be done using this function
def flatten(df, col):
col_flat = pd.DataFrame([[i, x] for i, y in df[col].apply(list).iteritems() for x in y], columns=['I', col])
col_flat = col_flat.set_index('I')
df = df.drop(col, 1)
df = df.merge(col_flat, left_index=True, right_index=True)
return df
Unflattening:
def unflatten(flat_df, col):
flat_df.groupby(level=0).agg({**{c:'first' for c in flat_df.columns}, col: list})
After unflattening we get the same dataframe except column order:
(df.sort_index(axis=1) == unflatten(flatten(df)).sort_index(axis=1)).all().all()
>> True
To create unique index you can call reset_index after flattening

How does Append (,) work?

Append , says:
x,y appends items of y to items of x after:
Reshaping an atomic argument to the shape of the items of the other,
Bringing the arguments to a common rank (of at least 1) by repeatedly itemizing (,:) any of lower rank, and
Bringing them to a common shape by padding with fill elements in the manner described in Section II B.
What is meant by 1.? Doesn't step 2. and 3. do this? Could 1. be removed from the list and the result would still be the same (I assume it cannot but like to understand why)?
Reshaping an atomic argument to the shape of the items of the other,
This step will repeat an argument, if it is atomic, which is different from "padding with fill elements" (of step 3.). Compare the scalar 5 with the list 1$5 (i.e. a list of one element):
NB. scalar 5, atomic case (step 1. applies), argument is repeated
(i. 2 3), 5
0 1 2
3 4 5
5 5 5
NB. list 1$5, non-atomic case (step 2. and 3. apply), argument is padded
(i. 2 3), 1$5
0 1 2
3 4 5
5 0 0

x-.y And what about intersection?

x-.y includes all items of x except for those that are cells of y
But what if I want to get all items that are cells of x and of y?
I can achieve this by
x -.^:2 y
But it require running expensive operation twice.
Is there a better solution?
e. is often useful when working with sets.
x e. y
gives a list of matches:
for each item of x return 1 if it exists in the "set" y, 0 otherwise.
1 2 3 4 e. 5 9 2
0 1 0 0
Then,
x (e. # [) y
selects those elements that do exist in both lists.
1 2 3 4 (e. # [) 5 9 2
2
5 8 (e. # [) i.12
5 8
Doing -. twice is the classic way of implementing intersection in J.
The inefficiency is minor (a constant factor - and, in general, you should not concern yourself with efficiency issues in J unless they exceed a factor of 2 - when you have resource problems you're generally going to want to focus on the factor of 1000 or greater issues).
Put differently, if ([-.-.) or -.^:2 is too slow for you then -. would also be too slow for you. (This can happen on extremely large data sets where the underlying implementation has been inefficient. Recent versions of J have had some work done, to correct this issue.)
Disappointing, perhaps, but practical.

Need Hint for ProjectEuler Problem

What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?
I could easily brute force the solution in an imperative programming language with loops. But I want to do this in Haskell and not having loops makes it much harder. I was thinking of doing something like this:
[n | n <- [1..], d <- [1..20], n `mod` d == 0] !! 0
But I know that won't work because "d" will make the condition equal True at d = 1. I need a hint on how to make it so that n mod d is calculated for [1..20] and can be verified for all 20 numbers.
Again, please don't give me a solution. Thanks.
As with many of the Project Euler problems, this is at least as much about math as it is about programming.
What you're looking for is the least common multiple of a set of numbers, which happen to be in a sequence starting at 1.
A likely tactic in a functional language is trying to make it recursive based on figuring out the relation between the smallest number divisible by all of [1..n] and the smallest number divisible by all of [1..n+1]. Play with this with some smaller numbers than 20 and try to understand the mathematical relation or perhaps discern a pattern.
Instead of a search until you find such a number, consider instead a constructive algorithm, where, given a set of numbers, you construct the smallest (or least) positive number that is evenly divisible by (aka "is a common multiple of") all those numbers. Look at the algorithms there, and consider how Euclid's algorithm (which they mention) might apply.
Can you think of any relationship between two numbers in terms of their greatest common divisor and their least common multiple? How about among a set of numbers?
If you look at it, it seems to be a list filtering operation. List of infinite numbers, to be filtered based on case the whether number is divisible by all numbers from 1 to 20.
So what we got is we need a function which takes a integer and a list of integer and tells whether it is divisible by all those numbers in the list
isDivisible :: [Int] -> Int -> Bool
and then use this in List filter as
filter (isDivisible [1..20]) [1..]
Now as Haskell is a lazy language, you just need to take the required number of items (in your case you need just one hence List.head method sounds good) from the above filter result.
I hope this helps you. This is a simple solution and there will be many other single line solutions for this too :)
Alternative answer: You can just take advantage of the lcm function provided in the Prelude.
For efficiently solving this, go with Don Roby's answer. If you just want a little hint on the brute force approach, translate what you wrote back into english and see how it differs from the problem description.
You wrote something like "filter the product of the positive naturals and the positive naturals from 1 to 20"
what you want is more like "filter the positive naturals by some function of the positive naturals from 1 to 20"
You have to get Mathy in this case. You are gonna do a foldl through [1..20], starting with an accumulator n = 1. For each number p of that list, you only proceed if p is a prime. Now for the previous prime p, you want to find the largest integer q such that p^q <= 20. Multiply n *= (p^q). Once the foldl finishes, n is the number you want.
A possible brute force implementation would be
head [n|n <- [1..], all ((==0).(n `mod`)) [1..20]]
but in this case it would take way too long. The all function tests if a predicate holds for all elements of a list. The lambda is short for (\d -> mod n d == 0).
So how could you speed up the calculation? Let's factorize our divisors in prime factors, and search for the highest power of every prime factor:
2 = 2
3 = 3
4 = 2^2
5 = 5
6 = 2 * 3
7 = 7
8 = 2^3
9 = 3^2
10 = 2 * 5
11 = 11
12 = 2^2*3
13 = 13
14 = 2 *7
15 = 3 * 5
16 = 2^4
17 = 17
18 = 2 * 3^2
19 = 19
20 = 2^2 * 5
--------------------------------
max= 2^4*3^2*5*7*11*13*17*19
Using this number we have:
all ((==0).(2^4*3^2*5*7*11*13*17*19 `mod`)) [1..20]
--True
Hey, it is divisible by all numbers from 1 to 20. Not very surprising. E.g. it is divisible by 15 because it "contains" the factors 3 and 5, and it is divisible by 16, because it "contains" the factor 2^4. But is it the smallest possible number? Think about it...

Resources