Is it possible to parallelize or unroll this loop? - multithreading

I am trying to see if I can improve the performance of the following loop in C++, which uses two dimensional vectors (_external and _Table) and has a carried loop dependency on the previous iteration. Additionally, it has a calculated index accessor in the innermost loop that will make the access of _Table non sequential on the right hand side.
int N = 8000;
int M = 400
int P = 100;
for(int i = 1; i <= N; i++){
for(int j = 0; j < M; j++){
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
}
}
What can I do to improve the performance of a loop like this?

Well it looks to me like the order in which these statements:
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
are executed is critical to correctness. (That is, if the iteration order for i, j, k changes, then the results will be different ... and incorrect.)
So I think you are only left with micro-optimizations, like hoisting the expressions _Table.at(j).at(i) and _external.at(j) out of the innermost loop.
Consider this:
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
_Table.at(j).at(i) += _Table.at(index).at(i-1);
}
This loop is repeatedly adding numbers to _Table.at(j).at(i). Since (by inspection) _Table.at(index).at(i-1) must be reading from a different cell of the table (because of i-1 versus i), you could do this:
int temp = 0;
for(int k =0; k < P; k++){
int index = _external.at(j).at(k);
temp += _Table.at(index).at(i-1);
}
_Table.at(j).at(i) += temp;
This will reduce the number of calls to at, and may also improve cache performance a bit.

Related

I'm not able to understand logic of coin changing problem in o(sum) space complexity

I'm facing difficulty in understanding O(sum) complexity solution of coin changing problem.
The problem statement is:
You are given a set of coins A. In how many ways can you make sum B assuming you have infinite amount of each coin in the set.
NOTE:
Coins in set A will be unique. Expected space complexity of this problem is O(B).
The solution is:
int count( int S[], int m, int n )
{
int table[n+1];
memset(table, 0, sizeof(table));
table[0] = 1;
for(int i=0; i<m; i++)
for(int j=S[i]; j<=n; j++)
table[j] += table[j-S[i]];
return table[n];
}
can someone explain me this code.?
First, let's identify the parameters and variables used in the function:
Parameters:
S contain the denomination of all m coins. i.e. Each element contain the value of each coin.
m represents the number of coin denominations. Essentially, it's the length of array S.
n represents the sum B to be achieved.
Variables:
table: Element i in array table contains the number of ways sum i can be achieved with the given coins. table[0] = 1 because there is a single way to achieve a sum of 0 (not using any coin).
i loops through each coin.
Logic:
The number of ways to achieve a sum j = sum of the following:
number of ways to achieve a sum of j - S[0]
number of ways to achieve a sum of j - S[1]
...
number of ways to achieve a sum of j - S[m-1] (S[m-1] is the value of the mth coin)
I did not completely decipher nor validate the rest of the code, but I hope this is a step in the right direction.
Added comments to code:
#include <stdio.h>
#include <string.h>
int count( int S[], int m, int n )
{
int table[n+1];
memset(table, 0, sizeof(table));
table[0] = 1;
for(int i=0; i<m; i++) // Loop through all of the coins
for(int j=S[i]; j<=n; j++) // Achieve sum j between the value of S[i] and n.
table[j] += table[j-S[i]]; // Add to the number of ways to achieve sum j the number of ways to achieve sum j - S[i]
return table[n];
}
int main() {
int S[] = {1, 2};
int m = 2;
int n = 3;
int c = count(S, m, n);
printf("%d\n", c);
}
Notes:
The code avoids repeats: 3 = 1+1+1, 1+2 (2 ways instead of 3 if 2+1 was considered.
No dependence on the order of the coins in term of value.

OpenMP in Biham-Middleton-Levine BML model

I've got a serial version of BML and I'm trying to write a parallel one with OpenMP. Basically my code works with a main witin a loop calling two functions for horizontal and vertical moves. Like that:
for (s = 0; s < nmovss; s++) {
horizontal_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
vertical_movs(grid, N);
copy_sides(grid, N);
cur = 1-cur;
}
Where cur is the current grid. Then horizontal and vertical functions are similar and have a nested loop:
for(i = 1; i <= n; i++) {
for(j = 1; j <= n+1; j++) {
if(grid[cur][i][j-1] == LR && grid[cur][i][j] == EMPTY) {
grid[1-cur][i][j-1] = EMPTY;
grid[1-cur][i][j] = LR;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
The code produces a ppm image at every step, and whit a certain input the serial version produce an output that we can suppose good. But using #pragma omp parallel for inside the two functions H and V, the ppm file results splitted in such zones as the number of threads(i.e. 4):
I suppose the problem is that every thread should be doing both functions in sequence before termitate because movememnts are strictcly connected. I don't know how to do that. If I set pragma at a highter level like before main loop, there is no speed-up. Obviously the ppm file has to be not sliced like the image.
Goin'on I tried this solution that gives me an identical result as the serial code, but I don't excatly understand why
# pragma omp parallel num_threads(thread_count) default(none) \
shared(grid, n, cur) private(i, j)
for(i = 1; i <= n+1; i++) {
# pragma omp for
for(j = 1; j <= n; j++) {
if(grid[cur][i-1][j] == TB && grid[cur][i][j] == EMPTY) {
grid[1-cur][i-1][j] = EMPTY;
grid[1-cur][i][j] = TB;
}
else {
grid[1-cur][i][j] = grid[cur][i][j];
}
}
}
}
Therefore, if i use just one thread more than available cores(4), the execution time "explodes" instead of remain barely the same.

Longest Common Substring non-DP solution with O(m*n)

The definition of the problem is:
Given two strings, find the longest common substring.
Return the length of it.
I was solving this problem and I think I solved it with O(m*n) time complexity. However I don't know why when I look up the solution, it's all talking about the optimal solution being dynamic programming - http://www.geeksforgeeks.org/longest-common-substring/
Here's my solution, you can test it here: http://www.lintcode.com/en/problem/longest-common-substring/
int longestCommonSubstring(string &A, string &B) {
int ans = 0;
for (int i=0; i<A.length(); i++) {
int counter = 0;
int k = i;
for (int j=0; j<B.length() && k <A.length(); j++) {
if (A[k]!=B[j]) {
counter = 0;
k = i;
} else {
k++;
counter++;
ans = max(ans, counter);
}
}
}
return ans;
}
My idea is simple, start from the first position of string A and see what's the longest substring I can match with string B, then start from the second position of string A and see what's the longest substring I can match....
Is there something wrong with my solution? Or is it not O(m*n) complexity?
Good news: your algorithm is O(mn). Bad news: it doesn't work correctly.
Your inner loop is wrong: it's intended to find the longest initial substring of A[i:] in B, but it works like this:
j = 0
While j < len(B)
Match as much of A[i:] against B[j:]. Call it s.
Remember s if it's the longest so far found.
j += len(s)
This fails to find the longest match. For example, when A = "XXY" and B = "XXXY" and i=0 it'll find "XX" as the longest match instead of the complete match "XXY".
Here's a runnable version of your code (lightly transcribed into C) that shows the faulty result:
#include <string.h>
#include <stdio.h>
int lcs(const char* A, const char* B) {
int al = strlen(A);
int bl = strlen(B);
int ans = 0;
for (int i=0; i<al; i++) {
int counter = 0;
int k = i;
for (int j=0; j<bl && k<al; j++) {
if (A[k]!=B[j]) {
counter = 0;
k = i;
} else {
k++;
counter++;
if (counter >= ans) ans = counter;
}
}
}
return ans;
}
int main(int argc, char**argv) {
printf("%d\n", lcs("XXY", "XXXY"));
return 0;
}
Running this program outputs "2".
Your solution is O(nm) complexity and if you look compare the structure to the provided algorithm its the exact same; however, yours does not memoize.
One advantage that the dynamic algorithm provided in the link has is that in the same complexity class time it can recall different substring lengths in O(1); otherwise, it looks good to me.
This is a kind of thing will happen from time to time because storing subspace solutions will not always result in a better run time (on first call) and result in the same complexity class runtime instead (eg. try to compute the nth Fibonacci number with a dynamic solution and compare that to a tail recursive solution. Note that in this case like your case, after the array is filled the first time, its faster to return an answer each successive call.

searching for dynamic programming solution

Problem :
There is a stack consisting of N bricks. You and your friend decide to play a game using this stack. In this game, one can alternatively remove 1/2/3 bricks from the top and the numbers on the bricks removed by the player is added to his score. You have to play in such a way that you obtain maximum possible score while it is given that your friend will also play optimally and you make the first move.
Input Format
First line will contain an integer T i.e. number of test cases. There will be two lines corresponding to each test case, first line will contain a number N i.e. number of element in stack and next line will contain N numbers i.e. numbers written on bricks from top to bottom.
Output Format
For each test case, print a single line containing your maximum score.
I have tried with recursion but didn't work
int recurse(int length, int sequence[5], int i) {
if(length - i < 3) {
int sum = 0;
for(i; i < length; i++) sum += sequence[i];
return sum;
} else {
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
sum1 += recurse(length, sequence, i+1);
sum2 += recurse(length, sequence, i+2);
sum3 += recurse(length, sequence, i+3);
return max(max(sum1,sum2),sum3);
}
}
int main() {
int sequence[] = {0, 0, 9, 1, 999};
int length = 5;
cout << recurse(length, sequence, 0);
return 0;
}
My approach to solving this problem was as follows:
Both players play optimally.
So, the solution is to be built in a manner that need not take the player into account. This is because both players are going to pick the best choice available to them for any given state of the stack of bricks.
The base cases:
Either player, when left with the last one/two/three bricks, will choose to remove all bricks.
For the sake of convenience, let's assume that the array is actually in reverse order (i.e. a[0] is the value of the bottom-most brick in the stack) (This can easily be incorporated by performing a reverse operation on the array.)
So, the base cases are:
# Base Cases
dp[0] = a[0]
dp[1] = a[0]+a[1]
dp[2] = a[0]+a[1]+a[2]
Building the final solution:
Now, in each iteration, a player has 3 choices.
pick brick (i), or,
pick brick (i and i-1) , or,
pick brick (i,i-1 and i-2)
If the player opted for choice 1, the following would result:
player secures a[i] points from the brick (i) (+a[i])
will not be able to procure the points on the bricks removed by the opponent. This value is stored in dp[i-1] (which the opponent will end up scoring by virtue of this choice made by the player).
will surely procure the points on the bricks not removed by the opponent. (+ Sum of all the bricks up until brick (i-1) not removed by opponent )
A prefix array to store the partial sums of points of bricks can be computed as follows:
# build prefix sum array
pre = [a[0]]
for i in range(1,n):
pre.append(pre[-1]+a[i])
And, now, if player opted for choice 1, the score would be:
ans1 = a[i] + (pre[i-1] - dp[i-1])
Similarly, for choices 2 and 3. So, we get:
ans1 = a[i]+ (pre[i-1] - dp[i-1]) # if we pick only ith brick
ans2 = a[i]+a[i-1]+(pre[i-2] - dp[i-2]) # pick 2 bricks
ans3 = a[i]+a[i-1]+a[i-2]+(pre[i-3] - dp[i-3]) # pick 3 bricks
Now, each player wants to maximize this value. So, in each iteration, we pick the maximum among ans1, ans2 and ans3.
dp[i] = max(ans1, ans2, ans3)
Now, all we have to do is to iterate from 3 through to n-1 to get the required solution.
Here is the final snippet in python:
a = map(int, raw_input().split())
a.reverse() # so that a[0] is bottom brick of stack
dp = [0 for x1 in xrange(n)]
dp[0] = a[0]
dp[1] = a[0]+a[1]
dp[2] = a[0]+a[1]+a[2]
# build prefix sum array
pre = [a[0]]
for i in range(1,n):
pre.append(pre[-1]+a[i])
for i in xrange(3,n):
# We can pick brick i, (i,i-1) or (i,i-1,i-2)
ans1 = a[i]+ (pre[i-1] - dp[i-1]) # if we pick only ith brick
ans2 = a[i]+a[i-1]+(pre[i-2] - dp[i-2]) # pick 2
ans3 = a[i]+a[i-1]+a[i-2]+(pre[i-3] - dp[i-3]) #pick 3
# both players maximise this value. Doesn't matter who is playing
dp[i] = max(ans1, ans2, ans3)
print dp[n-1]
At a first sight your code seems totally wrong for a couple of reasons:
The player is not taken into account. You taking a brick or your friend taking a brick is not the same (you've to maximize your score, the total is of course always the total of the score on the bricks).
Looks just some form of recursion with no memoization and that approach will obviously explode to exponential computing time (you're using the "brute force" approach, enumerating all possible games).
A dynamic programming approach is clearly possible because the best possible continuation of a game doesn't depend on how you reached a certain state. For the state of the game you'd need
Who's next to play (you or your friend)
How many bricks are left on the stack
With these two input you can compute how much you can collect from that point to the end of the game. To do this there are two cases
1. It's your turn
You need to try to collect 1, 2 or 3 and call recursively on the next game state where the opponent will have to choose. Of the three cases you keep what is the highest result
2. It's opponent turn
You need to simulate collection of 1, 2 or 3 bricks and call recursively on next game state where you'll have to choose. Of the three cases you keep what is the lowest result (because the opponent is trying to maximize his/her result, not yours).
At the very begin of the function you just need to check if the same game state has been processed before, and when returning from a computation you need to store the result. Thanks to this lookup/memorization the search time will not be exponential, but linear in the number of distinct game states (just 2*N where N is the number of bricks).
In Python:
memory = {}
bricks = [0, 0, 9, 1, 999]
def maxResult(my_turn, index):
key = (my_turn, index)
if key in memory:
return memory[key]
if index == len(bricks):
result = 0
elif my_turn:
result = None
s = 0
for i in range(index, min(index+3, len(bricks))):
s += bricks[i]
x = s + maxResult(False, i+1)
if result is None or x > result:
result = x
else:
result = None
for i in range(index, min(index+3, len(bricks))):
x = maxResult(True, i+1)
if result is None or x < result:
result = x
memory[key] = result
return result
print maxResult(True, 0)
import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;
public class Solution {
public static void main(String[] args){
Scanner sc=new Scanner(System.in);
int noTest=sc.nextInt();
for(int i=0; i<noTest; i++){
int noBrick=sc.nextInt();
ArrayList<Integer> arr=new ArrayList<Integer>();
for (int j=0; j<noBrick; j++){
arr.add(sc.nextInt());
}
long sum[]= new long[noBrick];
sum[noBrick-1]= arr.get(noBrick-1);
for (int j=noBrick-2; j>=0; j--){
sum[j]= sum[j+1]+ arr.get(j);
}
long[] max=new long[noBrick];
if(noBrick>=1)
max[noBrick-1]=arr.get(noBrick-1);
if(noBrick>=2)
max[noBrick-2]=(int)Math.max(arr.get(noBrick-2),max[noBrick-1]+arr.get(noBrick-2));
if(noBrick>=3)
max[noBrick-3]=(int)Math.max(arr.get(noBrick-3),max[noBrick-2]+arr.get(noBrick-3));
if(noBrick>=4){
for (int j=noBrick-4; j>=0; j--){
long opt1= arr.get(j)+sum[j+1]-max[j+1];
long opt2= arr.get(j)+arr.get(j+1)+sum[j+2]-max[j+2];
long opt3= arr.get(j)+arr.get(j+1)+arr.get(j+2)+sum[j+3]-max[j+3];
max[j]= (long)Math.max(opt1,Math.max(opt2,opt3));
}
}
long cost= max[0];
System.out.println(cost);
}
}
}
I tried this using Java, seems to work alright.
here a better solution that i found on the internet without recursion.
#include <iostream>
#include <fstream>
#include <algorithm>
#define MAXINDEX 10001
using namespace std;
long long maxResult(int a[MAXINDEX], int LENGTH){
long long prefixSum [MAXINDEX] = {0};
prefixSum[0] = a[0];
for(int i = 1; i < LENGTH; i++){
prefixSum[i] += prefixSum[i-1] + a[i];
}
long long dp[MAXINDEX] = {0};
dp[0] = a[0];
dp[1] = dp[0] + a[1];
dp[2] = dp[1] + a[2];
for(int k = 3; k < LENGTH; k++){
long long x = prefixSum[k-1] + a[k] - dp[k-1];
long long y = prefixSum[k-2] + a[k] + a[k-1] - dp[k-2];
long long z = prefixSum[k-3] + a[k] + a[k-1] + a[k-2] - dp[k-3];
dp[k] = max(x,max(y,z));
}
return dp[LENGTH-1];
}
using namespace std;
int main(){
int cases;
int bricks[MAXINDEX];
ifstream fin("test.in");
fin >> cases;
for (int i = 0; i < cases; i++){
long n;
fin >> n;
for(int j = 0; j < n; j++) fin >> bricks[j];
reverse(bricks, bricks+n);
cout << maxResult(bricks, n)<< endl;
}
return 0;
}

Longest Common Subsequence for a series of strings

For the Longest Common Subsequence of 2 Strings I have found plenty examples online and I believe that I understand the solution.
What I don't understand is, what is the proper way to apply this problem for N Strings? Is the same solution somehow applied? How? Is the solution different? What?
This problem becomes NP-hard when input has arbitrary number of strings. This problem becomes tractable only when input has fixed number of strings. If input has k strings, we could apply the same DP technique in by using a k dimensional array to stored optimal solutions of sub-problems.
Reference: Longest common subsequence problem
To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array.
You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.
This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.
Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:
string lcs(string[] strings)
{
if (strings.Length == 0)
return "";
if (strings.Length == 1)
return strings[0];
int max = -1;
int cacheSize = 1;
for (int i = 0; i < strings.Length; i++)
{
cacheSize *= strings[i].Length;
if (strings[i].Length > max)
max = strings[i].Length;
}
string[] cache = new string[cacheSize];
int[] indexes = new int[strings.Length];
for (int i = 0; i < indexes.Length; i++)
indexes[i] = strings[i].Length - 1;
return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
for (int i = 0; i < indexes.Length; i++ )
if (indexes[i] == -1)
return "";
bool match = true;
for (int i = 1; i < indexes.Length; i++)
{
if (strings[0][indexes[0]] != strings[i][indexes[i]])
{
match = false;
break;
}
}
if (match)
{
int[] newIndexes = new int[indexes.Length];
for (int i = 0; i < indexes.Length; i++)
newIndexes[i] = indexes[i] - 1;
string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
cache[calcCachePos(indexes, strings)] = result;
return result;
}
else
{
string[] subStrings = new string[strings.Length];
for (int i = 0; i < strings.Length; i++)
{
if (indexes[i] <= 0)
subStrings[i] = "";
else
{
int[] newIndexes = new int[indexes.Length];
for (int j = 0; j < indexes.Length; j++)
newIndexes[j] = indexes[j];
newIndexes[i]--;
int cachePos = calcCachePos(newIndexes, strings);
if (cache[cachePos] == null)
subStrings[i] = lcsBack(strings, newIndexes, cache);
else
subStrings[i] = cache[cachePos];
}
}
string longestString = "";
int longestLength = 0;
for (int i = 0; i < subStrings.Length; i++)
{
if (subStrings[i].Length > longestLength)
{
longestString = subStrings[i];
longestLength = longestString.Length;
}
}
cache[calcCachePos(indexes, strings)] = longestString;
return longestString;
}
}
int calcCachePos(int[] indexes, string[] strings)
{
int factor = 1;
int pos = 0;
for (int i = 0; i < indexes.Length; i++)
{
pos += indexes[i] * factor;
factor *= strings[i].Length;
}
return pos;
}
My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.
On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"
The LCS returned is "54442"

Resources