Writing a C program to solve an equation numerically - linux

I am trying to solve x-cos(x)=0 numerically.
I need the program to accept one argument on the command line that becomes the desired accuracy of the solution.
The solution should yield an answer within some +/- the specified accuracy (epsilon) of 0 when the equation x-cos(x) is evaluated.
The maximum number of iterations should be set to 100.
The program should start with a first guess value of x=0.
The desired accuracy should accept both floating and scientific notation formats.
There should be a warning message if too few or too many arguments are supplied, and therefore exit the program.
If a solution is found in the max iterations, it should print the solution, accuracy and number of iterations.
If no solution is achieved in the max iterations, the program should print a message to indicate as such and then close.
Find the smallest accuracy that can be achieved in max iterations in powers of 10.
I know that there are loops involved. I've started it as such:`
#include<stdio.h>
#include<math.h>
int
main(void)
{
int MAX_ITERATIONS[100],x=0;
float epsilon;
double epsilon;
x=cos[x];
for (x=0; x<MAX_ITERATIONS; ++x)
if (MAX_ITERATIONS < x)
x=MAX_ITERATIONS[100];
}
I am not sure where to go from here or if I am even on the right track.

Here is some code to help you get started. My philosophy is to make only small changes and always keep a copy of what was working before. The way when I break something, I know exactly where I broke it. This code does not do everything you want, but you can make those changes yourself. To compile the code, I used cc -lm progname.c. To execute it, I used ./a.out 0.002.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
double delta, x;
double epsilon=0.001;
double previous = 1.0;
if (argc > 1)
epsilon = strtof(argv[1], NULL);
printf("Using epsilon = %12.8f\n", epsilon);
for (x=0.1; x<1.0; x+=epsilon) {
delta = fabs(x-cos(x));
if (delta < previous)
previous = delta;
else
break;
}
printf("%12.8f %12.8f %12.8f\n", x, cos(x), delta);
}

Related

GMP setting last digit to zero

I’m looking for fastest way to set last digit of positive number l declated as mpz_t to zero. I didn’t find the function could to this already. For example 6531489321483 should be changed to 6531489321480.
Update
It appears that subtraction and modulo is the superior method for zeroing out the last digit with mpz_t types. Just as #MarkDickinson and #MarcGlisse pointed out, the asymptotic behavior greatly favors using mpz_tdiv_r_ui (or mpz_fdiv_r_ui) over mpz_div_ui followed by mpz-mul_ui. My original benchmarks were on relatively small numbers (25 digits). I retested on a 175 digit number and the sub_mod method was nearly 40% faster.
Test value: 1234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789
Result with div_mul: 1234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
Result with sub_mod: 1234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
time with division followed by multiplication: 6.145656
time with subtraction and modulo: 4.413998
And with a 350 digit number we see that sub_mod is around 85% faster:
Test value: 12345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789
Result with div_mul: 12345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
Result with sub_mod: 12345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456789123456789876543212345678912345678987654321234567891234567898765432123456780
time with division followed by multiplication: 10.256122
time with subtraction and modulo: 5.522990
It should be noted that whether we use mpz_tdiv_r_ui or mpz_fdiv_r_ui, the results were almost identical.
Since the sub_mod method was only marginally slower with smaller numbers, it seems reasonable to only use this method for all cases.
It would be nice to tests this on different compilers. I'm currently using clang 5.0.1.
Original
Benchmarks on my machine show that division followed by multiplication is faster than finding the remainder via modulo operator and subtracting.
#include <stdio.h>
#include <time.h>
#include <gmp.h>
void div_mul(mpz_t x) {
mpz_tdiv_q_ui(x, x, 10u);
mpz_mul_ui(x, x, 10u);
}
// Maybe this could be simpler?
void sub_mod(mpz_t x, mpz_t y) {
// N.B. mpz_mod_ui is equivalent to mpz_fdiv_r_ui. Changed to
// mpz_tdiv_r_ui for consistency with div_mul.
mpz_tdiv_r_ui(y, x, 10u);
mpz_sub(x, x, y);
}
Main:
int main() {
mpz_t testVal;
mpz_init(testVal);
mpz_set_str(testVal, "1234567898765432123456789", 10);
gmp_printf("Test value: %Zd\n", testVal);
mpz_t x;
mpz_t y;
mpz_init(x);
mpz_init(y);
mpz_set(x, testVal);
div_mul(x);
gmp_printf("Result with div_mul: %Zd\n", x);
mpz_set(x, testVal);
sub_mod(x, y);
gmp_printf("Result with sub_mod: %Zd\n", x);
const int limit = 100000000;
const double checkPoint0 = (double) clock() / CLOCKS_PER_SEC;
for (int i = 0; i < limit; ++i) {
mpz_set(x, testVal);
div_mul(x);
}
const double checkPoint1 = (double) clock() / CLOCKS_PER_SEC;
const double time_div_mul = checkPoint1 - checkPoint0;
printf("time with division followed by multiplication: %f\n", time_div_mul);
const double checkPoint2 = (double) clock() / CLOCKS_PER_SEC;
for (int i = 0; i < limit; ++i) {
mpz_set(x, testVal);
sub_mod(x, y);
}
const double checkPoint3 = (double) clock() / CLOCKS_PER_SEC;
const double time_sub_mod = checkPoint3 - checkPoint2;
printf("time with subtraction and modulo: %f\n", time_sub_mod);
mpz_clear(testVal);
mpz_clear(x);
mpz_clear(y);
return 0;
}
Output:
Test value: 1234567898765432123456789
Result with div_mul: 1234567898765432123456780
Result with sub_mod: 1234567898765432123456780
time with division followed by multiplication: 2.941251
time with subtraction and modulo: 3.171949
I suspect that one of the reasons that the latter method is slightly slower is that 2 variables are needed as complicated multi operations on the same line are not accessible in the C api. If we could use gmpxx, we could write x - x % 10.
Another thought as to why the first method is faster, is that the div_mul involves two operations with unsigned integers while the sub_mod method involves an operation with an unsigned integer followed by an operation with mpz_t.
I tried to get this reproduced on ideone.com but could not get gmp.h loaded. I opted to implement a similar benchmark with type long long int just for fun. You will note the presence of volatile and that the limit is one billion instead of one hundred million as seen above. The volatile was need to keep the for loop from being optimized away.
Converting the number to a string and changing last character wouldn't be the fastest way?

CUDA Programming: Compilation Error

I am making a CUDA program that implements the data parallel prefix sum calculation operating upon N numbers. My code is also supposed to generate the numbers on the host using a random number generator. However, I seem to always run into a "unrecognized token" and "expected a declaration" error on the ending bracket of int main when attempting to compile. I am running the code on Linux.
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <math.h>
__global__ void gpu_cal(int *a,int i, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid>=i && tid < n) {
a[tid] = a[tid]+a[tid-i];
}
}
int main(void)
{
int key;
int *dev_a;
int N=10;//size of 1D array
int B=1;//blocks in the grid
int T=10;//threads in a block
do{
printf ("Some limitations:\n");
printf (" Maximum number of threads per block = 1024\n");
printf (" Maximum sizes of x-dimension of thread block = 1024\n");
printf (" Maximum size of each dimension of grid of thread blocks = 65535\n");
printf (" N<=B*T\n");
do{
printf("Enter size of array in one dimension, currently %d\n",N);
scanf("%d",&N);
printf("Enter size of blocks in the grid, currently %d\n",B);
scanf("%d",&B);
printf("Enter size of threads in a block, currently %d\n",T);
scanf("%d",&T);
if(N>B*T)
printf("N>B*T, this will result in an incorrect result generated by GPU, please try again\n");
if(T>1024)
printf("T>1024, this will result in an incorrect result generated by GPU, please try again\n");
}while((N>B*T)||(T>1024));
cudaEvent_t start, stop; // using cuda events to measure time
float elapsed_time_ms1, elapsed_time_ms3;
int a[N],gpu_result[N];//for result generated by GPU
int cpu_result[N];//CPU result
cudaMalloc((void**)&dev_a,N * sizeof(int));//allocate memory on GPU
int i,j;
srand(1); //initialize random number generator
for (i=0; i < N; i++) // load array with some numbers
a[i] = (int)rand() ;
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);//load data from host to device
cudaEventCreate(&start); // instrument code to measure start time
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
//GPU computation
for(j=0;j<log(N)/log(2);j++){
gpu_cal<<<B,T>>>(dev_a,pow(2,j),N);
cudaThreadSynchronize();
}
cudaMemcpy(gpu_result,dev_a,N*sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms1, start, stop );
printf("\n\n\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); // print out execution time
//CPU computation
cudaEventRecord(start, 0);
for(i=0;i<N;i++)
{
cpu_result[i]=0;
for(j=0;j<=i;j++)
{
cpu_result[i]=cpu_result[i]+a[j];
}
}
cudaEventRecord(stop, 0); // instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms3, start, stop );
printf("Time to calculate results on CPU: %f ms.\n\n", elapsed_time_ms3); // print out execution time
//Error check
for(i=0;i < N;i++) {
if (gpu_result[i] != cpu_result[i] ) {
printf("ERROR!!! CPU and GPU create different answers\n");
break;
}
}
//Calculate speedup
printf("Speedup on GPU compared to CPU= %f\n", (float) elapsed_time_ms3 / (float) elapsed_time_ms1);
printf("\nN=%d",N);
printf("\nB=%d",B);
printf("\nT=%d",T);
printf("\n\n\nEnter '1' to repeat, or other integer to terminate\n");
scanf("%d",&key);
}while(key == 1);
cudaFree(dev_a);//deallocation
return 0;
}​
The very last } in your code is a Unicode character. If you delete this entire line, and retype the }, the error will be gone.
There are two compile errors in your code.
First, Last ending bracket is a unicode character, so you should resave your code as unicode or delete and rewrite the last ending bracket.
Second, int type variable N which used at this line - int a[N],gpu_result[N];//for result generated by GPU
was declared int type, but it's not allowed in c or c++ compiler, so you should change the N declaration as const int N.

time.h regarding doubts please explain

Could someone clear my following doubts regarding this code:
1) What is this clock() ? How does it work?
2)I have never seen ; after for statement when do i need to use these semicolons after for
and will it loop though the only next line of code int tick= clock();?
3)Is he coverting tick to float by doing this: (float)tick ? Can i do this thing with
every variable i initialize first as short, integer, long, long long and then change it to float by just doing float(variable_name)?
Thanks in advance
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main()
{
int i;
for(i=0; i<10000000; i++);
int tick= clock();
printf("%f", (float)tick/ CLOCKS_PER_SEC);
return 0;
}
1) What is this clock() ? How does it work?
It reads the "CPU" clock of your program (how much time your program has taken so far). It is called RTC for Real Time Clock. The precision is very high, it should be close to one CPU cycle.
2) I have never seen ; after for statement when do i need to use these semicolons after for and will it loop though the only next line of code int tick= clock();?
No. The ';' means "loop by yourself." This is wrong though because such a loop can be optimized out if it does nothing. So the number of cycles that loop takes could drop to zero.
So "he" [hopes he] counts from 0 to 10 million.
3) Is he converting tick to float by doing this: (float)tick ? Can i do this thing with every variable i initialize first as short, integer, long, long long and then change it to float by just doing float(variable_name)?
Yes. Three problems though:
a) he uses int for tick which is wrong, he should use clock_t.
b) clock_t is likely larger than 32 bits and a float cannot even accommodate 32 bit integers.
c) printf() anyway takes double so converting to float is not too good, since it will right afterward convert that float to a double.
So this would be better:
clock_t tick = clock();
printf("%f", (double)tick / CLOCKS_PER_SEC);
Now, this is good if you want to know the total time your program took to run. If you'd like to know how long a certain piece of code takes, you want to read clock() before and another time after your code, then the difference gives you the amount of time your code took to run.

Parallel processing a prime finder with openMP

I am trying to construct a prime finder for a bit of C practice. I've got the algorithm down and I've done a bunch of optimisations to make it faster, I then decided to try to parallelize it because, hey why not! Turns out to be harder than I thought. I can either get all threads running the same process (with same args) or a single thread will run if I try and supply different args to each process. I really have no idea what I'm doing here but you can see some experimental values I'm using in this code:
// gcc -std=c99 -o multithread multithread.c -fopenmp -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int pf(unsigned int start, unsigned int limit, unsigned int q);
int main(int argc, char *argv[])
{
printf("prime finder\n");
int j, slimits[4] = {1,10000000,20000000,30000000}, elimits[4] = {10000000,20000000,30000000,40000000};
double startTime = omp_get_wtime();
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
printf("%d prime numbers found in %.2f seconds.\n\n", primes, omp_get_wtime() - startTime);
return 0;
}
I havn't included the pf function as it is quite large but it works on its own, it returns the number of primes found. Im sure the issue is here somewhere.
Any help would be greatly appreciated!
You have made at least one obvious (to me) and serious mistake. You've declared primes shared and allowed all the threads in the program to update it. You have, thereby, programmed a data race. Nothing in OpenMP (nor in C if I recall correctly) guarantees that += will be implemented atomically. You haven't actually specified what the problem with your program is, or what the problems are, but this must surely be one of them.
I'll tell you how to fix this later but I think there is a more serious underlying design problem you should address first. You seem to have decided that you would have 4 threads running and that you should divide the range of integers to test for primality into 4 and pass one chunk to each thread. Sure, you can make that work but it's not a smart approach to using OpenMP. Nor is it a smart approach to dividing the work of primality testing.
A smarter approach to OpenMP program design is to start off by making no assumptions about the number of threads that will be available to the executing program. Design for any number of threads, do not design a program whose behaviour depends on the number of threads it gets at run-time. Use OpenMP's facilities, specifically the schedule clause, to distribute the workload at run time.
Turning to primality testing. Draw, or at least think about, a scatter plot of points (i,t(i)), where i is an integer and t(i) is the time it takes to determine whether or not i is prime. The pattern in this plot is about as difficult to discern as the pattern in the plot of the occurrence of primes in the integers. In other words, the time to determine the primality of an integer is very unpredictable. It does tend to rise as the integers increase (well, excluding large even integers which I'm sure your test doesn't consider anyway).
One implication of this unpredictability is that if you divide a range of integers into N sub-ranges and give one sub-range to each of N threads you are not giving the threads the same amount of work to do. Indeed, in the range of integers 1..m (any m) there is one integer which takes much longer to test than any other integer in the range, and this time is the irreducible minimum that your program will take. A naive distribution of the range will produce a seriously unbalanced workload.
Here's what I think you should do to fix your program.
First, write a function which tests the primality of a single integer. This will be the basic task for your computation. Call this is_prime. Next, study the schedule clause for the parallel for construct. OpenMP provides a number of task scheduling options, I won't explain them here, you will find plenty of good documentation online. Finally, study also the reduction clause; this provides the solution to the data race you have programmed.
Applying all this I suggest you change
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
to
#pragma omp parallel shared(slimits, elimits, max_int_to_test)
{
#pragma omp for reduction(+:primes) schedule (dynamic, 10)
for (j = 3; j < max_int_to_test; j += 2)
{
primes += is_prime(j);
}
}
With any luck my rudimentary C hasn't screwed up the syntax too much.

Thread-safe random number generation for Monte-Carlo integration

Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack
Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!

Resources