Crosscompiling from Linux to Windows, but Windows terminal won't stop? (gcc) - linux

quick question;
I'm using Ubuntu as my coding environment, and I am trying to write a C program for Windows for school.
The assignment says I have to do something using the system clock, and I decided to make a quick benchmarking program. Here it is:
#include <stdio.h>
#include <unistd.h>
#include <time.h>
int main () {
int i = 0;
int p = (int) getpid();
int n = 0;
clock_t cstart = clock();
clock_t cend = 0;
for (i=0; i<100000000; i++) {
long f = (((i+9)*99)%4)+(8+i*999);
if (i % p == 0)
printf("i=%d, f=%ld\n", i, f);
}
cend = clock();
printf ("%.3f cpu sec\n", ((double)cend - (double)cstart)* 1.0e-6);
return 0;
}
When I cross compile from Ubuntu to Windows using mingw32, it's fine. However, when I run the program in Windows, two issues happen:
The benchmark runs as expected, and takes roughly 5 seconds, yet the timer says it took 0.03 seconds (this doesnt happen when testing in my Ubuntu VM. If the benchmark takes 5 seconds in real time, the timer will say 5 seconds. So obviously, this is an issue.)
Then, once the program is done, the Windows terminal will close immediately.
How do I make the program stay open so you can look at your time for more than like 10 milliseconds, and how can I make the runtime of the benchmark reflect it's score like it does when I test in Ubuntu?
Thanks!

Related

Why does my process take too long to die?

Basically I'm using Linux 2.6.34 on PowerPC (Freescale e500mc). I have a process (a kind of VM that was developed in-house) that uses about 2.25 G of mlocked VM. When I kill it, I notice that it takes upwards of 2 minutes to terminate.
I investigated a little. First, I closed all open file descriptors but that didn't seem to make a difference. Then I added some printk in the kernel and through it I found that all delay comes from the kernel unlocking my VMAs. The delay is uniform across pages, which I verified by repeatedly checking the locked page count in /proc/meminfo. I've checked with programs that allocate that much memory and they all die as soon as I signal them.
What do you think I should check now? Thanks for your replies.
Edit: I had to find a way to share more information about the problem so I wrote this below program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <sys/time.h>
#define MAP_PERM_1 (PROT_WRITE | PROT_READ | PROT_EXEC)
#define MAP_PERM_2 (PROT_WRITE | PROT_READ)
#define MAP_FLAGS (MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE)
#define PG_LEN 4096
#define align_pg_32(addr) (addr & 0xFFFFF000)
#define num_pg_in_range(start, end) ((end - start + 1) >> 12)
inline void __force_pgtbl_alloc(unsigned int start)
{
volatile int *s = (int *) start;
*s = *s;
}
int __map_a_page_at(unsigned int start, int whichperm)
{
int perm = whichperm ? MAP_PERM_1 : MAP_PERM_2;
if(MAP_FAILED == mmap((void *)start, PG_LEN, perm, MAP_FLAGS, 0, 0)){
fprintf(stderr,
"mmap failed at 0x%x: %s.\n",
start, strerror(errno));
return 0;
}
return 1;
}
int __mlock_page(unsigned int addr)
{
if (mlock((void *)addr, (size_t)PG_LEN) < 0){
fprintf(stderr,
"mlock failed on page: 0x%x: %s.\n",
addr, strerror(errno));
return 0;
}
return 1;
}
void sigint_handler(int p)
{
struct timeval start = {0 ,0}, end = {0, 0}, diff = {0, 0};
gettimeofday(&start, NULL);
munlockall();
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("Munlock'd entire VM in %u secs %u usecs.\n",
diff.tv_sec, diff.tv_usec);
exit(0);
}
int make_vma_map(unsigned int start, unsigned int end)
{
int num_pg = num_pg_in_range(start, end);
if (end < start){
fprintf(stderr,
"Bad range: start: 0x%x end: 0x%x.\n",
start, end);
return 0;
}
for (; num_pg; num_pg --, start += PG_LEN){
if (__map_a_page_at(start, num_pg % 2) && __mlock_page(start))
__force_pgtbl_alloc(start);
else
return 0;
}
return 1;
}
void display_banner()
{
printf("-----------------------------------------\n");
printf("Virtual memory allocator. Ctrl+C to exit.\n");
printf("-----------------------------------------\n");
}
int main()
{
unsigned int vma_start, vma_end, input = 0;
int start_end = 0; // 0: start; 1: end;
display_banner();
// Bind SIGINT handler.
signal(SIGINT, sigint_handler);
while (1){
if (!start_end)
printf("start:\t");
else
printf("end:\t");
scanf("%i", &input);
if (start_end){
vma_end = align_pg_32(input);
make_vma_map(vma_start, vma_end);
}
else{
vma_start = align_pg_32(input);
}
start_end = !start_end;
}
return 0;
}
As you would see, the program accepts ranges of virtual addresses, each range being defined by start and end. Each range is then further subdivided into page-sized VMAs by giving different permissions to adjacent pages. Interrupting (using SIGINT) the program triggers a call to munlockall() and the time for said procedure to complete is duly noted.
Now, when I run it on freescale e500mc with Linux version at 2.6.34 over the range 0x30000000-0x35000000, I get a total munlockall() time of almost 45 seconds. However, if I do the same thing with smaller start-end ranges in random orders (that is, not necessarily increasing addresses) such that the total number of pages (and locked VMAs) is roughly the same, observe total munlockall() time to be no more than 4 seconds.
I tried the same thing on x86_64 with Linux 2.6.34 and my program compiled against the -m32 parameter and it seems the variations, though not so pronounced as with ppc, are still 8 seconds for the first case and under a second for the second case.
I tried the program on Linux 2.6.10 on the one end and on 3.19, on the other and it seems these monumental differences don't exist there. What's more, munlockall() always completes at under a second.
So, it seems that the problem, whatever it is, exists only around the 2.6.34 version of the Linux kernel.
You said the VM was developed in-house. Does this mean you have access to the source? I would start by checking to see if it has anything to stop it from immediately terminating to avoid data loss.
Otherwise, could you potentially try to provide more information? You may also want to check out: https://unix.stackexchange.com/ as they would be better suited to help with any issues the linux kernel may be having.

Why this simple program on shared variable does not scale? (no lock)

I'm new to concurrent programming. I implement a CPU intensive work and measure how much speedup I could gain. However, I cannot get any speedup as I increase #threads.
The program does the following task:
There's a shared counter to count from 1 to 1000001.
Each thread does the following until the counter reaches 1000001:
increments the counter atomically, then
run a loop for 10000 times.
There're 1000001*10000 = 10^10 operations in total to be perform, so I should be able to get good speedup as I increment #threads.
Here's how I implemented it:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdatomic.h>
pthread_t workers[8];
atomic_int counter; // a shared counter
void *runner(void *param);
int main(int argc, char *argv[]) {
if(argc != 2) {
printf("Usage: ./thread thread_num\n");
return 1;
}
int NUM_THREADS = atoi(argv[1]);
pthread_attr_t attr;
counter = 1; // initialize shared counter
pthread_attr_init(&attr);
const clock_t begin_time = clock(); // begin timer
for(int i=0;i<NUM_THREADS;i++)
pthread_create(&workers[i], &attr, runner, NULL);
for(int i=0;i<NUM_THREADS;i++)
pthread_join(workers[i], NULL);
const clock_t end_time = clock(); // end timer
printf("Thread number = %d, execution time = %lf s\n", NUM_THREADS, (double)(end_time - begin_time)/CLOCKS_PER_SEC);
return 0;
}
void *runner(void *param) {
int temp = 0;
while(temp < 1000001) {
temp = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
for(int i=1;i<10000;i++)
temp%i; // do some CPU intensive work
}
pthread_exit(0);
}
However, as I run my program, I cannot get better performance than sequential execution!!
gcc-4.9 -std=c11 -pthread -o my_program my_program.c
for i in 1 2 3 4 5 6 7 8; do \
./my_program $i; \
done
Thread number = 1, execution time = 19.235998 s
Thread number = 2, execution time = 20.575237 s
Thread number = 3, execution time = 25.161116 s
Thread number = 4, execution time = 28.278671 s
Thread number = 5, execution time = 28.185605 s
Thread number = 6, execution time = 28.050380 s
Thread number = 7, execution time = 28.286925 s
Thread number = 8, execution time = 28.227132 s
I run the program on a 4-core machine.
Does anyone have suggestions to improve the program? Or any clue why I cannot get speedup?
The only work here that can be done in parallel is the loop:
for(int i=0;i<10000;i++)
temp%i; // do some CPU intensive work
gcc, even with the minimal optimisation level, will not emit any code for the temp%i; void expression (disassemble it and see), so this essentially becomes an empty loop, which will execute very fast - the execution time in the case with multiple threads running on different cores will be dominated by the cacheline containing your atomic variable ping-ponging between the different cores.
You need to make this loop actually do a significant amount of work before you'll see a speed-up.

how to generate a delay

i'm new to kernel programming and i'm trying to understand some basics of OS. I am trying to generate a delay using a technique which i've implemented successfully in a 20Mhz microcontroller.
I know this is a totally different environment as i'm using linux centOS in my 2 GHz Core 2 duo processor.
I've tried the following code but i'm not getting a delay.
#include<linux/kernel.h>
#include<linux/module.h>
int init_module (void)
{
unsigned long int i, j, k, l;
for (l = 0; l < 100; l ++)
{
for (i = 0; i < 10000; i ++)
{
for ( j = 0; j < 10000; j ++)
{
for ( k = 0; k < 10000; k ++);
}
}
}
printk ("\nhello\n");
return 0;
}
void cleanup_module (void)
{
printk ("bye");
}
When i dmesg after inserting the module as quickly as possile for me, the string "hello" is already there. If my calculation is right, the above code should give me atleast 10 seconds delay.
Why is it not working? Is there anything related to threading? How could a 20 Ghz processor execute the above code instantly without any noticable delay?
The compiler is optimizing your loop away since it has no side effects.
To actually get a 10 second (non-busy) delay, you can do something like this:
#include <linux/sched.h>
//...
unsigned long to = jiffies + (10 * HZ); /* current time + 10 seconds */
while (time_before(jiffies, to))
{
schedule();
}
or better yet:
#include <linux/delay.h>
//...
msleep(10 * 1000);
for short delays you may use mdelay, ndelay and udelay
I suggest you read Linux Device Drivers 3rd edition chapter 7.3, which deals with delays for more information
To answer the question directly, it's likely your compiler seeing that these loops don't do anything and "optimizing" them away.
As for this technique, what it looks like you're trying to do is use all of the processor to create a delay. While this may work, an OS should be designed to maximize processor time. This will just waste it.
I understand it's experimental, but just the heads up.

How do I make a multi-threaded app use all the cores on Ubuntu under VMWare?

I have got a multi-threaded app that process a very large data file. Works great on Window 7, the code is all C++, uses the pthreads library for cross-platform multi-threading. When I run it under Windows on my Intel i3 - Task manager shows all four cores pegged to the limit, which is what I want. Compiled the same code using g++ Ubuntu/VMWare workstation - same number of threads are launched, but all threads are running on one core (as far as I can tell - Task Manager only shows one core busy).
I'm going to dive into the pThreads calls - perhaps I missed some default setting - but if anybody has any idea, I'd like to hear them, and I can give more info -
Update: I did setup VMWare to see all four cores and /proc/cpuinfo shows 4 cores
Update 2 - just wrote a simple app to show the problem - maybe it's VMWare only? - any Linux natives out there want to try and see if this actually loads down multiple cores? To run this on Windows you will need the pThread library - easily downloadable. And if anyone can suggest something more cpu intensive than printf- go ahead!
#ifdef _WIN32
#include "stdafx.h"
#endif
#include "stdio.h"
#include "stdlib.h"
#include "pthread.h"
void *Process(void *data)
{
long id = (long)data;
for (int i=0;i<100000;i++)
{
printf("Process %ld says Hello World\n",id);
}
return NULL;
}
#ifdef _WIN32
int _tmain(int argc, _TCHAR* argv[])
#else
int main(int argc, char* argv[])
#endif
{
int numCores = 1;
if (argc>1)
numCores = strtol(&argv[1][2],NULL,10);
pthread_t *thread_ids = (pthread_t *)malloc(numCores*sizeof(pthread_t));
for (int i=0;i<numCores;i++)
{
pthread_create(&thread_ids[i],NULL,Process,(void *)i);
}
for (int i=0;i<numCores;i++)
{
pthread_join(thread_ids[i],NULL);
}
return 0;
}
I changed your code a bit. I changed numCores = strtol(&argv[1][2], NULL, 10); to numCores = strtol(&argv[1][0], NULL, 10); to make it work under Linux by calling ./core 4 maybe you where passing something in front of the number of cores, or because type _TCHAR is 3byte per char? Not that familiar with windows.. Further more since I wasn't able to stress the CPU with only printf I also changed Process a bit.
void *Process(void *data)
{
long hdata = (long)data;
long id = (long)data;
for (int i=0;i<10000000;i++)
{
printf("Process %ld says Hello World\n",id);
for (int j = 0; j < 100000; j++)
{
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
hdata *= j;
...
}
}
return (void*)hdata;
}
And now when I run gcc -O2 -lpthread -std=gnu99 core.c -o core && ./core 4 You can see that all 4 threads are running on 4 different cores well probably the are swapped from core to core at a time but all 4 cores are working overtime.
core.c: In function ‘main’:
core.c:75:50: warning: cast to pointer from integer of different size [-Wint-to-pointer- cast]
Starting 4 threads
Process 0 says Hello World
Process 1 says Hello World
Process 3 says Hello World
Process 2 says Hello World
I verified it with htop hope it helps.. :) I run dedicated Debian SID x86_64 with 4cores in case you're wondering.

Question regarding the clock() function

Why does the time to execute function f1() changes from one run to another in debug mode? Why it's always zero in release mode?
I didn't include stdio.h nor cstdio and the code compiled. How ?
#include <iostream>
#include <ctime>
void f1()
{
for( int i = 0; i < 10000; i++ );
}
int main()
{
clock_t start, finish;
start = clock();
for( int i = 0; i < 100000; i++ ) f1();
finish = clock();
double duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "Duration = %6.2f seconds\n", duration);
}
Possible the machine you're running your test code from is too fast. Try increasing the loop count to a really huge number.
Other things to try is to test with the sleep() function.
This should confirm the behavior of your clock() measurements.
I believe the reason you are seeing zero runtime for f1() in release mode is because the compiler is optimizing the function. Since your for loop doesn't have a code block, it can effectively be pulled out during compilation.
I'm guessing that this optimization is not performed in debug mode, which would explain why you see a longer execution time. It varies between runs simply because your OS scheduler (almost certainly) does not guarantee a fixed time slot for processes.
As for why you can use printf() when you have not explicitly included <cstdio>, it's because of the <iostream> include.
From looking my headers at C:\Program Files\Microsoft Visual Studio 10.0\VC\include, I can see that iostream includes istream and ostream, both of which include ios, which includes xlocnum, which includes both cstdlib and cstdio.

Resources