How to tell if Spock is actually executing tests in parallel

How to tell if Spock is actually executing tests in parallel - gitlab

We have some API integration tests that take ~30 minutes to run a test class with 19 rows in the where: table. We're trying to speed this up using Spock's (experimental) Parellel Execution feature. We are using a simple SpockConfig.groovy file:
runner {
parallel {
enabled true
fixed(4)
}
}
This still took ~30 minutes on a GitLab runner. Is there any way to log something out so we can verify that the test is running in parallel? Or am I misunderstanding the nature of Spock's parallelization?

You can parallelise Spock testing on several levels, i.e. per specification (test class) or per feature (test method). The per-method setting also means that in iterated (unrolled) tests, iterations can run in parallel. Here is some proof:
import static org.spockframework.runtime.model.parallel.ExecutionMode.CONCURRENT
runner {
parallel {
enabled true
// These values are the default already, specifying them is redundant
// defaultSpecificationExecutionMode = CONCURRENT
// defaultExecutionMode = CONCURRENT
fixed 4
}
}
package de.scrum_master.testing
import spock.lang.Specification
import static java.lang.System.currentTimeMillis
import static java.lang.Thread.currentThread
class ParallelExecutionTest extends Specification {
static long startTime = currentTimeMillis()
static volatile int execCount = 0
def "feature-A-#count"() {
given:
def i = ++execCount
printf "%5d %2d [%s] >> %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
sleep 1000
printf "%5d %2d [%s] << %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
where:
count << (1..8)
}
def "feature-B-#count"() {
given:
def i = ++execCount
printf "%5d %2d [%s] >> %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
sleep 1000
printf "%5d %2d [%s] << %s%n", currentTimeMillis() - startTime, i, currentThread().name, specificationContext.currentIteration.displayName
where:
count << (1..8)
}
}
The output would look something like this (I used the volatile variable in order to more easily sort the log output in my editor into groups of threads running simultaneously in groups of 4:
68 1 [ForkJoinPool-1-worker-3] >> feature-B-1
68 2 [ForkJoinPool-1-worker-2] >> feature-A-5
68 3 [ForkJoinPool-1-worker-1] >> feature-B-6
68 4 [ForkJoinPool-1-worker-4] >> feature-B-2
1177 1 [ForkJoinPool-1-worker-3] << feature-B-1
1177 3 [ForkJoinPool-1-worker-1] << feature-B-6
1179 4 [ForkJoinPool-1-worker-4] << feature-B-2
1180 2 [ForkJoinPool-1-worker-2] << feature-A-5
1191 5 [ForkJoinPool-1-worker-3] >> feature-B-3
1199 6 [ForkJoinPool-1-worker-4] >> feature-B-4
1200 7 [ForkJoinPool-1-worker-1] >> feature-B-5
1204 8 [ForkJoinPool-1-worker-2] >> feature-A-6
2199 5 [ForkJoinPool-1-worker-3] << feature-B-3
2205 7 [ForkJoinPool-1-worker-1] << feature-B-5
2213 6 [ForkJoinPool-1-worker-4] << feature-B-4
2213 8 [ForkJoinPool-1-worker-2] << feature-A-6
2203 9 [ForkJoinPool-1-worker-3] >> feature-B-7
2207 10 [ForkJoinPool-1-worker-1] >> feature-A-1
2217 11 [ForkJoinPool-1-worker-4] >> feature-B-8
2222 12 [ForkJoinPool-1-worker-2] >> feature-A-8
3205 9 [ForkJoinPool-1-worker-3] << feature-B-7
3213 10 [ForkJoinPool-1-worker-1] << feature-A-1
3229 12 [ForkJoinPool-1-worker-2] << feature-A-8
3230 11 [ForkJoinPool-1-worker-4] << feature-B-8
3208 13 [ForkJoinPool-1-worker-3] >> feature-A-2
3216 14 [ForkJoinPool-1-worker-1] >> feature-A-3
3234 15 [ForkJoinPool-1-worker-4] >> feature-A-4
3237 16 [ForkJoinPool-1-worker-2] >> feature-A-7
4214 13 [ForkJoinPool-1-worker-3] << feature-A-2
4230 14 [ForkJoinPool-1-worker-1] << feature-A-3
4245 15 [ForkJoinPool-1-worker-4] << feature-A-4
4245 16 [ForkJoinPool-1-worker-2] << feature-A-7
As you can clearly see, with your own settings you can expect that multiple features and even multiple iterations in your spec run concurrently. If in your case the concurrent iterations take as long to finish as running the test sequentially, it could mean that those tests spend a lot of time using a shared resource which is synchronised or in some other way can only be used by one caller at the same time. So probably not Spock is the problem here, but the shared resource or the way you synchronise on it.

Related

pthread_cond_signal doesn't unblock pthread_cond_wait thread immediately

I'm observing a behaviour which doesn't seem to be inline with how pthread_cond_signal and pthread_cond_wait should behave (accordingly to manpages). man 3 pthread_cond_signal stipulates that:
The pthread_cond_signal() function shall unblock at least one of the threads that are blocked on the specified condition variable cond (if any threads are blocked on cond).
This isn't precise enough and doesn't clarify if, at the same time, the thread calling pthread_cond_signal will yield its time back to the scheduler.
Here's an example program:
1 #include <pthread.h>
2 #include <iostream>
3 #include <time.h>
4 #include <unistd.h>
5
6 int mMsg = 0;
7 pthread_mutex_t mMsgMutex;
8 pthread_cond_t mMsgCond;
9 pthread_t consumerThread;
10 pthread_t producerThread;
11
12 void* producer(void* data) {
13 (void) data;
14 while(true) {
15 pthread_mutex_lock(&mMsgMutex);
16 std::cout << "1> locked" << std::endl;
17 mMsg += 1;
18 std::cout << "1> sending signal, mMsg = " << mMsg << "" << std::endl;
19 pthread_cond_signal(&mMsgCond);
20 pthread_mutex_unlock(&mMsgMutex);
21 }
22
23 return nullptr;
24 }
25
26 void* consumer(void* data) {
27 (void) data;
28 pthread_mutex_lock(&mMsgMutex);
29
30 while(true) {
31 while (mMsg == 0) {
32 pthread_cond_wait(&mMsgCond, &mMsgMutex);
33 }
34 std::cout << "2> wake up, msg: " << mMsg << std::endl;
35 mMsg = 0;
36 }
37
38 return nullptr;
39 }
40
41 int main()
42 {
43 pthread_mutex_init(&mMsgMutex, nullptr);
44 pthread_cond_init(&mMsgCond, nullptr);
45
46 pthread_create(&consumerThread, nullptr, consumer, nullptr);
47
48 std::cout << "starting producer..." << std::endl;
49
50 sleep(1);
51 pthread_create(&producerThread, nullptr, producer, nullptr);
52
53 pthread_join(consumerThread, nullptr);
54 pthread_join(producerThread, nullptr);
55 return 0;
56 }
Here's the output:
starting producer...
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
1> locked
1> sending signal, mMsg = 4
1> locked
1> sending signal, mMsg = 5
1> locked
1> sending signal, mMsg = 6
1> locked
1> sending signal, mMsg = 7
1> locked
1> sending signal, mMsg = 8
1> locked
1> sending signal, mMsg = 9
1> locked
1> sending signal, mMsg = 10
2> wake up, msg: 10
1> locked
1> sending signal, mMsg = 1
1> locked
1> sending signal, mMsg = 2
1> locked
1> sending signal, mMsg = 3
...
It seems like there's no guarantee that any pthread_cond_signal will indeed immediately unblock any waiting pthread_cond_wait thread. At the same time it seems that any amount of pthread_cond_signal can be lost after first one has been issued.
Is this really the intended behaviour or am I doing something wrong here?

This is the intended behavior. pthread_cond_signal does not yield it's remaining runtime, but will continue to run.
And yes, pthread_cond_signal will immediately unblock (one or more) thread waiting on the corresponding condition variable. However, that doesn't guarantee that said waiting thread will immediately run. It just tells the OS that this thread is no longer blocked, and it's up to the OS thread scheduler to decide when to start running it. Since the signalling thread is already running, is hot in the cache etc., it will likely have plenty of time to do something before the now-unblocked thread starts doing anything.
In your example above, if you don't want to skip messages, maybe what you're looking for is something like a producer-consumer queue, maybe backed by a ring buffer.

How many promises can Perl 6 keep?

That's a bit of a glib title, but in playing around with Promises I wanted to see how far I could stretch the idea. In this program, I make it so I can specify how many promises I want to make.
The default value in the thread scheduler is 16 threads (rakudo/ThreadPoolScheduler.pm)
If I specify more than that number, the program hangs but I don't get a warning (say, like "Too many threads").
If I set RAKUDO_MAX_THREADS, I can stop the program hanging but eventually there is too much thread competition to run.
I have two questions, really.
How would a program know how many more threads it can make? That's slightly more than the number of promises, for what that's worth.
How would I know how many threads I should allow, even if I can make more?
This is Rakudo 2017.01 on my puny Macbook Air with 4 cores:
my $threads = #*ARGS[0] // %*ENV<RAKUDO_MAX_THREADS> // 1;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread {$*THREAD.id} got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await |#promises;
put "Done!";

This isn't actually about Promise per se, but rather about the thread pool scheduler. A Promise itself is just a synchronization construct. The start construct actually does two things:
Ensures a fresh $_, $/, and $! inside of the block
Calls Promise.start with that block
And Promise.start also does two things:
Creates and returns a Promise
Schedules the code in the block to be run on the thread pool, and arranges that successful completion keeps the Promise and an exception breaks the Promise.
It's not only possible, but also relatively common, to have Promise objects that aren't backed by code on the thread pool. Promise.in, Promise.anyof and Promise.allof factories don't immediately schedule anything, and there are all kinds of uses of a Promise that involve doing Promise.new and then calling keep or break later on. So I can easily create and await on 1000 Promises:
my #p = Promise.new xx 1000;
start { sleep 1; .keep for #p };
await #p;
say 'done' # completes, no trouble
Similarly, a Promise is not the only thing that can schedule code on the ThreadPoolScheduler. The many things that return Supply (like intervals, file watching, asynchronous sockets, asynchronous processes) all schedule their callbacks there too. It's possible to throw code there fire-and-forget style by doing $*SCHEDULER.cue: { ... } (though often you care about the result, or any errors, so it's not especially common).
The current Perl 6 thread pool scheduler has a configurable but enforced upper limit, which defaults to 16 threads. If you create a situation where all 16 are occupied but unable to make progress, and the only thing that can make progress is stuck in the work queue, then deadlock will occur. This is nothing unique to Perl 6 thread pool; any bounded pool will be vulnerable to this (and any unbounded pool will be vulnerable to using up all resources and getting the process killed :-)).
As mentioned in another post, Perl 6.d will make await and react non-blocking constructs; this has always been the plan, but there was insufficient development resources to realize it in time for Perl 6.c. The use v6.d.PREVIEW pragma provides early access to this feature. (Also, fair warning, it's a work in progress.) The upshot of this is that an await or react on a thread owned by the thread pool will pause the execution of the scheduled code (for those curious, by taking a continuation) and and allow the thread to get on with further work. The resumption of the code will be scheduled when the awaited thing completes, or the react block gets done. Note that this means you can be on a different OS thread before and after the await or react in 6.d. (Most Perl 6 users will not need to care about this. It's mostly relevant for those writing bindings to C libraries, or doing over systems-y stuff. And a good C library binding will make it so users of the binding don't have to care.)
The upcoming 6.d change doesn't eliminate the possibility of exhausting the thread pool, but it will mean a bunch of ways that you can do in 6.c will no longer be of concern (of note, writing recursive conquer/divide things that await the results of the divided parts, or having thousands of active react blocks launched with start react { ... }).
Looking forward, the thread pool scheduler itself will also become smarter. What follows is speculation, though given I'll likely be the one implementing the changes it's probably the best speculation on offer. :-) The thread pool will start following the progress being made, and use it to dynamically tune the pool size. This will include noticing that no progress is being made and, combined with the observation that the work queues contain items, adding threads to try and resolve the deadlock - at the cost of memory overhead of added threads. Today the thread pool conservatively tends to spawn up to its maximum size anyway, even if this is not a particularly optimal choice; most likely some kind of hill-climbing algorithm will be used to try and settle on an optimal number instead. Once that happens, the default max_threads can be raised substantially, so that more programs will - at the cost of a bunch of memory overhead - be able to complete, but most will run with just a handful of threads.

Quick fix, add use v6.d.PREVIEW; on the first line.
This fixes a number of thread exhaustion issues.
I added a few other changes like $*SCHEDULER.max_threads, and adding the Promise “id” so that it is easy to see that the Thread id doesn't necessarily correlate with a given Promise.
#! /usr/bin/env perl6
use v6.d.PREVIEW; # <--
my $threads = #*ARGS[0] // $*SCHEDULER.max_threads;
put "There are $threads threads";
my $channel = Channel.new;
# start some promises
my #promises;
for 1 .. $threads {
#promises.push: start {
react {
whenever $channel -> $i {
say "Thread $*THREAD.id() ($_) got $i";
}
}
}
}
put "Done making threads";
for ^100 { $channel.send( $_ ) }
put "Done sending";
$channel.close;
await #promises;
put "Done!";
There are 16 threads
Done making threads
Thread 4 (14) got 0
Thread 4 (14) got 1
Thread 8 (8) got 3
Thread 10 (6) got 4
Thread 6 (1) got 5
Thread 16 (5) got 2
Thread 3 (16) got 7
Thread 7 (8) got 8
Thread 7 (9) got 9
Thread 5 (3) got 6
Thread 3 (6) got 10
Thread 11 (2) got 11
Thread 14 (5) got 12
Thread 4 (16) got 13
Thread 16 (15) got 14 # <<
Thread 13 (11) got 15
Thread 4 (15) got 16 # <<
Thread 4 (15) got 17 # <<
Thread 4 (15) got 18 # <<
Thread 11 (15) got 19 # <<
Thread 13 (15) got 20 # <<
Thread 3 (15) got 21 # <<
Thread 9 (13) got 22
Thread 18 (15) got 23 # <<
Thread 18 (15) got 24 # <<
Thread 8 (13) got 25
Thread 7 (15) got 26 # <<
Thread 3 (15) got 27 # <<
Thread 7 (15) got 28 # <<
Thread 8 (15) got 29 # <<
Thread 13 (13) got 30
Thread 14 (13) got 31
Thread 8 (13) got 32
Thread 6 (13) got 33
Thread 9 (15) got 34 # <<
Thread 13 (15) got 35 # <<
Thread 9 (15) got 36 # <<
Thread 16 (15) got 37 # <<
Thread 3 (15) got 38 # <<
Thread 18 (13) got 39
Thread 3 (15) got 40 # <<
Thread 7 (14) got 41
Thread 12 (15) got 42 # <<
Thread 15 (15) got 43 # <<
Thread 4 (1) got 44
Thread 11 (1) got 45
Thread 7 (15) got 46 # <<
Thread 8 (15) got 47 # <<
Thread 7 (15) got 48 # <<
Thread 17 (15) got 49 # <<
Thread 10 (10) got 50
Thread 10 (15) got 51 # <<
Thread 11 (14) got 52
Thread 6 (8) got 53
Thread 5 (13) got 54
Thread 11 (15) got 55 # <<
Thread 11 (13) got 56
Thread 3 (13) got 57
Thread 7 (13) got 58
Thread 16 (16) got 59
Thread 5 (15) got 60 # <<
Thread 5 (15) got 61 # <<
Thread 6 (15) got 62 # <<
Thread 5 (15) got 63 # <<
Thread 5 (15) got 64 # <<
Thread 17 (11) got 65
Thread 15 (15) got 66 # <<
Thread 17 (15) got 67 # <<
Thread 11 (13) got 68
Thread 10 (15) got 69 # <<
Thread 3 (15) got 70 # <<
Thread 11 (15) got 71 # <<
Thread 6 (15) got 72 # <<
Thread 16 (13) got 73
Thread 6 (13) got 74
Thread 17 (15) got 75 # <<
Thread 4 (13) got 76
Thread 8 (13) got 77
Thread 12 (15) got 78 # <<
Thread 6 (11) got 79
Thread 3 (15) got 80 # <<
Thread 11 (13) got 81
Thread 7 (13) got 82
Thread 4 (15) got 83 # <<
Thread 7 (15) got 84 # <<
Thread 7 (15) got 85 # <<
Thread 10 (15) got 86 # <<
Thread 7 (15) got 87 # <<
Thread 12 (13) got 88
Thread 3 (13) got 89
Thread 18 (13) got 90
Thread 6 (13) got 91
Thread 18 (13) got 92
Thread 15 (15) got 93 # <<
Thread 16 (15) got 94 # <<
Thread 12 (15) got 95 # <<
Thread 17 (15) got 96 # <<
Thread 11 (13) got 97
Thread 15 (16) got 98
Thread 18 (7) got 99
Done sending
Done!

Unexpected task switch on Linux despite of real time and nice -20

I have a program that needs to execute with 100% performance but I see that it is sometimes paused for more than 20 uSec. I've struggled with this for a while and can't find the reason/explanation.
So my question is:
Why is my program "paused"/"stalled" for 20 uSec every now and then?
To investigate this I wrote the following small program:
#include <string.h>
#include <iostream>
#include <signal.h>
using namespace std;
unsigned long long get_time_in_ns(void)
{
struct timespec tmp;
if (clock_gettime(CLOCK_MONOTONIC, &tmp) == 0)
{
return tmp.tv_sec * 1000000000 + tmp.tv_nsec;
}
else
{
exit(0);
}
}
bool go_on = true;
static void Sig(int sig)
{
(void)sig;
go_on = false;
}
int main()
{
unsigned long long t1=0;
unsigned long long t2=0;
unsigned long long t3=0;
unsigned long long t4=0;
unsigned long long t5=0;
unsigned long long t2saved=0;
unsigned long long t3saved=0;
unsigned long long t4saved=0;
unsigned long long t5saved=0;
struct sigaction sig;
memset(&sig, 0, sizeof(sig));
sig.sa_handler = Sig;
if (sigaction(SIGINT, &sig, 0) < 0)
{
cout << "sigaction failed" << endl;
return 0;
}
while (go_on)
{
t1 = get_time_in_ns();
t2 = get_time_in_ns();
t3 = get_time_in_ns();
t4 = get_time_in_ns();
t5 = get_time_in_ns();
if ((t2-t1)>t2saved) t2saved = t2-t1;
if ((t3-t2)>t3saved) t3saved = t3-t2;
if ((t4-t3)>t4saved) t4saved = t4-t3;
if ((t5-t4)>t5saved) t5saved = t5-t4;
cout <<
t1 << " " <<
t2-t1 << " " <<
t3-t2 << " " <<
t4-t3 << " " <<
t5-t4 << " " <<
t2saved << " " <<
t3saved << " " <<
t4saved << " " <<
t5saved << endl;
}
cout << endl << "Closing..." << endl;
return 0;
}
The program simply test how long time it takes to call the function "get_time_in_ns". The program does this 5 times in a row. The program also tracks the longest time measured.
Normally it takes 30 ns to call the function but sometimes it takes as long as 20000 ns. Which I don't understand.
A little part of the program output is:
8909078678739 37 29 28 28 17334 17164 17458 18083
8909078680355 36 30 29 28 17334 17164 17458 18083
8909078681947 38 28 28 27 17334 17164 17458 18083
8909078683521 37 29 28 27 17334 17164 17458 18083
8909078685096 39 27 28 29 17334 17164 17458 18083
8909078686665 37 29 28 28 17334 17164 17458 18083
8909078688256 37 29 28 28 17334 17164 17458 18083
8909078689827 37 27 28 28 17334 17164 17458 18083
The output shows that normal call time is approx. 30ns (column 2 to 5) but the largest time is nearly 20000ns (column 6 to 9).
I start the program like this:
chrt -f 99 nice -n -20 myprogram
Any ideas why the call sometimes takes 20000ns when it normally takes 30ns?
The program is executed on a dual Xeon (8 cores each) machine.
I connect using SSH.
top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8107 root rt -20 16788 1448 1292 S 3.0 0.0 0:00.88 myprogram
2327 root 20 0 69848 7552 5056 S 1.3 0.0 0:37.07 sshd

Even the lowest value of niceness is not a real time priority — it is still in policy SCHED_OTHER, which is a round-robin time-sharing policy. You need to switch to a real time scheduling policy with sched_setscheduler(), either SCHED_FIFO or SCHED_RR as required.
Note that that will still not give you absolute 100% CPU if it isn't the only task running. If you run the task without interruption, Linux will still grant a few percent of the CPU time to non-real time tasks so that a runaway RT task will not effectively hang the machine. Of course, a real time task needing 100% CPU time is unlikely to perform correctly.
Edit: Given that the process already runs with a RT scheduler (nice values are only relevant to SCHED_OTHER, so it's pointless to set those in addition) as pointed out, the rest of my answer still applies as to how and why other tasks still are being run (remember that there are also a number kernel tasks).
The only way better than this is probably dedicating one CPU core to the task to get the most out of it. Obviously this only works on multi-core CPUs. There is a question related to that here: Whole one core dedicated to single process

Mpiexec fails to terminate when program ends

I am running an mpi program on a cluster. When the program ends the job does not. And so I have to wait for it to time out.
I am not sure how to debug this. I checked that the program got to the finalize statement in MPI, and it does. I am using lib Elemental.
Final lines of the program
if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
std::string message = std::string("rank_") +
std::to_string(mpi::Rank(mpi::COMM_WORLD)) + "_a";
std::cout << message;
Finalize();
message = message + "b";
std::cout << message;
mpi::Finalize();
message = message + "c";
std::cout << message;
return 0;
The output will is
Finalize
rank_0_arank_0_abrank_0_abcmpiexec: killall: caught signal 15 (Terminated).
mpiexec: kill_tasks: killing all tasks.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: killall: caught signal 15 (Terminated).
=>> PBS: job killed: walltime 801 exceeded limit 780
----------------------------------------
Begin Torque Epilogue (Tue Nov 4 16:15:19 2014)
Job ID: ***
Username: ***
Group: ***
Job Name: mpi_test1
Session: 11270
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Resources: cput=00:02:12,mem=429524kb,vmem=773600kb,walltime=00:13:21
Job Queue: secondary
Account: ***
Nodes: taub205
End Torque Epilogue
----------------------------------------
Running these modules on https://campuscluster.illinois.edu/hardware/#taub
> module list
Currently Loaded Modulefiles:
1) torque/4.2.9 5) gcc/4.7.1
2) moab/7.2.9 6) mvapich2/2.0b-gcc-4.7.1
3) env/taub 7) mvapich2/mpiexec
4) blas 8) lapack

Pulse width modulation (PWM) on AVR Studio

I'm trying to use PWM for an LED on an ATmega8, any pin of port B. Setting up timers has been a annoying, and I don't know what to do with my OCR1A. Here's my code, and I'd love some feedback.
I'm just trying to figure out how use PWM. I know the concept, and OCR1A is supposed to be the fraction of the whole counter time I want the pulse on.
#define F_CPU 1000000 // 1 MHz
#include <avr/io.h>
#include <avr/delay.h>
#include <avr/interrupt.h>
int main(void){
TCCR1A |= (1 << CS10) | (1 << CS12) | (1 << CS11);
OCR1A = 0x0000;
TCCR1A |= ( 0 << WGM11 ) | ( 1 << WGM10 ) | (WGM12 << 1) | (WGM13 << 0);
TCCR1A |= ( 1 << COM1A0 ) | ( 0 << COM1A1 );
TIMSK |= (1 << TOIE1); // Enable timer interrupt
DDRB = 0xFF;
sei(); // Enable global interrupts
PORTB = 0b00000000;
while(1)
{
OCR1A = 0x00FF; //I'm trying to get the timer to alternate being on for 100% of the time,
_delay_ms(200);
OCR1A = 0x0066; // Then 50%
_delay_ms(200);
OCR1A = 0x0000; // Then 0%
_delay_ms(200);
}
}
ISR (TIMER1_COMA_vect) // timer0 overflow interrupt
{
PORTB =~ PORTB;
}

No, this is not the way how you should do a PWM. For example, how do you set a PWM rate of, for example, 42% with it? Also, the code size is big, it can be done in a much more efficient way. Also, you waste a 16 bit timer to do 8 bit operations. You have 2x 8 bit timers (Timer/Counter 0 and 2), and one 16 bit timer, Timer/Counter 1.
It's also a bad idea to set unused portpins to output. All portpins which are not connected to anything, should be left as inputs.
The ATmega8 has a built-in PWM generator on timers 1 and 2, there is no need in simulating it through software. You don't even have to set your ports manually (you only have to set the corresponding portpin to output)
You don't even need any interrupt.
#define fillrate OCR2A
//...
// main()
PORTB=0x00;
DDRB=0x08; //We use PORTB.3 as output, for OC2A, see the atmega8 reference manual
// Mode: Phase correct PWM top=0xFF
// OC2A output: Non-Inverted PWM
TCCR2A=0x81;
// Set the speed here, it will depend on your clock rate.
TCCR2B=0x02;
// for example, this will alternate between 75% and 42% PWM
while(1)
{
fillrate = 191; // ca. 75% PWM
delay_ms(2000);
fillrate = 107; // ca. 42% PWM
delay_ms(2000);
}
Note that you can use another LED with another PWM, by using the same timer and setting OCR2B instead of OCR2A. Don't forget to set TCCR2A to enable OCR2B as output for your PWM, as in this example only OCR2A is allowed.

You need to initialize your OCR1A with these two lines:
TCCR1A = (1 << WGM10) | (1 << COM1A1);
TCCR1B = (1 << CS10) | (1 << WGM12);
And then use this:
OCR1A = in
And know that the range is 0-255. Count your percentages, and there you have it!
#define F_CPU 1000000 // 1 MHz
#include <avr/io.h>
#include <avr/delay.h>
#include <avr/interrupt.h>
int main(void){
TCCR1A = (1 << WGM10) | (1 << COM1A1);
TCCR1B = (1 << CS10) | (1 << WGM12);
DDRB = 0xFF;
sei(); // Enable global interrupts
PORTB = 0b00000000;
while(1)
{
OCR1A = 255;
_delay_ms(200);
OCR1A = 125;
_delay_ms(200);
OCR1A = 0;
_delay_ms(200);
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string