I'm trying to use PWM for an LED on an ATmega8, any pin of port B. Setting up timers has been a annoying, and I don't know what to do with my OCR1A. Here's my code, and I'd love some feedback.
I'm just trying to figure out how use PWM. I know the concept, and OCR1A is supposed to be the fraction of the whole counter time I want the pulse on.
#define F_CPU 1000000 // 1 MHz
#include <avr/io.h>
#include <avr/delay.h>
#include <avr/interrupt.h>
int main(void){
TCCR1A |= (1 << CS10) | (1 << CS12) | (1 << CS11);
OCR1A = 0x0000;
TCCR1A |= ( 0 << WGM11 ) | ( 1 << WGM10 ) | (WGM12 << 1) | (WGM13 << 0);
TCCR1A |= ( 1 << COM1A0 ) | ( 0 << COM1A1 );
TIMSK |= (1 << TOIE1); // Enable timer interrupt
DDRB = 0xFF;
sei(); // Enable global interrupts
PORTB = 0b00000000;
OCR1A = 0x00FF; //I'm trying to get the timer to alternate being on for 100% of the time,
OCR1A = 0x0066; // Then 50%
OCR1A = 0x0000; // Then 0%
ISR (TIMER1_COMA_vect) // timer0 overflow interrupt

No, this is not the way how you should do a PWM. For example, how do you set a PWM rate of, for example, 42% with it? Also, the code size is big, it can be done in a much more efficient way. Also, you waste a 16 bit timer to do 8 bit operations. You have 2x 8 bit timers (Timer/Counter 0 and 2), and one 16 bit timer, Timer/Counter 1.
It's also a bad idea to set unused portpins to output. All portpins which are not connected to anything, should be left as inputs.
The ATmega8 has a built-in PWM generator on timers 1 and 2, there is no need in simulating it through software. You don't even have to set your ports manually (you only have to set the corresponding portpin to output)
You don't even need any interrupt.
#define fillrate OCR2A
// main()
DDRB=0x08; //We use PORTB.3 as output, for OC2A, see the atmega8 reference manual
// Mode: Phase correct PWM top=0xFF
// OC2A output: Non-Inverted PWM
// Set the speed here, it will depend on your clock rate.
// for example, this will alternate between 75% and 42% PWM
fillrate = 191; // ca. 75% PWM
fillrate = 107; // ca. 42% PWM
Note that you can use another LED with another PWM, by using the same timer and setting OCR2B instead of OCR2A. Don't forget to set TCCR2A to enable OCR2B as output for your PWM, as in this example only OCR2A is allowed.

You need to initialize your OCR1A with these two lines:
TCCR1A = (1 << WGM10) | (1 << COM1A1);
TCCR1B = (1 << CS10) | (1 << WGM12);
And then use this:
OCR1A = in
And know that the range is 0-255. Count your percentages, and there you have it!
#define F_CPU 1000000 // 1 MHz
#include <avr/io.h>
#include <avr/delay.h>
#include <avr/interrupt.h>
int main(void){
TCCR1A = (1 << WGM10) | (1 << COM1A1);
TCCR1B = (1 << CS10) | (1 << WGM12);
DDRB = 0xFF;
sei(); // Enable global interrupts
PORTB = 0b00000000;
OCR1A = 255;
OCR1A = 125;
OCR1A = 0;


VGA pixel grouping on STM32

I have some code that displays a single pixel on screen through VGA but am a bit stuck on how I could set multiple pixels on screen where I want them. I set up two Timers for Vertical Sync and Horizontal Sync then using the V-Sync interrupt, I set a flag to allow PA8 to toggle and output a pixel at the correct timing based on the SetCompare value I set on the timer's channel. The STM32f103c8 is also overclocked to 128MHz. Here's the code:
#include "Arduino.h"
//640x480 at 60Hz
static volatile int vflag = 0;
void setup() {
#define FLASH_ACR (*(volatile uint32_t*)(0x40022000))
FLASH_ACR = 0b110010; //enable flash prefetch and wait state to increase stability at higher freq
pinMode(PA0, PWM); //31,468.75Hz (Horizontal Sync) (Channel 1)
Timer2.setOverflow(4067); //reload register value
Timer2.setPrescaleFactor(1); //number that divides main clock
Timer2.setCompare(1, 488); //12% duty cycle (Syncpulse/Wholeline)
Timer2.setCompare(2, 2000); //0-4067 = vertical line going left or right respectively
Timer2.attachInterrupt(2, TRIGGER);
pinMode(PA6, PWM); //60Hz (Vertical Sync) (Channel 1)
Timer3.setOverflow(4183); //reload register value
Timer3.setPrescaleFactor(510); //number that divides main clock
Timer3.setCompare(1, 16); //0.38% duty cycle (Syncpulse/Wholeframe)
Timer3.setCompare(2, 2000); //0-4183 = horizontal line going up or down respectively
Timer3.attachInterrupt(2, TRIGGER2);
pinMode(PA8, OUTPUT); //need to set PinMode in order for the ODR register to work
void loop() {
void TRIGGER(){
__asm__ volatile (
"ldr r0, =(0x4001080C) \n\t" //GPIOA base address is 0x40010800 and ODR offset is 0x0C
"ldr r1, =(1<<8) \n\t" //turn on PA8
"ldr r2, =0 \n\t" //turn off PA8
"str r1, [r0] \n\t" //turn on PA8
"str r2, [r0] \n\t" //turn off PA8
vflag = 0; //we set the vflag back to zero when were done outputing pixels.
I understand there's graphical defects/glitches and the code can be improved on but I'm trying to focus on how in theory this works. What I want to do is have a word display on screen, that word will be made up of letters, and those letters will be made up of groups of pixels. So then whats the best (or simplest) way to group pixels and execute them multiple times on-screen? Or how is this usually done?
I do not code for STM32 so even the code looks foreign to me however it sounds like you are hard-coding the individual pixels with timer... and generating VGA signal by some GPIO. That combination of methods is problematic to use for programmable graphics.
I am using AVR32 (UC3A with much slower clock then yours) to doing VGA image using:
screen buffer (videoram)
simply I have entire screen image stored in MCU memory. So you can simply change it contents without changing the VGA output code ...
The problem is you need to have enough memory for the image (encoded in a way to enable direct transfer to VGA connector). I am using AVR32 with 16+32+32 KByte of RAM but most MCUs have much less RAM (static images can be stored in EPROM but then it would not be possible to change the image output). So in case you do not have enough either lower resolution to fit into memory or add external memory to your system.
use integrated HW peripherial for VGA signal generation
I have more versions of VGA signal generation working:
SDRAM ... using SDRAM interface of the MCU
SMC ... using SMC interface of the MCU
SSC ... using synchronous serial interface
It all boils down to copying the screen buffer to IO interface at the VGA dot clock (~30MHz). Then on HW side combining the used MCU pins into VGA signals
for speed you can use DMA.
On top of all this you need to generate the sync signals and that is all.
So look at your MCU datasheet and find interface capable of synchronous transfer at least 3 bit data (R,G,B) with VGA dot clock speed. The closer the clock is to VGA dot clock the better (as some VGA monitors and LCDs do not tolerate too big difference)
the sync signals can be hardcoded or even encoded in the video ram.
The fastest is to use serial interface but the output is just B&W instead of RGB (unless you got 3 SSC units/channels) however you can send entire 8/16/32 pixels at once or by DMA directly so the MCU has time for other stuff and also requires much less VRAM.
My favourite is SDRAM interface (using just its data bus)
Here image from mine system:
Mine interconnection looks like this (using MCUs SDRAM interface version):
VGA <- AT32UC3A0512
R PX10 (EBI_D0)
G PX09 (EBI_D1)
B PX08 (EBI_D2)
Bright PX07 (EBI_D3)*
Here the relevant source:
#define _PA_VGA_HS 8
#define _PA_VGA_VS 16
#define _PAmo 24
volatile avr32_gpio_port_t *port_PA=&GPIO.port[AVR32_PIN_PA00>>5];
volatile U8 *SDRAM=(U8*)AVR32_EBI_CS0_ADDRESS;
//--- VGA 640x480x4 60Hz -------------------------------------------------------------------------
#define VRAM_xs 304
#define VRAM_ys 400
#define VRAM_bs 4
#define VRAM_ls (VRAM_xs>>1)
U8 VRAM_empty[VRAM_ls];
// Horizontal timing [us]
#define VGA_t0 3
#define VGA_t1 5
#define VGA_t2 32
// Vertikal timing [lines ~31.817us] aby voslo viac bodov tak je to natiahnute na 32us++
#define VGA_ys 525
#define VGA_VS 2
#define VGA_y0 (36+40)
#define VGA_y1 (VGA_y0+VRAM_ys)
void VGA_init();
void VGA_screen();
void VGA_init()
static const gpio_map_t EBI_GPIO_MAP[] =
gpio_enable_module(EBI_GPIO_MAP, sizeof(EBI_GPIO_MAP) / sizeof(EBI_GPIO_MAP[0]));
AVR32_SDRAMC.mr=0; // normal mode
AVR32_SDRAMC.tr=0; // no refresh (T=0)
// map SDRAM CS -> memory space
U32 a;
for (a=0;a<VRAM_ls*VRAM_ys;a++) VRAM[a]=0;
for (a=0;a<VRAM_ls;a++) VRAM_empty[a]=0;
void VGA_screen()
U32 a,x,y,c,PA,t0;
for (;;)
for (PA=_PAmo,a=0,y=0;y<VGA_ys;y++)
if (y== 0) PA^=_PA_VGA_VS; else PA^=0; // VS on
if (y==VGA_VS) PA^=_PA_VGA_VS; else PA^=0; // VS off
PA^=_PA_VGA_HS; // HS on
PA^=_PA_VGA_HS; // HS off
*SDRAM=0; // blank (black)
if ((y>=VGA_y0)&&(y<VGA_y1))
for (x=0;x<VRAM_ls;x++)
*SDRAM=c>>4; // write pixel into SDRAM interface (address is ignored as I use only data bus pins)
*SDRAM=c; // write pixel into SDRAM interface (address is ignored as I use only data bus pins)
*SDRAM=0; // blank (black)
#include "System\include.h"
#include "pic_zilog_inside.h"
//#include "VGA_EBI_SMC.h"
#include "VGA_EBI_SDRAMC.h"
//#include "VGA_SSC.h"
void pic_copy(U8 *dst,U32 dst_xs,U32 dst_ys,U32 dst_bs,U8 *src,U32 src_xs,U32 src_ys,U32 src_bs)
U32 x0,y0,a0,l0;
U32 x1,y1,a1,l1;
U32 a; U8 c,m;
l0=1; l1=1;
if (dst_bs==1) l0=dst_xs>>3;
if (dst_bs==2) l0=dst_xs>>2;
if (dst_bs==4) l0=dst_xs>>1;
if (dst_bs==8) l0=dst_xs;
if (src_bs==1) l1=src_xs>>3;
if (src_bs==2) l1=src_xs>>2;
if (src_bs==4) l1=src_xs>>1;
if (src_bs==8) l1=src_xs;
for (a0=0;a0<dst_ys*l0;a0++) dst[a0]=0;
for (y0=0;y0<dst_ys;y0++)
for (x0=0;x0<dst_xs;x0++)
if (src_bs==1)
if (src_bs==4)
if (U32(x0&1)==0) c>>=4;
if (dst_bs==1)
dst[a]|=c; if (!c) dst[a]^=c;
if (dst_bs==4)
if (c) c=15;
if (U32(x0&1)==0) { c<<=4; m=0x0F; } else m=0xF0;
int main(void)
const U32 pic_zilog_inside_xs=640;
const U32 pic_zilog_inside_ys=480;
const U32 pic_zilog_inside_bs=1;
const U32 pic_zilog_inside[]= // hard coded image
The function pic_copy just copies hard-coded image into VRAM.
The function VGA_screen() creates the VGA signal in endless loop so other tasks must be encoded into ISRs or hard coded into pause code or between individual frames (however this is really demanding on mine setup as I got small MCU clock so there is not much room for other stuff to do). The VRAM is encoded in 16 colors (4 bits per pixel)
8 4 2 1
Brightness B G R
The brightness should just adds some voltage to R,G,B with few resistors and diodes but newer implemented it on HW side instead I have this circuit (8 colors only):
The diodes must be fast with the same barrier voltage and capacitors are 1nF. Its to avoid glitching of the image due to used interface data bus timing. Also the diodes are needed for the brightness if added in future for more info see the R2R current problem in here:
Generating square wave in AVR Assembly without PWM
[Edit1] I made huge changes in code:
//--- VGA EBI SDRAMC DMACA ver: 3.0 --------------------------------------------------------------
VGA <- AT32UC3A3256
R PX10 (EBI_D0)
G PX09 (EBI_D1)
B PX08 (EBI_D2)
Bright PX07 (EBI_D3)*
/HS PX58
/VS PX59
Button PB10 (Bootloader)
debug PX54 (timing of TC00)
//#define _Debug AVR32_PIN_PX54
#define _Button AVR32_PIN_PB10
#define _VGA_HS (1<<(AVR32_PIN_PX58&31))
#define _VGA_VS (1<<(AVR32_PIN_PX59&31))
#define _VGA_mask (_VGA_HS|_VGA_VS)
volatile avr32_gpio_port_t *port_VGA=&GPIO.port[AVR32_PIN_PX58>>5];
volatile U8 *SDRAM=(U8*)AVR32_EBI_CS0_ADDRESS;
//--- VGA 640x480x4 60Hz -------------------------------------------------------------------------
#define VRAM_xs 256
#define VRAM_ys 192
#define VRAM_bs 8
#define VRAM_ls ((VRAM_xs*VRAM_bs)>>3)
volatile static U8 VRAM[VRAM_ls*VRAM_ys];
| V sync |
| |----------------------|
| | V back |
| | |------------------| |
|H|H| |H|
|s|b| Video 640x480 |f|
|y|a| 525 lines |r|
|n|c| 60 Hz |o|
|c|k| |n|
| | |------------------|t|
| | V front |
// VGA 640x480 60Hz H timing [pixels] dot clock = 25.175MHz
#define VGA_H_front 16
#define VGA_H_sync 96
#define VGA_H_back 48
#define VGA_H_video 640
// VGA 640x480 60Hz H timing [us] Ht = H/25.175, Hf = Vf*(VGA_V_video+VGA_V_front+VGA_V_sync+VGA_V_back)
#define VGA_Ht_front 1
#define VGA_Ht_sync 2
#define VGA_Ht_back 1
#define VGA_Hf 31500
// VGA 640x480 60Hz V timing [lines]
#define VGA_V_video 480
#define VGA_V_front 10
#define VGA_V_sync 2
#define VGA_V_back 33
#define VGA_Vf 60
void VGA_init();
void VGA_screen();
__attribute__((__interrupt__)) static void ISR_TC00_VGA() // TC0 chn:0 VGA horizontal frequency
{ // 8us every 31.75us -> 25% of CPU power
tc_read_sr(&AVR32_TC0,0); // more centered image requires +1 us to VGA_Ht_back -> 28% CPU power
#define y0 (VGA_V_video)
#define y1 (y0+VGA_V_front)
#define y2 (y1+VGA_V_sync)
#define y3 (y2+VGA_V_back)
#define yscr0 ((VGA_V_video>>1)-VRAM_ys)
#define yscr1 (yscr0+(VRAM_ys<<1))
static volatile U8 *p;
static volatile U32 y=y3;
#ifdef _Debug
// init state
if (y>=y3){ y=0; p=VRAM; port_VGA->ovrs=_VGA_mask; }
// VS sync
if (y==y1) port_VGA->ovrc=_VGA_VS; // VS = L
if (y==y2) port_VGA->ovrs=_VGA_VS; // VS = H
// HS sync
wait_us(VGA_Ht_front); // front
port_VGA->ovrc=_VGA_HS; // HS = L
wait_us(VGA_Ht_sync); // sync
port_VGA->ovrs=_VGA_HS; // HS = H
wait_us(VGA_Ht_back); // back
// 8bit pixelformat DMACA, scan doubler + y offset
if ((y>=yscr0)&&(y<yscr1))
// Enable the DMACA
// Src Address: the source_data address
AVR32_DMACA.sar2 = (uint32_t)p;
// Dst Address: the dest_data address
AVR32_DMACA.dar2 = (uint32_t)SDRAM;
// Linked list ptrs: not used.
AVR32_DMACA.llp2 = 0x00000000;
// Channel 2 Ctrl register low
AVR32_DMACA.ctl2l =
(0 << AVR32_DMACA_CTL2L_INT_EN_OFFSET) | // Enable interrupts
(0 << AVR32_DMACA_CTL2L_DST_TR_WIDTH_OFFSET) | // Dst transfer width: 8bit (1,2 znasobi dotclock na 2x)
(0 << AVR32_DMACA_CTL2L_SRC_TR_WIDTH_OFFSET) | // Src transfer width: 8bit
(0 << AVR32_DMACA_CTL2L_DINC_OFFSET) | // Dst address increment: increment
(0 << AVR32_DMACA_CTL2L_SINC_OFFSET) | // Src address increment: increment
(0 << AVR32_DMACA_CTL2L_DST_MSIZE_OFFSET) | // Dst burst transaction len: 1 data items (each of size DST_TR_WIDTH)
(0 << AVR32_DMACA_CTL2L_SRC_MSIZE_OFFSET) | // Src burst transaction len: 1 data items (each of size DST_TR_WIDTH)
(0 << AVR32_DMACA_CTL2L_TT_FC_OFFSET) | // transfer type:M2M, flow controller: DMACA
(1 << AVR32_DMACA_CTL2L_DMS_OFFSET) | // Destination master: HSB master 2
(0 << AVR32_DMACA_CTL2L_SMS_OFFSET) | // Source master: HSB master 1
(0 << AVR32_DMACA_CTL2L_LLP_D_EN_OFFSET) | // Not used
(0 << AVR32_DMACA_CTL2L_LLP_S_EN_OFFSET); // Not used
// Channel 2 Ctrl register high
AVR32_DMACA.ctl2h =
((VRAM_ls) << AVR32_DMACA_CTL2H_BLOCK_TS_OFFSET) | // Block transfer size
(0 << AVR32_DMACA_CTL2H_DONE_OFFSET); // Not done
// Channel 2 Config register low
AVR32_DMACA.cfg2l =
(0 << AVR32_DMACA_CFG2L_HS_SEL_DST_OFFSET) | // Destination handshaking: ignored because the dst is memory
(0 << AVR32_DMACA_CFG2L_HS_SEL_SRC_OFFSET); // Source handshaking: ignored because the src is memory.
// Channel 2 Config register high
AVR32_DMACA.cfg2h =
(0 << AVR32_DMACA_CFG2H_DEST_PER_OFFSET) | // Dest hw handshaking itf: ignored because the dst is memory.
(0 << AVR32_DMACA_CFG2H_SRC_PER_OFFSET); // Source hw handshaking itf: ignored because the src is memory.
// Enable Channel 2 : start the process.
// DMACA is messing up first BYTE so send it by SW before DMA
// scan doubler increment only every second scanline
if ((y&1)==1) p+=VRAM_ls;
*SDRAM=0; y++;
#ifdef _Debug
#undef y0
#undef y1
#undef y2
#undef y3
void VGA_init()
#ifdef _Debug
static const gpio_map_t EBI_GPIO_MAP[] =
gpio_enable_module(EBI_GPIO_MAP, sizeof(EBI_GPIO_MAP) / sizeof(EBI_GPIO_MAP[0]));
AVR32_SDRAMC.mr=0; // normal mode
AVR32_SDRAMC.tr=0; // no refresh (T=0)
// map SDRAM CS -> memory space
for (U32 a=0;a<VRAM_ls*VRAM_ys;a++) VRAM[a]=0;
Now its enough to call VGA_init(); and the stuff runs in background using Timer and DMA between internal RAM and EBI SDRAM interface. It uses only 25% of CPU power in current configuration. However half of VRAM is wasted as only 4 bits are used so high nibel might be used for back buffering to compensate. I also downclock the stuf to 66MHz as I do not have enough RAM for higher resolutions.

UART bluetooth communication problem What is the proper format to send data to the UART (integer values)

I have created my functions to send and receive from the UART, and sending the data does not seem to be a problem. In the data visualizer we can see the values and even plot them.
However when sending these data through the bluetooth, we cannot get the values to plot them in any of many available apps.
I believe there is a problem with the way we are sending data through the UART and to the bluetooth and that is why we cannot then get the values to be plotted.
Being a starter at all this, I would like to someone please advice us if the code below is ok, if there is a mistake and if there is a better way to send the data through the UART so as to make the Bluetooth work properly. Target is to be able to plot (graph) the values on the phone.
Many thanks
#define F_CPU 16000000UL
#include <avr/io.h>
#include <util/delay.h>
#include <avr/interrupt.h>
#include <stdio.h>
#define BAUDRATE 9600
#define BAUD_PRESCALLER (((F_CPU / (BAUDRATE * 16UL))) - 1)
float V_n,V_nm1,V_measure=0;
volatile int Velo_pulse;
float Exp_fltr_Coeff=0.2;
unsigned int Counter_ADC=0b0001;
unsigned int Value1;
char String[]="";
//----------Functions Definition
void Timer1_Control();
void AttachInterrupt();
void Set_Ports();
void AnalogRead_Setup();
unsigned int AnalogRead();
void USART_init(void);
unsigned char USART_receive(void);
void USART_send( unsigned char data);
void USART_putstring(char* StringPtr, unsigned int Value1);
int main(void){
USART_init(); //Call the USART initialization code
return 0;
void USART_init(void){
UBRR0H = (unsigned char)(BAUD_PRESCALLER>>8); //UBRR0H = (uint8_t)(BAUD_PRESCALLER>>8);
UBRR0L = (unsigned char)(BAUD_PRESCALLER);
UCSR0B = (1<<RXEN0)|(1<<TXEN0); //Enable receiver / transmitter
UCSR0C = (1<<USBS0)|(3<<UCSZ00); //Set frame format: 8data, 2stop bit
unsigned char USART_receive(void){
while(!(UCSR0A & (1<<RXC0))); //Wait for data to be received (buffer RXCn in the UCSRnA register)
return UDR0;
void USART_send( unsigned char data){
while(!(UCSR0A & (1<<UDRE0))); //Waiting for empty transmit buffer (buffer UDREn in the UCSRnA register)
UDR0 = data; //Loading Data on the transmit buffer
void USART_putstring(char* String, unsigned int Value1){
while(*String != 0x00){
void Set_Ports()
DDRD = 0b11111111; //All port is output
DDRD ^= (1 << DDD5); // PD5 is now input
//ADMUX ^= Counter_ADC; //Swapping between ADC0 an ADC1
void AnalogRead_Setup()
ADCSRA |= (1 << ADPS2) | (0 << ADPS1) | (0 << ADPS0); // Set ADC prescaler to 16 - 1 MHz sample rate # 16MHz
ADMUX |= (1 << REFS0); // Set ADC reference to AVCC
ADMUX |= (1 << ADLAR); // Left adjust ADC result to allow easy 8 bit reading
ADCSRA |= (1 << ADATE); // Set ADC to Free-Running Mode
ADCSRA |= (1 << ADIE); // Interrupt in Conversion Complete
ADCSRA |= (1 << ADEN); // Enable ADC
unsigned int AnalogRead(unsigned int PortVal)
if (PortVal==5){
ADMUX |= (0 << MUX3) | (1 << MUX2) | (0 << MUX1) | (1 << MUX0); //sets the pin 0101 sets pin5
} else if (PortVal==4){
ADMUX |= (0 << MUX3) | (1 << MUX2) | (0 << MUX1) | (0 << MUX0); //sets the pin 0101 sets pin4
ADCSRA |= (1 << ADSC); // Start A2D Conversions
//while(ADCSRA & (1 << ADSC));
return ADCH;
//----------Timer Functions
ISR (TIMER1_COMPA_vect) // Timer1 ISR (compare A vector - Compare Interrupt Mode)
ISR (INT0_vect)
void Timer1_Control()
TCCR1A=0b00000000; //Clear the timer1 registers
TCCR1B=0b00001101; //Sets prescaler (1024) & Compare mode
OCR1A=2604; // 160ms - 6 Hz
void AttachInterrupt()
DDRD ^= (1 << DDD2); // PD2 (PCINT0 pin) is now an input
PORTD |= (1 << PORTD2); // turn On the Pull-up // PD2 is now an input with pull-up enabled
EICRA = 0b00000011; // set INT0 to trigger on rising edge change
EIMSK = 0b00000001; // Turns on INT0
Look at string initialization:
char String[]="";
this allocates an array of chars with a size of 1 item (which is terminating zero).
Then you make a call, passing this array reference as the first parameter:
And the USART_putstring is as follows:
void USART_putstring(char* String, unsigned int Value1){
while(*String != 0x00){
Note sprintf(String,"%d\r\n",Value1); it converts numeric value into the char buffer. I.e. the buffer should be large enough to contain the text representation of the number, line feeds \r\n\ and zero - the string terminator.
But since your string buffer has size for only 1 char, it totally depends on luck, what will happen after sprintf: maybe there is some unused memory area, so the whole thing will look as if it working. Maybe there are some other variables, and their value will be overwritten, which makes the program behavior unexpected in the future. Or maybe there is some essential data, and your app will be crashing. Behavior may change after you adding several lines and recompile the code.
The point is: be careful with your buffers. Instead of using constants for initialization, set the exact size for the buffer. The number length is maximum 6 symbols (1 possible sign and 5 digits, assuming you're using AVR-GCC, which has the int 16-bits wide, thus has -32768 as the minimum) + 2 for \r\n\ + 1 for terminating zero. I.e. size of the buffer should be 9 at least.
char String[9];

Multiples threads running on one core instead of four depending on the OS

I am using Raspbian on Raspberry 3.
I need to divide my code in few blocks (2 or 4) and assign a thread per block to speed up calculations.
At the moment, I am testing with simple loops (see attached code) on one thread and then on 4 threads. And executions time on 4 threads is always 4 times longer, so it looks like this 4 threads are scheduled to run on the same CPU.
How to assign each thread to run on other CPUs? Even 2 threads on 2 CPUs should make big difference to me.
I even tried to use g++6 and no improvement. And using parallel libs openmp in the code with "#pragma omp for" still running on one CPU.
I tried to run this code on Fedora Linux x86 and I had the same behavior, but on Windows 8.1 and VS2015 i have got different results where time was the same one one thread and then on 4 threads, so it was running on different CPUs.
Would you have any suggestions??
Thank you.
#include <iostream>
//#include <arm_neon.h>
#include <ctime>
#include <thread>
#include <mutex>
#include <iostream>
#include <vector>
using namespace std;
float simd_dot0() {
unsigned int i;
unsigned long rezult;
for (i = 0; i < 0xfffffff; i++) {
rezult = i;
return rezult;
int main() {
unsigned num_cpus = std::thread::hardware_concurrency();
std::mutex iomutex;
std::vector<std::thread> threads(num_cpus);
cout << "Start Test 1 CPU" << endl; // prints !!!Hello World!!!
double t_start, t_end, scan_time;
scan_time = 0;
t_start = clock();
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 1 CPU: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 1 CPU" << endl; // prints !!!Hello World!!!
cout << "Start Test 4 CPU" << endl; // prints !!!Hello World!!!
scan_time = 0;
t_start = clock();
for (unsigned i = 0; i < 4; ++i) {
threads[i] = std::thread([&iomutex, i] {
std::cout << "\nExecution time on CPU: "
<< i << std::endl;
// Simulate important work done by the tread by sleeping for a bit...
for (auto& t : threads) {
t_end = clock();
scan_time += t_end - t_start;
std::cout << "\nExecution time on 4 CPUs: "
<< 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
cout << "Finish Test on 4 CPU" << endl; // prints !!!Hello World!!!
cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
while (1);
return 0;
Edit :
On Raspberry Pi3 Raspbian I used g++4.9 and 6 with the following flags :
-std=c++11 -ftree-vectorize -Wl--no-as-needed -lpthread -march=armv8-a+crc -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -funsafe-math-optimizations -O3

std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.
The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).
What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.
Example code:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);
double generate_randn(uint64_t iterations)
// Print "S" when a thread starts
std::cout << "S";
double rvalue = 0;
for (int i = 0; i < iterations; i++)
rvalue += randn(gen);
// Print "F" when a thread finishes
std::cout << "F";
return rvalue/iterations;
int main(int argc, char *argv[])
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
// Start async tasks
futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
for (auto &future : futures)
// Wait for tasks to finish
total += future.get();
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << std::endl;
std::cout << total << std::endl;
std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections and The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)
As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.
The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.
There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.
All told, your program should look something like this:
#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>
static double generate_randn(uint64_t iterations, unsigned int seed)
// Print "S" when a thread starts
std::cout << "S";
std::default_random_engine gen(seed);
std::normal_distribution<double> randn(0.0, 1.0);
double rvalue = 0;
for (int i = 0; i < iterations; i++)
rvalue += randn(gen);
// Print "F" when a thread finishes
std::cout << "F";
return rvalue/iterations;
int main(int argc, char *argv[])
if (argc < 2)
return 0;
uint64_t count = 100000000;
uint32_t threads = std::atoi(argv[1]);
double total = 0;
std::vector<std::future<double>> futures;
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
std::random_device make_seed;
// Start timing
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads; i++)
// Start async tasks
for (auto &future : futures)
// Wait for tasks to finish
total += future.get();
// End timing
t2 = std::chrono::high_resolution_clock::now();
// Take the average of the threads' results
total /= threads;
std::cout << '\n' << total
<< "\nFinished in "
<< std::chrono::duration_cast<
std::chrono::milliseconds>(t2 - t1).count()
<< " ms\n";
(This isn't really an answer, but it won't fit into a comment, especially with the command formatting an links.)
You can profile your executable on Solaris using Solaris Studio's collect utility. On Solaris, that will be able to show you where your threads are contending.
collect -d /tmp -p high -s all app [app args]
Then view the results using the analyzer utility:
analyzer /tmp/test.1.er &
Replace /tmp/test.1.er with the path to the output generated by a collect profile run.
If your threads are contending over some resource(s) as #zwol posted in his answer, you will see it.
Oracle marketing brief for the toolset can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf
You can also try compiling your code with Solaris Studio for more data.

Large overhead in CUDA kernel launch outside GPU execution

I am measuring the running time of kernels, as seen from a CPU thread, by measuring the interval from before launching a kernel to after a cudaDeviceSynchronize (using gettimeofday). I have a cudaDeviceSynchronize before I start recording the interval. I also instrument the kernels to record the timestamp on the GPU (using clock64) at the start of the kernel by thread(0,0,0) of each block from block(0,0,0) to block(occupancy-1,0,0) to an array of size equal to number of SMs. Every thread at the end of the kernel code, updates the timestamp to another array (of the same size) at the index equal to the index of the SM it runs on.
The intervals calculated from the two arrays are 60-70% of that measured from the CPU thread.
For example, on a K40, while gettimeofday gives an interval of 140ms, the avg of intervals calculated from GPU timestamps is only 100ms. I have experimented with many grid sizes (15 blocks to 6K blocks) but have found similar behavior so far.
__global__ void some_kernel(long long *d_start, long long *d_end){
d_start[blockIdx.x] = clock64();
//some_kernel code
d_end[blockIdx.x] = clock64();
Does this seem possible to the experts?
Does this seem possible to the experts?
I suppose anything is possible for code you haven't shown. After all, you may just have a silly bug in any of your computation arithmetic. But if the question is "is it sensible that there should be 40ms of unaccounted-for time overhead on a kernel launch, for a kernel that takes ~140ms to execute?" I would say no.
I believe the method I outlined in the comments is reasonably accurate. Take the minimum clock64() timestamp from any thread in the grid (but see note below regarding SM restriction). Compare it to the maximum time stamp of any thread in the grid. The difference will be comparable to the reported execution time of gettimeofday() to within 2 percent, according to my testing.
Here is my test case:
$ cat t1040.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define LS_MAX 2000000000U
#define MAX_SM 64
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
__device__ int result;
__device__ unsigned long long t_start[MAX_SM];
__device__ unsigned long long t_end[MAX_SM];
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
__device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void kernel(unsigned ls){
unsigned long long int ts = clock64();
unsigned my_sm = __mysmid();
atomicMin(t_start+my_sm, ts);
// junk code to waste time
int tv = ts&0x1F;
for (unsigned i = 0; i < ls; i++){
tv &= (ts+i);}
result = tv;
// end of junk code
ts = clock64();
atomicMax(t_end+my_sm, ts);
// optional command line parameter 1 = kernel duration, parameter 2 = number of blocks, parameter 3 = number of threads per block
int main(int argc, char *argv[]){
unsigned ls;
if (argc > 1) ls = atoi(argv[1]);
else ls = 1000000;
if (ls > LS_MAX) ls = LS_MAX;
int num_sms = 0;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("cuda get attribute fail");
int gpu_clk = 0;
cudaDeviceGetAttribute(&gpu_clk, cudaDevAttrClockRate, 0);
if ((num_sms < 1) || (num_sms > MAX_SM)) {printf("invalid sm count: %d\n", num_sms); return 1;}
unsigned blks;
if (argc > 2) blks = atoi(argv[2]);
else blks = num_sms;
if ((blks < 1) || (blks > 0x3FFFFFFF)) {printf("invalid blocks: %d\n", blks); return 1;}
unsigned ntpb;
if (argc > 3) ntpb = atoi(argv[3]);
else ntpb = 256;
if ((ntpb < 1) || (ntpb > 1024)) {printf("invalid threads: %d\n", ntpb); return 1;}
kernel<<<1,1>>>(100); // warm up
cudaCheckErrors("kernel fail");
unsigned long long *h_start, *h_end;
h_start = new unsigned long long[num_sms];
h_end = new unsigned long long[num_sms];
for (int i = 0; i < num_sms; i++){
h_end[i] = 0;}
cudaMemcpyToSymbol(t_start, h_start, num_sms*sizeof(unsigned long long));
cudaMemcpyToSymbol(t_end, h_end, num_sms*sizeof(unsigned long long));
unsigned long long htime = dtime_usec(0);
htime = dtime_usec(htime);
cudaMemcpyFromSymbol(h_start, t_start, num_sms*sizeof(unsigned long long));
cudaMemcpyFromSymbol(h_end, t_end, num_sms*sizeof(unsigned long long));
cudaCheckErrors("some error");
printf("host elapsed time (ms): %f \n device sm clocks:\n start:", htime/1000.0f);
unsigned long long max_diff = 0;
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_start[i]);}
printf("\n end: ");
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_end[i]);}
for (int i = 0; i < num_sms; i++) if ((h_start[i] != 0xFFFFFFFFFFFFFFFFULL) && (h_end[i] != 0) && ((h_end[i]-h_start[i]) > max_diff)) max_diff=(h_end[i]-h_start[i]);
printf("\n max diff clks: %lu\nmax diff kernel time (ms): %f\n", max_diff, max_diff/(float)(gpu_clk));
return 0;
$ nvcc -o t1040 t1040.cu -arch=sm_35
$ ./t1040 1000000 1000 128
host elapsed time (ms): 2128.818115
device sm clocks:
start: 3484744 3484724
end: 2219687393 2228431323
max diff clks: 2224946599
max diff kernel time (ms): 2128.117432
This code can only be run on a cc3.5 or higher GPU due to the use of 64-bit atomicMin and atomicMax.
I've run it on a variety of grid configurations, on both a GT640 (very low end cc3.5 device) and K40c (high end) and the timing results between host and device agree to within 2% (for reasonably long kernel execution times. If you pass 1 as the command line parameter, with very small grid sizes, the kernel execution time will be very short (nanoseconds) whereas the host will see about 10-20us. This is kernel launch overhead being measured. So the 2% number is for kernels that take much longer than 20us to execute).
It accepts 3 (optional) command line parameters, the first of which varies the amount of time the kernel will execute.
My timestamping is done on a per-SM basis, because the clock64() resource is indicated to be a per-SM resource. The sm clocks are not guaranteed to be synchronized between SMs.
You can modify the grid dimensions. The second optional command line parameter specifies the number of blocks to launch. The third optional command line parameter specifies the number of threads per block. The timing methodology I have shown here should not be dependent on number of blocks launched or number of threads per block. If you specify fewer blocks than SMs, the code should ignore "unused" SM data.
