CUDA Programming: Compilation Error

CUDA Programming: Compilation Error - linux

I am making a CUDA program that implements the data parallel prefix sum calculation operating upon N numbers. My code is also supposed to generate the numbers on the host using a random number generator. However, I seem to always run into a "unrecognized token" and "expected a declaration" error on the ending bracket of int main when attempting to compile. I am running the code on Linux.
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <math.h>
__global__ void gpu_cal(int *a,int i, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid>=i && tid < n) {
a[tid] = a[tid]+a[tid-i];
}
}
int main(void)
{
int key;
int *dev_a;
int N=10;//size of 1D array
int B=1;//blocks in the grid
int T=10;//threads in a block
do{
printf ("Some limitations:\n");
printf (" Maximum number of threads per block = 1024\n");
printf (" Maximum sizes of x-dimension of thread block = 1024\n");
printf (" Maximum size of each dimension of grid of thread blocks = 65535\n");
printf (" N<=B*T\n");
do{
printf("Enter size of array in one dimension, currently %d\n",N);
scanf("%d",&N);
printf("Enter size of blocks in the grid, currently %d\n",B);
scanf("%d",&B);
printf("Enter size of threads in a block, currently %d\n",T);
scanf("%d",&T);
if(N>B*T)
printf("N>B*T, this will result in an incorrect result generated by GPU, please try again\n");
if(T>1024)
printf("T>1024, this will result in an incorrect result generated by GPU, please try again\n");
}while((N>B*T)||(T>1024));
cudaEvent_t start, stop; // using cuda events to measure time
float elapsed_time_ms1, elapsed_time_ms3;
int a[N],gpu_result[N];//for result generated by GPU
int cpu_result[N];//CPU result
cudaMalloc((void**)&dev_a,N * sizeof(int));//allocate memory on GPU
int i,j;
srand(1); //initialize random number generator
for (i=0; i < N; i++) // load array with some numbers
a[i] = (int)rand() ;
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);//load data from host to device
cudaEventCreate(&start); // instrument code to measure start time
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
//GPU computation
for(j=0;j<log(N)/log(2);j++){
gpu_cal<<<B,T>>>(dev_a,pow(2,j),N);
cudaThreadSynchronize();
}
cudaMemcpy(gpu_result,dev_a,N*sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms1, start, stop );
printf("\n\n\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); // print out execution time
//CPU computation
cudaEventRecord(start, 0);
for(i=0;i<N;i++)
{
cpu_result[i]=0;
for(j=0;j<=i;j++)
{
cpu_result[i]=cpu_result[i]+a[j];
}
}
cudaEventRecord(stop, 0); // instrument code to measue end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms3, start, stop );
printf("Time to calculate results on CPU: %f ms.\n\n", elapsed_time_ms3); // print out execution time
//Error check
for(i=0;i < N;i++) {
if (gpu_result[i] != cpu_result[i] ) {
printf("ERROR!!! CPU and GPU create different answers\n");
break;
}
}
//Calculate speedup
printf("Speedup on GPU compared to CPU= %f\n", (float) elapsed_time_ms3 / (float) elapsed_time_ms1);
printf("\nN=%d",N);
printf("\nB=%d",B);
printf("\nT=%d",T);
printf("\n\n\nEnter '1' to repeat, or other integer to terminate\n");
scanf("%d",&key);
}while(key == 1);
cudaFree(dev_a);//deallocation
return 0;
}

The very last } in your code is a Unicode character. If you delete this entire line, and retype the }, the error will be gone.

There are two compile errors in your code.
First, Last ending bracket is a unicode character, so you should resave your code as unicode or delete and rewrite the last ending bracket.
Second, int type variable N which used at this line - int a[N],gpu_result[N];//for result generated by GPU
was declared int type, but it's not allowed in c or c++ compiler, so you should change the N declaration as const int N.

Related

C Function to return a String resulting in corrupted top size

I am trying to write a program that calls upon an [external library (?)] (I'm not sure that I'm using the right terminology here) that I am also writing to clean up a provided string. For example, if my main.c program were to be provided with a string such as:
asdfFAweWFwseFL Wefawf JAWEFfja FAWSEF
it would call upon a function in externalLibrary.c (lets call it externalLibrary_Clean for now) that would take in the string, and return all characters in upper case without spaces:
ASDFFAWEWFWSEFLWEFAWFJAWEFFJAFAWSEF
The crazy part is that I have this working... so long as my string doesn't exceed 26 characters in length. As soon as I add a 27th character, I end up with an error that says
malloc(): corrupted top size.
Here is externalLibrary.c:
#include "externalLibrary.h"
#include <ctype.h>
#include <malloc.h>
#include <assert.h>
#include <string.h>
char * restrict externalLibrary_Clean(const char* restrict input) {
// first we define the return value as a pointer and initialize
// an integer to count the length of the string
char * returnVal = malloc(sizeof(input));
char * initialReturnVal = returnVal; //point to the start location
// until we hit the end of the string, we use this while loop to
// iterate through it
while (*input != '\0') {
if (isalpha(*input)) { // if we encounter an alphabet character (a-z/A-Z)
// then we convert it to an uppercase value and point our return value at it
*returnVal = toupper(*input);
returnVal++; //we use this to move our return value to the next location in memory
}
input++; // we move to the next memory location on the provided character pointer
}
*returnVal = '\0'; //once we have exhausted the input character pointer, we terminate our return value
return initialReturnVal;
}
int * restrict externalLibrary_getFrequencies(char * ar, int length){
static int freq[26];
for (int i = 0; i < length; i++){
freq[(ar[i]-65)]++;
}
return freq;
}
the header file for it (externalLibrary.h):
#ifndef LEARNINGC_EXTERNALLIBRARY_H
#define LEARNINGC_EXTERNALLIBRARY_H
#ifdef __cplusplus
extern "C" {
#endif
char * restrict externalLibrary_Clean(const char* restrict input);
int * restrict externalLibrary_getFrequencies(char * ar, int length);
#ifdef __cplusplus
}
#endif
#endif //LEARNINGC_EXTERNALLIBRARY_H
my main.c file from where all the action is happening:
#include <stdio.h>
#include "externalLibrary.h"
int main() {
char * unfilteredString = "ASDFOIWEGOASDGLKASJGISUAAAA";//if this exceeds 26 characters, the program breaks
char * cleanString = externalLibrary_Clean(unfilteredString);
//int * charDist = externalLibrary_getFrequencies(cleanString, 25); //this works just fine... for now
printf("\nOutput: %s\n", unfilteredString);
printf("\nCleaned Output: %s\n", cleanString);
/*for(int i = 0; i < 26; i++){
if(charDist[i] == 0){
}
else {
printf("%c: %d \n", (i + 65), charDist[i]);
}
}*/
return 0;
}
I'm extremely well versed in Java programming and I'm trying to translate my knowledge over to C as I wish to learn how my computer works in more detail (and have finer control over things such as memory).
If I were solving this problem in Java, it would be as simple as creating two class files: one called main.java and one called externalLibrary.java, where I would have static String Clean(string input) and then call upon it in main.java with String cleanString = externalLibrary.Clean(unfilteredString).
Clearly this isn't how C works, but I want to learn how (and why my code is crashing with corrupted top size)

The bug is this line:
char * returnVal = malloc(sizeof(input));
The reason it is a bug is that it requests an allocation large enough space to store a pointer, meaning 8 bytes in a 64-bit program. What you want to do is to allocate enough space to store the modified string, which you can do with the following line:
char *returnVal = malloc(strlen(input) + 1);
So the other part of your question is why the program doesn't crash when your string is less than 26 characters. The reason is that malloc is allowed to give the caller slightly more than the caller requested.
In your case, the message "malloc(): corrupted top size" suggests that you are using libc malloc, which is the default on Linux. That variant of malloc, in a 64-bit process, would always give you at least 0x18 (24) bytes (minimum chunk size 0x20 - 8 bytes for the size/status). In the specific case that the allocation immediately precedes the "top" allocation, writing past the end of the allocation will clobber the "top" size.
If your string is larger than 23 (0x17) you will start to clobber the size/status of the subsequent allocation because you also need 1 byte to store the trailing NULL. However, any string 23 characters or shorter will not cause a problem.
As to why you didn't get an error with a string with 26 characters, to answer that one would have to see that exact program with the string of 26 characters that does not crash to give a more precise answer. For example, if the program provided a 26-character input that contained 3 blanks, this would would require only 26 + 1 - 3 = 24 bytes in the allocation, which would fit.
If you are not interested in that level of detail, fixing the malloc call to request the proper amount will fix your crash.

MPI_Recv takes a long and strange time to return

I am trying to compare the execution time of two programs:
-the first one uses non blocking functions MPI_Ssend and MPI_Irecv, which allows to do some calculations "while messages are being sent and received".
-the other one uses blocking OpenMPI functions.
I have no problem to assess the performances of the first program and they look "good".
My problem is that the second program, that uses MPI_Recv, often takes a very long and kind of "strange" time to finish : always a little bit more than 1 second : (for example 1.001033 seconds). As if the process had to do something that takes exactly 1 second.
I modified my initial code to show you an equivalent one.
#include "mpi.h"
#include <stdlib.h>
#include <stdio.h>
#define SIZE 5000
int main(int argc, char * argv[])
{
MPI_Request trash;
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int myVector[SIZE];
for(int i = 0 ; i < SIZE ; i++)
myVector[i] = 100;
int buff[SIZE];
double startTime = MPI_Wtime();
if(rank == 0)
MPI_Issend(myVector, SIZE, MPI_INT, 1, 1, MPI_COMM_WORLD, &trash);
if(rank == 1) {
MPI_Recv(buff, SIZE, MPI_INT, 0, 1, MPI_COMM_WORLD, NULL);
printf("Processus 1 finished in %lf seconds\n", MPI_Wtime()-startTime);
}
MPI_Finalize();
return 0;
}
Thank you.

Large overhead in CUDA kernel launch outside GPU execution

I am measuring the running time of kernels, as seen from a CPU thread, by measuring the interval from before launching a kernel to after a cudaDeviceSynchronize (using gettimeofday). I have a cudaDeviceSynchronize before I start recording the interval. I also instrument the kernels to record the timestamp on the GPU (using clock64) at the start of the kernel by thread(0,0,0) of each block from block(0,0,0) to block(occupancy-1,0,0) to an array of size equal to number of SMs. Every thread at the end of the kernel code, updates the timestamp to another array (of the same size) at the index equal to the index of the SM it runs on.
The intervals calculated from the two arrays are 60-70% of that measured from the CPU thread.
For example, on a K40, while gettimeofday gives an interval of 140ms, the avg of intervals calculated from GPU timestamps is only 100ms. I have experimented with many grid sizes (15 blocks to 6K blocks) but have found similar behavior so far.
__global__ void some_kernel(long long *d_start, long long *d_end){
if(threadIdx.x==0){
d_start[blockIdx.x] = clock64();
}
//some_kernel code
d_end[blockIdx.x] = clock64();
}
Does this seem possible to the experts?

Does this seem possible to the experts?
I suppose anything is possible for code you haven't shown. After all, you may just have a silly bug in any of your computation arithmetic. But if the question is "is it sensible that there should be 40ms of unaccounted-for time overhead on a kernel launch, for a kernel that takes ~140ms to execute?" I would say no.
I believe the method I outlined in the comments is reasonably accurate. Take the minimum clock64() timestamp from any thread in the grid (but see note below regarding SM restriction). Compare it to the maximum time stamp of any thread in the grid. The difference will be comparable to the reported execution time of gettimeofday() to within 2 percent, according to my testing.
Here is my test case:
$ cat t1040.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define LS_MAX 2000000000U
#define MAX_SM 64
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
__device__ int result;
__device__ unsigned long long t_start[MAX_SM];
__device__ unsigned long long t_end[MAX_SM];
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void kernel(unsigned ls){
unsigned long long int ts = clock64();
unsigned my_sm = __mysmid();
atomicMin(t_start+my_sm, ts);
// junk code to waste time
int tv = ts&0x1F;
for (unsigned i = 0; i < ls; i++){
tv &= (ts+i);}
result = tv;
// end of junk code
ts = clock64();
atomicMax(t_end+my_sm, ts);
}
// optional command line parameter 1 = kernel duration, parameter 2 = number of blocks, parameter 3 = number of threads per block
int main(int argc, char *argv[]){
unsigned ls;
if (argc > 1) ls = atoi(argv[1]);
else ls = 1000000;
if (ls > LS_MAX) ls = LS_MAX;
int num_sms = 0;
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("cuda get attribute fail");
int gpu_clk = 0;
cudaDeviceGetAttribute(&gpu_clk, cudaDevAttrClockRate, 0);
if ((num_sms < 1) || (num_sms > MAX_SM)) {printf("invalid sm count: %d\n", num_sms); return 1;}
unsigned blks;
if (argc > 2) blks = atoi(argv[2]);
else blks = num_sms;
if ((blks < 1) || (blks > 0x3FFFFFFF)) {printf("invalid blocks: %d\n", blks); return 1;}
unsigned ntpb;
if (argc > 3) ntpb = atoi(argv[3]);
else ntpb = 256;
if ((ntpb < 1) || (ntpb > 1024)) {printf("invalid threads: %d\n", ntpb); return 1;}
kernel<<<1,1>>>(100); // warm up
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
unsigned long long *h_start, *h_end;
h_start = new unsigned long long[num_sms];
h_end = new unsigned long long[num_sms];
for (int i = 0; i < num_sms; i++){
h_start[i] = 0xFFFFFFFFFFFFFFFFULL;
h_end[i] = 0;}
cudaMemcpyToSymbol(t_start, h_start, num_sms*sizeof(unsigned long long));
cudaMemcpyToSymbol(t_end, h_end, num_sms*sizeof(unsigned long long));
unsigned long long htime = dtime_usec(0);
kernel<<<blks,ntpb>>>(ls);
cudaDeviceSynchronize();
htime = dtime_usec(htime);
cudaMemcpyFromSymbol(h_start, t_start, num_sms*sizeof(unsigned long long));
cudaMemcpyFromSymbol(h_end, t_end, num_sms*sizeof(unsigned long long));
cudaCheckErrors("some error");
printf("host elapsed time (ms): %f \n device sm clocks:\n start:", htime/1000.0f);
unsigned long long max_diff = 0;
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_start[i]);}
printf("\n end: ");
for (int i = 0; i < num_sms; i++) {printf(" %12lu ", h_end[i]);}
for (int i = 0; i < num_sms; i++) if ((h_start[i] != 0xFFFFFFFFFFFFFFFFULL) && (h_end[i] != 0) && ((h_end[i]-h_start[i]) > max_diff)) max_diff=(h_end[i]-h_start[i]);
printf("\n max diff clks: %lu\nmax diff kernel time (ms): %f\n", max_diff, max_diff/(float)(gpu_clk));
return 0;
}
$ nvcc -o t1040 t1040.cu -arch=sm_35
$ ./t1040 1000000 1000 128
host elapsed time (ms): 2128.818115
device sm clocks:
start: 3484744 3484724
end: 2219687393 2228431323
max diff clks: 2224946599
max diff kernel time (ms): 2128.117432
$
Notes:
This code can only be run on a cc3.5 or higher GPU due to the use of 64-bit atomicMin and atomicMax.
I've run it on a variety of grid configurations, on both a GT640 (very low end cc3.5 device) and K40c (high end) and the timing results between host and device agree to within 2% (for reasonably long kernel execution times. If you pass 1 as the command line parameter, with very small grid sizes, the kernel execution time will be very short (nanoseconds) whereas the host will see about 10-20us. This is kernel launch overhead being measured. So the 2% number is for kernels that take much longer than 20us to execute).
It accepts 3 (optional) command line parameters, the first of which varies the amount of time the kernel will execute.
My timestamping is done on a per-SM basis, because the clock64() resource is indicated to be a per-SM resource. The sm clocks are not guaranteed to be synchronized between SMs.
You can modify the grid dimensions. The second optional command line parameter specifies the number of blocks to launch. The third optional command line parameter specifies the number of threads per block. The timing methodology I have shown here should not be dependent on number of blocks launched or number of threads per block. If you specify fewer blocks than SMs, the code should ignore "unused" SM data.

Why does my process take too long to die?

Basically I'm using Linux 2.6.34 on PowerPC (Freescale e500mc). I have a process (a kind of VM that was developed in-house) that uses about 2.25 G of mlocked VM. When I kill it, I notice that it takes upwards of 2 minutes to terminate.
I investigated a little. First, I closed all open file descriptors but that didn't seem to make a difference. Then I added some printk in the kernel and through it I found that all delay comes from the kernel unlocking my VMAs. The delay is uniform across pages, which I verified by repeatedly checking the locked page count in /proc/meminfo. I've checked with programs that allocate that much memory and they all die as soon as I signal them.
What do you think I should check now? Thanks for your replies.
Edit: I had to find a way to share more information about the problem so I wrote this below program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <sys/time.h>
#define MAP_PERM_1 (PROT_WRITE | PROT_READ | PROT_EXEC)
#define MAP_PERM_2 (PROT_WRITE | PROT_READ)
#define MAP_FLAGS (MAP_ANONYMOUS | MAP_FIXED | MAP_PRIVATE)
#define PG_LEN 4096
#define align_pg_32(addr) (addr & 0xFFFFF000)
#define num_pg_in_range(start, end) ((end - start + 1) >> 12)
inline void __force_pgtbl_alloc(unsigned int start)
{
volatile int *s = (int *) start;
*s = *s;
}
int __map_a_page_at(unsigned int start, int whichperm)
{
int perm = whichperm ? MAP_PERM_1 : MAP_PERM_2;
if(MAP_FAILED == mmap((void *)start, PG_LEN, perm, MAP_FLAGS, 0, 0)){
fprintf(stderr,
"mmap failed at 0x%x: %s.\n",
start, strerror(errno));
return 0;
}
return 1;
}
int __mlock_page(unsigned int addr)
{
if (mlock((void *)addr, (size_t)PG_LEN) < 0){
fprintf(stderr,
"mlock failed on page: 0x%x: %s.\n",
addr, strerror(errno));
return 0;
}
return 1;
}
void sigint_handler(int p)
{
struct timeval start = {0 ,0}, end = {0, 0}, diff = {0, 0};
gettimeofday(&start, NULL);
munlockall();
gettimeofday(&end, NULL);
timersub(&end, &start, &diff);
printf("Munlock'd entire VM in %u secs %u usecs.\n",
diff.tv_sec, diff.tv_usec);
exit(0);
}
int make_vma_map(unsigned int start, unsigned int end)
{
int num_pg = num_pg_in_range(start, end);
if (end < start){
fprintf(stderr,
"Bad range: start: 0x%x end: 0x%x.\n",
start, end);
return 0;
}
for (; num_pg; num_pg --, start += PG_LEN){
if (__map_a_page_at(start, num_pg % 2) && __mlock_page(start))
__force_pgtbl_alloc(start);
else
return 0;
}
return 1;
}
void display_banner()
{
printf("-----------------------------------------\n");
printf("Virtual memory allocator. Ctrl+C to exit.\n");
printf("-----------------------------------------\n");
}
int main()
{
unsigned int vma_start, vma_end, input = 0;
int start_end = 0; // 0: start; 1: end;
display_banner();
// Bind SIGINT handler.
signal(SIGINT, sigint_handler);
while (1){
if (!start_end)
printf("start:\t");
else
printf("end:\t");
scanf("%i", &input);
if (start_end){
vma_end = align_pg_32(input);
make_vma_map(vma_start, vma_end);
}
else{
vma_start = align_pg_32(input);
}
start_end = !start_end;
}
return 0;
}
As you would see, the program accepts ranges of virtual addresses, each range being defined by start and end. Each range is then further subdivided into page-sized VMAs by giving different permissions to adjacent pages. Interrupting (using SIGINT) the program triggers a call to munlockall() and the time for said procedure to complete is duly noted.
Now, when I run it on freescale e500mc with Linux version at 2.6.34 over the range 0x30000000-0x35000000, I get a total munlockall() time of almost 45 seconds. However, if I do the same thing with smaller start-end ranges in random orders (that is, not necessarily increasing addresses) such that the total number of pages (and locked VMAs) is roughly the same, observe total munlockall() time to be no more than 4 seconds.
I tried the same thing on x86_64 with Linux 2.6.34 and my program compiled against the -m32 parameter and it seems the variations, though not so pronounced as with ppc, are still 8 seconds for the first case and under a second for the second case.
I tried the program on Linux 2.6.10 on the one end and on 3.19, on the other and it seems these monumental differences don't exist there. What's more, munlockall() always completes at under a second.
So, it seems that the problem, whatever it is, exists only around the 2.6.34 version of the Linux kernel.

You said the VM was developed in-house. Does this mean you have access to the source? I would start by checking to see if it has anything to stop it from immediately terminating to avoid data loss.
Otherwise, could you potentially try to provide more information? You may also want to check out: https://unix.stackexchange.com/ as they would be better suited to help with any issues the linux kernel may be having.

C++ 11 std::thread strange behavior

I am experimenting a bit with std::thread and C++11, and I am encountering strange behaviour.
Please have a look at the following code:
#include <cstdlib>
#include <thread>
#include <vector>
#include <iostream>
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i);
count = i;
}
class A {
public:
A(const size_t x) : x_(x) {}
size_t sum_up(const size_t num_threads) const {
size_t i;
std::vector<std::thread> threads;
std::vector<size_t> data_vector;
for (i = 0; i < num_threads; ++i) {
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
}
std::cout << "Threads started ...\n";
for (i = 0; i < num_threads; ++i)
threads[i].join();
size_t sum = 0;
for (i = 0; i < num_threads; ++i)
sum += data_vector[i];
return sum;
}
private:
const size_t x_;
};
int main(int argc, char* argv[]) {
const size_t x = atoi(argv[1]);
const size_t num_threads = atoi(argv[2]);
A a(x);
std::cout << a.sum_up(num_threads) << std::endl;
return 0;
}
The main idea here is that I want to specify a number of threads which do independent computations (in this case, simple increments).
After all threads are finished, the results should be merged in order to obtain an overall result.
Just to clarify: This is only for testing purposes, in order to get me understand how
C++11 threads work.
However, when compiling this code using the command
g++ -o threads threads.cpp -pthread -O0 -std=c++0x
on a Ubuntu box, I get very strange behaviour, when I execute the resulting binary.
For example:
$ ./threads 1000 4
Threads started ...
Segmentation fault (core dumped)
(should yield the output: 4000)
$ ./threads 100000 4
Threads started ...
200000
(should yield the output: 400000)
Does anybody has an idea what is going on here?
Thank you in advance!

Your code has many problems (see even thread_sum_up for about 2-3 bugs) but the main bug I found by glancing your code is here:
data_vector.push_back(0);
threads.push_back(std::thread(thread_sum_up, x_, std::ref(data_vector[i])));
See, when you push_back into a vector (I'm talking about data_vector), it can move all previous data around in memory. But then you take the address of (reference to) a cell for your thread, and then push back again (making the previous reference invalid)
This will cause you to crash.
For an easy fix - add data_vector.reserve(num_threads); just after creating it.
Edit at your request - some bugs in thread_sum_up
void thread_sum_up(const size_t n, size_t& count) {
size_t i;
for (i = 0; i < n; ++i); // see that last ';' there? means this loop is empty. it shouldn't be there
count = i; // You're just setting count to be i. why do that in a loop? Did you mean +=?
}

The cause of your crash might be that std::ref(data_vector[i]) being invalidated by the next push_back in data_vector. Since you know the number of threads, do a data_vector.reserve(num_threads) before you start spawning off the threads to keep the references from being invalidated.

As you resize the vector with the calls to push_back, it is likely to have to reallocate the storage space, causing the references to the contained values to be invalidated. This causes the thread to write to non-allocated memory, which is undefined behavior.
Your options are to pre-allocate the size you need (vector::reserve is one option), or choose a different container.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string