Write data to an SD card through a buffer without a race condition

Write data to an SD card through a buffer without a race condition - io

I am writing firmware for a data logging device. It reads data from sensors at 20 Hz and writes data to an SD card. However, the time to write data to SD card is not consistent (about 200-300 ms). Thus one solution is writing data to a buffer at a consistent rate (using a timer interrupt), and have a second thread that writes data to the SD card when the buffer is full.
Here is my naive implementation:
#define N 64
char buffer[N];
int count;
ISR() {
if (count < N) {
char a = analogRead(A0);
buffer[count] = a;
count = count + 1;
}
}
void loop() {
if (count == N) {
myFile.open("data.csv", FILE_WRITE);
int i = 0;
for (i = 0; i < N; i++) {
myFile.print(buffer[i]);
}
myFile.close();
count = 0;
}
}
The code has the following problems:
Writing data to the SD card is blocking reading when the buffer is full
It might have a race conditions.
What is the best way to solve this problem? Using a circular buffer, or double buffering? How do I ensure that a race condition does not happen?

You have rather answered your own question; you should use either double buffering or a circular buffer. Double-buffering is probably simpler to implement and appropriate for devices such as an SD card for which block-writes are generally more efficient.
Buffer length selection may need some consideration; generally you would make the buffer the same as the SD sector buffer size (typically 512 bytes), but that may not be practical, and with a sample rate as low as 20 sps, optimising SD write performance is perhaps not an issue.
Another consideration is that you need to match your sample rate to the file-system latency by selecting an appropriate buffer size. In this case the 64 sample buffer buffer will fill in a little more than three seconds, but the block write takes only up-to 300 ms - so you could use a much smaller buffer if required - 8 samples would be sufficient - although be careful, you may have observed latency of 300 ms, but it may be larger when specific boundaries are crossed in the physical flash memory - I have seen significant latency on some cards at 1 Mbyte boundaries - moreover card performance varies significantly between sizes and manufacturers.
An adaptation of your implementation with double-buffering is below. I have reduced the buffer length to 32 samples, but with double-buffering the total is unchanged at 64, but the write lag is reduced to 1.6 seconds.
// Double buffer and its management data/constants
static volatile char buffer[2][32];
static const int BUFFLEN = sizeof(buffer[0]);
static const unsigned char EMPTY = 0xff;
static volatile unsigned char inbuffer = 0;
static volatile unsigned char outbuffer = EMPTY;
ISR()
{
static int count = 0;
// Write to current buffer
char a = analogRead(A0);
buffer[inbuffer][count] = a;
count++ ;
// If buffer full...
if( count >= BUFFLEN )
{
// Signal to loop() that data available (not EMPTY)
outbuffer = inbuffer;
// Toggle input buffer
inbuffer = inbuffer == 0 ? 1 : 0;
count = 0;
}
}
void loop()
{
// If buffer available...
if( outbuffer != EMPTY )
{
// Write buffer
myFile.open("data.csv", FILE_WRITE);
for( int i = 0; i < BUFFLEN; i++)
{
myFile.print(buffer[outbuffer][i]);
}
myFile.close();
// Set the buffer to empty
outbuffer = EMPTY;
}
}
Note the use of volatile and unsigned char for the shared data. It is important that data shared between concurrent execution contexts is accessed explicitly and atomically; access to an int on 8-bit AVR based Arduino requires multiple machine instructions and the interrupt may occur part way through a read/write in loop() and cause an incorrect value to be read.

Related

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.
The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.
The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA.
Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).
The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512).
The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.
In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA.
Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.
In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.
So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel.
Of course such a function should verify that the buffer indeed consists of hugepages.
I have found and read these posts: (A), (B), but couldn't find a good answer.
Is there any official method to do it in the current Linux kernel?

At the moment I have a very inefficient solution based on get_user_pages_fast:
int sgt_prepare(const char __user *buf, size_t count,
struct sg_table * sgt, struct page *** a_pages,
int * a_n_pages)
{
int res = 0;
int n_pages;
struct page ** pages = NULL;
const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
//Calculate number of pages
n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
printk(KERN_ALERT "n_pages: %d",n_pages);
//Allocate the table for pages
pages = vzalloc(sizeof(* pages) * n_pages);
printk(KERN_ALERT "pages: %p",pages);
if(pages == NULL) {
res = -ENOMEM;
goto sglm_err1;
}
//Now pin the pages
res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages);
printk(KERN_ALERT "gupf: %d",res);
if(res < n_pages) {
int i;
for(i=0; i<res; i++)
put_page(pages[i]);
res = -ENOMEM;
goto sglm_err1;
}
//Now create the sg-list
res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
printk(KERN_ALERT "satf: %d",res);
if(res < 0)
goto sglm_err2;
*a_pages = pages;
*a_n_pages = n_pages;
return res;
sglm_err2:
//Here we jump if we know that the pages are pinned
{
int i;
for(i=0; i<n_pages; i++)
put_page(pages[i]);
}
sglm_err1:
if(sgt) sg_free_table(sgt);
if(pages) kfree(pages);
* a_pages = NULL;
* a_n_pages = 0;
return res;
}
void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
{
int i;
//Free the sg list
if(sgt->sgl)
sg_free_table(sgt);
//Unpin pages
for(i=0; i < n_pages; i++) {
set_page_dirty(pages[i]);
put_page(pages[i]);
}
}
The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.
Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.
Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.
The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

sending audio via bluetooth a2dp source esp32

I am trying to send measured i2s analogue signal (e.g. from mic) to the sink device via Bluetooth instead of the default noise.
Currently I am trying to change the bt_app_a2d_data_cb()
static int32_t bt_app_a2d_data_cb(uint8_t *data, int32_t i2s_read_len)
{
if (i2s_read_len < 0 || data == NULL) {
return 0;
}
char* i2s_read_buff = (char*) calloc(i2s_read_len, sizeof(char));
bytes_read = 0;
i2s_adc_enable(I2S_NUM_0);
while(bytes_read == 0)
{
i2s_read(I2S_NUM_0, i2s_read_buff, i2s_read_len,&bytes_read, portMAX_DELAY);
}
i2s_adc_disable(I2S_NUM_0);
// taking care of the watchdog//
TIMERG0.wdt_wprotect=TIMG_WDT_WKEY_VALUE;
TIMERG0.wdt_feed=1;
TIMERG0.wdt_wprotect=0;
uint32_t j = 0;
uint16_t dac_value = 0;
// change 16bit input signal to 8bit
for (int i = 0; i < i2s_read_len; i += 2) {
dac_value = ((((uint16_t) (i2s_read_buff[i + 1] & 0xf) << 8) | ((i2s_read_buff[i + 0]))));
data[j] = (uint8_t) dac_value * 256 / 4096;
j++;
}
// testing for loop
//uint8_t da = 0;
//for (int i = 0; i < i2s_read_len; i++) {
// data[i] = (uint8_t) (i2s_read_buff[i] >> 8);// & 0xff;
// da++;
// if(da>254) da=0;
//}
free(i2s_read_buff);
i2s_read_buff = NULL;
return i2s_read_len;
}
I can hear the sawtooth sound from the sink device.
Any ideas what to do?

your data can be an array of some float digits representing analog signals or analog signal variations, for example, a 32khz sound signal contains 320000 float numbers to define captures sound for every second. if your data have been expected to transmit in offline mode you can prepare your outcoming data in the form of a buffer plus a terminator sign then send buffer by Bluetooth module of sender device which is connected to the proper microcontroller. for the receiving device, if you got terminator character like "\r" you can process incoming buffer e.g. for my case, I had to send a string array of numbers but I often received at most one or two unknown characters and to avoid it I reject it while fulfill receiving container.
how to trim unknown first characters of string in code vision
if you want it in online mode i.e. your data must be transmitted and played concurrently. you must consider delays and reasonable time to process for all microcontrollers and devices like Bluetooth, EEprom iCs and...

I'm also working on a project "a2dp source esp32".
I'm playing a wav-file from spiffs.
If the wav-file is 44100, 16-bit, stereo then you can directly write a stream of bytes from the file to the array data[ ].
When I tried to write less data than in the len-variable and return less (for example 88), I got an error, now I'm trying to figure out how to reduce this buffer because of big latency (len=512).
Also, the data in the array data[ ] is stored as stereo.
Example: read data from file to data[ ]-array:
size_t read;
read = fread((void*) data, 1, len, fwave);//fwave is a file
if(read<len){//If get EOF, go to begin of the file
fseek(fwave , 0x2C , SEEK_SET);//skip wav-header 44bytesт
read = fread((void*) (&(data[read])), 1, len-read, fwave);//read up
}
If file mono, I convert it to stereo like this (I read half and then double data):
int32_t lenHalf=len/2;
read = fread((void*) data, 1, lenHalf, fwave);
if(read<lenHalf){
fseek(fwave , 0x2C , SEEK_SET);//skip wav-header 44bytesт
read = fread((void*) (&(data[read])), 1, lenHalf-read, fwave);//read up
}
//copy to the second channel
uint16_t *data16=(uint16_t*)data;
for (int i = lenHalf/2-1; i >= 0; i--) {
data16[(i << 1)] = data16[i];
data16[(i << 1) + 1] = data16[i];
}
I think you have got sawtooth sound because:
your data is mono?
in your "return i2s_read_len;" i2s_read_len less than len
you // change 16bit input signal to 8bit, in the array data[ ] data as 16-bit: 2ByteLeft-2ByteRight-2ByteLeft-2ByteRight-...
I'm not sure, it's a guess.

Linux: How to mmap a sequence of physically contiguous areas into user space?

In my driver I have certain number of physically contiguous DMA buffers (e.g. 4MB long each) to receive data from a device. They are handled by hardware using the SG list. As the received data will be subjected to intensive processing, I don't want to switch off cache and I will use dma_sync_single_for_cpu after each buffer is filled by DMA.
To simplify data processing, I want those buffers to appear as a single huge, contiguous, circular buffer in the user space.
In case of a single buffer I simply use remap_pfn_range or dma_mmap_coherent. However, I can't use those functions multiple times to map consecutive buffers.
Of course, I can implement the fault operation in the vm_operations so that it finds the pfn of the corresponding page in the right buffer, and inserts it into the vma with vm_insert_pfn.
The acquisition will be really fast, so I can't handle mapping when the real data arrive. But this can be solved easily. To have all mapping ready before the data acquisition starts, I can simply read the whole mmapped buffer in my application before starting the acquisition, so that all pages are already inserted when the first data arrive.
Tha fault based trick should work, but maybe there is something more elegant? Just a single function, that may be called multiple times to build the whole mapping incrementally?
Additional difficulty is that the solution should be applicable (with minimal adjustments) to kernels starting from 2.6.32 to the newest one.
PS. I have seen that annoying post. Is there a danger that if the application attempts to write something to the mmapped buffer (just doing the in place processing of data), my carefully built mapping will be destroyed by COW?

Below is my solution that works for buffers allocated with dmam_alloc_noncoherent.
Allocation of the buffers:
[...]
for(i=0;i<DMA_NOFBUFS;i++) {
ext->buf_addr[i] = dmam_alloc_noncoherent(&my_dev->dev, DMA_BUFLEN, &my_dev->buf_dma_t[i],GFP_USER);
if(my_dev->buf_addr[i] == NULL) {
res = -ENOMEM;
goto err1;
}
//Make buffer ready for filling by the device
dma_sync_single_range_for_device(&my_dev->dev, my_dev->buf_dma_t[i],0,DMA_BUFLEN,DMA_FROM_DEVICE);
}
[...]
Mapping of the buffers
void swz_mmap_open(struct vm_area_struct *vma)
{
}
void swz_mmap_close(struct vm_area_struct *vma)
{
}
static int swz_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
long offset;
char * buffer = NULL;
int buf_num = 0;
//Calculate the offset (according to info in https://lxr.missinglinkelectronics.com/linux+v2.6.32/drivers/gpu/drm/i915/i915_gem.c#L1195 it is better not ot use the vmf->pgoff )
offset = (unsigned long)(vmf->virtual_address - vma->vm_start);
buf_num = offset/DMA_BUFLEN;
if(buf_num > DMA_NOFBUFS) {
printk(KERN_ERR "Access outside the buffer\n");
return -EFAULT;
}
offset = offset - buf_num * DMA_BUFLEN;
buffer = my_dev->buf_addr[buf_num];
vm_insert_pfn(vma,(unsigned long)(vmf->virtual_address),virt_to_phys(&buffer[offset]) >> PAGE_SHIFT);
return VM_FAULT_NOPAGE;
}
struct vm_operations_struct swz_mmap_vm_ops =
{
.open = swz_mmap_open,
.close = swz_mmap_close,
.fault = swz_mmap_fault,
};
static int char_sgdma_wz_mmap(struct file *file, struct vm_area_struct *vma)
{
vma->vm_ops = &swz_mmap_vm_ops;
vma->vm_flags |= VM_IO | VM_RESERVED | VM_CAN_NONLINEAR | VM_PFNMAP;
swz_mmap_open(vma);
return 0;
}

OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):
__local float sum = 0;
int i;
for(i = 0; i < length; i++)
sum += //some operation depending on i here;
Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.
In pseudo code, I would like to able to write something like this:
int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
res = sum;
Is there a way?
I have a race condition on sum.

To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.
Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.
If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).
EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.
__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output)
{
int lid = get_local_id(0);
int group_size = get_local_size(0);
partial_sums[lid] = data[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = group_size/2; i>0; i >>= 1) {
if(lid < i) {
partial_sums[lid] += partial_sums[lid + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) {
output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
}
}

I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.
The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.
I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.
This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.
The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.
This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.
Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...
__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
const size_t globalId = get_global_id(0);
const size_t localId = get_local_id(0);
target[localId] = A[globalId+AOffset];
barrier(CLK_LOCAL_MEM_FENCE);
size_t blockSize = get_local_size(0);
size_t halfBlockSize = blockSize / 2;
while (halfBlockSize>0) {
if (localId<halfBlockSize) {
target[localId] += target[localId + halfBlockSize];
if ((halfBlockSize*2)<blockSize) { // uneven block division
if (localId==0) { // when localID==0
target[localId] += target[localId + (blockSize-1)];
}
}
}
barrier(CLK_LOCAL_MEM_FENCE);
blockSize = halfBlockSize;
halfBlockSize = blockSize / 2;
}
if (localId==0) {
output[get_group_id(0)] = target[0];
}
}
https://pastebin.com/xN4yQ28N

You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features

A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half.
For example, please use the following ridiculously simple CL code:
__kernel void foldKernel(__global float *arVal, int offset) {
int gid = get_global_id(0);
arVal[gid] = arVal[gid]+arVal[gid+offset];
}
With the following Java/JOCL host code (or port it to C++ etc):
int t = totalDataSize;
while (t > 1) {
int m = t / 2;
int n = (t + 1) / 2;
clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
cl_event evFold = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
clWaitForEvents(1, new cl_event[]{evFold});
t = n;
}
The host code loops log2(n) times, so it finishes quickly even with huge arrays. The fiddle with "m" and "n" is to handle non-power-of-two arrays.
Easy for OpenCL to parallelize well for any GPU platform (i.e. fast).
Low memory, because it works in place
Works efficiently with non-power-of-two data sizes
Flexible, e.g. you can change kernel to do "min" instead of "+"

Why would a VC++ program that is storing 5MB of data consume 64MB of system memory?

I have been working on trying to figure out why my program is consuming so much system RAM. I'm loading a file from disk into a vector of structs of several dynamically allocated arrays. A 16MB file ends up consuming 280MB of system RAM according to task manager. The types in the file are mostly chars with some shorts and a few longs. There are 331,000 records in the file containing on average about 5 fields. I converted the vector to a struct and that reduced the memory to about 255MB but that still seems very high. With the vector taking up so much memory the program is running out of memory so I need to find a way to get the memory usage more reasonable.
I wrote a simple program to just stuff a vector (or array) with 1,000,000 char pointers. I would expect it to allocate 4+1 bytes for each giving 5MB of memory required for storage, but in fact it is using 64MB (array version) or 67MB (vector version). When the program first starts up it only consumes 400K so why is there an additional 59MB for array or 62MB for vectors being allocated? This extra memory seems to be for each container, so if I create a size_check2 and copy everything and run it the program uses up 135MB for 10MB worth of pointers and data.
Thanks in advance,
size_check.h
#pragma once
#include <vector>
class size_check
{
public:
size_check(void);
~size_check(void);
typedef unsigned long size_type;
void stuff_me( unsigned int howMany );
private:
size_type** package;
// std::vector<size_type*> package;
size_type* me;
};
size_check.cpp
#include "size_check.h"
size_check::size_check(void)
{
}
size_check::~size_check(void)
{
}
void size_check::stuff_me( unsigned int howMany )
{
package = new size_type*[howMany];
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type;
*me = 33;
package[i] = me;
// package.push_back( me );
}
}
main.cpp
#include "size_check.h"
int main( int argc, char * argv[ ] )
{
const unsigned int buckets = 20;
const unsigned int size = 50000;
size_check* me[buckets];
for( unsigned int i = 0; i < buckets; ++i )
{
me[i] = new size_check();
me[i]->stuff_me( size );
}
printf( "done.\n" );
}

In my test using VS2010, a debug build had a working set size of 52,500KB. But a release build had a working set
size of 20,944KB.
Debug builds will usually use more memory than optimized builds due to the debug heap manager doing things like creating memory fences.
In release builds, I suspect that the heap manager reserves more memory than you are actually using as a performance optimization.

Memory Leak
package = new size_type[howMany]; // instantiate 50,000 size_type's
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type; // Leak: results in an extra 50k size_type's being instantiated
*me = 33;
package[i] = *me; // Set a non-pointer to what is at the address of pointer "me"
// Would package[i] = 33; not suffice?
}
Furthermore, make sure you've compiled in release mode

There might be a couple reasons why you're seeing such a large memory footprint from your test program. Inside your
void size_check::stuff_me( unsigned int howMany )
{
This method is always getting called with howMany = 50000.
package = new size_type[howMany];
Assuming this is on a 32-bit setup the above statement will allocate 50,000 * 4 bytes.
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type;
The above will allocate new storage on each iteration of the loop. Since this loops 50,000 and the allocation never gets deleted that effectively takes up another 50,000 * 4 bytes upon loop completion.
*me = 33;
package[i] = *me;
}
}
Lastly, since stuff_me() gets called 20 times from main() your program would have allocated at least ~8Mbytes upon completion. If this is on a 64-bit system than the footprint will likely double since sizeof(long) == 8bytes.
The increase in memory consumption could have something to do with the way VS implements dynamic allocation. For performance reasons, it's possible that due to the multiple calls to new your program is reserving extra memory so as to avoid hitting up the OS everytime it needs more.
FYI, when I ran your test program on mingw-gcc 4.5.2, the memory consumption was ~20Mbytes -- much lower than what you were seeing but still a substantial amount. If I changed the stuff_me method to this:
void size_check::stuff_me( unsigned int howMany )
{
package = new size_type[howMany];
size_type *me = new size_type;
for( unsigned int i = 0; i < howMany; ++i )
{
*me = 33;
package[i] = *me;
}
delete me;
}
memory consumption goes down quite a bit down to ~4-5mbytes.

I think I found the answer by delving into the new statement. In debug builds there are two items that are created when you do a new. One is _CrtMemBlockHeader which is 32 bytes in length. The other is noMansLand (a memory fence) with a size of 4 bytes which gives us an overhead of 36 bytes for each new. In my case each individual new for a char was costing me 37 bytes. In release builds the memory usage is reduced to about 1/2 but I can't tell exactly how much is allocated for each new as I can't get to the new/malloc routine.
So my work around is to allocate a large block of memory to hold the file in memory. Then parse the memory image filling in a vector of pointers to the beginning of each of the records. Then on demand, I build a record from the memory image using the pointer to the beginning of the selected record. Doing this reduced the memory footprint to <25MB.
Thanks for all your help and suggestions.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string