How to handle special characters in csv file with encoding other then utf-8 - node.js

I am trying to read a csv file in nodejs using createReadStream but stuck with an issue in case of special characters. When csv file charset is UTF-8 it return special characters intact but if charset is other then UTF-8 then special characters are being converted to ?
Here is what i have tried :
let parseOptions = {
headers: false,
ignoreEmpty: false,
trim: true,
discardUnmappedColumns: false,
quoteHeaders: true
};
let stream = fs.createReadStream(obj.data.file_data.path, {encoding : 'utf8'});
let parser=csv.fromStream(stream, parseOptions)
.on("data", function(row){
console.log('Row data ----->', row);
// Prints row
}).on("end", function(){
// proccess data here
});
I have tried with encoding option binary, utf16 and other as well but nothing seems to handle all characters. Is there any way we can ignore charset and get intact special characters or convert it to UTF-8 charset.

Related

Buffer.from(base64EncodedString, 'base64').toString('binary') vs 'utf8'

In Node.js: Why does this test fail on the second call of main?
test('base64Encode and back', () => {
function main(input: string) {
const base64string = base64Encode(input);
const text = base64Decode(base64string);
expect(input).toEqual(text);
}
main('demo');
main('😉😉😉');
});
Here are my functions:
export function base64Encode(text: string): string {
const buffer = Buffer.from(text, 'binary');
return buffer.toString('base64');
}
export function base64Decode(base64EncodedString: string): string {
const buffer = Buffer.from(base64EncodedString, 'base64');
return buffer.toString('binary');
}
From these pages, I figured I had written these functions correctly so that one would reverse the other:
https://github.com/node-browser-compat/btoa/blob/master/index.js
https://github.com/node-browser-compat/atob/blob/master/node-atob.js
https://stackoverflow.com/a/47890385/470749
If I change the 'binary' options to be 'utf8'instead, the test passes.
But my database currently has data where this function only seems to work if I use 'binary'.
binary is an alias for latin1
'latin1': Latin-1 stands for ISO-8859-1. This character encoding only supports the Unicode characters from U+0000 to U+00FF. Each character is encoded using a single byte. Characters that do not fit into that range are truncated and will be mapped to characters in that range.
This character set is unable to display multibyte utf8 characters.
To get utf8 multibyte characters back, go directly to base64 and back again
function base64Encode(str) {
return Buffer.from(str).toString('base64')
}
function base64Decode(str) {
return Buffer.from(str, 'base64').toString()
}
> base64Encode('😉')
'8J+YiQ=='
> base64Decode('8J+YiQ==')
'😉'

Why the line breaks are different of CSV (Macintosh) and CSV parsing with using node module csv-parser?

I'm using node module csv-parser for the streaming csv parsing. It's working fine when uploading a CSV (Comma separated value) but when we upload a CSV (Macintosh) file the problem occurs with line breaks. The CSV that's generated on Windows contains the line breaks like this \r\n but with CSV (MAC) it contains only \r as it's the Mac format. What configuration needs to be done to make it work for both file types?
Here's the code snippet where the streams hooking is done.
// Create a read stream for the passed file path and abort if the file is not found
let readStream: fs.ReadStream;
try {
readStream = fs.createReadStream(filePath);
} catch (error) {
console.log('Skipped order batch file processing. File not found.');
resolve();
return;
}
// Create the CSV transform
let csvStream: Transform;
if (file.mapping) {
csvStream = csv({ headers: false });
} else {
csvStream = csv();
}
readStream
.pipe(csvStream);
CSV-PARSER has an option of newline parameter it's default value is "\n" using "\r" it worked.
csvStream = csv({ headers: false, newline:"\r" });
How can I make the newline value to conditionally set for example if it's csv (Mac) it should "\r" for CSV (Windows) "\r\n" and for linux "\n"?
Note: I need to detect this on File Reading.
Your Help would be really appreciated!
Thanks!

Converting a string from utf8 to latin1 in NodeJS

I'm using a Latin1 encoded DB and can't change it to UTF-8 meaning that I run into issues with certain application data. I'm using Tesseract to OCR a document (tesseract encodes in UTF-8) and tried to use iconv-lite; however, it creates a buffer and to convert that buffer into a string. But again, buffer to string conversion does not allow "latin1" encoding.
I've read a bunch of questions/answers; however, all I get is setting client encoding and stuff like that.
Any ideas?
Since Node.js v7.1.0, you can use the transcode function from the buffer module:
https://nodejs.org/api/buffer.html#buffer_buffer_transcode_source_fromenc_toenc
For example:
const buffer = require('buffer');
const latin1Buffer = buffer.transcode(Buffer.from(utf8String), "utf8", "latin1");
const latin1String = latin1Buffer.toString("latin1");
You can create a buffer from the UFT8 string you have, and then decode that buffer to Latin 1 using iconv-lite, like this
var buff = new Buffer(tesseract_string, 'utf8');
var DB_str = iconv.decode(buff, 'ISO-8859-1');
I've found a way to convert any encoded text file, to UTF8
var
fs = require('fs'),
charsetDetector = require('node-icu-charset-detector'),
iconvlite = require('iconv-lite');
/* Having different encodings
* on text files in a git repo
* but need to serve always on
* standard 'utf-8'
*/
function getFileContentsInUTF8(file_path) {
var content = fs.readFileSync(file_path);
var original_charset = charsetDetector.detectCharset(content);
var jsString = iconvlite.decode(content, original_charset.toString());
return jsString;
}
I'ts also in a gist here: https://gist.github.com/jacargentina/be454c13fa19003cf9f48175e82304d5
Maybe you can try this, where content should be your database buffer data (in latin1 encoding)

Why can't I parse this CSV file inside of node.js?

Here's my code:
var options = {
rowDelimiter: 'windows',
encoding: 'ascii'
}
var data = fs.readFileSync(localFolder+'/'+file, 'ascii');
console.log(data);
csv().from.string(data, options).to.array(function(data, count) {
console.log(data);
});
The first console.log returns the following data:
"Filename","DID#","Document Type","Date Sent","School","First Name","Middle Name","Last Name","DOB","SSN","Application #","Common App ID","RH CEEB","Class Of","Years Attended"
"TR58A3D.pdf","TR58A3D","Transcript","07/19/2012","zz Screaming Eagle High School","Kim","","Smith","05/05/1995","","","","555555","2013",""
"TR58AQH.pdf","TR58AQH","Transcript","07/19/2012","zz Screaming Eagle High School","Jon","","Sink","05/09/1996","","","","555555","2015",""
[scott#localhost]$ file transcripts/index_07_19_2012_1043460.csv
transcripts/index_07_19_2012_1043460.csv: ASCII text, with CRLF line terminators
The second console.log doesn't print anything to my console. Anyone have any ideas why it's not parsing the CSV?
The problem was the value for rowDelimiter option. It needs to be the actual line break character used - i.e.: \r\n or \r.

reading from file containing accented characters in nodejs

So i am parsing a large csv file and pushing the results into mongo.
The file is maxminds city database. It has all kinds of fun utf8 characters. I am still getting (?) symbols in some city names. Here is how I am reading the file:
(using csv node module)
csv().from.stream(fs.createReadStream(path.join(__dirname, 'datafiles', 'cities.csv'), {
flags: 'r',
encoding: 'utf8'
})).on('record', function(row,index){
.. uninteresting code to add it to mongodb
});
What could i be doing wrong here?
I am getting things like this in mongo: Ch�teauguay, Canada
EDIT:
i tried using a different lib to read the file:
lazy(fs.createReadStream(path.join(__dirname, 'datafiles', 'cities.csv'), {
flags: 'r',
encoding: 'utf8',
autoClose: true
}))
.lines
.map(String)
.skip(1) // skips the two lines that are iptables header
.map(function (line) {
console.log(line);
});
it produces the same bad results:
154252,"PA","03","Capellan�a","",8.3000,-80.5500,,
154220,"AR","01","Villa Espa�a","",-34.7667,-58.2000,,
turns out maxmind encodes their stuff in latin1.
this works:
var iconv = require('iconv-lite')
lazy(fs.createReadStream(path.join(__dirname, 'datafiles', 'cities.csv')))
.lines
.map(function(byteArray) {
return iconv.decode(byteArray, 'latin1');
})
.skip(1) // skips the two lines that are iptables header
.map(function (line) {
//WORKS

Resources