Node js equivalent to Python utf8, Sha1, base64 - node.js

I have this piece of code in python3
payload = 'my URI'
payload_utf8 = payload.encode("utf-8")
print(payload_utf8)
payload_sha1 = hashlib.sha1(payload_utf8).digest()
print(payload_sha1)
payload_base64 = base64.b64encode(payload_sha1)
print(payload_base64)
I want the same result but in node.js. I have tried this
const payload = "my URI";
console.log(payload);
const payload_UTF8 = utf8.encode(payload);
console.log(payload_UTF8);
const payload_Sha = crypto.createHash('sha1').update(payload_UTF8).digest()
console.log(payload_Sha);
const payload_Base64 = Buffer.from(payload_Sha).toString('base64');
But the results isn't the same.

The results are the same, the only difference is that in the python example you're returning a byte array and in the Js example, you're returning a string. If you want to get the exact same result in string format you can use print(payload_base64.decode("utf-8")).

Related

Cannot read cyrillic symbols from a .csv file

I need to read some .csv file, get data in .json format and work with it.
I'm using npm package convert-csv-to-json. As a result - cyrillic symbols aren't displaying properly:
const csvToJson = require('convert-csv-to-json');
let json = csvToJson.fieldDelimiter(',').getJsonFromCsv("input.csv");
console.log(json);
Result:
If I try to decode file:
const csvToJson = require('convert-csv-to-json');
let json = csvToJson.asciiEncoding().fieldDelimiter(',').getJsonFromCsv("input.csv");
console.log(json);
result is:
When I open a .csv file using AkelPad or notepad++ - it displays as it has to, and detected format is Win 1251 (ANSI - кириллица).
Is there a way to read a file with properly encoding, or to decode a result string?
Try using UTF-8 encoding instead of ASCII.
As a result, change
let json = csvToJson.asciiEncoding().fieldDelimiter(',').getJsonFromCsv("input.csv");
to
let json = csvToJson.utf8Encoding().fieldDelimiter(',').getJsonFromCsv("input.csv");
This is a code to solve the problem:
const fs = require('fs');
var iconv = require('iconv-lite');
const Papa = require('papaparse');
// read csv file and get buffer
const buffer = fs.readFileSync("input.csv");
// parse buffer to string with encoding
let dataString = iconv.decode(buffer, 'win1251');
// parse string to array of objects
let config = {
header: true
};
const parsedOutput = Papa.parse(dataString, config);
console.log('parsedOutput: ', parsedOutput);

Different result in NodeJS calculating MD5 hash using crypo

I am trying to get the MD5 has from a number in NodeJS using crypto but I am getting a different hash returned then I get from site where I can calculate the has.
According to http://onlinemd5.com/ the MD5 has for 1092000 is AF118C8D2A0D27A1D49582FDF6339B7C.
When I try to calculate the hash for that number in NodeJS it gives me a different result (ac4d61a5b76c96b00235a124dfd1bfd1). My code:
const crypto = require('crypto');
const num = 1092000;
const hash = crypto.createHash('md5').update(toString(num)).digest('hex');
console.log(hash);
If you convert it to a string normally it works:
const hash = crypto.createHash('md5').update(String(num)).digest('hex'); // or num.toString()
See the difference:
toString(num) = [object Undefined]
(1092000).toString() = "1092000"
If you console.log(this) in a Node env by default you will see that it is:
this = {} typeof = 'object'
this in a Node env is pointing at module.exports so you're calling this toString on the Object.prototype which is not the right thing to do a string conversion on anything other than module.exports.

NodeJS SHA1 get raw output (PHP SHA1 raw output equivalent)

In PHP, the code below returns the raw output of the SHA1 of the "string"
sha1("string", true);
What is the nodeJS equivalent of getting the SHA1 raw output?
Edit: I made some test and this line:
crypto.createHash('sha1').update('string').digest('base64');
generates same output as php's
base64_encode(sha1('string', true));
My issue occurs when I try to concatenate a string and the result of sha1, the get the sha1 again:
base64_encode(sha1(sha1("string", true) . "another string", true))
Different with nodejs:
var stringhash = crypto.createHash('sha1').update('string').digest();
crypto.createHash('sha1').update("another string" + stringhash).digest('base64')
Something like this:
const crypto = require('crypto');
let digest = crypto.createHash('sha1').update('string').digest();
process.stdout.write( digest );
EDIT: the equivalent of your second example:
let hash1 = crypto.createHash('sha1').update('string').digest();
let hash2 = crypto.createHash('sha1').update(hash1).update('another string');
let digest = hash2.digest('base64');

PDF to Text extractor in nodejs without OS dependencies

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.

Converting a string from utf8 to latin1 in NodeJS

I'm using a Latin1 encoded DB and can't change it to UTF-8 meaning that I run into issues with certain application data. I'm using Tesseract to OCR a document (tesseract encodes in UTF-8) and tried to use iconv-lite; however, it creates a buffer and to convert that buffer into a string. But again, buffer to string conversion does not allow "latin1" encoding.
I've read a bunch of questions/answers; however, all I get is setting client encoding and stuff like that.
Any ideas?
Since Node.js v7.1.0, you can use the transcode function from the buffer module:
https://nodejs.org/api/buffer.html#buffer_buffer_transcode_source_fromenc_toenc
For example:
const buffer = require('buffer');
const latin1Buffer = buffer.transcode(Buffer.from(utf8String), "utf8", "latin1");
const latin1String = latin1Buffer.toString("latin1");
You can create a buffer from the UFT8 string you have, and then decode that buffer to Latin 1 using iconv-lite, like this
var buff = new Buffer(tesseract_string, 'utf8');
var DB_str = iconv.decode(buff, 'ISO-8859-1');
I've found a way to convert any encoded text file, to UTF8
var
fs = require('fs'),
charsetDetector = require('node-icu-charset-detector'),
iconvlite = require('iconv-lite');
/* Having different encodings
* on text files in a git repo
* but need to serve always on
* standard 'utf-8'
*/
function getFileContentsInUTF8(file_path) {
var content = fs.readFileSync(file_path);
var original_charset = charsetDetector.detectCharset(content);
var jsString = iconvlite.decode(content, original_charset.toString());
return jsString;
}
I'ts also in a gist here: https://gist.github.com/jacargentina/be454c13fa19003cf9f48175e82304d5
Maybe you can try this, where content should be your database buffer data (in latin1 encoding)

Resources