I have this piece of code in python3
payload = 'my URI'
payload_utf8 = payload.encode("utf-8")
print(payload_utf8)
payload_sha1 = hashlib.sha1(payload_utf8).digest()
print(payload_sha1)
payload_base64 = base64.b64encode(payload_sha1)
print(payload_base64)
I want the same result but in node.js. I have tried this
const payload = "my URI";
console.log(payload);
const payload_UTF8 = utf8.encode(payload);
console.log(payload_UTF8);
const payload_Sha = crypto.createHash('sha1').update(payload_UTF8).digest()
console.log(payload_Sha);
const payload_Base64 = Buffer.from(payload_Sha).toString('base64');
But the results isn't the same.
The results are the same, the only difference is that in the python example you're returning a byte array and in the Js example, you're returning a string. If you want to get the exact same result in string format you can use print(payload_base64.decode("utf-8")).
Related
I need to read some .csv file, get data in .json format and work with it.
I'm using npm package convert-csv-to-json. As a result - cyrillic symbols aren't displaying properly:
const csvToJson = require('convert-csv-to-json');
let json = csvToJson.fieldDelimiter(',').getJsonFromCsv("input.csv");
console.log(json);
Result:
If I try to decode file:
const csvToJson = require('convert-csv-to-json');
let json = csvToJson.asciiEncoding().fieldDelimiter(',').getJsonFromCsv("input.csv");
console.log(json);
result is:
When I open a .csv file using AkelPad or notepad++ - it displays as it has to, and detected format is Win 1251 (ANSI - кириллица).
Is there a way to read a file with properly encoding, or to decode a result string?
Try using UTF-8 encoding instead of ASCII.
As a result, change
let json = csvToJson.asciiEncoding().fieldDelimiter(',').getJsonFromCsv("input.csv");
to
let json = csvToJson.utf8Encoding().fieldDelimiter(',').getJsonFromCsv("input.csv");
This is a code to solve the problem:
const fs = require('fs');
var iconv = require('iconv-lite');
const Papa = require('papaparse');
// read csv file and get buffer
const buffer = fs.readFileSync("input.csv");
// parse buffer to string with encoding
let dataString = iconv.decode(buffer, 'win1251');
// parse string to array of objects
let config = {
header: true
};
const parsedOutput = Papa.parse(dataString, config);
console.log('parsedOutput: ', parsedOutput);
I am trying to get the MD5 has from a number in NodeJS using crypto but I am getting a different hash returned then I get from site where I can calculate the has.
According to http://onlinemd5.com/ the MD5 has for 1092000 is AF118C8D2A0D27A1D49582FDF6339B7C.
When I try to calculate the hash for that number in NodeJS it gives me a different result (ac4d61a5b76c96b00235a124dfd1bfd1). My code:
const crypto = require('crypto');
const num = 1092000;
const hash = crypto.createHash('md5').update(toString(num)).digest('hex');
console.log(hash);
If you convert it to a string normally it works:
const hash = crypto.createHash('md5').update(String(num)).digest('hex'); // or num.toString()
See the difference:
toString(num) = [object Undefined]
(1092000).toString() = "1092000"
If you console.log(this) in a Node env by default you will see that it is:
this = {} typeof = 'object'
this in a Node env is pointing at module.exports so you're calling this toString on the Object.prototype which is not the right thing to do a string conversion on anything other than module.exports.
In PHP, the code below returns the raw output of the SHA1 of the "string"
sha1("string", true);
What is the nodeJS equivalent of getting the SHA1 raw output?
Edit: I made some test and this line:
crypto.createHash('sha1').update('string').digest('base64');
generates same output as php's
base64_encode(sha1('string', true));
My issue occurs when I try to concatenate a string and the result of sha1, the get the sha1 again:
base64_encode(sha1(sha1("string", true) . "another string", true))
Different with nodejs:
var stringhash = crypto.createHash('sha1').update('string').digest();
crypto.createHash('sha1').update("another string" + stringhash).digest('base64')
Something like this:
const crypto = require('crypto');
let digest = crypto.createHash('sha1').update('string').digest();
process.stdout.write( digest );
EDIT: the equivalent of your second example:
let hash1 = crypto.createHash('sha1').update('string').digest();
let hash2 = crypto.createHash('sha1').update(hash1).update('another string');
let digest = hash2.digest('base64');
Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.
I'm using a Latin1 encoded DB and can't change it to UTF-8 meaning that I run into issues with certain application data. I'm using Tesseract to OCR a document (tesseract encodes in UTF-8) and tried to use iconv-lite; however, it creates a buffer and to convert that buffer into a string. But again, buffer to string conversion does not allow "latin1" encoding.
I've read a bunch of questions/answers; however, all I get is setting client encoding and stuff like that.
Any ideas?
Since Node.js v7.1.0, you can use the transcode function from the buffer module:
https://nodejs.org/api/buffer.html#buffer_buffer_transcode_source_fromenc_toenc
For example:
const buffer = require('buffer');
const latin1Buffer = buffer.transcode(Buffer.from(utf8String), "utf8", "latin1");
const latin1String = latin1Buffer.toString("latin1");
You can create a buffer from the UFT8 string you have, and then decode that buffer to Latin 1 using iconv-lite, like this
var buff = new Buffer(tesseract_string, 'utf8');
var DB_str = iconv.decode(buff, 'ISO-8859-1');
I've found a way to convert any encoded text file, to UTF8
var
fs = require('fs'),
charsetDetector = require('node-icu-charset-detector'),
iconvlite = require('iconv-lite');
/* Having different encodings
* on text files in a git repo
* but need to serve always on
* standard 'utf-8'
*/
function getFileContentsInUTF8(file_path) {
var content = fs.readFileSync(file_path);
var original_charset = charsetDetector.detectCharset(content);
var jsString = iconvlite.decode(content, original_charset.toString());
return jsString;
}
I'ts also in a gist here: https://gist.github.com/jacargentina/be454c13fa19003cf9f48175e82304d5
Maybe you can try this, where content should be your database buffer data (in latin1 encoding)