Convert process.stdin numeric file descriptor to a FileHandle? - node.js

How can I convert the numeric file descriptor in process.stdin to a FileHandle object like those returned by fs.promises.open()?
Rationale:
want to work with stdin or a named input file in a uniform way
hate that uniform way to be based on numeric file descriptors (which could be done by using filehandle.fd, but eughh)

There does not seem to be a stable way to get a FileHandle from a fd value, at least as of 19.2.0. There is a complicated work-around here that might work, but it is clearly not a recommended approach: https://github.com/nodejs/node/issues/43821
If you're okay not supporting Windows, you could do:
import fs from "node:fs/promises"
const inputFileHandle = await fs.open("/dev/stdin", "r")
const outputFileHandle = await fs.open("/dev/stdout", "w")
It doesn't actually use the same underlying file descriptor as process.stdin.fd and process.stdout.fd (0 and 1, respectively), but it should achieve basically the same effect.

Related

Converting a nodejs buffer to string and back to buffer gives a different result in some cases

I created a .docx file.
Now, I do this:
// read the file to a buffer
const data = await fs.promises.readFile('<pathToMy.docx>')
// Converts the buffer to a string using 'utf8' but we could use any encoding
const stringContent = data.toString()
// Converts the string back to a buffer using the same encoding
const newData = Buffer.from(stringContent)
// We expect the values to be equal...
console.log(data.equals(newData)) // -> false
I don't understand in what step of the process the bytes are being changed...
I already spent sooo much time trying to figure this out, without any result... If someone can help me understand what part I'm missing out, it would be really awesome!
A .docXfile is not a UTF-8 string (it's a binary ZIP file) so when you read it into a Buffer object and then call .toString() on it, you're assuming it is already encoding as UTF-8 in the buffer and you want to now move it into a Javascript string. That's not what you have. Your binary data will likely encounter things that are invalid in UTF-8 and those will be discarded or coerced into valid UTF-8, causing an irreversible change.
What Buffer.toString() does is take a Buffer that is ALREADY encoded in UTF-8 and puts it into a Javascript string. See this comment in the doc,
If encoding is 'utf8' and a byte sequence in the input is not valid UTF-8, then each invalid byte is replaced with the replacement character U+FFFD.
So, the code you show in your question is wrongly assuming that Buffer.toString() takes binary data and reversibly encodes it as a UTF8 string. That is not what it does and that's why it doesn't do what you are expecting.
Your question doesn't describe what you're actually trying to accomplish. If you want to do something useful with the .docX file, you probably need to actually parse it from it's binary ZIP file form into the actual components of the file in their appropriate format.
Now that you explain you're trying to store it in localStorage, then you need to encode the binary into a string format. One such popular option is Base64 though it isn't super efficient (size wise), but it is better than many others. See Binary Data in JSON String. Something better than Base64 for prior discussion on this topic. Ignore the notes about compression in that other answer because your data is already ZIP compressed.

shutil.copyfileobj but without headers or skip first line [duplicate]

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow. It can take up to 2 days to process.
I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
Thanks.
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don't omit the `allFiles.sort()!†
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII \n and it's the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
For the cases where the encoding's version of a newline doesn't look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le', and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before
# Perform encoding to chosen output encoding, disabling line-ending
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
for i, fname in enumerate(allFiles):
# Decode with known encoding, disabling line-ending translation
# for same reasons as above
with open(fname, encoding=known_input_encoding, newline='') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
# just letting the file object decode from input and encode to output
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
This is still much faster than involving the csv module, especially in modern Python (where the io module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It's also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn't match the assumed self-checking encoding, it's highly unlikely to decode validly, so you'll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than copyfileobj, some options:
The only succinct, reasonably portable option is to continue using copyfileobj and explicitly pass a non-default length parameter, e.g. shutil.copyfileobj(infile, outfile, 1 << 20) (1 << 20 is 1 MiB, a number which shutil has switched to for plain shutil.copyfile calls on Windows due to superior performance).
Still portable, but only works for binary files and not succinct, would be to copy the underlying code copyfile uses on Windows, which uses a reusable bytearray buffer with a larger size than copyfileobj's default (1 MiB, rather than 64 KiB), removing some allocation overhead that copyfileobj can't fully avoid for large buffers. You'd replace shutil.copyfileobj(infile, outfile) with (3.8+'s walrus operator, :=, used for brevity) the following code adapted from CPython 3.10's implementation of shutil._copyfileobj_readinto (which you could always use directly if you don't mind using non-public APIs):
buf_length = 1 << 20 # 1 MiB buffer; tweak to preference
# Using a memoryview gets zero copy performance when short reads occur
with memoryview(bytearray(buf_length)) as mv:
while n := infile.readinto(mv):
if n < buf_length:
with mv[:n] as smv:
outfile.write(smv)
else:
outfile.write(mv)
Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what shutil.copyfile uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:
a. On Linux kernel 2.6.33 and higher (and any other OS that allows the sendfile(2) system call to work between open files), you can replace the .readline() and copyfileobj calls with:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)
To make it signal resilient, it may be necessary to check the return value from sendfile, and track the number of bytes sent + skipped and the number remaining, looping until you've copied them all (these are low level system calls, they can be interrupted).
b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace sendfile with copy_file_range:
filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)
With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API shutil.copyfile uses, posix._fcopyfile for a similar purpose, with something like (completely untested, and really, don't do this; it's likely to break across even minor Python releases):
infile.seek(header_len_bytes) # Skip past header
posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)
which assumes fcopyfile pays attention to the seek position (docs aren't 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of glob: That allFiles.sort() call should not be omitted; glob imposes no ordering on the results, and for reproducible results, you'll want to impose some ordering (it wouldn't be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv and b.csv in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.
Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
You don't need pandas for this, just the simple csv module would work fine.
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.
Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)

Difference between io.StringIO and a string variable in python

I am new to python.
Can anybody explain what's the difference between a string variable and io.StringIO . In both we can save character.
e.g
String variable
k= 'RAVI'
io.stringIO
string_out = io.StringIO()
string_out.write('A sample string which we have to send to server as string data.')
string_out.getvalue()
If we print k or string_out.getvalue() both will print the text
print(k)
print(string_out.getvalue())
They are similar because both str and StringIO represent strings, they just do it in different ways:
str: Immutable
StringIO: Mutable, file-like interface, which stores strs
A text-mode file handle (as produced by open("somefile.txt")) is also very similar to StringIO (both are "Text I/O"), with the latter allowing you to avoid using an actual file for file-like operations.
you can use io.StringIO() to simulate files, since python is dynamic with variable types usually if you have something that accepts a file object you can also use io.StringIO() with it, meaning you can have a "file" in memory that you can control the contents of without actually writing any temporary files to disk

Python 3 combining file open and read commands - a need to close a file and how?

I am working through "Learn Python 3 the Hard Way" and am making code more concise. Lines 11 to 18 of the program below (line 1 starts at # program: p17.py) are relevant to my question. Opening and reading a file are very easy and it is easy to see how you close the file you open when working with the files. The original section is commented out and I include the concise code on line 16. I commented out the line of code that causes an error (on line 20):
$ python3 p17_aside.py p17_text.txt p17_to_file_3.py
Copying from p17_text.txt to p17_to_file_3.py
This is text.
Traceback (most recent call last):
File "p17_aside.py", line 20, in
indata.close()
AttributeError: 'str' object has no attribute 'close'
Code is below:
# program: p17.py
# This program copies one file to another. It uses the argv function as well
# as exists - from sys and os.path modules respectively
from sys import argv
from os.path import exists
script, from_file, to_file = argv
print(f"Copying from {from_file} to {to_file}")
# we could do these two on one line, how?
#in_file = open(from_file)
#indata = in_file.read()
#print(indata)
# THE ANSWER -
indata = open(from_file).read()
# The next line was used for testing
print(indata)
# indata.close()
So my question is should I just avoid the practice of combining commands as done above or is there a way to properly deal with that situation so files are closed when they should be? Is it necessary to deal with the situation of closing a file at all in this situation?
Context manager and with statement is a comfortable way to make sure your file is closed as needed:
with open(from_file) as fobj:
indata = fobj.read()
Nowadays, you can also use Path-like objects and their read_text and read_bytes methods:
# This assumes Path from pathlib has been imported
indata = Path(from_file).read_text()
The error you were seeing... is because you were not trying to close the file, but str into which you've read its content into. You'd need to assign object returned by open a name, and then read from and close that one:
fobj = open(from_file)
indata = fobj.read()
fobj.close() # This is OK
Strictly speaking, you would not need to close that file as dangling file descriptors would be "clobbered" with the process. Esp. in a short example like this, it would be of relatively little concern.
I hope I got the follow up question in comment correctly to extend on this a bit more.
If you wanted a single command, look at the pathtlib.Path example above.
With open as such, you cannot perform read and close in a single operation and without assigning result of open to a variable. As both read and close would have to be performed on the same object returned by open. If you do:
var = fobj.read()
Now, var refers to content read out of the file (so nothing that you could close, would have a close method).
If you did:
open(from_file).close()
After (but also before; at any point), you would simply open that file (again) and close it immediately. BTW. this returns None, just in case you wanted to get the return value. But it would not affect previously open file handles and file-like objects. It would not serve any practical purpose except for perhaps making sure you can open a file.
But again. It's a good practice to perform the housekeeping, but strictly speaking (and esp. in a short code like this). If you did not close the file and relied on the OS to clean-up after your process. It'd work fine.
How about the following:
# to open the file and read it
indata = open(from_file).read()
print(indata)
# this closes the file - just the opposite of opening and reading
open(from_file).close()

Nodejs fs.readfile vs new buffer binary

I have a situation where I receive a base64 encoded image, decode it, then want to use it in some analysis activity.
I can use Buffer to go from base64 to binary but i seem to be unable to use that output as expected (as an image).
The solution now is to convert to binary, write it to a file, then read that file again. The FS output can be used as an image but this approach seems a bit inefficient and additional steps as i would expect the buffer output to also be a usable image as it has the same data?
my question, is how does the fs.readfile output differ from the buffer output? And is there a way i can use the buffer output as i would the fs output?
Buffer from a base64 string:
var bin = new Buffer(base64String, 'base64').toString('binary');
Read a file
var bin = fs.readFileSync('image.jpg');
Many thanks

Resources