Gulp extract text from a file - node.js

I want to build up a table of contents file based on the comments from the first line of each file.
I can get to the files no problem and read the contents, but that only returns a buffer of the file.
I want the check if the first line is comments. if it is then extract that line and add it to a new file.
var bufferContents = through.obj(function(file,enc,cb){
console.log(file.contents);
});

If the file is pretty large, I would recommend use lazy module.
https://github.com/jpommerening/node-lazystream
Or if line length is specified, you can set chunk size.
https://stackoverflow.com/a/19426486/1502019

Related

Eliminate footer and header information from multiple text files (why can't I eliminate the last line as easily as I can eliminate the first lines?)

I have been trying all day.
# successfully writes the data from line 17 and next lines
# to new (temp) file named and saved in the os
import os
import glob
files = glob.glob('/Users/path/Documents/test/*.txt')
for myspec in files:
temp_filename = 'foo.temp.txt'
with open(myspec) as f:
for n in range(17):
f.readline()
with open(temp_filename, 'w') as w:
w.writelines(f)
os.remove(myspec)
os.rename(temp_filename, myspec)
# delete original file and rename the temp file so it replaces the original file
print("done")
The above works and it works well! I love it. I am very happy.
But this below does NOT work (same files, I am preprocessing files) :
# trying unsuccessfully to remove the last line which is line
# 2048 in all files and save again like above
import os
import glob
files = glob.glob('/Users/path/Documents/test/*.txt')
for myspec in files:
temp_filename = 'foo.temp.txt'
with open(myspec) as f:
for n in range(-1):
f.readline()
with open(temp_filename, 'w') as w:
w.writelines(f)
os.remove(myspec)
os.rename(temp_filename, myspec)
# delete original file and rename the temp file so it replaces the original file
print("done")
This does not work. It doesn't give an error, it prints done, but it does not change the file. I have tried range(-1), all the way up to range(-7), thinking maybe there were blank lines at the end I could not see. This is the only difference between the two blocks of code. If anyone could help that would be great.
To summarize, I got rid of permanently the headers and now I still have a 1 line footer I can not get rid of permanently.
Thank you so much for any help. I need to write permanently edited files. Because I have a ton of code that wants 2 or 3 column files without all the header footer junk, and the junk and file types vary widely. So if I lose the junk permanently ASCII can guess correctly the file types. And I really do not want to try and rewrite that code right now, it's very complicated and involves uncertainty and it took me months to get working correctly. I don't read the files until I'm inside a function and there are many files that are displayed in multiple drop downs. Thank you! All day I've been at this, I have tried other methods. I'd like to make THIS the above method work. To pop off the last and write it back to a permanent file. It doesn't like the -1. Right now it is just one specific line, it is (specifically line 2048 after the header is removed.) Therefore just removing line 2048 would be fine too. Its the last line of the files which are a batch of TSV files that are CCD readouts. Thanks in advance!

How to remove the pre header line item of a CSV file using csv.DictReader

I would like to remove the first line from a csv file like below which comes with the encode type, so when I`m reading that file with csv.DictReader it is recognizing this word as a key from the dictionary.
csv input: (raw_file_data)
UTF-8,,,,
POSID,POST1,VERNR,PBUKR,PWPOS
"B00007","testing",08027011,"0030","CNY"
code to read it:
for row in csv.DictReader(codecs.getreader(encoding="ISO-8859-1")(raw_file_data)):
data_list.append(row)
the result is that the first line of CSV is being considered as a key and that`s my issue, can anyone help me to just ignore that first line and consider the csv reader since the second line which contains the header information?, I tried something with next(), but I could not solve it.
Many thanks.
You can skip the first line by calling next on the file iterator before you pass it to csv.DictReader:
file = codecs.getreader(encoding="ISO-8859-1")(raw_file_data)
next(file)
for row in csv.DictReader(file):
data_list.append(row)

Node.js "readline" + "fs. createReadStream" : Specify start & end line number

https://nodejs.org/api/readline.html
provides this solution for reading large files like CSVs line by line:
const { createReadStream } = require('fs');
const { createInterface } = require('readline');
(async function processLineByLine() {
try {
const rl = createInterface({
input: createReadStream('big-file.txt'),
crlfDelay: Infinity
});
rl.on('line', (line) => {
// Process the line.
});
await once(rl, 'close');
console.log('File processed.');
} catch (err) {
console.error(err);
}
})();
But I dont want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
Is this doable with readline & fs.createReadStream?
If not please suggest alternate approach.
PS: It's a large file (around 1 GB) & loading it in memory causes memory issues.
But I don't want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Unless your lines are of fixed, identical length, there is NO way to know where line 10,000 starts without reading from the beginning of the file and counting lines until you get to line 10,000. That's how text files with variable length lines work. Lines in the file are not physical structures that the file system knows anything about. To the file system, the file is just a gigantic blob of data. The concept of lines is something we invent at a higher level and thus the file system or OS knows nothing about lines. The only way to know where lines are is to read the data and "parse" it into lines by searching for line delimiters. So, line 10,000 is only found by searching for the 10,000th line delimiter starting from the beginning of the file and counting.
There is no way around it, unless you preprocess the file into a more efficient format (like a database) or create an index of line positions.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
The only way to do that is to "index" the data ahead of time so you already know where each line starts/ends. Some text editors made to handle very large files do this. They read through the file (perhaps lazily) reading every line and build an in-memory index of what file offset each line starts at. Then, they can retrieve specific blocks of lines by consulting the index and reading that set of data from the file.
Is this doable with readline & fs.createReadStream?
Without fixed length lines, there's no way to know where in the file line 10,000 starts without counting from the beginning.
It's a large file(around 1 GB) & loading it in memory causes MEMORY ISSUES.
Streaming the file a line at a time with the linereader module or others that do something similar will handle the memory issue just fine so that only a block of data from the file is in memory at any given time. You can handle arbitrarily large files even in a small memory system this way.
A new line is just a character (or two characters if you're on windows), you have no way of knowing where those characters are without processing the file.
You are however able to read only a certain byte range in a file. If you know for a fact that every line contains 64 bytes, you can skip the first 100 lines by starting your read at byte 6400, and you can read only 100 lines by stopping your read at byte 12800.
Details on how to specify start and end points are available in the createReadStream docs.

re-organize data stored in a csv

I have successfully downloaded my data from a given url and for storing it into a csv file I used the following code:
fx = open(destination_url, "w") #write data into a file
for line in lines: #loop through the string
fx.write(line + "\n")
fx.close() # close the file object
return
What happened is that the data is stored but not in separate lines. As one can see in the snapshot - the data is not separated into a different lines when I use the '\n'.
Every separate line of data that I wanted seems to be separated via the '\r' (marked by yellow) on the same cell in the csv file. Here is a snip: .
I know I am missing something here but can I get some pointers with regards to rearranging each line that ends with a \r into a separate line ?
I hope I have made myself clear.
Thanks
~V
There is a method call writelines
https://www.tutorialspoint.com/python/file_writelines.htm
some example is in the given link you can try that first in reality it should work we need the format of the data (what is inside the element) during each iteration print that out if the above method does not work

file.read() not working as intended in string comparison

stackoverflow.
I've been trying to get the following code to create a .txt file, write some string on it and then print some message if said string was in the file. This is merely a study for a more complex project, but even given it's simplicity, it's still not working.
Code:
import io
file = open("C:\\Users\\...\\txt.txt", "w+") #"..." is the rest of the file destination
file.write('wololo')
if "wololo" in file.read():
print ("ok")
This function always skips the if as if there was no "wololo" inside the file, even though I've checked it all times and it was properly in there.
I'm not exactly sure what could be the problem, and I've spend a great deal of time searching everywhere for a solution, all to no avail. What could be wrong in this simple code?
Oh, and if I was to search for a string in a much bigger .txt file, would it still be wise to use file.read()?
Thanks!
When you write to your file, the cursor is moved to the end of your file. If you want to read the data aferwards, you'll have to move the cursor to the beginning of the file, such as:
file = open("txt.txt", "w+")
file.write('wololo')
file.seek(0)
if "wololo" in file.read():
print ("ok")
file.close() # Remember to close the file
If the file is big, you should consider to iterate over the file line by line instead. This would avoid that the entire file is stored in memory. Also consider using a context manager (the with keyword), so that you don't have to explicitly close the file yourself.
with open('bigdata.txt', 'rb') as ifile: # Use rb mode in Windows for reading
for line in ifile:
if 'wololo' in line:
print('OK')
else:
print('String not in file')

Resources