How to Pipeline File Lines in NodeJS

How to Pipeline File Lines in NodeJS - node.js

I am new to programming NodeJS and want to load multi-line sections from a file into MongoDB. I have seen simple solution that reads the file into memory and parses into lines. This will fill my need but doesn't seem the "NodeJS" way of doing it. If I just wanted to quickly load the file, I would load IntelliJ and do it in Java. I want to learn asynchronous and pipelines.
I see the the basic steps are stream the file in chunks, parse the chunks into lines, group lines into sections, convert sections into JSON, and insert JSON into MongoDB.
I like the idea of pipelines since I can easily replace parts and reuse others. It also helps in this case since the slowest step is the last and the whole input file could be loaded into memory before the first MongoDB document is written.
I have searched for a good example but they seem to be missing parts and explanations of what I need to modify. I have seen you can easily pipe a file stream in chunks but I need lines. I have seen you can easily stream a file to a line parser but that is not a pipeline.
Any ideas on how to do this or good examples?
Thanks,
Wes.

Related

Can antlr4 be used to parse very large gzip compressed files?

I am trying to parse very large gzip compressed (10+GB) file in python3. Instead of creating the parse tree, instead I used embedded actions based on the suggestions in this answer.
However, looking at the FileStream code it wants to read the entire file and then parse it. This will not work for big files.
So, this is a two part question.
Can ANTLR4 use a file stream, probably custom, that allows it to read chunks of the file at a time? What should the class interface look like?
Predicated on the above having "yes", would that class need to handle seek operations, which would be a problem if the underlying file is gzip compressed?

Short anser: no, not possible.
Long(er) answer: ANTLR4 can potentially use unlimited lookahead, so it relies on the stream to seek to any position with no delay or parsing speed will drop to nearly a hold. For that reason all runtimes use a normal file stream that reads in the entire file at once.
There were discussions/attempts in the past to create a stream that buffers only part of the input, but I haven't heard of anything that actually works.

Merging PDF files in Haskell

The Preview application on the Mac allows one to merge multiple PDF files, although the functionality is rather obscure. I'm writing a utility in Haskell that needs to perform a similar task, that is, merge an arbitrary number of PDF files into one new file.
Does anyone have a suggestion as to where to start with this? Obviously if there's a library on Hackage that will do most of the work out of the box that would be ideal, but if not, then some pointers about where to start would be very much appreciated.

I'm working on pdf library, that supports parsing and generating. It is low level, higher level tools are in todo list yet (because it is hard to design good high level API).
Here is an example of unpacking and decrypting of PDF file. It is easy to implement PDF merging, but you need to be familiar with PDF internals.
ADDED:
I create a basic example of merging PDF files in Haskell. 150 lines of code total, but it lacks few features (see comments at on the top of the file). They are easy to add, so let me know if you are interested.

The PDF file format isn't that complicated. Adobe has an official specification document for it somewhere. Essentially a PDF file contains a set of numbered "objects". You'd have to get all the objects from each PDF file, renumber them so they're unique, and then you need to fiddle with the page index so all the pages actually show up.
There appear to be a couple of packages on Hackage for writing PDF files, but I don't see much for reading them. You may like to look at the source code for pdfsplit for ideas. Also HPDF.

Saving a program and loading a program in use

I'm trying to create a program that moves a little man forward by walking or running etc.
I need to be able to save the state of the little man and load it at any given time and I have no idea how to do this. Could anyone please help me get the ball rolling?

You will probably have the state (the position of the character, maybe something else) stored in some sort of Python data structures. You could create a file storing this information. The most obvious ways are:
Create a plain-text file with text/numbers needed to recreate the state. You will need to parse it to load the state, but you can choose any file format you like.
Use pickle to dump your Python objects to a file directly. You will then read them using pickle.load and won't have to bother parsing the file.

Combining resources into a single binary file

How does one combine several resources for an application (images, sounds, scripts, xmls, etc.) into a single/multiple binary file so that they're protected from user's hands? What are the typical steps (organizing, loading, encryption, etc...)?
This is particularly common in game development, yet a lot of the game frameworks and engines out there don't provide an easy way to do this, nor describe a general approach. I've been meaning to learn how to do it, but I don't know where to begin. Could anyone point me in the right direction?

There are lots of ways to do this. m_pGladiator has some good ideas, especially with seralization. I would like to make a few other comments.
First, if you are going to pack a bunch of resources into a single file (I call these packfiles), then I think that you should work to avoid loading the whole file and then deseralizing out of that file into memory. The simple reason is that it's more memory. That's really not a problem on PC's I guess, but it's good practice, and it's essential when working on the console. While we don't (currently) serialize objects as m_pGladiator has suggested, we are moving towards that.
There are two types of packfiles that you might have. One would be a file where you want arbitrary access to the contents of the files. A second type might be a collection of files where you need all of those files when loading a level. A basic example might be:
An audio packfile might contain all the audio for your game. You might only need to load certain kinds of audio for the menus or interface screens and different sets of audio for the levels. This might fall intot he first category above.
A type that falls into the second category might be all models/textures/etc for a level. You basically want to load the entire contents of this file into the game at load time because you will (likely) need all of it's contents while a player is playing that level or section.
many of the packfiles that we build fall into the second category. We basically package up the level contents, and then compresses them with something like zlib. When we load one of these at game time, we read a small amount of the file, uncompress what we've read into a memory buffer, and then repeat until the full file has been read into memory. The buffer we read into is relatively small while final destination buffer is large enough to hold the largest set of uncompressed data that we need. This method is tricky, but again, it saves on RAM, it's an interesting exercise to get working, and you feel all nice and warm inside because you are being a good steward of system resources. once the packfile has been completely uncompressed into it's destinatino buffer, we run a final pass on the buffer to fix up pointer locations, etc. This method only works when you write out your packfile as structures that the game knows. In other words, our packfile writing tools share struct (or classses) with the game code. We are basically writing out and compressing exact representations of data structures.
If you simply want to cut down on the number of files that you are shipping and installing on a users machine, you can do with something like the first kind of packfile that I describe. Maybe you have 1000s of textures and would just simply like to cut down on the sheer number of files that you have to zip up and package. You can write a small utility that will basically read the files that you want to package together and then write a header containing the files and their offsets in the packfile, and then you can write the contents of the file, one at a time, one after the other, in your large binary file. At game time, you can simply load the header of this packfile and store the filenames and offsets in a hash. When you need to read a file, you can hash the filename and see if it exists in your packfile, and if so, you can read the contents directly from the packfile by seeking to the offset and then reading from that location in the packfile. Again, this method is basically a way to pack data together without regards for encryption, etc. It's simply an organizational method.
But again, I do want to stress that if you are going a route like I or m_pGladiator suggests, I would work hard to not have to pull the whole file into RAM and then deserialize to another location in RAM. That's a waste of resources (that you perhaps have plenty of). I would say that you can do this to get it working, and then once it's working, you can work on a method that only reads part of the file at a time and then decompresses to your destination buffer. You must use a comprsesion scheme that will work like this though. zlib and lzw both do (I believe). I'm not sure about an MD5 algorithm.
Hope that this helps.

do as Java: pack it all in a zip, and use an filesystem-like API to read directly from there.

Personally, I never used the already available tools to do that. If you want to prevent your game to be hacked easily, then you have to develop your own resource manipulation engine.
First of all read about serializing objects. When you load a resource from file (graphic, sound or whatever), it is stored in some object instance in the memory. A game usually uses dozens of graphical and sound objects. You have to make a tool, which loads them all and stores them in collections in the memory. Then serialize those collections into a binary file and you have every resource there.
Then you can use for example MD5 or any other encryption algorithm to encrypt this file.
Also, you can use zlib or other compression library to make this big binary file a bit smaller.
In the game, you should load the encrypted binary file and unpack it. Then decrypt it. Then deserialize the object collections and you have all resources back in memory.
Of course you can make this more comprehensive by storing in different binary files the resources for different levels and so on - there are plenty of variants, depending on what you want. Also you can first zip, then encrypt, or make other combinations of the steps.

Short answer: yes.
In Mac OS 6,7,8 there was a substantial API devoted to this exact task. Lookup the "Resource Manager" if you are interested. Edit: So does the ROOT physics analysis package.
Not that I know of a good tool right now. What platform(s) do you want it to work on?
Edited to add: All of the two-or-three tools of this sort that I am away of share a similar struture:
The file starts with a header and index
There are a series of blocks some of which may have there own headers and indicies, some of which are leaves
Each leaf is a simple serialization of the data to be stored.
The whole file (or sometimes individual blocks) may be compressed.
Not terribly hard to implement your own, but I'd look for a good existing one that meets your needs first.

For future people, like me, who are wondering about this same topic, check out the two following links:
http://www.sfml-dev.org/wiki/en/tutorials/formatdat
http://archive.gamedev.net/reference/programming/features/pak/

How do you visualize logfiles in realtime?

Sometimes it might be useful, but mostly just looking cool or impressive to visualize log files (anything from http requests and to bandwith usage to cups of coffee drunk per day).
I know about Visitorville which I think look a bit silly, and then there's gltail.
How do you "visualize" your log files in realtime?

There is also the logstalgia tool. Visualizes Apache logs. See http://code.google.com/p/logstalgia/ for more details and a youtube video.

You may take a look at Apache Chainsaw. This nifty tool allows Log incomes from nearly everyqhere and has live filtering and colering. If you have an already written Log, I'm not sure if it can read it, it's been a while since I used it last time (was very usefull for the prototyping phase of our JBoss server)

Google has released the Visualization API that is probably flexible enough to help you:
The Google Visualization API lets you access multiple sources of structured data that you can display, choosing from a large selection of visualizations. The Google Visualization API also provides a platform that can be used to create, share and reuse visualizations written by the developer community at large.
It requires some Javascript knowledge and includes Google Docs integration, Spreadsheet integration. Check out the Gallery for some examples.

You could take a look at this. http://www.intalisys.com. 3D realtime vis app

We use Awk and Perl scripts to parse the log files and create summary reports and "databases" (technically databases in that each row corresponds to a unique event with many columns of data about that event, but not stored in a traditional database format. We're moving in that direction). I like Awk because you can very quickly search for specific strings in the log files using regex, keep counters and gather data from the log file entries, and do all kinds of calculations with that data. Then use your favorite plotting software. We use Excel, mainly because that's what was here before I started this job. I prefer MATLAB and it's open-source cousin, Octave, which is built on gnuplot.

I prefer Sawmill for visualizing data. You can basically throw any log file against it, and it will not only autodetect its structure*, but will also decide on how to analyze it. Even if you have a custom log file, you can still define what and how shall be analyzed and visualized.

I mainly use R to visualize data, but I've heard of Orange, too.

Not sure if it fits the question, but I just released this:
numStepCsvLogVis - analyze logfile data in CSV format
It uses Python's matplotlib, is motivated by the need to visualize syslog data in context of debugging kernel circular buffer operation (and variables) in C; and it visualizes by using CSV file format as intermediary to the logfile data (I cannot explain it better in brief - take a look at the README for more detail).
It has a "step" player accessed in terminal, and can handle "live" stdin input, but unfortunately, I cannot get a better response that 1 FPS when plot renders, so I wouldn't really call it "realtime" per se - but you can use it to eventually generate sonified videos of plot animations.

A simple solution is to use Logstalgia alongside the lightweight local-web-server.
First install the above. Then, from the root folder of your site visualise your logs in realtime with:
$ ws --log-format default | logstalgia -

Using SciTe, Notepad++ or other powerful text editor which have file processing routines, so you can create a script that colorizes parts of the log or just delete some non-important lines from it

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string