Filtering by length of lines in file giving unexpected result - haskell

I am working my way through Learn You a Haskell for great good. I am currently on the files and streams section of Chapter 9. For some reason, when I try to pipe code into one of the example Haskell programs, I do not get the same output as the book. Using ConEmu for Linux commands on Windows. For example, I have the program that only prints out strings that are less than 10 characters with the code below (short_lines.hs):
main = interact $ unlines . filter ((<10) . length) . lines
I am going to be passing this file (short_long.txt):
i'm short
so am i
i am a loooooooooong line!!!
yeah i'm long so what hahahaha!!!!!!
short line
loooooooooooooooooooooooooooong
short
Here is the command:
cat short_long.txt | runhaskell short_lines.hs
Here is my output:
so am i
short
The book says that the output is the following:
i'm short
so am i
short
I believe this has to do with the handling of the newline character but I can't figure this out since lines should have removed the newline characters before filtering. It works with manual input but not with piping. Why am I getting a different output? Am I doing something wrong? I tried removing trailing newline characters in Atom editor but it didn't change anything. Any help on why I am not getting the expected result and what I could do to get the expected result would be greatly appreciated. Thank you!

The default newline mode for stdin is nativeNewline, which chooses its behavior based on what it believes your OS to be. I suspect that it has (wrongly) decided you are on a Unix system and it therefore should not do CRLF conversion; thus when given a Windows-style file each line has a trailing '\r' character. Try using
import System.IO
main = do
hSetNewlineMode stdin universalNewlineMode
interact $ unlines . filter ((<10) . length) . lines
to force CRLF conversion and see if that gets you the expected results.
I can reproduce your problem on my Unix system by converting a text file to DOS mode before giving it to your program. Having done so, my suggested fix gets the desired behavior.

I found out that I can change the line ending style from Windows-CRLF to Unix-LF on Atom editor. Currently it is located on the bottom and simply says CRLF or LF. You can click on it to choose a different line style. For this book, that is what I will use for simplicity's sake. However, I believe that amalloy's answer is a better long-term universal approach to IO.

Related

Is there an end= equivalent for inputs?

So as I'm sure you know there's a specific operator for print() functions called end.
#as an example
print('BOB', end='-')
#would output
BOB-
So is there something like this for inputs? For example, if I wanted to have an input that would look something like this:
Input here
►
-------------------------------------------------------
And have the input at the ► and be inside the dashes using something like
x = input('Input here\n►', end='-------')
Would there be some equivalent?
EDIT:
Just to be clear, everything will be printed at the same time. The input would just be on the line marked with the ►, and the ---- would be printed below it, but at the SAME time. This means that the input would be "enclosed" by the ---.
Also, there has been a comment about curses - can you please clarify on this?
Not exactly what you want, but if the --- (or ___) can also be on the same line, you could use an input prompt with \r:
input("__________\r> ")
This means: print 10 _, then go back \r to the beginning of the line, print > overwriting the first two _, then capture the input, overwriting more _. This shows the input prompt > ________. After typing some chars: > test____. Captured input: 'test'
For more complex input forms, you should consider using curses.
When using basic console IO, once a line has been ended with a newline, it's gone and can't be edited. You can't move the cursor up to do print anything above that last line, only add on a new line below.
That means that without using a specialized "console graphics" library like curses (as tobias_k suggests), you pretty much can't do what you're asking. You can mess around a little with the contents of the last line (overwriting text you've already written there), but you can't write to any line other than the last one.
To understand why console IO works this way, you should know that very early computers didn't have screens. Instead, their console output was directly printed out one line at a time on paper. While some line printers could print several characters on the same spot (to get effects line strikethrough or underline), you couldn't unprint anything once it was on the paper. Furthermore, the paper feed only worked in one direction. Once you had sent a newline character that told the printer to advance the paper, you couldn't go back to an old line again.
I believe this would solve your problem:
print(f">>> {input()} ------")
OR
print(f"{input(">>>")} ------")
F-strings are quite useful when it comes to printing text + variables.

How to strip binary characters from a file?

I've got a file that contains lines that look like this in vim:
^[[0;32msalt-2016.3.2-1.el6.noarch^[[0;0m^M
which look like this in more:
salt-2016.3.2-1.el6.noarch
I would like to produce a copy of this file that only contains the displayed characters as more shows them. I tried piping it through dos2unix but it refuses to do anything, complaining that "dos2unix: Binary symbol 0x1B found at line 2".
Probably I could achieve what I want with some sed statements, but I'm wondering whether there is a linux/unix utility that will take output from more or cat and produce a file that contains only the whitespace and text as displayed?
There's something called ansifilter which does exactly this. I tested it out on my file and it works.

node JS console log ascii symbol

HiI am using node JS for my app, and I want to print ascii symbols in terminal.Here is a table for ascii symbols. Please check Extended ASCII Codes field. I want to print square or circle, for example 178 or 219.
Can anyone say me, how can do it?Thank you
Like several other languages, Javascript suffers from The UTF‐16
Curse. Except that Javascript has an even worse form of it, The UCS‐2
Curse. Things like charCodeAt and fromCharCode only ever deal with
16‐bit quantities, not with real, 21‐bit Unicode code points.
Therefore, if you want to print out something like 𝒜, U+1D49C,
MATHEMATICAL SCRIPT CAPITAL A, you have to specify not one character
but two “char units”: "\uD835\uDC9C".
Please refer to this link: https://dheeb.files.wordpress.com/2011/07/gbu.pdf
Your desired character is not a printable ASCII character. On linux you can print all the printable ascii characters by running this command:
for((i=32;i<=127;i++)) do printf \\$(printf '%03o\t' "$i"); done;printf "\n"
or
man ascii
So what you can do is to print unicode characters. Here is a list of all the available unicode characters, and you can select one which is looking almost identical with your desired character.
http://unicode-table.com/en/#2764
I've tested on a windows terminal but it is still not showing the desired character, but it's working on linux. If it's still not working you had to make sure to set LANGUAGE="en_US.UTF-8" in /etc/rc.conf and LANG="en_US.UTF-8" in /etc/locale.conf.
So printing out something like this on node console:
console.log('\u2592 start typing...');
will output this result:
▒ start typing...
Actually, if you only care about ASCII that should not be a real problem at all. You only have to properly escape them. A good reference for this is https://mathiasbynens.be/notes/javascript-escapes
console.log('\xB2 \xDB')
Works for me with recentish node under Windows (cmd shell) and mac OS. For ASCII characters you can just convert them to hex and prepend them with \x in your strings. Give it a try with node -e "console.log('\xB2')"
And when you try this answer, and it works, you might want to try:
node -e "console.log('\x07')"

Removing lines containing encoding errors in a text file

I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.
When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).
You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt

set list command in vi

I am trying to parse a giant log file using node.js, the file does not seem to get '\n' but when I do set list in vi it shows me '$" at the end of every line, does anyone know what that is. I means can I split a string on that.
I would recommend checking out your file via
cat -v -e
which will show you all unprintable characters and line endings.
It happens when you do set list, so you should read :h 'list' instead of asking this here. Everything what you need to know about this $ is stated in the help.
Second question (splitting string on end-of-line) is answered in :h getline(). I also doubt that file really does not have a NL so write here how did you came to conclusion «the file does not seem to get '\n'».

Resources