lazyness, does words read (within interact) all the input, or only what is needed? - haskell

I am starting to learn Haskell, and I am trying to understand
how much work do the functions do (specially with respect to
the laziness concept). Please see the following program:
main::IO()
main = interact ( head . words)
Will this program read all the input or only the first word in input?

Just the first word:
% yes | ghc -e 'interact (head . words)'
y
%
But beware: this relies a feature called "lazy IO" that is only kind of related to the technique of laziness in pure code. Pure functions are lazy by default and you must work hard to make them strict; IO is "strict IO" by default and you must work hard to make it lazy IO. A handful of library functions (notably interact, (h)getContents, and readFile) have gone to this effort.
It also has some problems with composability.

Conceptually, it reads only what it needs. But it probably uses a buffer to do so:
$ yes | strace -feread,write ghc -e 'interact (head . words)'
...
[pid 61274] read(0, "y\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\ny\n"..., 8096) = 8096
[pid 61272] write(1, "y", 1y) = 1
[pid 61272] --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=0}} ---
[pid 61272] write(5, "\376", 1) = 1
[pid 61273] read(4, "\376", 1) = 1
[pid 61273] +++ exited with 0 +++
[pid 61274] +++ exited with 0 +++
[pid 61276] +++ exited with 0 +++
+++ exited with 0 +++
This shows (on a Linux system) that the program split itself into multiple threads, one of them read 8KiB of data from stdin, then another output the first word. The main reason is that repeatedly reading small amounts is quite inefficient. Asynchronous sources like terminals and sockets may produce smaller amounts of data, though:
$ strace -f -e trace=read,write -e signal= ghc -e 'interact (head . words)'
...
hello program
Process 61594 attached
[pid 61592] read(0, "hello program\n", 8096) = 14
[pid 61590] write(1, "hello", 5hello) = 5
In this case, the terminal layer completed the read at the first newline, even though the buffer was still 8KiB large. As this was enough data for the first word to be identified, no further reads were needed.

Related

How does gdb start an assembly compiled program and step one line at a time?

Valgrind says the following on their documentation page
Your program is then run on a synthetic CPU provided by the Valgrind core
However GDB doesn't seem to do that. It seems to launch a separate process which executes independently. There's also no c library from what I can tell. Here's what I did
Compile using clang or gcc gcc -g tiny.s -nostdlib (-g seems to be required)
gdb ./a.out
Write starti
Press s a bunch of times
You'll see it'll print out "Test1\n" without printing test2. You can also kill the process without terminating gdb. GDB will say "Program received signal SIGTERM, Terminated." and won't ever write Test2
How does gdb start the process and have it execute only one line at a time?
.text
.intel_syntax noprefix
.globl _start
.p2align 4, 0x90
.type _start,#function
_start:
lea rsi, [rip + .s1]
mov edi, 1
mov edx, 6
mov eax, 1
syscall
lea rsi, [rip + .s2]
mov edi, 1
mov edx, 6
mov eax, 1
syscall
mov eax, 60
xor edi, edi
syscall
.s1:
.ascii "Test1\n"
.s2:
.ascii "Test2\n"
starti implementation
As usual for a process that wants to start another process, it does a fork/exec, like a shell does. But in the new process, GDB doesn't just make an execve system call right away.
Instead, it calls ptrace(PTRACE_TRACEME) to wait for the parent process to attach to it, so GDB (the parent) is already attached before the child process makes an execve() system call to make this process start executing the specified executable file.
Also note in the execve(2) man page:
If the current program is being ptraced, a SIGTRAP signal is sent
to it after a successful execve().
So that's how the kernel debugging API supports stopping before the first user-space instruction is executed in a newly-execed process. i.e. exactly what starti wants. This doesn't depend on setting a breakpoint; that can't happen until after execve anyway, and with ASLR the correct address isn't even known until after execve picks a base address. (GDB by default disables ASLR, but it still works if you tell it not to disable ASLR.)
This is also what GDB use if you set breakpoints before run, manually, or by using start to set a one-time breakpoint on main. Before the starti command existed, a hack to emulate that functionality was to set an invalid breakpoint before run, so GDB would stop on that error, giving you control at that point.
If you strace -f -o gdb.trace gdb ./foo or something, you'll see some of what GDB does. (Nested tracing apparently doesn't work, so running GDB under strace means GDB's ptrace system call fails, but we can see what it does leading up to that.)
...
231566 execve("/usr/bin/gdb", ["gdb", "./foo"], 0x7ffca2416e18 /* 57 vars */) = 0
# the initial GDB process is PID 231566.
... whole bunch of stuff
231566 write(1, "Starting program: /tmp/foo \n", 28) = 28
231566 personality(0xffffffff) = 0 (PER_LINUX)
231566 personality(PER_LINUX|ADDR_NO_RANDOMIZE) = 0 (PER_LINUX)
231566 personality(0xffffffff) = 0x40000 (PER_LINUX|ADDR_NO_RANDOMIZE)
231566 vfork( <unfinished ...>
# 231584 is the new PID created by vfork that would go on to execve the new PID
231584 openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 13
231584 newfstatat(13, "", {st_mode=S_IFDIR|0500, st_size=0, ...}, AT_EMPTY_PATH) = 0
231584 getdents64(13, 0x558403e20360 /* 16 entries */, 32768) = 384
231584 close(3) = 0
... all these FDs
231584 close(12) = 0
231584 getdents64(13, 0x558403e20360 /* 0 entries */, 32768) = 0
231584 close(13) = 0
231584 getpid() = 231584
231584 getpid() = 231584
231584 setpgid(231584, 231584) = 0
231584 ptrace(PTRACE_TRACEME) = -1 EPERM (Operation not permitted)
231584 write(2, "warning: ", 9) = 9
231584 write(2, "Could not trace the inferior pro"..., 37) = 37
231584 write(2, "\n", 1) = 1
231584 write(2, "warning: ", 9) = 9
231584 write(2, "ptrace", 6) = 6
231584 write(2, ": ", 2) = 2
231584 write(2, "Operation not permitted", 23) = 23
231584 write(2, "\n", 1) = 1
# gotta love unbuffered stderr
231584 exit_group(127) = ?
231566 <... vfork resumed>) = 231584 # in the parent
231584 +++ exited with 127 +++
# then the parent is running again
231566 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=231584, si_uid=1000, si_status=127, si_utime=0, si_stime=0} ---
231566 rt_sigreturn({mask=[]}) = 231584
... then I typed "quit" and hit return
There some earlier clone system calls to create more threads in the main GDB process, but those didn't exit until after the vforked PID that attempted ptrace(PTRACE_TRACEME). They were all just threads since they used clone with CLONE_VM. There was one earlier vfork / execve of /usr/bin/iconv.
Annoyingly, modern Linux has moved to PIDs wider than 16-bit so the numbers get inconveniently large for human minds.
step implementation:
Unlike stepi which would use PTRACE_SINGLESTEP on ISAs that support it (e.g. x86 where the kernel can use the TF trap flag, but interestingly not ARM), step is based on source-level line number <-> address debug info. That's usually pointless for asm, unless you want to step past macro expansions or something.
But for step, GDB will use ptrace(PTRACE_POKETEXT) to write an int3 debug-break opcode over the first byte of an instruction, then ptrace(PTRACE_CONT) to let execution run in the child process until it hits a breakpoint or other signal. (Then put back the original opcode byte when this instruction needs to execute). The place at which it puts that breakpoint is something it finds by looking for the next address of a line-number in the DWARF or STABS debug info (metadata) in the executable. That's why only stepi (aka si) works when you don't have debug info.
Or possibly it would use PTRACE_SINGLESTEP one or two times as an optimization if it saw it was close.
(I normally only use si or ni for debugging asm, not s or n. layout reg is also nice, when GDB doesn't crash. See the bottom of the x86 tag wiki for more GDB asm debugging tips.)
If you meant to ask how the x86 ISA supports debugging, rather than the Linux kernel API which exposes those features via a target-independent API, see the related Q&As:
How is PTRACE_SINGLESTEP implemented?
Why Single Stepping Instruction on X86?
How to tell length of an x86-64 instruction opcode using CPU itself?
Also How does a debugger work? has some Windowsy answers.

Bash file descriptors vs Linux file descriptors

I'm just trying to reconcile these two seemingly similar concepts.
In Bash, one is allowed to make arbitrary redirections, and importantly, using one's chosen file descriptor number. However in Linux, the value returned by an open call (AFAIK) cannot be chosen by the calling process.
Thus, are Bash fd numbers the same as the fd numbers returned by system calls? If not, what's the difference?
Here's a little experiment that might shed some light on what's going on when you open a file descriptor in bash with a number of your choosing:
> cat test.txt
foobar!
> cat test.sh
#!/bin/bash
exec 17<test.txt
read -u 17 line
echo "$line"
exec 17>&-
> strace ./test.sh
//// A bunch of stuff omitted so we can skip to the interesting part...
open("test.txt", O_RDONLY) = 3
fcntl(17, F_GETFD) = -1 EBADF (Bad file descriptor)
dup2(3, 17) = 17
close(3) = 0
fcntl(17, F_GETFD) = 0
ioctl(17, TCGETS, 0x7ffc56f093f0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(17, 0, SEEK_CUR) = 0
read(17, "foobar!\n", 128) = 8
write(1, "foobar!\n", 8foobar!) = 8
fcntl(17, F_GETFD) = 0
fcntl(17, F_DUPFD, 10) = 10
fcntl(17, F_GETFD) = 0
fcntl(10, F_SETFD, FD_CLOEXEC) = 0
close(17) = 0
The part that answers your question is where it calls open() on test.txt, which returns a value of 3. This is what you would most likely get in a C program if you did the same, because file descriptors 0, 1, and 2 (i.e., stdin, stdout, and stderr) are all you have open initially. The number 3 is just the next available file descriptor.
And we see that in the strace output of the bash script as well. What bash does differently is that it then calls fcntl(17, F_GETFD) to check if file descriptor 17 is already open (because it wants to use that fd for test.txt). Then, when fcntl returns EBADF indicating no such fd is open, bash knows it is free to use it. So then it calls dup2(3, 17) to make fd 17 a copy of fd 3. Finally, it calls close() on fd 3 to free it up again, leaving fd 17 (and only fd 17) as an open file descriptor for test.txt.
So the answer to your question is that bash file descriptors are not special creatures set apart from the "normal" file descriptors that everyone else uses. They are in fact just the same thing. You could easily use the same trick in your C program to open files with file descriptor numbers of your choosing.
Also, it's worth pointing out that bash doesn't really get to choose its own file descriptor when it calls open(). It has to make do with whatever open() returns, like everyone else. All that's really going on in your bash script is some smoke and mirrors (via dup2()) to make it seem as if you get to choose your own file descriptor.

How to test STDIN read error

I have a Linux program (currently in assembly) that has a check: if a read from STDIN failed, show an error message. The problem is that I do not know how to test this condition, how to execute the program so that it will fail reading from STDIN. IT must be run without STDIn or STDIN couldbe closed some how before the program starts?
Yes, you can close the file descriptor, that will trigger an error. Test using bash:
$ strace ./a.out 0<&-
execve("./a.out", ["./a.out"], [/* 32 vars */]) = 0
[ Process PID=4012 runs in 32 bit mode. ]
read(0, 0xffe13fec, 1) = -1 EBADF (Bad file descriptor)
You can also provoke other errors that are listed in the man page, such as:
$ strace ./a.out 0</tmp
execve("./a.out", ["./a.out"], [/* 32 vars */]) = 0
[ Process PID=4056 runs in 32 bit mode. ]
read(0, 0xffed5c0c, 1) = -1 EISDIR (Is a directory)

Capture both exit status and output from a system call in R

I've been playing a bit with system() and system2() for fun, and it struck me that I can save either the output or the exit status in an object. A toy example:
X <- system("ping google.com",intern=TRUE)
gives me the output, whereas
X <- system2("ping", "google.com")
gives me the exit status (1 in this case, google doesn't take ping). If I want both the output and the exit status, I have to do 2 system calls, which seems a bit overkill. How can I get both with using only one system call?
EDIT : I'd like to have both in the console, if possible without going over a temporary file by using stdout="somefile.ext" in the system2 call and subsequently reading it in.
As of R 2.15, system2 will give the return value as an attribute when stdout and/or stderr are TRUE. This makes it easy to get the text output and return value.
In this example, ret ends up being a string with an attribute "status":
> ret <- system2("ls","xx", stdout=TRUE, stderr=TRUE)
Warning message:
running command ''ls' xx 2>&1' had status 1
> ret
[1] "ls: xx: No such file or directory"
attr(,"status")
[1] 1
> attr(ret, "status")
[1] 1
I am a bit confused by your description of system2, because it has stdout and stderr arguments. So it is able to return both exit status, stdout and stderr.
> out <- tryCatch(ex <- system2("ls","xx", stdout=TRUE, stderr=TRUE), warning=function(w){w})
> out
<simpleWarning: running command ''ls' xx 2>&1' had status 2>
> ex
[1] "ls: cannot access xx: No such file or directory"
> out <- tryCatch(ex <- system2("ls","-l", stdout=TRUE, stderr=TRUE), warning=function(w){w})
> out
[listing snipped]
> ex
[listing snipped]
I suggest using this function here:
robust.system <- function (cmd) {
stderrFile = tempfile(pattern="R_robust.system_stderr", fileext=as.character(Sys.getpid()))
stdoutFile = tempfile(pattern="R_robust.system_stdout", fileext=as.character(Sys.getpid()))
retval = list()
retval$exitStatus = system(paste0(cmd, " 2> ", shQuote(stderrFile), " > ", shQuote(stdoutFile)))
retval$stdout = readLines(stdoutFile)
retval$stderr = readLines(stderrFile)
unlink(c(stdoutFile, stderrFile))
return(retval)
}
This will only work on a Unix-like shell that accepts > and 2> notations, and the cmd argument should not redirect output itself. But it does the trick:
> robust.system("ls -la")
$exitStatus
[1] 0
$stdout
[1] "total 160"
[2] "drwxr-xr-x 14 asieira staff 476 10 Jun 18:18 ."
[3] "drwxr-xr-x 12 asieira staff 408 9 Jun 20:13 .."
[4] "-rw-r--r-- 1 asieira staff 6736 5 Jun 19:32 .Rapp.history"
[5] "-rw-r--r-- 1 asieira staff 19109 11 Jun 20:44 .Rhistory"
$stderr
character(0)

Linux proc/pid/fd for stdout is 11?

Executing a script with stdout redirected to a file. So /proc/$$/fd/1 should point to that file (since stdout fileno is 1). However, actual fd of the file is 11. Please, explain, why.
Here is session:
$ cat hello.sh
#!/bin/sh -e
ls -l /proc/$$/fd >&2
$ ./hello.sh > /tmp/1
total 0
lrwx------ 1 nga users 64 May 28 22:05 0 -> /dev/pts/0
lrwx------ 1 nga users 64 May 28 22:05 1 -> /dev/pts/0
lr-x------ 1 nga users 64 May 28 22:05 10 -> /home/me/hello.sh
l-wx------ 1 nga users 64 May 28 22:05 11 -> /tmp/1
lrwx------ 1 nga users 64 May 28 22:05 2 -> /dev/pts/0
I have a suspicion, but this is highly dependent on how your shell behaves. The file descriptors you see are:
0: standard input
1: standard output
2: standard error
10: the running script
11: a backup copy of the script's normal standard out
Descriptors 10 and 11 are close on exec, so won't be present in the ls process. 0-2 are, however, prepared for ls before forking. I see this behaviour in dash (Debian Almquist shell), but not in bash (Bourne again shell). Bash instead does the file descriptor manipulations after forking, and incidentally uses 255 rather than 10 for the script. Doing the change after forking means it won't have to restore the descriptors in the parent, so it doesn't have the spare copy to dup2 from.
The output of strace can be helpful here.
The relevant section is
fcntl64(1, F_DUPFD, 10) = 11
close(1) = 0
fcntl64(11, F_SETFD, FD_CLOEXEC) = 0
dup2(2, 1) = 1
stat64("/home/random/bin/ls", 0xbf94d5e0) = -1 ENOENT (No such file or
+++++++>directory)
stat64("/usr/local/bin/ls", 0xbf94d5e0) = -1 ENOENT (No such file or directory)
stat64("/usr/bin/ls", 0xbf94d5e0) = -1 ENOENT (No such file or directory)
stat64("/bin/ls", {st_mode=S_IFREG|0755, st_size=96400, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
+++++++>child_tidptr=0xb75a8938) = 22748
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 22748
--- SIGCHLD (Child exited) # 0 (0) ---
dup2(11, 1) = 1
So, the shell moves the existing stdout to an available file descriptor above 10 (namely, 11), then moves the existing stderr onto its own stdout (due to the >&2 redirect), then restores 11 to its own stdout when the ls command is finished.

Resources