Related
I have specific dataformat, say 'n' (arbitrary) row and '4' columns. If 'n' is '10', the example data would go like this.
1.01e+00 -2.01e-02 -3.01e-01 4.01e+02
1.02e+00 -2.02e-02 -3.02e-01 4.02e+02
1.03e+00 -2.03e-02 -3.03e-01 4.03e+02
1.04e+00 -2.04e-02 -3.04e-01 4.04e+02
1.05e+00 -2.05e-02 -3.05e-01 4.05e+02
1.06e+00 -2.06e-02 -3.06e-01 4.06e+02
1.07e+00 -2.07e-02 -3.07e-01 4.07e+02
1.08e+00 -2.08e-02 -3.08e-01 4.07e+02
1.09e+00 -2.09e-02 -3.09e-01 4.09e+02
1.10e+00 -2.10e-02 -3.10e-01 4.10e+02
Constraints in building this input would be
data should have '4' columns.
data separated by white spaces.
I want to implement a feature to check whether the input file has '4' columns in every row, and built my own based on the 'M.S.B's answer in the post Reading data file in Fortran with known number of lines but unknown number of entries in each line.
program readtest
use :: iso_fortran_env
implicit none
character(len=512) :: buffer
integer :: i, i_line, n, io, pos, pos_tmp, n_space
integer,parameter :: max_len = 512
character(len=max_len) :: filename
filename = 'data_wrong.dat'
open(42, file=trim(filename), status='old', action='read')
print *, '+++++++++++++++++++++++++++++++++++'
print *, '+ Count lines +'
print *, '+++++++++++++++++++++++++++++++++++'
n = 0
i_line = 0
do
pos = 1
pos_tmp = 1
i_line = i_line+1
read(42, '(a)', iostat=io) buffer
(*1)! Count blank spaces.
n_space = 0
do
pos = index(buffer(pos+1:), " ") + pos
if (pos /= 0) then
if (pos > pos_tmp+1) then
n_space = n_space+1
pos_tmp = pos
else
pos_tmp = pos
end if
endif
if (pos == max_len) then
exit
end if
end do
pos_tmp = pos
if (io /= 0) then
exit
end if
print *, '> line : ', i_line, ' n_space : ', n_space
n = n+1
end do
print *, ' >> number of line = ', n
end program
If I run the above program with a input file with some wrong rows like follows,
1.01e+00 -2.01e-02 -3.01e-01 4.01e+02
1.02e+00 -2.02e-02 -3.02e-01 4.02e+02
1.03e+00 -2.03e-02 -3.03e-01 4.03e+02
1.04e+00 -2.04e-02 -3.04e-01 4.04e+02
1.05e+00 -2.05e-02 -3.05e-01 4.05e+02
1.06e+00 -2.06e-02 -3.06e-01 4.06e+02
1.07e+00 -2.07e-02 -3.07e-01 4.07e+02
1.0 2.0 3.0
1.08e+00 -2.08e-02 -3.08e-01 4.07e+02 1.00
1.09e+00 -2.09e-02 -3.09e-01 4.09e+02
1.10e+00 -2.10e-02 -3.10e-01 4.10e+02
The output is like this,
+++++++++++++++++++++++++++++++++++
+ Count lines +
+++++++++++++++++++++++++++++++++++
> line : 1 n_space : 4
> line : 2 n_space : 4
> line : 3 n_space : 4
> line : 4 n_space : 4
> line : 5 n_space : 4
> line : 6 n_space : 4
> line : 7 n_space : 4
> line : 8 n_space : 3 (*2)
> line : 9 n_space : 5 (*3)
> line : 10 n_space : 4
> line : 11 n_space : 4
>> number of line = 11
And you can see that the wrong rows are properly detected as I intended (see (*2) and (*3)), and I can write 'if' statements to make some error messages.
But I think my code is 'extremely' ugly since I had to do something like (*1) in the code to count consecutive white spaces as one space. I think there would be much more elegant way to ensure the rows contain only '4' column each, say,
read(*,'4(X, A)') line
(which didn't work)
And also my program would fail if the length of 'buffer' exceeds 'max_len' which is set to '512' in this case. Indeed '512' should be enough for most practical purposes, I also want my checking subroutine to be robust in this way.
So, I want to improve my subroutine in at least these aspects
Want it to be more elegant (not as (*1))
Be more general (especially in regards to 'max_len')
Does anyone has some experience in building this kind of input-checking subroutine ??
Any comments would be highly appreciated.
Thank you for reading the question.
Without knowledge of the exact data format, I think it would be rather difficult to achieve what you want (or at least, I wouldn't know how to do it).
In the most general case, I think your space counting idea is the most robust and correct.
It can be adapted to avoid the maximum string length problem you describe.
In the following code, I go through the data as an unformatted, stream access file.
Basically you read every character and take note of new_lines and spaces.
As you did, you use spaces to count to columns (skipping double spaces) and new_line characters to count the rows.
However, here we are not reading the entire line as a string and going through it to find spaces; we read char by char, avoiding the fixed string length problem and we also end up with a single loop. Hope it helps.
EDIT: now handles white spaces at beginning at end of line and empty lines
program readtest
use :: iso_fortran_env
implicit none
character :: old_char, new_char
integer :: line, io, cols
logical :: beg_line
integer,parameter :: max_len = 512
character(len=max_len) :: filename
filename = 'data_wrong.txt'
! Output format to be used later
100 format (a, 3x, i0, a, 3x , i0)
open(42, file=trim(filename), status='old', action='read', &
form="unformatted", access="stream")
! set utils
old_char = " "
line = 0
beg_line = .true.
cols = 0
! Start scannig char by char
do
read(42, iostat = io) new_char
! Exit if EOF
if (io < 0) then
exit
end if
! Deal with empty lines
if (beg_line .and. new_char==new_line(new_char)) then
line = line + 1
write(*, 100, advance="no") "Line number:", line, &
"; Columns: Number", cols
write(*,'(6x, a5)') "EMPTYLINE"
! Deal with beginning of line for white spaces
elseif (beg_line) then
beg_line = .false.
! this indicates new columns
elseif (new_char==" " .and. old_char/=" ") then
cols = cols + 1
! End of line: time to print
elseif (new_char==new_line(new_char)) then
if (old_char/=" ") then
cols = cols+1
endif
line = line + 1
! Printing out results
write(*, 100, advance="no") "Line number:", line, &
"; Columns: Number", cols
if (cols == 4) then
write(*,'(6x, a5)') "OK"
else
write(*,'(6x, a5)') "ERROR"
end if
! Restart with a new line (reset counters)
cols = 0
beg_line = .true.
end if
old_char = new_char
end do
end program
This is the output of this program:
Line number: 1; Columns number: 4 OK
Line number: 2; Columns number: 4 OK
Line number: 3; Columns number: 4 OK
Line number: 4; Columns number: 4 OK
Line number: 5; Columns number: 4 OK
Line number: 6; Columns number: 4 OK
Line number: 7; Columns number: 4 OK
Line number: 8; Columns number: 3 ERROR
Line number: 9; Columns number: 5 ERROR
Line number: 10; Columns number: 4 OK
Line number: 11; Columns number: 4 OK
If you knew your data format, you could read your lines in a vector of dimension 4 and use iostat variable to print out an error on each line where iostat is an integer greater than 0.
Instead of counting whitespace you can use manipulation of substrings to get what you want. A simple example follows:
program foo
implicit none
character(len=512) str ! Assume str is sufficiently long buffer
integer fd, cnt, m, n
open(newunit=fd, file='test.dat', status='old')
do
cnt = 0
read(fd,'(A)',end=10) str
str = adjustl(str) ! Eliminate possible leading whitespace
do
n = index(str, ' ') ! Find first space
if (n /= 0) then
write(*, '(A)', advance='no') str(1:n)
str = adjustl(str(n+1:))
end if
if (len_trim(str) == 0) exit ! Trailing whitespace
cnt = cnt + 1
end do
if (cnt /= 3) then
write(*,'(A)') ' Error'
else
write(*,*)
end if
end do
10 close(fd)
end program foo
this should read any line of reasonable length (up to the line limit your compiler defaults to, which is generally 2GB now-adays). You could change it to stream I/O to have no limit but most Fortran compilers have trouble reading stream I/O from stdin, which this example reads from. So if the line looks anything like a list of numbers it should read them, tell you how many it read, and let you know if it had an error reading any value as a number (character strings, strings bigger than the size of a REAL value, ....). All the parts here are explained on the Fortran Wiki, but to keep it short this is a stripped down version that just puts the pieces together. The oddest behavior it would have is that if you entered something like this with a slash in it
10 20,,30,40e4 50 / this is a list of numbers
it would treat everything after the slash as a comment and not generate a non-zero status return while returning five values. For a more detailed explanation of the code I think the annotated pieces on the Wiki explain how it works. In the search, look for "getvals" and "readline".
So with this program you can read a line and if the return status is zero and the number of values read is four you should be good except for a few dusty corners where the lines would definitely not look like a list of numbers.
module M_getvals
private
public getvals, readline
implicit none
contains
subroutine getvals(line,values,icount,ierr)
character(len=*),intent(in) :: line
real :: values(:)
integer,intent(out) :: icount, ierr
character(len=:),allocatable :: buffer
character(len=len(line)) :: words(size(values))
integer :: ios, i
ierr=0
words=' '
buffer=trim(line)//"/"
read(buffer,*,iostat=ios) words
icount=0
do i=1,size(values)
if(words(i).eq.'') cycle
read(words(i),*,iostat=ios)values(icount+1)
if(ios.eq.0)then
icount=icount+1
else
ierr=ios
write(*,*)'*getvals* WARNING:['//trim(words(i))//'] is not a number'
endif
enddo
end subroutine getvals
subroutine readline(line,ier)
character(len=:),allocatable,intent(out) :: line
integer,intent(out) :: ier
integer,parameter :: buflen=1024
character(len=buflen) :: buffer
integer :: last, isize
line=''
ier=0
INFINITE: do
read(*,iostat=ier,fmt='(a)',advance='no',size=isize) buffer
if(isize.gt.0)line=line//buffer(:isize)
if(is_iostat_eor(ier))then
last=len(line)
if(last.ne.0)then
if(line(last:last).eq.'\\')then
line=line(:last-1)
cycle INFINITE
endif
endif
ier=0
exit INFINITE
elseif(ier.ne.0)then
exit INFINITE
endif
enddo INFINITE
line=trim(line)
end subroutine readline
end module M_getvals
program tryit
use M_getvals, only: getvals, readline
implicit none
character(len=:),allocatable :: line
real,allocatable :: values(:)
integer :: icount, ier, ierr
INFINITE: do
call readline(line,ier)
if(allocated(values))deallocate(values)
allocate(values(len(line)/2+1))
if(ier.ne.0)exit INFINITE
call getvals(line,values,icount,ierr)
write(*,'(*(g0,1x))')'VALUES=',values(:icount),'NUMBER OF VALUES=',icount,'STATUS=',ierr
enddo INFINITE
end program tryit
Honesty, it should work reasonably with just about any line you throw at it.
PS:
If you are always reading four values, using list-directed I/O and checking the iostat= value on READ and checking if you hit EOR would be very simple (just a few lines) but since you said you wanted to read lines of arbitrary length I am assuming four values on a line was just an example and you wanted something very generic.
I have a textual file and I would like to write a function that reads this file and returns a list of tuples, where each tuple will consist of the word as string, the word line number as int, and the position of the last character of the word as int. Sample input,
example of the first line
followed by the second line
Sample output:
[
("example",1,8);
("of",1,11);
("the",1,15);
("first",1,21);
("line",1,26);
("followed",2,13);
("by",2,16);
("the",2,20);
("second",2,27);
("line",2,32)
]
The function that you are looking looks something like this,
let read filename =
In_channel.read_lines filename |>
List.mapi ~f:(fun line data ->
String.split data ~on:' ' |>
List.fold_map ~init:0 ~f:(fun pos word ->
let pos = pos + String.length word in
pos+1, (word,line+1,pos-1)) |>
snd) |>
List.concat
Here is how to use it. First install the dependencies,
opam install dune stdio merlin
Next, setup your project,
dune init exe readlines --libs=base,stdio
Then open readlines.ml in your favorite editor and substitute its contents with the following,
open Base
open Stdio
let read filename =
In_channel.read_lines filename |>
List.mapi ~f:(fun line data ->
String.split data ~on:' ' |>
List.fold_map ~init:0 ~f:(fun pos word ->
let pos = pos + String.length word in
pos+1, (word,line+1,pos-1)) |>
snd) |>
List.concat
let print =
List.iter ~f:(fun (line,data,pos) ->
printf "(%s,%d,%d)\n" line data pos)
let main filename =
print (read filename)
let () = match Sys.get_argv () with
| [|_; filename|] -> main filename
| _ -> failwith "expects one argument: filename"
To run and test, create a sample input, e.g. a file named test.txt
example of the first line
followed by the second line
(make sure that the last line is followed by a newline)
Now you can run it,
dune exec ./readlines.exe test.txt
The result should be the following,
(example,1,6)
(of,1,9)
(the,1,13)
(first,1,19)
(line,1,24)
(followed,2,7)
(by,2,10)
(the,2,14)
(second,2,21)
(line,2,26)
(Notice, that I am counting positions from 0 not from 1).
You can also run this code interactively in utop, but you would need to install base and stdio and load them into the interpreter, with
#require "base";;
#require "stdio";;
If you're not using utop but the default OCaml toplevel, you need to also install ocamlfind (opam install ocamlfind) and do
#use "topfind";;
#require "base";;
#require "stdio";;
If you want to just use the standard libraries as String you can do what you want with String.split_on_char and some other stuff applied on each line.
Here is an example on how you could do for the first lien
let ic = open_in (*your file name*) in
let first_line = input_line ic in
let words = String.split_on_char ' ' first_line in
let rec aux accLen =
function
| [] -> []
| s :: ts ->
match s with
(* empty string means that their were a white space before the split *)
| "" -> aux (accLen +1) ts
| s -> let l = accLen + String.length s in (1, s, l) :: aux l ts
in aux 0 words;;
As ivg said, you can replace the aux function with a List.fold_left :
let ic = open_in (*your file name*) in
let first_line = input_line ic in
let words = String.split_on_char ' ' first_line in
let _, l = List.fold_left (
fun (accLen, accRes) ->
function
| "" -> (accLen+1, accRes)
| s -> let l = accLen + String.length s in (l, (1, s, l) :: accRes)
) (0, []) words
in List.rev l;;
Doesn't include the file I/O component, but does properly handle multiple spaces between words, including tabs. Some fun use of fold_left to entertain a new OCaml programmer.
let words_with_last_index line =
line ^ " "
|> String.to_seqi
|> Seq.fold_left
(fun (wspace, cur_word, words) (cur_pos, cur_ch) ->
match cur_ch with
| ' ' when wspace || cur_pos = 0 -> (true, cur_word, words)
| '\t' when wspace || cur_pos = 0 -> (true, cur_word, words)
| ' ' | '\t' -> (true, "", words # [(cur_word, cur_pos - 1)])
| ch -> (false, cur_word ^ String.make 1 ch, words))
(false, "", [])
|> (fun (_, _, collection) -> collection)
let parse_lines text =
String.split_on_char '\n' text
|> List.mapi
(fun i line ->
line
|> words_with_last_index
|> List.map (fun (word, pos) -> (word, i + 1, pos)))
|> List.flatten
so this program predicts the first winning move of the famous Game of Nim. I just need a little help figuring out this problem in the code. The input file reads something like this.
3
13 4 5
29 5 1
34 4 50
The first number would represent the number of lines following the first line that the program has to read. So if the case was
2
**13 4 5
29 5 1**
34 4 50
it would only read the next two lines following it.
So far this has been the progress of my code
def main ():
nim_file = open('nim.txt', 'r')
first_line = nim_file.readline()
counter = 1
n = int (first_line)
for line in nim_file:
for j in range(1, n):
a, b, c = [int(i) for i in line.split()]
nim_sum = a ^ b ^ c
if nim_sum == 0:
print ("Heaps:", a, b, c, ": " "You Lose!")
else:
p = a ^ nim_sum
q = b ^ nim_sum
r = c ^ nim_sum
if p < a:
stack1 = a - p
print ("Heaps:", a, b, c, ": " "remove", stack1, "from Heap 1")
elif q < b:
stack2 = b - q
print ("Heaps:", a, b, c, ": " "remove", stack2, "from Heap 2")
elif r < c:
stack3 = c - r
print ("Heaps:", a, b, c, ": " "remove", stack3, "from Heap 3")
else:
print ("Error")
nim_file.close()
main()
I converted the first line number to an int and tried to set a while loop at first with a counter to see that the counter wouldn't go above the value of n but that didn't work. So any thoughts?
If the file is small, just load the whole thing:
lines = open('nim.txt').readlines()
interesting_lines = lines[1:int(lines[0])+1]
and continue from there.
Yo have two nested for statement, the second of which doesn't make much sence. You need to leave just one, like this:
for _ in range(n):
a, b, c = [int(i) for i in nim_file.readline()]
and remove for line in nim_file. Also check out this question and consider using the with statement to handle the file opening/closing.
Here I am trying to find the index of '-' followed by '}' in a String.
For an input like sustringIndex "abcd -} sad" it gives me an output of 10
which is giving me the entire string length.
Also if I do something like sustringIndex "abcd\n -} sad" it gives me 6
Why is that so with \n. What am I doing wrong. Please correct me I'm a noob.
substrIndex :: String -> Int
substrIndex ""=0
substrIndex (s:"") = 0
substrIndex (s:t:str)
| s== '-' && t == '}' = 0
| otherwise = 2+(substrIndex str)
Your program has a bug. You are checking every two characters. But, what if the - and } are in different pairs, for example S-}?
It will first check S and - are equal to - and } respectively.
Since they don't match, it will move on with } alone.
So, you just need to change the logic a little bit, like this
substrIndex (s:t:str)
| s == '-' && t == '}' = 0
| otherwise = 1 + (substrIndex (t:str))
Now, if the current pair doesn't match -}, then just skip the first character and proceed with the second character, substrIndex (t:str). So, if S- doesn't match, your program will proceed with -}. Since we dropped only one character we add only 1, instead of 2.
This can be shortened and written clearly, as suggested by user2407038, like this
substrIndex :: String -> Int
substrIndex [] = 0
substrIndex ('-':'}':_) = 0
substrIndex (_:xs) = 1 + substrIndex xs
I am on the lookout for a gsub based function which would enable me to do combinatorial string replacement, so that if I would have an arbitrary number of string replacement rules
replrules=list("<x>"=c(3,5),"<ALK>"=c("hept","oct","non"),"<END>"=c("ane","ene"))
and a target string
string="<x>-methyl<ALK><END>"
it would give me a dataframe with the final string name and the substitutions that were made as in
name x ALK END
3-methylheptane 3 hept ane
5-methylheptane 5 hept ane
3-methyloctane 3 oct ane
5-methyloctane 5 ... ...
3-methylnonane 3
5-methylnonane 5
3-methylheptene 3
5-methylheptene 5
3-methyloctene 3
5-methyloctene 5
3-methylnonene 3
5-methylnonene 5
The target string would be of arbitrary structure, e.g. it could also be string="1-<ALK>anol" or each pattern could occur several times, as in string="<ALK>anedioic acid, di<ALK>yl ester"
What would be the most elegant way to do this kind of thing in R?
How about
d <- do.call(expand.grid, replrules)
d$name <- paste0(d$'<x>', "-", "methyl", d$'<ALK>', d$'<END>')
EDIT
This seems to work (substituting each of these into the strplit)
string = "<x>-methyl<ALK><END>"
string2 = "<x>-ethyl<ALK>acosane"
string3 = "1-<ALK>anol"
Using Richards regex
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
s <- strsplit(string3, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in s) {
out[[i]] <- ifelse (i %in% names(d), d[i], i)
}
d$name <- do.call(paste0, unlist(out, recursive=F))
EDIT
This should work for repeat items
d <- do.call(expand.grid, list(replrules, stringsAsFactors=FALSE))
names(d) <- gsub("<|>","",names(d))
string4 = "<x>-methyl<ALK><END>oate<ALK>"
s <- strsplit(string4, "(<|>)", perl = TRUE)[[1]]
out <- list()
for(i in seq_along(s)) {
out[[i]] <- ifelse (s[i] %in% names(d), d[s[i]], s[i])
}
d$name <- do.call(paste0, unlist(out, recursive=F))
Well, I'm not exactly sure we can even produce a "correct" answer to your question, but hopefully this helps give you some ideas.
Okay, so in s, I just split the string where it might be of most importance. Then g gets the first value in each element of r. Then I constructed a data frame as an example. So then dat is a one row example of how it would look.
> (s <- strsplit(string, "(?<=l|\\>)", perl = TRUE)[[1]])
# [1] "<x>" "-methyl" "<ALK>" "<END>"
> g <- sapply(replrules, "[", 1)
> dat <- data.frame(name = paste(append(g, s[2], after = 1), collapse = ""))
> dat[2:4] <- g
> names(dat)[2:4] <- sapply(strsplit(names(g), "<|>"), "[", -1)
> dat
# name x ALK END
# 1 3-methylheptane 3 hept ane