Parsec csv parser parsing extra line - haskell

I have defined the follwing Parsec parser for parsing csv files into a table of strings, i.e. [[String]]
--A csv parser is some rows seperated, and possibly ended, by a newline charater
csvParser = sepEndBy row (char '\n')
--A row is some cells seperated by a comma character
row = sepBy cell (char ',')
--A cell is either a quoted cell, or a normal cell
cell = qcell <|> ncell
--A normal cell is a series of charaters which are neither , or newline. It might also be an escape character
ncell = many (escChar <|> noneOf ",\n")
--A quoted cell is a " followd by some characters which either are escape charaters or normal characters except for "
qcell = do
char '"'
res <- many (escChar <|> noneOf "\"")
char '"'
return res
--An escape character is anything followed by a \. The \ will be discarded.
escChar = char '\\' >> anyChar
I don't really know if the comments are too much and annoying, of if they are helping. As a Parsec noob they would help me, so I thought I would add them.
It works pretty good, but there is a problem. It creates an extra, empty, row in the table. So if I for example have a csv file with 10 rows(that is, only 10 lines. No empty lines in the end*), the [[String]] structure will have length 11 and the last list of Strings will contain 1 element. An empty String (at least this is how it appears when printing it using show).
My main question is: Why does this extra row appear, and what can I do to stop it?
Another thing I have noted is that if there are empty lines after the data in the csv files, these will end up as rows containing only an empty String in the table. I thought that using sepEndBy instead of sepBy would make the extra empty lines by ignored. Is this not the case?
*After looking at the text file in a hex editor, it seems that it indeed actually ends in a newline character, even though vim doesn't show it...

If you want each row to have at least one cell, you can use sepBy1 instead of sepBy. This should also stop empty rows being parsed as a row. The difference between sepBy and sepBy1 is the same as the difference between many and many1: the 1 version only parses sequences of at least one element. So row becomes this:
row = sepBy1 cell (char ',')
Also, the usual style is to use sepBy1 in infix: cell `sepBy1` char ','. This reads more naturally: you have a "cell separated by a comma" rather than "separated by cell a comma".
EDIT: If you don't want to accept empty cells, you have to specify that ncell has at least one character using many1:
ncell = many1 (escChar <|> noneOf ",\n")

Related

Python List Formatting and Updation

I have a list Eg. a = ["dgbbgfbjhffbjjddvj/n//n//n' "]
How do I remove the trailing new lines i.e. all /n with extra single inverted comma at the end?
Expected result = ["dfgjhgjjhgfjjfgg"] (I typed it randomly)
you can use string rstrip() method.
usage:
str.rstrip([c])
where c are what chars have to be trimmed, whitespace is the default when no arg provided.
example:
a = ['Return a copy of the string\n', 'with trailing characters removed\n\n']
[i.rstrip('\n') for i in a]
result:
['Return a copy of the string', 'with trailing characters removed']
more about strip():
https://www.tutorialspoint.com/python3/string_rstrip.htm

Parsing block comments with Megaparsec using symbols for start and end

I want to parse text similar to this in Haskell using Megaparsec.
# START SKIP
def foo(a,b):
c = 2*a # Foo
return a + b
# END SKIP
, where # START SKIP and # END SKIP marks the start and end of the block of text to parse.
Compared to skipBlockComment I want the parser to return the lines between the start and end marker.
This is my parser.
skip :: Parser String
skip = s >> manyTill anyChar e
where s = string "# START SKIP"
e = string "# END SKIP"
The skip parser works as intended.
To allow for a variable amount of white space within the start and end marker, for example # START SKIP I've tried the following:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = symbol "#" >> symbol "END" >> symbol "SKIP"
Using skip' to parse the above text gives the following error.
3:15:
unexpected 'F'
expecting "END", space, or tab
I would like to understand the cause of this error and how I can fix it.
As Alec already commented, the problem is that as soon as e encounters '#', it counts as a consumed character. And the way parsec and its derivatives work is that as soon as you've consumed any characters, you're committed to that parsing branch – i.e. the manyTill anyChar alternative is then not considered anymore, even though e ultimately fails here.
You can easily request backtracking though, by wrapping the end delimiter in try:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = try $ symbol "#" >> symbol "END" >> symbol "SKIP"
This then will before consuming '#' set a “checkpoint”, and when e fails later on (in your example, at "Foo"), it will act as if no characters had matched at all.
In fact, traditional parsec would give the same behaviour also for skip. Just, because looking for a string and only succeeding if it matches entirely is such a common task, megaparsec's string is implemented like try . string, i.e. if the failure occurs within that fixed string then it will always backtrack.
However, compound parsers still don't backtrack by default, like they do in attoparsec. The main reason is that if anything can backtrack to any point, you can't really get a clear point of failure to show in the error message.

How to remove extra spaces in between string, matlab?

I have created a script to convert text to morsecode, and now I want to modify it to include a slash between words.So something like space slash space between morsecode words. I know my loop before the main loop is incorrect and I want to fix it to do as stated before I just really need help Thank You!!!:
...
Word=input('Please enter a word:','s');
...
Code=MC_1;
...
case ' '
Code='/'
otherwise
Valid=0;
end
if Valid
fprintf('%s ',Code);
else
disp('Input has invalid characters!')
break
end
I know you want to write a loop to remove multiple spaces in between words, but the best way to remove white space in your particular problem would be to use regular expressions, specifically with regexprep. Regular expressions are used to search for particular patterns / substrings within a larger string. In this case, what we are trying to find are substrings that consist of more than one whitespace. regexprep finds substrings that match a pattern, and replaces them with another string. In our case, you would search for any substrings within your string that contain at least one more whitespace characters, and replace them with a single whitespace character. Also, I see that you've trimmed both leading and trailing whitespace for the string using strtrim, which is great. Now, all you need to do is callregexprep like so:
Word = regexprep(Word, '\s+', ' ');
\s+ is the regular expression for finding at least one white space character. We then replace this with a single whitespace. As such, supposing we had this string stored in Word:
Word = ' hello how are you ';
Doing a trim of leading and trailing whitespace, then calling regexprep in the way we talked about thus gives:
Word = strtrim(Word);
Word = regexprep(Word, '\s+', ' ')
Word =
hello how are you
As you can see, the leading and trailing white space was removed with strtrim, and the regular expression takes care of the rest of the spaces in between.
However, if you are dead set on using a loop, what you can do is use a logical variable which is set to true when we detect a white space, and then we use this variable and skip other white space characters until we hit a character that isn't a space. We would then place our space, then /, then space, then continue. In other words, do something like this:
Word = strtrim(Word); %// Remove leading and trailing whitespace
space_hit = false; %// Initialize space encountered flag
Word_noSpace = []; %// Will store our new string
for index=1:length(Word) %// For each character in our word
if Word(index) == ' ' %// If we hit a space
if space_hit %// Check to see if we have already hit a space
continue; %// Continue if we have
else
Word_noSpace = [Word_noSpace ' ']; %// If not, add a space, then set the flag
space_hit = true;
end
else
space_hit = false; %// When we finally hit a non-space, set back to false
Word_noSpace = [Word_noSpace Word(index)]; %// Keep appending characters
end
end
Word = Word_noSpace; %// Replace to make compatible with the rest of your code
for Character = Word %// Your code begins here
...
...
What the above code does is that we have an empty string called Word_noSpace that will contain our word with no extra spaces, and those spaces replaced with a single whitespace. The loop goes through each character, and should we encounter a space, we check to see if we have already encountered a space. If we have, just continue on in the loop. If we haven't, then concatenate a whitespace. Once we finally hit a non-space character, we simply just add those characters that are not spaces to this new string. The result will be a string with no extra spaces, and those are replaced with a single white space.
Running the above code after you trim the leading and trailing white space thus gives:
Word =
hello how are you

Split string into 100 words parts in R

How do I split a single huge "character" into smaller ones, each containing exactly 100 words.
For example, that's how I used to split it by single words.
myCharSplitByWords <- strsplit(myCharUnSplit, " ")[[1]]
I think that this can probably be done with regex (maybe selecting 100th space or smth) but couldn't write a proper expression
I'm new to R and I'm totally stuck. Thanks
Maybe there is a way using regular expressions but after strsplit it would be easier to group the words by "hand":
## example data
set.seed(1)
string <- paste0(sample(c(LETTERS[1:10], " "), 1e5, replace=TRUE), collapse="")
## split if there is at least one space
words <- strsplit(string, "\\s+")[[1]]
## build group index
group <- rep(seq(ceiling(length(words)/100)), each=100)[1:length(words)]
## split by group index
words100 <- split(words, group)
You can get every 100th instances of a run of spaces preceded by a run of non-spaces (if that's your definition of a word) by:
ind<- gregexpr("([^ ]+? +){100}", string)[[1]]
and then substring your original by
hundredWords <- substr(string, ind, c(ind[-1]-1, nchar(string))
This will leave trailing spaces at the end of each entry, and the final entry will not necessarily have 100 entries, but will have the remaining words that are left after removing batches of 100. If you have another definition of word delimiter (tabs, punctuation, ...) then post that and we can change the regular expression accordingly.

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

Resources