vim delete single space between (character/number) and (character number) - vim

I have a text file with four variables (TA 000, TB 111, T2 333, T56 R88), separated by 3 single spaces among each other like:
TA 000 TB 111 T2 333 T56 R88
Is it possible to erase the single space within the variables with vim, maintaining intact the 3 spaces that separate the variables?
TA000 TB111 T2333 T56R88

Certainly. One approach is with capturing groups, capturing the words + single space + numbers, and reassembling only with words + numbers:
:%s/\(\w\+\) \(\d\+\)/\1\2/g
Another approach matches only the single space (and replaces it with nothing), asserting (but not matching) the stuff around it:
:%s/\w\zs \ze\d//g
The \zs and \ze (you can look up anything here via :h /\zs etc.) are specific to Vim. A variation (that would work in other regular expression engines, too) would be using positive lookahead and lookbehind, but the syntax is more complex.
If the three spaces have special meaning (to limit the matching places), you can incorporate those into both approaches, too. I leave that to you, as such relatively easy problems provide great learning experiences :-)

Related

How to (with good asymptotic complexity) delete all of a specific character from a (very long) line?

I have a line with 25.9 million characters about 2.4 million of which are commas, and I want to remove all of the commas from the line.
If I use the command :s/,//g it constructs a regular expression which is run repeatedly on the line until there are no commas left. This seems to run in O(n^2) time based on empirical measurement. And as such my regular expression runs for well over an hour on this line.
Using a macro is no good because of the redraw that occurs which tends to be somewhat expensive when you are in the middle of such a long line.
Splitting up the lines seems to be the best option, but due to the structure of the file, I'd need to create a new buffer to do so cleanly.
Yes, there are much better ways to output this much data that does not involve CSVs with ridiculous numbers of columns, let's assume I didn't generate it, but I have it, and I have to work with it.
Is there an asymptotically fast way to simply delete every occurrence of a specific character from a line in vim?
As a text editor, Vim isn't well suited to such pathologically formatted files (as you've already found out).
As others have already commented, tr is a good alternative for removing the commas. Either externally:
$ tr -d , input.txt
Or from within Vim:
:.! tr -d ,
Vim also has a built-in low-level function :help tr(). Unfortunately, it doesn't handle deletion, only conversion. You could use it to change commas into semicolons in the current line like this:
:call setline('.', tr(getline('.'), ',', ';'))

Why do ANSI color escapes end in 'm' rather than ']'?

ANSI terminal color escapes can be done with \033[...m in most programming languages. (You may need to do \e or \x1b in some languages)
What has always seemed odd to me is how they start with \033[, but they end in m Is there some historical reason for this (perhaps ] was mapped to the slot that is now occupied by m in the ASCII table?) or is it an arbitrary character choice?
It's not completely arbitrary, but follows a scheme laid out by committees, and documented in ECMA-48 (the same as ISO 6429). Except for the initial Escape character, the succeeding characters are specified by ranges.
While the pair Escape[ is widely used (this is called the control sequence introducer CSI), there are other control sequences (such as Escape], the operating system command OSC). These sequences may have parameters, and a final byte.
In the question, using CSI, the m is a final byte, which happens to tell the terminal what the sequence is supposed to do. The parameters if given are a list of numbers. On the other hand, with OSC, the command-type is at the beginning, and the parameters are less constrained (they might be any string of printable characters).

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source
line shall be eligible for being part of such a splice.
(ยง5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Detecting syllables in a word containing non-alphabetical characters

I'm implementing readability test and have implemented simple algorithm of detecting sylables.
Detecting sequences of vowels I'm counting them in words, for example word "shoud" contains one sequence of vowels which is 'ou'. Before I'm counting them i'm removing suffixes like -les, -e, -ed (for example word "like" contains one syllable but two sequences of vowels, so this method works).
But...
Consider these words / sequences:
x-ray (it contains two syllables)
I'm (One syllable, maybe I may use removal of all apostrophes in the text?)
goin'
I'd've
n' (for example Pork n' Beans)
3rd (how to treat this ?)
12345
What to do with special characters? Remove them all? It will be ok for most of words, but not with "n'" and "x-ray". And how treat cyphers.
These are special cases of words but I'll be very glad to see some experience or ideas in this subject.
I'd advise you to first determine how much of your data consists of these kinds of words and how much it matters to your program's overall performance. Also compile some statistics of which kinds occur most.
There's no simple correct solution for this problem, but I can suggest a few heuristics:
A ' between two consonants (shouldn't) seems to mark the elision of a syllable
A ' with a vowel or word boundary on one side (I'd, goin') seems not to do so (but note that goin' is still two syllables)
Any word, including n' is at least one syllable long
Dashes (-) may be handled by treating the text on both sides as separate words
3rd can be handled by code that writes ordinals out as words, or by simpler heuristics.

What character sequence should I not allow in a filename?

I found out after testing that linux allows any character in a file name except for / and null (\0). So what sequence should I not allow in a filename? I heard a leading - may confuse some command line programs, which doesn't matter to me, however it may bother other people if they decide to collect a bunch of files and filter it with some GNU programs.
It was suggested to me to remove leading and trailing spaces and I plan to only because typically the user doesn't mean to have leading/trailing space.
What problematic sequence might there be and what sequence should I consider not allowing?
I am also considering not allowing characters illegal in windows just for convenience. I think I may not allow dashes at the beginning (dash is a legal window character)
Your question is somewhat confusing since you talk at length about Linux, but then in a comment to another answer you say that you are generating filenames for people to download, which presumably means that you have absolutely no control whatsoever over the filesystem and operating system that the files will be stored on, making Linux completely irrelevant.
For the purpose of this answer I'm going to assume that your question is wrong and your comment is correct.
The vast majority of operating systems and filesystems in use today fall roughly into three categories: POSIX, Windows and MacOS.
The POSIX specification is very clear on what a filename that is guaranteed to be portable across all POSIX systems looks like. The characters that you can use are defined in Section 3.276 (Portable Filename Character Set) of the Open Group Base Specification as:ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789._-The maximum filename length that you can rely on is defined in Section 13.23.3.5 (<limits.h> Minimum Values) as 14. (The relevant constant is _POSIX_NAME_MAX.)
So, a filename which is up to 14 characters long and contains only the 65 characters listed above, is safe to use on all POSIX compliant systems, which gives you 24407335764928225040435790 combinations (or roughly 84 bits).
If you don't want to annoy your users, you should add two more restrictions: don't start the filename with a dash or a dot. Filenames starting with a dot are customarily interpreted as "hidden" files and are not displayed in directory listings unless explicitly requested. And filenames starting with a dash may be interpreted as an option by many commands. (Sidenote: it is amazing how many users don't know about the rm ./-rf or rm -- -rf tricks.)
This leaves you at 23656340818315048885345458 combinations (still 84 bits).
Windows adds a couple of new restrictions to this: filenames cannot end with a dot and filenames are case-insensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It doesn't add any length restrictions, Windows can deal with 14 characters just fine.
This reduces the possible combinations to 17866587696996781449603 (73 bits).
Another restriction is that Windows treats everything after the last dot as a filename extension which denotes the type of the file. If you want to avoid potential confusion (say, if you generate a filename like abc.mp3 for a text file), you should avoid dots altogether.
You still have 13090925539866773438463 combinations (73 bits).
If you have to worry about DOS, then additional restrictions apply: the filename consists of one or two parts (seperated by a dot), where neither of the two parts can contain a dot. The first part has a maximum length of 8, the second of 3 characters. Again, the second part is usually reserved to indicate the file type, which leaves you only 8 characters.
Now you have 4347792138495 possible filenames or 41 bits.
The good news is that you can use the 3 character extension to actually correctly indicate the file type, without breaking the POSIX filename limit (8+3+1 = 12 < 14).
If you want your users to be able to burn the files onto a CD-R formatted with ISO9660 Level 1, then you have to disallow hyphen anywhere, not just as the first character. Now, the remaining character set looks likeABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_which gives you 3512479453921 combinations (41 bits).
I would leave the determination of what's "valid" up to the OS and filesystem driver. Let the user type whatever they want, and pass it on. Handle errors from the OS in an appropriate manner. The exception is I think it's reasonable to strip leading and trailing spaces. If people want to create filenames with embedded spaces or leading dashes or question marks, and their chosen filesystem allows it, it shouldn't be up to you to try to prevent them.
It's possible to mount different filesystems at different mount points (or drives in Windows) that have different rules regarding legal characters in a file name. Handling this sort of thing inside your application will be much more work than is necessary, because the OS will already do it for you.
Since you seem to be interested primarily in Linux, one thing to avoid is characters that the (typical) shell will try to interpret, for example, as a wildcard. You can create a file named "*" if you insist, but you might have some users who don't appreciate it much.
Are you developing an application where you have to ask the user to create files themselves? If that's what you are doing, then you can set the rules in your application. (eg only allow [a-zA-Z0-9_.] and reject the rest of special characters.) this is much simpler to enforce.
urlencode all strings to be use as filenames and you'll only have to worry about length. This answer might be worth reading.
I'd recommend the use of a set of whitelist characters. In general, symbols in filenames will annoy people.
By all means allow people to use a-z 0-9 and unicode characters > 0x80, but do not allow arbitrary symbols, things like & and , will cause a lot of annoyance, as well as fullstops in inappropriate places.
I think the ASCII symbols which are safe to allow are: fullstop underscore hyphen
Allowing any OTHER ascii symbols in the filename is asking for trouble.
A filename should also not start with an ascii symbol. Policy on spaces in filenames is tricky as users may expect to be able to use them, but some filenames are obviously silly (such as those which START with spaces)

Resources