I use this to detect space in a string in Lua:
if string.byte(" ")==32 then blah blah
What is the return number (instead of 32) for enter key or new line in Lua?
These numbers denote the ASCII codes for each character. Here's a chart for future reference (but only to 127, as extended ASCII is not supported) so newline is 10.
You can also print a list with the following code:
for i=1,127 do
print(i .. " = " .. string.char(i))
end
However, command characters (such as newline) are difficult to interpret.
You can check them with the \n and \r characters.
> =string.byte '\r'
13
> =string.byte '\n'
10
I don't know the number, but you could try finding it by printing print(string.byte("\n"))
Related
I'd like to replace double quotes " characters which come in pairs. Let me explain what I mean.
"Some sentence"
Here double quotes should be replaced because they come in pair.
"Some sentence
Here should not be replaced - there is no matching pair for the first quote character.
I'd like to replace first quote character with „.
❯ echo „ |hexdump -C
00000000 e2 80 9e 0a
And the second quote character with ”
❯ echo ” |hexdump -C
00000000 e2 80 9d 0a
Summing it up, the following:
Hi, "how
are you"
Should be the following after being replacement is made.
Hi, „how
are you”
I've come up with the following code, but it fails to work:
'sed -r s/(\")(.+)(\")/\1\xe2\x80\x9e\3\xe2\x80\x9d/g'
" hi " gives "„"”.
EDIT
As requested in the comments, here comes a sample from a file to be modified. Important note: the file is structured - perhaps it may help. The file is always a srt file, i.e. movie subtitle format.
104
00:10:25,332 --> 00:10:27,876
Kobieta mówi do drugiej:
"Widzisz to, co ja?"
105
00:10:28,001 --> 00:10:30,904
A tamta: "No to co?
Każdy wygląda tak samo."
Your expression doesn't work because you have three capturing groups: The three sets of (). You are putting the 1st (the first quote) and the 3rd (the last quote) in the output and ignoring the 2nd, which is the part you want to keep.
There's no reason to capture the quotes, since you don't want to inject them into the output. Only the bit in the middle needs to be captured.
There is also a flaw, the (.*) will itself match against a string containing a quote. So /"(.*)"/ would match the entire sequence "one"two", with the capture, (.*), matching one"two. Use [^"]* to match a sequence of non-quote characters.
Fixing this, and treating the entire text file as one line with -z, which only works if there are no nul characters in the text file, it appears this works:
sed -zE 's/"([^"]+)"/„\1“/g'
sed -rn ':a;s/"([^"]*)"/„\1”/g;/"/!{p;b;};$p;N;ba'
It substitutes all "xx" with „xx”. If the result contains no more " it is printed and we restart with next line. Else we concatenate the next line and we restart. The $p is just here to print the last lines if they contain a dangling ".
I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).
From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt
The second type of special character sequence is a dollar sign
followed by two hexadecimal digits which give the value of the
replacement byte. This is often used for bytes that are non-printable
as ASCII characters, especially in UTF-16 text. For example, a string
with the Unicode snowman character (U+2603):
☃snowman☃
may be represented as UTF-16 text in an Alias this way:
<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>
From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).
The following characters need to be escaped (with \) within the string to be used literally: $, ) and \
Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08
$C3$B1 is ñ; $E2$98$BA is ☺; $C3$ADis í
I tried using the regex to replaced unescaped $ into \x
my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g; # unescape "\ $ ("
Within perl it just prints the literal string.
My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string
$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
aaa$AAñb☺í
Questions i consulted:
Convert UTF-8 character sequence to real UTF-8 bytes
But it seems it interprets every byte by itself, not in groups.
In Perl, how can I convert an array of bytes to a Unicode string?
I don't know how to apply this solution to my use case.
Is there any way to achieve this in perl?
The following substitution seems to work for your input:
s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;
Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.
If you want to work with the result in Perl, don't forget to decode it from UTF-8.
$chars = decode('UTF-8', $bytes);
I have a large (4 GB) Windows .csv text file (each lines end in "\r\n") in a Linux environment that was supposed to have been a csv delimited file (delimiter = '|', text qualifier = '"') with each field separated by a pipe and enclosed in double quotes. Any narrative text field with embedded double quotes was supposed to have the double quote escaped with a second double quote (ie. " the quick "brown" fox" was supposed to have been represented as "the quick ""brown"" fox"). Unfortunately escaping the embedded double quotes did not occur. Further the text fields may include embedded new lines (i.e. Windows CR (\r\n)) which need to be retained.
Sample lines might look as follows:
"1234567890123456"|"2016-07-30"|"2016-08-01"|"123"|"456"|"789"|"text narrative field starts\r\n
with text lines that may have embedded double quotes "For example"\r\n
and may include measurements such as 1/2" x 2" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote"\r\n
"9876543210654321"|"2017-01-31"|"2018-08-01"|"123"|"456"|"789"|"text narrative field"\r\n
"2345678901234567"|"...."\r\n
with the objective to have the output appear as follows:
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts\r\n
with text lines that may have embedded double quotes ""For example""\r\n
and may include measurements such as 1/2"" x 2"" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote~\r\n
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~\r\n
~2345678901234567~|~....~\r\n
The solution I was attempting to implement was to:
SUCCESSFUL: change all the "|" sequences to ~|~
SUCCESSFUL: change the double quote (")at the start of the first line and end of the last line to a tilde (~)
change the ending and starting double quotes to tildes for any lines ending in a double quote at the end of the first line and terminated with a CR (\r\n) (eg. ..."\r\n) and the next line begins with a double quote, followed by 16 digit number and a tilde (eg. "1234567890123456~...) (i.e. it is the start of a new record)
convert all remaining double quote characters to two successive double quotes (change " to "")
then reverse the first 3 steps above changing all ~ back to double quotes.
I started by using sed to replace all strings with double quote, followed by a pipe, followed by a double quote (i.e. "|") with a tilde, pipe, tilde (i.e. ~|~). I then manually replaced the first and last doublequote in the file with a tilde.
This is where I ran into issues as I tried to count the number of occurrences where a line ends with a doublequote(") and the start of the next line begins with a doublequote followed by a 16 digit number and a "~" which will tell me the actual number of csv records in the file (minus one) as opposed to the number of lines. I attempted to do this using grep: grep '"\r\n"\d{16}~' | wc -l but that didn't work
I then need to replace those double quotes wherein a double quote ends a record and the succeeding record begins with a double quote followed by a 16 digit number and a "~" leaving everything else intact.
I tried to use sed: sed 's/"\r\n"(\d{16}~)/~\r\n~\1' windows_file.txt but it is not working as hoped.
I would welcome any recommendations as to how to accomplish the above.
The script below does what you expect using awk, except for the very last line in the file since it does not know where that record ends.
It could be fixed counting lines in the file but would be impractical since it's a big file.
Looking at data structure records are separated by "\r\n" and fields by "|" let's use that with awk.
gawk 'BEGIN{
RS="\"\r\n\"" # input record separator RS, 2 double quotes with a DOS line ending in the middle
FS="\"\\|\"" # input field separator FS, 2 double quotes with a pipe in the middle
ORS="~\r\n~" # your record separator
OFS="~|~" # your field separator
} {
$1=$1 # trick awk into believing something has changed
if (NR == 1){ # first record, replace first character
print "~" substr($0,2)
}else{
print $0
}
} ' test.txt
Result (assuming lines end with \r\n):
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts
with text lines that may have embedded double quotes "For example"
and may include measurements such as 1/2" x 2" with
the text continuing and includes embedded line breaks
which will finally be terminated with a double quote~
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~
~10654321~|~2018-09-31~|~2018-08-01~|~123~|~456~|~789~|~asdasdasdasdad asasda"
~
~
PS: will break if a field contains a line that starts with " and the preceding line within the same ends with "\r\n since the pattern will match the proposed RS.
"10654321"|"2018-09-31"|"2018-08-01"|"123"|"456"|"789"|"asdasdasdasdad asasda"\r\n
"some more"\r\n
"22222"|".... (another record)
I want to remove in my text any kind of ASCII character with code in interval [128-255].
gsub(/[^a-z]/, "", $0) --This is how I remove everything except the letters;
gsub(/ē|é|ě|è|ū|ú|ǔ|ù|ǖ|ǘ|ǚ|ǜ|ü|ō|ó|ǒ|ò|ī|í|ǐ|ì|ā|á|ǎ|à|å|ä|â/, "", $0) -- This is how I remove some extended characters, but not every.
gsub(/"[\128-\255]"/, "", $0) I am trying this, but it shows me an error, invalid interval. So, can anybody please help with that problem. Thanks beforehand.
Backslash codes must be in octal, or prefixed with a x and in hexadecimal.
\200-\377
\x80-\xff
Or you could just use strings.
The \nnn syntax is octal (where n is 0-7), so:
\128 = invalid octal
\200 = 128
\255 = 173
\377 = 255
So you want:
\200-\377
I have an assignment asking me to print x iterations of a string for each character in that string. So if the string input is "Gum", then it should print out:
Gum
Gum
Gum
Right now my code is
my $string = <>;
my $length = length($string);
print ($string x $length, "\n");
And I'm getting gum printed five times as my output.
Those who have said you will get CR + LF at the end of the line on a Windows system are mistaken. Perl will convert the native line ending to a simple newline \n on any platform.
You must bear this in mind whether you are reading from the terminal or from a file.
The built-in chomp function will remove the line terminator character from the end of a string variable. If the string doesn't end with a line terminator then it will have no effect.
So when you type GumEnter you are setting $string to "Gum\n", and length will show that it has four characters.
You are seeing it five times on your screen because the first line is what you typed in yourself. The following four are printed by the program.
After a chomp, $string is just "Gum" with a length of three characters, which is what you want.
To output this on separate lines you have to print a newline after each line, so you can write
my $string = <>;
chomp $string;
my $length = length $string;
print ("$string\n" x $length);
or perhaps
print $string, "\n" for 1 .. $length;
I hope that helps
As you are simply using the input string, it still contains the newline at the end. This is also counted as a character. On my system, it outputs 4 Gum\n.
chomp($string) will remove the line ending, but the output will then also run together, resulting in GumGumGum\n
When You insert input and press enter afterwards You don't enter "Gum" but "Gum\r\n" which is a string of length 5. You should do trimming.
Your code is working fine. See this: http://ideone.com/AsPFh3
Possibility 1: It might be that you're putting 2 spaces while giving input from command line, that's why the length comes out to be 5, and it prints 5 times. Something like this: http://ideone.com/fsvnrd
In above case the my $string=<>; will give you my $string = "gum "; so length will be 5.
Possibility 2:
Another possibility is that if you use Windows then it will add carriage return (\r) and new line (due to enter \n) at the end of string. So it makes the length 5.
Edit: To print in new line: Use the below code.
#!/usr/bin/perl
# your code goes here
chomp(my $string=<>);
my $length = length($string);
print ("$string\n" x $length);
Demo
Edit 2: To remove \r\n use the below:
$string=~ s/\r|\n//g; Read more here.