I have a script to return parts of a text file, but I notice it sometimes returns characters that are not visible when viewing the text file directly. For example, the word:
breeders
becomes
breed‰ rs
I tried adding "as Unicode text" to my text return, but that isn't working. Thoughts? Here's my script:
set some_file to "[...]Words.txt" as alias
set the_text to read some_file as string
set the text item delimiters of AppleScript to ", "
set the_lines to (every text item of the_text)
return some item of the_lines as Unicode text
Have you tried something like ruby -KU -e '"breeders".chars{|c|puts c.unpack("U*")[0].to_s(16)}' or searching for the characters that aren't displayed correctly in Character Viewer?
read will jumble up non-ASCII characters unless you add as «class utf8»:
do shell script "echo ä > /tmp/test.txt"
read POSIX file "/tmp/test.txt" as «class utf8»
as text, as string, and as Unicode text have been equivalent since 10.5.
Related
I need to put the following text format into a variable
"Sometext":"more text";'still text
However, because the text has double and single quotes, I can't put it in a string. I've tried using #''#, and #""# but it's not working.
Sidenote: I can't edit the text because it's originated automatically
What can I do?
Thank you in advance
If it's a literal, you can use a here string:
$variable = #"
"Sometext":"more text";'still text
"#
(Note that the final "# has to be on a separate line, at the very beginning of that line.)
To build strings with complex quotations, consider composite formatting. Save the quotes in a variable and use formatting placeholders to insert them. Like so,
$squote = "'"
$dquote = '"
$myString = "{0}Sometext{0}:{0}more text{0};{1}still text" -f $dquote, $squote
$myString
"Sometext":"more text";'still text
I have a string that is like so:
"string content here
"
because it is too long to fit on the screen in one line
The string is the name of a file i would like to read, but i always get an error message that the file name wasn't found because it includes the new line character in the string when this obviously isn't in the file name. I cannot rename the file and I have tried the strip function to remove it, but this doesn't work. How can I remove the enter character from my string so I can load my file?
You can use the function strip to remove any trailing whitespace from a string.
>> text = "hello" + newline; %Create test string.
>> disp(text)
hello
>> text_stripped = strip(text);
>> disp(text_stripped)
hello
>>
In the above ">>" has been included to better present the removal of the whitespace in the string.
Consider replacing the newline character with nothing using strrep. Link
As an example:
s = sprintf('abc\ndef') % Create a string s with a newline character in the middle
s = strrep(s, newline, '') % Replace newline with nothing
Alternatively, you could use regular expressions if there are several characters causing you issues.
Alternatively, you could use strip if you know the newline always occurs at the beginning or end.
I am trying to change a file looking like this :
>sample_A#Dakota
text
text
text
>text_2#Idao
text
text
text
>junk_1#Alabama
text
text
text
>example_4#Dakota
text
text
text
>example5#Honduras
text
text
text
to a file looking like this :
>model_1#Dakota
text
text
text
>model_2#Idao
text
text
text
>model_3#Alabama
text
text
text
>model_4#Dakota
text
text
text
>model_5#Honduras
text
text
text
So, I need to find the text between > and #, and replace it with "model" followed by an incremental number. I have found some answers only for doing these thing separately, but I haven't been able to combine them. I would want to use bash, with a one-line answer like a sed or an awk.
I have tried this :
awk 'BEGIN { cntr = 0 } />/,/#/ { cntr++ ; print "model", cntr } !/>/,/#/ { print $0 }' infile
but I got this :
model 1
text
text
text
model 2
>text_2#Idao
text
text
text
model 3
>junk_1#Alabama
text
text
text
model 4
>example_4#Dakota
text
text
text
model 5
>example5#Honduras
text
text
text
Thanks in advance,
T
$ awk '/^>.*#/{sub(/^>[^#]+/, ">model_" ++c)} 1' ip.txt
>model_1#Dakota
text
text
text
>model_2#Idao
text
text
text
>model_3#Alabama
text
text
text
>model_4#Dakota
text
text
text
>model_5#Honduras
text
text
text
/^>.*#/ if line starts with > and has # in the line
sub function helps to search and replace first match
/^>[^#]+/ match characters from start of line from > until just before # character
">model_" ++c replacement string
c will be zero at the start (since this is numerical context), ++c will give the value after incrementing, so first time we get 1, next time 2 and so on
$ awk 'sub(/^>[^#]+/,""){$0=">model1_" (++cnt) $0} 1' file
>model1_1#Dakota
text
text
text
>model1_2#Idao
text
text
text
>model1_3#Alabama
text
text
text
>model1_4#Dakota
text
text
text
>model1_5#Honduras
text
text
text
Could you please try following too.
awk 'match($0,/>.*#/){print ">model_"++count"#" substr($0,RSTART+RLENGTH);next} 1' Input_file
awk '/^>/{$0=">model_" ++c "#" $3}1' FS='[>#]' file
I used > and # as field separators.
Output:
>model_1#Dakota
text
text
text
>model_2#Idao
text
text
text
>model_3#Alabama
text
text
text
>model_4#Dakota
text
text
text
>model_5#Honduras
text
text
text
This might work for you (GNU sed and shell):
sed -E '/^>.*#/{x;s/.*/expr & + 1/e;x;G;s/^[^#]*(.*)\n(.*)/echo "model_\2\1"/e}' file
For lines that begin > and contain #, increment a counter in the hold space (HS), append the HS to the current line and re-arrange into the desired format.
I have a function analyze_text: string -> unit to analyze a text. As a result, (most of the time,) ./analyze aText launches the function with the argument.
let usage_msg = "./analyze [options] TEXT" in
Arg.parse options analyze_text usage_msg;
However, I realize that when the text contains special characters like ", ' or !, it cannot be well read. Does anyone know if there is a way to well wrap the text and give it to the function?
On the shell there are many shell characters. You can escape the shell characters by enclosing your input in single quotes.
$ echo 'a*$b"$c"!d'
a*$b"$c"!d
If your input itself contains single quote. You'll have to enclose that in the double quotes and concatenate with the rest of substrings of input which are enclosed in single quotes.
e.g. You want to print: He$l!o Wo$r'ld
You can do it like:
$ echo 'He$l!o Wo$r'"'"'ld'
He$l!o Wo$r'ld
In your case, the culprit is not your OCaml code, but the behavior of your shell, e.g., bash. When you enter text in the bash command line prompt many characters have special meaning, e.g., ", ', $, \ and so on. To hush the special meaning of a character in bash you can either escape it with the backslash, e.g., \$, \\, \' or delimit with single quotes (but you will still need to escape single quotes in the single-quotes-delimited text.
The general approach is that when your input is actual text or data, not a sequence of commands and options, you should read the input from a file or from the standard input channel. This also helps, when the size of the input is large, as most of the shells limit (sometimes significantly) the total number of characters that can be passed through the command line. In vanilla OCaml, you can input the whole file into a single string using the following simple code
let read_file filename =
let buf = Buffer.create 4096 in
let chan = open_in filename in
begin
try while true do Buffer.add_channel buf chan 4096 done
with End_of_file -> ()
end;
Buffer.contents buf
Then you don't need to deal with any special characters, as your input will be the file and no shell in between will do any interpretations. You can even analyze binary data with that.
I have a large (4 GB) Windows .csv text file (each lines end in "\r\n") in a Linux environment that was supposed to have been a csv delimited file (delimiter = '|', text qualifier = '"') with each field separated by a pipe and enclosed in double quotes. Any narrative text field with embedded double quotes was supposed to have the double quote escaped with a second double quote (ie. " the quick "brown" fox" was supposed to have been represented as "the quick ""brown"" fox"). Unfortunately escaping the embedded double quotes did not occur. Further the text fields may include embedded new lines (i.e. Windows CR (\r\n)) which need to be retained.
Sample lines might look as follows:
"1234567890123456"|"2016-07-30"|"2016-08-01"|"123"|"456"|"789"|"text narrative field starts\r\n
with text lines that may have embedded double quotes "For example"\r\n
and may include measurements such as 1/2" x 2" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote"\r\n
"9876543210654321"|"2017-01-31"|"2018-08-01"|"123"|"456"|"789"|"text narrative field"\r\n
"2345678901234567"|"...."\r\n
with the objective to have the output appear as follows:
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts\r\n
with text lines that may have embedded double quotes ""For example""\r\n
and may include measurements such as 1/2"" x 2"" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote~\r\n
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~\r\n
~2345678901234567~|~....~\r\n
The solution I was attempting to implement was to:
SUCCESSFUL: change all the "|" sequences to ~|~
SUCCESSFUL: change the double quote (")at the start of the first line and end of the last line to a tilde (~)
change the ending and starting double quotes to tildes for any lines ending in a double quote at the end of the first line and terminated with a CR (\r\n) (eg. ..."\r\n) and the next line begins with a double quote, followed by 16 digit number and a tilde (eg. "1234567890123456~...) (i.e. it is the start of a new record)
convert all remaining double quote characters to two successive double quotes (change " to "")
then reverse the first 3 steps above changing all ~ back to double quotes.
I started by using sed to replace all strings with double quote, followed by a pipe, followed by a double quote (i.e. "|") with a tilde, pipe, tilde (i.e. ~|~). I then manually replaced the first and last doublequote in the file with a tilde.
This is where I ran into issues as I tried to count the number of occurrences where a line ends with a doublequote(") and the start of the next line begins with a doublequote followed by a 16 digit number and a "~" which will tell me the actual number of csv records in the file (minus one) as opposed to the number of lines. I attempted to do this using grep: grep '"\r\n"\d{16}~' | wc -l but that didn't work
I then need to replace those double quotes wherein a double quote ends a record and the succeeding record begins with a double quote followed by a 16 digit number and a "~" leaving everything else intact.
I tried to use sed: sed 's/"\r\n"(\d{16}~)/~\r\n~\1' windows_file.txt but it is not working as hoped.
I would welcome any recommendations as to how to accomplish the above.
The script below does what you expect using awk, except for the very last line in the file since it does not know where that record ends.
It could be fixed counting lines in the file but would be impractical since it's a big file.
Looking at data structure records are separated by "\r\n" and fields by "|" let's use that with awk.
gawk 'BEGIN{
RS="\"\r\n\"" # input record separator RS, 2 double quotes with a DOS line ending in the middle
FS="\"\\|\"" # input field separator FS, 2 double quotes with a pipe in the middle
ORS="~\r\n~" # your record separator
OFS="~|~" # your field separator
} {
$1=$1 # trick awk into believing something has changed
if (NR == 1){ # first record, replace first character
print "~" substr($0,2)
}else{
print $0
}
} ' test.txt
Result (assuming lines end with \r\n):
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts
with text lines that may have embedded double quotes "For example"
and may include measurements such as 1/2" x 2" with
the text continuing and includes embedded line breaks
which will finally be terminated with a double quote~
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~
~10654321~|~2018-09-31~|~2018-08-01~|~123~|~456~|~789~|~asdasdasdasdad asasda"
~
~
PS: will break if a field contains a line that starts with " and the preceding line within the same ends with "\r\n since the pattern will match the proposed RS.
"10654321"|"2018-09-31"|"2018-08-01"|"123"|"456"|"789"|"asdasdasdasdad asasda"\r\n
"some more"\r\n
"22222"|".... (another record)