How can I use uconv to convert a list of place names to ASCII? - icu

I'm trying to 'intelligently' convert place name strings to ASCII. I think what I'm looking for is transliteration. I was able to use the 'uconv' util to get some encouraging results:
Say my input was "Emberá":
uconv -x Latin-ASCII input.dat > output.dat
The corresponding output would be "Embera" -- exactly what I want. However in some cases, I don't get the expected result, (maybe if the input isn't Latin to begin with?). For example, check this place out (I can't copy and paste the name correctly into this text box) http://maps.google.ca/maps?q=karpos+macedonia&hl=en&ie=UTF8&hnear=Karpo%C5%A1,+Macedonia+(FYROM)&t=m&z=12.
Ideally that would be transliterated into "Karpos" (I think), but if I use that as input for uconv, with the above command, uconv doesn't modify it at all.
So given a list of placenames (here's the list if anyone's curious -- http://www.mediafire.com/file/gb0guu117yp1p26/test.dat), how do I convert them into ASCII?

try -x 'Any-Latin;Latin-ASCII'
You could also add --to-callback escape-unicode -t ascii force everything in the output to ascii, otherwise something like {U+3045}.
Note that 'intelligently' is...relative here. You're stripping off a lot of information and going through several layers of translation. This won't help much, but you can do something like -x 'el-en;Any-Latin;Latin-ASCII' if you know that the text is going from, say Greek to English ( that's the el-en), then it can attempt to use langauge specific transliteration.

Related

node JS console log ascii symbol

HiI am using node JS for my app, and I want to print ascii symbols in terminal.Here is a table for ascii symbols. Please check Extended ASCII Codes field. I want to print square or circle, for example 178 or 219.
Can anyone say me, how can do it?Thank you
Like several other languages, Javascript suffers from The UTF‐16
Curse. Except that Javascript has an even worse form of it, The UCS‐2
Curse. Things like charCodeAt and fromCharCode only ever deal with
16‐bit quantities, not with real, 21‐bit Unicode code points.
Therefore, if you want to print out something like 𝒜, U+1D49C,
MATHEMATICAL SCRIPT CAPITAL A, you have to specify not one character
but two “char units”: "\uD835\uDC9C".
Please refer to this link: https://dheeb.files.wordpress.com/2011/07/gbu.pdf
Your desired character is not a printable ASCII character. On linux you can print all the printable ascii characters by running this command:
for((i=32;i<=127;i++)) do printf \\$(printf '%03o\t' "$i"); done;printf "\n"
or
man ascii
So what you can do is to print unicode characters. Here is a list of all the available unicode characters, and you can select one which is looking almost identical with your desired character.
http://unicode-table.com/en/#2764
I've tested on a windows terminal but it is still not showing the desired character, but it's working on linux. If it's still not working you had to make sure to set LANGUAGE="en_US.UTF-8" in /etc/rc.conf and LANG="en_US.UTF-8" in /etc/locale.conf.
So printing out something like this on node console:
console.log('\u2592 start typing...');
will output this result:
▒ start typing...
Actually, if you only care about ASCII that should not be a real problem at all. You only have to properly escape them. A good reference for this is https://mathiasbynens.be/notes/javascript-escapes
console.log('\xB2 \xDB')
Works for me with recentish node under Windows (cmd shell) and mac OS. For ASCII characters you can just convert them to hex and prepend them with \x in your strings. Give it a try with node -e "console.log('\xB2')"
And when you try this answer, and it works, you might want to try:
node -e "console.log('\x07')"

Removing lines containing encoding errors in a text file

I must warn you I'm a beginner. I have a text file in which some lines contain encoding errors. By "error", this is what I get when parsing the file in my linux console (question marks instead of characters):
I want to remove every line showing those "question marks". I tried to grep -v the problematic character, but it doesn't work. The file itself is UTF8 and I guess some of the lines come from texts encoded in another format. I know I could find a way to reconvert them properly, but I just want them gone for now.
Do you have any ideas about how I could do this please?
PS: Some lines contain diacritics which are displayed fine. The "strings" command seems to remove too many "good" lines.
When dealing with mojibake on character encodings other than ANSI you must check 2 things:
Is the file really encoded in X? (X being UTF-8 WITHOUT BOM in your case. You could be trying to read UTF-8 WITH BOM, UTF-16, latin-1, etc. as UTF-8, and that would be the problem). Try reading in (not converting to) other encodings and see if any of them fits.
Is your locale or text editor set to read the file as UTF-8? If not, that may be the problem. Check for support and figure out how to change the setting. In linux try locale and setlocale commands to check and set it properly.
I like how notepad++ for windows (which also runs perfectly in linux using wine) lets you set any encoding you want to read the file without trying to convert it (of course if you set any other than the one the file is encoded in you will only see those weird characters), and also has a different option which allows you to convert it from one encoding to another. That has been pretty useful to me.
If you are a beginner you may be interested in this article. It explains briefly and clearly the whats, whys and hows of character encoding.
[EDIT] If the above fails, even windows-1252 and such ANSI encodings, I've just learned here how to remove non-ascii characters using tr unix command, turning it into ASCII (but be aware information on extra characters is lost in this output and there is no coming back, so keep the input file just in case you find a better fix):
tr -cd '\11\12\40-\176' < $INPUT_FILE > $OUTPUT_FILE
or, if you want to get rid of the whole line:
grep -v -P "[^\11\12\40-\176]" $INPUT_FILE > $OUTPUT_FILE
[EDIT 2] This answer here gives a pretty good guess of what could be happening if none of the encodings work on your file (Unfortunately the only straight forward solution seems to be removing those problematic characters).
You can use a micro-Perl script like:
perl -pe 's/[^[:ascii:]]+//g;' my_utf8_file.txt

Remove ANSI codes when storing script output

Some programs makes beautiful progressbars and stuff using ANSI escape sequences. That's nice.
What's not nice though is that if i put the output of that kind of program into a file and then try to view it it's filled with strange escape sequences.
Is there a way to strip away all the ANSI codes while logging?
I usually log the output of a script this way:
./script >> /tmp/output.log
Try:
$ TERM=dumb ./script >> /tmp/output.log
If that doesn't work, it's because the ANSI codes have been hard-coded into the script, so there is no easy way to remove them. If it does, it's because it's doing the right thing, delegating things like pretty output to libncurses or similar, so that when you change the TERM variable, the library no longer sends those codes.

How do I grep for entire, possibly wrapped, lines of code?

When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!
You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"
Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".
You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)
If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).
I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)

How can I view log files in Linux and apply custom filters while viewing?

I need to read through some gigantic log files on a Linux system. There's a lot of clutter in the logs. At the moment I'm doing something like this:
cat logfile.txt | grep -v "IgnoreThis\|IgnoreThat" | less
But it's cumbersome -- every time I want to add another filter, I need to quit less and edit the command line. Some of the filters are relatively complicated and may be multi-line.
I'd like some way to apply filters as I am reading through the log, and a way to save these filters somewhere.
Is there a tool that can do this for me? I can't install new software so hopefully it's something that would already be installed -- e.g., less, vi, something in a Python or Perl lib, etc.
Changing the code that generates the log to generate less is not an option.
Use &pattern command within less.
From the man page for less
&pattern
Display only lines which match the pattern; lines which do not
match the pattern are not displayed. If pattern is empty (if
you type & immediately followed by ENTER), any filtering is
turned off, and all lines are displayed. While filtering is in
effect, an ampersand is displayed at the beginning of the
prompt, as a reminder that some lines in the file may be hidden.
Certain characters are special as in the / command:
^N or !
Display only lines which do NOT match the pattern.
^R Don't interpret regular expression metacharacters; that
is, do a simple textual comparison.
Try the multitail tool - as well as letting you view multile logs at once, I'm pretty sure it lets you apply regex filters interactively.
Based on ghostdog74's answer and the less manpage, I came up with this:
~/.bashrc:
export LESSOPEN='|~/less-filter.sh %s'
export LESS=-R # to allow ANSI colors
~/less-filter.sh:
#!/bin/sh
case "$1" in
*logfile*.log*) ~/less-filter.sed < $1
;;
esac
~/less-filter.sed:
/deleteLinesLikeThis/d # to filter out lines
s/this/that/ # to change text on lines (useful to colorize using ANSI escapes)
Then:
less logfileFooBar.log.1 -- applies the filter applies automatically.
cat logfileFooBar.log.1 | less -- to see the log without filtering
This is adequate for now but I would still like to be able to edit the filters on the fly.
see the man page of less. there are some options you can use to search for words for example. It has line editing mode as well.
There's an application by Casstor Software Solutions called LogFilter (www.casstor.com) that can edit Windows/Mac/Linux text files and can easily perform file filtering. It supports multiple filters as well as regular expressions. I think it might be what you're looking for.

Resources