Nodejs Detect punycode IDN language - node.js

I know there are libraries like bestiejs/punycode.js or NodeJS PunnyCode to convert punycode, but I can't find any library that detect punycode languages(Geek, Chinese, etc).
Is that possible to detect punycode language natively or it has to use different software to detect the languages.
Also, is there any NodeJs library can use for punycode language detection?

The punycode is the ASCII (8 bit) representation of an otherwise 16 bit Unicode based Internationalized Domain Name. The conversion to punycode is termed as a variable length encoding and is a mathematical process, involving additional processings like case-folding and normalization to Unicode Form C. Owing to the mathematical nature of the punycode, the language information, as such is not supposed to be part of the punycode representation as such at all. It is the Unicode equivalent of the given punycode, that lies in specific Unicode range/block which gives the given character it's own script/language.
Hence, if one needs to have language/script detection capability of the IDN, then it needs to be converted to it's U-Label form first and then passed on to language/script detection routines.
To know about the various libraries that can be used in different programming languages for converting punycodes to their respective Unicode labels, please refer to the following two documents created by the "Universal Acceptance Steering Group"
UASG 018A UA Compliance of Some Programming Language Libraries and
Frameworks
(https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as
UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN"
(https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).

Related

A configurable (high/low level) programming language?

Is there a programming language that can be both high and low level? I'll elaborate...
Say, for example, you want to write an enterprise system which needs good abstractions and static typing (among other things), so you pick Java. But then, you need a portion of this system to be very low latency, so you pick C++ and do your garbage collection. Then, you want some kind of automated build script, so you quickly write something up in python.
Is there a language that could do all of these tasks, and be configurable such that one can use a GC or not if specified, the script be interpreted or not if specified, and the language can have static typing if specified? - Say, a unified syntax with optional static typing, customizable GC, and compilers..a sort of customizable Frankenstein language?
Disclaimer, I have not taken any programming language/compiler course, so this may be a noob-ish question!
Real world challenges are better dealt with multiple programming languages and programming language implementations.
Since you mentioned Python, important parts of the library are implemented in C (or in Cython) for performance reasons.
The programming languages involved on a modern Web site are at least HTML, CSS, and JavaScript, plus a templating language, plush the language used for the backend, plus the language used for services, plus SQL or the likes for querying databases.
The history of programming has recurring attempts at a single programming language for everything (ADA comes to mind), and each of them resulted in either inconsistent languages or complicated code. All of them were short-sighted.

Which standard language codes should I use for multilingual software?

I often see the abbreviation "en-US", which corresponds with the 2-character language codes standardized in ISO639-1. I also understand that the format of language tags generally consists of a primary language (subtag) code, followed by a series of other subtags separated by dashes, as explained in https://www.rfc-editor.org/rfc/rfc5646.
That link mentions that there are also 3-letter language codes defined in ISO639-2, ISO639-3, and ISO639-5.
Still, there are more codes defined for Windows/.NET here: http://msdn.microsoft.com/en-us/goglobal/bb896001.aspx. These refer to the language tags as "culture names", and use a distinct 3-character code for "language name". So the "culture name" appears to be the 2-character language codes, although I'm not sure why they vary between Windows versions, or how well they follow the standard language codes. Is "en-US" really a "language code" or is it a "culture name"?
If I'm developing software to use language codes, which standard should I use? (The 2-character codes or the 3-character codes? If 3-character, then ISO639- 2, 3, or 5?)
Why should I chose one over the other? (For OS platform or programming framework compatibility?)
Bcp47 is the industry best practice standard for identifying languages. You should use these language tags. Bcp47 dictates that if a language can be identified using a 2 letter or 3 letter tag, the 2 letter tag should be used.
Cultures and locales are distinct from language tags in how they conceive of the region information. The region information in a language tag identifies the origin of the particular dialect (en-US is American English or the variety of English that originated in the United States), the region information in a locale identifies the location where the information is relevant. Since the majority of American English speakers also live in the US, the distinction is not really important when it comes to providing information such as how to spell words or format dates or numbers.
Windows is moving away from the concept of a locale or culture to a more expressive notion of language and region (separately identified) which allows us to identify situations such as a speaker of American English who resides in England.
Note that there are cases where Windows still uses legacy names that predate this standard and depending on how you rely on the OS, you may need to map between standard compliant names and the legacy name.

How can I determine the language of a web page, like Chrome does?

I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it?
Chrome can do it, but what's the principle?
I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?
I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)... there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.
If you are just interested in collecting corpora of different languages, you can look at country specific pages. For example, <website>.es is likely to be in Spanish, and <website>.de is likely to be in German.
Also, Wikipedia is translated into many languages. It is not hard to write a scraper for a particular language.
The model that determines a webpage's language in Chrome is called the Compact Language Detector v3 (CLD3) and it's open source C++ code (sort of, it's not reproducible). There's also official Python bindings for it:
pip install gcld3

Voice Form Matching in Visual C++

Are there SDK's for voice-form matching / comparison for Visual C++? Or, possibly converting sounds to phonetics.
Usage: Program will do different things from input from certain command words given in a made-up foreign language. (Klingon)
Analysis - comparison of user's voice with existing pre-recorded voice segment
Rather than using existing text to speech SDK's, I believe I have to opt for a more general version since the language I am dealing with isn't widely supported.

What are the preferred conventions in naming attributes, methods and classes in different languages?

Are the naming conventions similar in different languages? If not, what are the differences?
Each language has a specific style. At least one.
Each project adopts a specific style. At least, they should. This can sometimes be a different style to the canonical style your language uses - probably based on the dev leaders preferences.
Which style to use?
If your language ships with a good standard library, try to adopt the conventions in that library.
If your language has a canonical book (The C Programming language, The Camel Book, Programming Ruby etc.) use that.
Sometimes the language designers (C#, Java spring to mind) actually write a bunch of guidelines. Use those, especially if the community adopts them too.
If you use multiple languages remember to stay flexible and adjust your preferred coding style to the language you are using - when coding in Python use a different style to coding in C# etc.
As others have said, things vary a lot, but here's a rough overview of the most commonly used naming conventions in various languages:
lowercase, lowercase_with_underscores:
Commonly used for local variables and function names (typical C syntax).
UPPERCASE, UPPERCASE_WITH_UNDERSCORES:
Commonly used for constants and variables that never change. Some (older) languages like BASIC also have a convention for using all upper case for all variable names.
CamelCase, javaCamelCase:
Typically used for function names and variable names. Some use it only for functions and combine it with lowercase or lowercase_with_underscores for variables. When javaCamelCase is used, it's typically used both for functions and variables.
This syntax is also quite common for external APIs, since this is how the Win32 and Java APIs do it. (Even if a library uses a different convention internally they typically export with the (java)CamelCase syntax for function names.)
prefix_CamelCase, prefix_lowercase, prefix_lowercase_with_underscores:
Commonly used in languages that don't support namespaces (i.e. C). The prefix will usually denote the library or module to which the function or variable belongs. Usually reserved to global variables and global functions. Prefix can also be in UPPERCASE. Some conventions use lowercase prefix for internal functions and variables and UPPERCASE prefix for exported ones.
There are of course many other ways to name things, but most conventions are based on one of the ones mentioned above or a variety on those.
BTW: I forgot to mention Hungarian notation on purpose.
G'day,
One of the best recommendations I can make is to read the relevant section(s) of Steve McConnell's Code Complete (Amazon Link). He has an excellent discussion on naming techniques.
HTH
cheers,
Rob
Of course there are some common guidelines but there are also differences due to difference in language syntax\design.
For .NET (C#, VB, etc) I would recommend following resource:
Framework Design Guidelines -
definitive book on .NET coding
guidelines including naming
conventions
Naming Guidelines - guidelines from Microsoft
General Naming Conventions - another set of MS guidelines (C#, C++, VB)
I think that most naming conventions will vary but the developer, for example I name variables like: mulitwordVarName, however some of the dev I have worked with used something like mulitword_var_name or multiwordvarname or aj5g54ag or... I think it really depends on your preference.
Years ago an wise old programmer taught me the evils of Hungarian notation, this was a real legacy system, Microsoft adopted it some what in the Windows SDK, and later in MFC. It was designed around loose typed languages like C, and not for strong typed languages like C++. At the time I was programming Windows 3.0 using Borland's Turbo Pascal 1.0 for Windows, which later became Delphi.
Anyway long story short at this time the team I was working on developed our own standards very simple and applicable to almost all languages, based on simple prefixes -
a - argument
l - local
m - member
g - global
The emphasis here is on scope, rely on the compiler to check type, all you need care about is scope, where the data lives. This has many advantages over nasty old Hungarian notation in that if you change the type of something via refactoring you don't have to search and replace all instances of it.
Nearly 16 years later I still promote the use of this practice, and have found it applicable to almost every language I have developed in.

Resources