What's the difference in one word token and mulit-word token in crf++ for Chinese? - crf++

I use crf++ for Chinese named entity recognition.The first column in train file is token represent current word.I see someone use only one Chinese character in first column but someone use many Chinese characters like 中国。

Chinese word could be 1 Chinese character or multiply Chinese characters:
中 represents a English word - middle.
国 represents another English word - country.
and 中国 represents English word - China.
they are same - current word - just like 'CHINA' has 5 English characters, 中国 has 2 Chinese characters - both are current word in cft++.

Related

How to capitalize specific uppercase words in vim

We have a 5000-line text file containing words like so:
BANKS
BEING AFRAID OF DOGS
This is a SENTENCE.
Just another sentence.
COUNTRY
Using vim, I want to capitalize the words only in the lines where all the words are in uppercase (meaning lines 3 and 4 should be left untouched). In other words, what I expect to get is:
Banks
Being Afraid Of Dogs
This is a SENTENCE.
Just another sentence.
Country
By referring to Power of g and Switching_case_of_characters.
Applying the command to line containing upper case character and space only, which is g/^[A-Z ]*$/
Then do Title case conversion s/\<\(\w\)\(\w*\)\>/\u\1\L\2/g
The whole command will be
:g/^[A-Z ]*$/s/\<\(\w\)\(\w*\)\>/\u\1\L\2/g

Extract strings of a certain language from a dataframe in python

I have a pandas DataFrame that contains a column with sentences from different languages (6 languages). The DataFrame also contains a column which states which language the corresponding sentence belongs to. However, a sentence may contain non letter ASCII characters such as =## etc.. and words that may not belong to the same language. Even though, it may be written in the same script. For an example please refer to the below sentence which, has been marked as Spanish;
'¿Vas a venir a la tienda conmigo?+== #loja' #Note that 'loja' is a Portuguese word.
Since the sentence is marked as Spanish I would like to remove all non Spanish words and non punctuation characters (+, =, =, #).
I have an idea to remove the non punctuation words by getting the set values and removing the ones that are not letters (there are only few punctuation characters. so no need to search). However, would someone be able to help remove the words that do not belong to the tagged language such as the Portuguese word in the above example using python.?
Thanks & Best Regards
Michael

What's the full name of "DBNum"?

In Excel, if the cell value is 123 and specify its custom formatting as [DBNum2][$-804]General then it will be displayed like 壹佰贰拾叁. (in Chinese,it's a local number format).
The question is :
What the DBNum mean? I think it's should be some word's short name. then what's the full name?
thx for your answer.
It is context which clarifies the name. Basically To display numbers using native number characters, use a [NatNum1], [NatNum2], ... [NatNum11] modifier at the beginning of a number format codes. DBnum is a native character modifier. DBNum is an identifier and has no expanded name. It is defined by usage
emphasized text"[DBNum2]" number type to convert numbers to Chinese uppercase.
Realize Chinese uppercase currency amount by using excel's [DBNum2] number type
Today's meeting budget needs to convert the final amount to Chinese uppercase numbers. I found that One can't find the relevant information in Excel help. We got a google look and conclude that we should use the "[DBNum2]" number type to convert numbers to Chinese uppercase.
Some common usages are as follows:
1. Set the custom format in the cell format:
Use the TEXT function to convert:
Source Link https://translate.google.com/translate?hl=en&sl=zh-CN&u=http://blog.zengrong.net/post/278.html&prev=search
By the way , EditGrid does not support [DBnum2]
EDIT :
Further related information
Displaying Numbers Using Native Characters
NatNum modifiers
To display numbers using native number characters, use a [NatNum1], [NatNum2], ... [NatNum11] modifier at the beginning of a number format codes.
The [NatNum1] modifier always uses a one to one character mapping to convert numbers to a string that matches the native number format code of the corresponding locale. The other modifiers produce different results if they are used with different locales. A locale can be the language and the territory for which the format code is defined, or a modifier such as [$-yyy] that follows the native number modifier. In this case, yyy is the hexadecimal MS-LCID that is also used in currency format codes. For example, to display a number using Japanese short Kanji characters in an English US locale, use the following number format code:
[NatNum0]
Try to convert any native number string to ASCII Arabic digits. If already ASCII, it remains ASCII.
**[NatNum1]**
Transliterations Native Number Characters DBNumX Date Format
Chinese Chinese lower case characters CAL: 1/7/7 [DBNum1]
Japanese short Kanji characters [DBNum1] CAL: 1/4/4 [DBNum1]
Korean Korean lower case characters [DBNum1] CAL: 1/7/7 [DBNum1]
Hebrew Hebrew characters
Arabic Arabic-Indic characters
Thai Thai characters
Hindi Indic-Devanagari characters
Odia Odia (Oriya) characters
Marathi Indic-Devanagari characters
Bengali Bengali characters
Punjabi Punjabi (Gurmukhi) characters
Gujarati Gujarati characters
Tamil Tamil characters
Telugu Telugu characters
Kannada Kannada characters
Malayalam Malayalam characters
Lao Lao characters
Tibetan Tibetan characters
Burmese Burmese (Myanmar) characters
Khmer Khmer (Cambodian) characters
Mongolian Mongolian characters
Nepali Indic-Devanagari characters
Dzongkha Tibetan characters
Farsi East Arabic-Indic characters
Church Slavic Cyrillic characters
[NatNum2]
Transliterations Native Number Characters DBNumX Date Format
Chinese Chinese upper case characters CAL 2/8/8 [DBNum2]
Japanese traditional Kanji characters CAL 2/5/5 [DBNum2]
Korean Korean upper case characters [DBNum2] CAL 2/8/8 [DBNum2]
Hebrew Hebrew numbering
[NatNum3]
Transliterations Native Number Characters DBNumX Date Format
Chinese fullwidth Arabic digits CAL: 3/3/3 [DBNum3]
Japanese fullwidth Arabic digits CAL: 3/3/3 [DBNum3]
Korean fullwidth Arabic digits [DBNum3] CAL: 3/3/3 [DBNum3]
Source Link : [Common/Number Format Codes][4]
Thanks, I have been looking for a description of [DBNUM1] (lower case Chinese number), [DBNUM2] (upper case Chinese number, for formal numbers) and [DBNUM3] (1-to-1 digit to Chinese number conversion), the last being obtained from somewhere else.
There is also another format code whose documentation needs to be found. An example is "[>100]#,000", which displays number in format indicated, but only if it is greater than 100. Not sure the rule of formatting if not meeting the condition. Not sure if you can specify multiple condition.
It is a pity that Microsoft does not give a complete list of format codes at a single place.
The actual meaning may only known by Microsoft Office developers. And we can only guess what the full name is.
The corresponding chapter introducing the equivalent function in LibreOffice and OpenOffice just use the name DBNum directly, without any further introduction, even the table of mapping in LibreOffice says "DBNumX".
Moreover, you can't find the definition of DBNum in Office Support.

Choosing correct word for the given string

Suppose the given word is" connnggggggrrrraaatsss" and we need to convert it to congrats .
Or for other example is "looooooovvvvvveeeeee" should be changed to "love" .
Here the given words can be repeated for any number of times but it should be changed to correct form. We need to write a java based program.
You cannot really check for every word because there are certain words which have more than 1 alphabets in their spelling. So one way you could go is -
check for each alphabet in the word and restrict its number of consecutive appearances to two
now check the new spelling on the spell checker, you might want to try HUNspell as it is widely used by many word processing softwares.

How to write numbers in other language in web?

My website has two language, English and Persian. When I use Persian language and I write some number, the number shows in English language
For example when I write 67 in Persian unfortunately it show me 67 in English but I write it in Persian (My font is Arial) :
How to write numbers in Persian?
Persian numbers have different unicode codes, f.e english zero 0 is code 48 (0x0030 hex), but persian zero 0 is code U + 06F0, so you have to enter numbers in proper unicode codes (different chars). Also your font have to support these codes (not all fonts support such numbers).
Persian numbers starts at unicode 0x06F0 (0, ۰) to 0x06F9 (9, ۹) while english numbers are 0x0030 (0) to 0x0039 (9).
How to enter these characters (in Windows)
If you do not have proper language support to enter these characters on your keyboard, you can enter them in Wordpad (standard application in Windows) or in Microsoft Word by entering four characters (hexa) of the code (f.e. 06F9) and pressing alt+x shortcut. four-character word will be converted to proper unicode character.

Resources