moses train-model.perl script error, --lm factor:order:filename requied - nlp

when I run:
$MOSES/scripts/training/train-model.perl -hierarchical -ghkm -external-bin-dir /home/zhanwang/giza-pp/tools/ -root-dir . --corpus corpus/nl2mr --f mr --e nl
showed:
ERROR: use --lm factor:order:filename to specify at least one language
model at /home/zhanwang/mosesdecoder/scripts/training/train-model.perl
line 597.
root#zhanwang-virtual-machine:/home/zhanwang/mosesmodel/corpus3#
$MOSES/scripts/training/train-model.perl
But I don't want to use factor model.
try this too, show me the same thing. whatever parameter I try ,it just ask me to use --lm factor:order:filename.
For an standard phrase model, you will typically run the training script as follows.
train-model.perl -root-dir . --corpus corpus/euro --f de --e en
I want to build syntax-base translation model, what should I do?
$MOSES/scripts/training/train-model.perl -ghkm -external-bin-dir /home/zhanwang/giza-pp/tools/ -root-dir . --corpus corpus/nl2mr --f mr --e nl
here is my corpus:
root#zhanwang-virtual-machine:/home/zhanwang/mosesmodel/corpus3/corpus# head nl2mr.nl nl2mr.mr
==> nl2mr.nl <==
Give me the cities in Virginia .
What are the high points of states surrounding Mississippi ?
==> nl2mr.mr <==
answer city loc_2 stateid 'virginia'
answer high_point_1 state next_to_2 stateid 'mississippi'
I want to extract ghkm rules and build a model that can translate "Give me the cities in Virginia ." to "answer city loc_2 stateid 'virginia'"

Related

Clean the text data and save in csv format using Python

I have a text file of about 7000 sentences. Every sentence is in a new line. The sample format of my text file's data is given below. I want to change the format and clean the data using python.
(input.txt)
I\PP.sg.n.n am\VM.3.fut.sim.dcl.fin.n.n.n going\VER.0.gen.n.n to\NC.0.0.n.n school\JQ.n.n.crd .\PU
When\PPR.pl.1.0.n.n.n.n I\PP.0.y go\VM.0.0.0.0.nfn.n.n.n outside\NC.0.0.n.n ,\PU I\NST.0.n.n saw\NN.loc.n.n something\DAB.sg.y .\PU
I\PP.0.y eat\JQ.n.n.nnm rice\NC.0.loc.n.n .\PU
I want to change the format of the above data of the text file and want the below format in CSV.
(input.csv)
Sentences
Tags
I am going to school .
PP VM VER NC JQ PU
When I go outside , I saw something .
PPR PP VM NC PU NST NN DAB PU
I eat rice .
PP JQ NC PU
I have tried some approaches but nothing is working properly to get my desired format. I am really confused. It would be a great help for me if any kind soul can help me.
Thanks in advance for the help.
Python Code:
txt = r"""
I\PP.sg.n.n am\VM.3.fut.sim.dcl.fin.n.n.n going\VER.0.gen.n.n to\NC.0.0.n.n school\JQ.n.n.crd .\PU
When\PPR.pl.1.0.n.n.n.n I\PP.0.y go\VM.0.0.0.0.nfn.n.n.n outside\NC.0.0.n.n ,\PU I\NST.0.n.n saw\NN.loc.n.n something\DAB.sg.y .\PU
I\PP.0.y eat\JQ.n.n.nnm rice\NC.0.loc.n.n .\PU
"""
for line in txt.strip().split('\n'):
words, tags = [], []
for wordtag in line.strip().split():
splits = wordtag.split('\\', 1)
words.append(splits[0])
tags.append(splits[1].split('.')[0])
print(f"\"{' '.join(words)}\",\"{' '.join(tags)}\"")
Output:
"I am going to school .","PP VM VER NC JQ PU"
"When I go outside , I saw something .","PPR PP VM NC PU NST NN DAB PU"
"I eat rice .","PP JQ NC PU"

unable to get rid of all emojis

I need help removing emojis. I looked at some other stackoverflow questions and this is what I am de but for some reason my code doesn't get rid of all the emojis
d= {'alexveachfashion': 'Fashion Style * Haute Couture * Wearable Tech * VR\n👓👜⌚👠\nSoundCloud is Live #alexveach\n👇New YouTube Episodes ▶️👇', 'andrewvng': 'Family | Fitness | Friends | Gym | Food', 'runvi.official': 'Accurate measurement via SMART insoles & real-time AI coaching. Improve your technique & BOOST your performance with every run.\nSoon on Kickstarter!', 'triing': 'Augmented Jewellery™️ • Montreal. Canada.', 'gedeanekenshima': 'Prof na Etec Albert Einstein, Mestranda em Automação e Controle de Processos, Engenheira de Controle e Automação, Técnica em Automação Industrial.', 'jetyourdaddy': '', 'lavonne_sun': '☄️🔭 ✨\n°●°。Visual Narrative\nA creative heart with a poetic soul.\n————————————\nPARSONS —— Design & Technology', 'taysearch': 'All the World’s Information At Your Fingertips. (Literally) Est. 1991🇺🇸 🎀#PrincessofSearch 🔎Sample 👇🏽 the Search Engine Here 🗽', 'hijewellery': 'Fine 3D printed jewellery for tech lovers #3dprintedjewelry #wearabletech #jewellery', 'yhanchristian': 'Estudante de Engenharia, Maker e viciado em café.', 'femka': 'Fashion Futurist + Fashion Tech Lab Founder #technoirlab + Fashion Designer / Parsons & CSM Grad / Obsessed with #fashiontech #future #cryptocurrency', 'sinhbisen': 'Creator, TRiiNG, augmented jewellery label ⭕️ Transhumanist ⭕️ Corporeal cartographer ⭕️', 'stellawearables': '#StellaWearables ✉️Info#StellaWearables.com Premium Wearable Technology That Monitors Personal Health & Environments ☀️🏝🏜🏔', 'ivoomi_india': 'We are the manufacturers of the most innovative technologies and user-friendly gadgets with a global presence.', 'bgutenschwager': "When it comes to life, it's all about the experience.\nGoogle Mapper 🗺\n360 Photographer 📷\nBrand Rep #QuickTutor", 'storiesofdesign': 'Putting stories at the heart of brands and businesses | Cornwall and London | #storiesofdesign', 'trume.jp': '草創期から国産ウオッチの製造に取り組み、挑戦を続けてきたエプソンが世界に放つ新ブランド「TRUME」(トゥルーム)。目指すのは、最先端技術でアナログウオッチを極めるブランド。', 'themarinesss': "I didn't choose the blog life, the blog life chose me | Aspiring Children's Book Author | www.slayathomemum.com", 'ayowearable': 'The world’s first light-based wearable that helps you sleep better, beat jet lag and have more energy! #goAYO Get yours at:', 'wearyourowntechs': 'Bringing you the latest trends, Current Products and Reviews of Wearable Technology. Discover how they can enhance your Life and Lifestyle', 'roxfordwatches': 'The Roxford | The most stylish and customizable fitness smartwatch. Tracks your steps/calories/dist/sleep. Comes with FOUR bands, and a travel case!', 'playertek': "Track your entire performance - every training session, every match. \nBecause the best players don't hide.", '_kate_hartman_': '', 'hmsmc10': 'Health & Wellness 🍎\nBoston, MA 🏙\nSuffolk MPA ‘17 🎓 \n.\nJust Strong Ambassador 🏋🏻\u200d♀️', 'gadgetxtreme': 'Dedicated to reviewing gadgets, technologies, internet products and breaking tech news. Follow us to see daily vblogs on all the disruptive tech..', 'freedom.journey.leader': '📍MN\n🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿 \n📧Ashleybp5#gmail.com \n#homeschool #bossmom #builder #momlife', 'arts_food_life': 'Life through my phone.', 'medgizmo': 'Wearable #tech: #health #healthcare #wellness #gadgets #apps. Images/links provided as information resource only; doesn’t mean we endorse referenced', 'sawearables': 'The home of wearable tech in South Africa!\n--> #WearableTech #WearableTechnology #FitnessTech Find your wearable #', 'shop.mercury': 'Changing the way you charge.⚡️\nGet exclusive product discounts, and help us reach our goal below!🔋', 'invisawear': 'PRE-ORDERS NOW AVAILABLE! Get yours 25% OFF here: #girlboss #wearabletech'}
for key in d:
print("---with emojis----")
print(d[key])
print("---emojis removed----")
x=''.join(c for c in d[key] if c <= '\uFFFF')
print(x)
output example
---with emojis----
📍MN
🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿
📧Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---emojis removed----
MN
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
Ashleybp5#gmail.com
#homeschool #bossmom #builder #momlife
---with emojis----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!🔋
---emojis removed----
Changing the way you charge.⚡️
Get exclusive product discounts, and help us reach our goal below!
There is no technical definition of what an "emoji" is. Various glyphs may be used to render printable characters, symbols, control characters and the like. What seems like an "emoji" to you may be part of normal script to others.
What you probably want to do is to look at the Unicode category of each character and filter out various categories. While this does not solve the "emoji"-definition-problem per se, you get much better control over what you are actually doing without removing, for example, literally all characters of languages spoken by 2/3 of the planet.
Instead of filtering out certain categories, you may filter everything except the lower- and uppercase letters (and numbers). However, be aware that ꙭ is not "the googly eyes emoji" but the CYRILLIC SMALL LETTER DOUBLE MONOCULAR O, which is a normal lowercase letter to millions of people.
For example:
import unicodedata
s = "🍃Wife • Homeschooling Mom to 5 🐵 • D Y I lover 🔨 • Small town living in MN. 🌿"
# Just filter category "symbol"
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', ))
print(t)
...results in
Wife • Homeschooling Mom to 5 • D Y I lover • Small town living in MN.
This may not be emoji-free enough, yet the • is technically a form of punctuation. So filter this as well
# Filter symbols and punctuations. You may want 'Cc' as well,
# to get rid of control characters. Beware that newlines are a
# form of control-character.
t = ''.join(c for c in s if unicodedata.category(c) not in ('So', 'Po'))
print(t)
And you get
Wife Homeschooling Mom to 5 D Y I lover Small town living in MN

The failure in using CRF+0.58 train NE Model

when i use CRF++0.58 to model a NE and progarm have a problem:
"reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s"
the develop environment:
red hat linux 6.5,gcc 5.0,CRF++0.58
written feature template:
template
dataset:
Boson_train.txt
Boson_test.txt
the first column is words ,the second column is pos,the third column is NER tagger
the problem:
when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and i got
this notification, "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s". I can't understand
the C++ language, so i can't fix the problem.
the method i tryed:
1.change the encode type of dataset. I use notepad++ to change "utf-8 with no BOM" to "utf-8". It didn't work.
2.change the delimiter from '\t' to ' '(space). It didn't work.
3.And i think maybe the template was wrong.So i use the crf++0.58/example/seg/template for test. It worked. But this template
is simple, so I use /example/JapaneseNE/template which is more similar with my feature template. It didn't work. Then, i check
the JapaneseNE example It works well. So i got confused. Is there someone can help me.
template
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
U10:%x[-2,1]/%x[0,1]
U11:%x[-2,1]/%x[1,1]
U11:%x[-1,1]/%x[0,1]
U12:%x[0,0]/%x[0,1]
U13:%x[0,1]/%x[1,1]
U14:%x[0,1]/%x[2,1]
U15:%x[-1,0]/%x[0,1]
U16:%x[-1,0]/%x[-1,1]
U17:%x[1,0]/%x[1,1]
U18:%x[1,0]/%x[1,1]
U19:%x[2,0]/%x[2,1]
U20:%x[-1,2]
U21:%x[-2,2]
U22:%x[0,1]/%x[-1,2]
U23:%x[0,1]/%x[-2,2]
U24:%x[0,0]/%x[-1,2]
U25:%x[0,0]/%x[-2,2]
U26:%x[-1,2]/%x[-2,2]/%x[0,1]
U27:%x[-2,2]/%x[0,1]/%x[1,1]
U28:%x[-1,1]/%x[-1,2]/%x[0,1]
U29:%x[-1,2]/%x[0,0]/%x[0,1]
Boson_train
浙江 ns B_product_name
在线 b I_product_name
杭州 ns I_product_name
4 m B_time
月 m I_time
25 m I_time
日 m I_time
讯 ng Out
( x Out
记者 n Out
x Out
x B_person_name
施宇翔 nr I_person_name
x Out
通讯员 n B_person_name
x Out
方英 nr B_person_name
) x Out
毒贩 n Out
很 zg Out
“ x Out
时髦 nr Out
” x Out
, x Out
用 p Out
微信 vn B_product_name
交易 n Out
毒品 n Out
。 x Out
没 v Out
料想 v Out
警方 n B_person_name
也 d Out
You were debugging in the right direction. The issue is indeed with your template file.
Your training data has 3 columns (column 0:word, column 1:pos-tag and column 2:tag).
You cannot use the tag as feature, but your template file has reference to it (i.e, column 2) in many feature definitions (see, U20 to U29). Your training should work after removing/correcting these.
Hope this helps :)
You can also checkout these video tutorials for better understanding of Template Files and Training NER with CRF++ :
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4

How can the NamedEntityTag be used as EntityMention in RelationMention in the RelationExtractor?

I'm trying to train my own NamedEntityRecognizer and RelationExtractor. I've managed the NER model, but the integration with the RelationExtractor is a bit tricky. I get the right NamedEntityTags, but the RelationMentions found by the are only one-term and with no extra NamedEntity than the default ones. I got input text:
America's President Nixon has passed a new law.
and I got the following output:
[Text=President CharacterOffsetBegin=10 CharacterOffsetEnd=19 PartOfSpeech=NNP Lemma=President NamedEntityTag=PRESIDENT]
[Text=Nixon CharacterOffsetBegin=20 CharacterOffsetEnd=25 PartOfSpeech=NNP Lemma=Nixon NamedEntityTag=PRESIDENT]
Extracted the following MachineReading relation mentions:
RelationMention [type=Live_In, start=0, end=4, {Live_In, 0.3719730539322602; OrgBased_In, 0.22490833335685645; _NR, 0.17474696244295865; Work_For, 0.11754788838692569; Located_In, 0.11082376188099895}
EntityMention [type=O, objectId=EntityMention-2, hstart=3, hend=4, estart=2, eend=4, headPosition=3, value="Nixon", corefID=-1]
EntityMention [type=LOCATION, objectId=EntityMention-1, hstart=0, hend=1, estart=0, eend=1, headPosition=0, value="America", corefID=-1]
]
You can match the text from the EntityMention with the NamedEntityText value from the NamedEntityTag.

Search in directory of files based on keywords from another file

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Resources