I want to truncate all tokens in a corpus to have a maximum length of 5 characters. Is there a way to set the --token-regex import option in MALLET to accomplish this? The code I'm currently using to import documents is this:
mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\p{L}[\p{L}\p{P}]*\p{L}'
If this is not possible in the MALLET import command, I’d appreciate suggestions on how to do the same in R.
Yes you can modify the token-regex so that it reads words of maximum 5 or n characters using this regular expression:
\b\w{1,5}\b
where \b is a word boundary, \w is a word and {1,5} defines the minimum (1) and the maximum (5).
Your command line should be:
mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\b\w{1,5}\b'
In Java:
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\b\\w{1,5}\\b")));
Hope this helps.
Related
Does anyone knows if I can get all the vocabulary for the glove model?
I look to do the same thing that this guy does to BERT on this video [on 15:40]: https://www.youtube.com/watch?v=zJW57aCBCTk&ab_channel=ChrisMcCormickAI
The GloVe vectors and their vocabulary are simply distributed as (space-separated column) text files. On a Unix-derived OS, you can get the vocabulary with a command like:
cut -f 1 -d ' ' glove.6B.50d.txt
If you'd like to do it in Python, the following works. The only trick is that the files use no quoting. Rather, the GloVe files simply use space as a delimiter and space is not allowed inside tokens.
import csv
vocab = set()
with open("glove.6B.100d.txt", encoding="utf-8") as f:
g300 = csv.reader(f, delimiter=" ", quoting=csv.QUOTE_NONE, escapechar=None)
for row in g300:
vocab.add(row[0])
print(vocab)
I'm currently trying to fine-tune DistilGPT-2 (with Pytorch and HuggingFace transformers library) for a code completion task. My corpus is arranged like the following example:
<|startoftext|>
public class FindCityByIdService {
private CityRepository cityRepository = ...
<|endoftext|>
My first attempt was to run the following script from transformers library:
python run_clm.py
--model_type=gpt2 \
--model_name_or_path distilgpt2 \
--do_train \
--train_file $TRAIN_FILE \
--num_train_epochs 100 \
--output_dir $OUTPUT_DIR \
--overwrite_output_dir \
--save_steps 20000 \
--per_device_train_batch_size 4 \
After doing some generation tests, I realized that the model is not predicting \ n for any given context. I imagine that some pre-process stage or something similar is missing. But anyway, what should I do so that \ n be predicted as expected?
HF Forum question
Thanks!!
I think I found a hacky solution for this.
In run_clm.py change:
def tokenize_function(examples):
return tokenizer(examples[text_column_name])
to:
def tokenize_function(examples):
return tokenizer([example + "\n" for example in examples[text_column_name]])
When the Dataset is initially built, it splits it by lines without keeping the newlines on each line. Then the group_texts method concatenates them into batches without adding newlines back. So changing tokenize_function to append \n to each line gives us those newlines back.
Just tested this change out on my fine-tuning job and it worked! Getting newlines being generated in the resulting model.
Spacy's pos tagger is really convenient, it can directly tag on raw sentence.
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I am eating")
But I'm using tokenizer from nltk. So how to use a tokenized sentence like
['I', 'am', 'eating'] rather than 'I am eating' for the Spacy's tagger?
BTW, where can I found detailed Spacy documentation?
I can only find an overview on the official website
Thanks.
There's two options:
You write a wrapper around the nltk tokenizer and use it to convert text to spaCy's Doc format. Then overwrite nlp.tokenizer with that new custom function. More info here: https://spacy.io/usage/linguistic-features#custom-tokenizer.
Generate a Doc directly from a list of strings, like so:
doc = Doc(nlp.vocab, words=[u"I", u"am", u"eating", u"."],
spaces=[True, True, False, False])
Defining the spaces is optional - if you leave it out, each word will be followed by a space by default. This matters when using e.g. the doc.text afterwards. More information here: https://spacy.io/usage/linguistic-features#own-annotations
[edit]: note that nlp and doc are sort of 'standard' variable names in spaCy, they correspond to the variables sp and sen respectively in your code
When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:
from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))
The output will be 哈佛大学 的 M e l i s s a D e l l. How do I modify this behavior?
You could try jieba.
import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']
I can't speak for nltk , but Stanford CoreNLP will not exhibit this behavior if run on this sentence.
If you issue this command on your example you get proper tokenization:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file example.txt -outputFormat text
You might want to look into using stanza if you want to access Stanford CoreNLP via Python.
More info here: https://github.com/stanfordnlp/stanza
A text file contains words with brackets i.e.'[Rahul] is a good batsman'. I want to identify the bracketed words and tagged them with '\O' sign i.e. the output will be 'Rahul\O is a good boy'.
How can i do it.
input: [Rahul] is a good batsman. #written in a file
output: Rahul\O is a good batsman. #written in a file
Using regular expression.
>>> import re
>>> re.sub(r'\[([A-Za-z ]+)\]',r'\1\O', '[Rahul] is a good batsman')
'Rahul\\O is a good batsman'