I'm currently trying to fine-tune DistilGPT-2 (with Pytorch and HuggingFace transformers library) for a code completion task. My corpus is arranged like the following example:
<|startoftext|>
public class FindCityByIdService {
private CityRepository cityRepository = ...
<|endoftext|>
My first attempt was to run the following script from transformers library:
python run_clm.py
--model_type=gpt2 \
--model_name_or_path distilgpt2 \
--do_train \
--train_file $TRAIN_FILE \
--num_train_epochs 100 \
--output_dir $OUTPUT_DIR \
--overwrite_output_dir \
--save_steps 20000 \
--per_device_train_batch_size 4 \
After doing some generation tests, I realized that the model is not predicting \ n for any given context. I imagine that some pre-process stage or something similar is missing. But anyway, what should I do so that \ n be predicted as expected?
HF Forum question
Thanks!!
I think I found a hacky solution for this.
In run_clm.py change:
def tokenize_function(examples):
return tokenizer(examples[text_column_name])
to:
def tokenize_function(examples):
return tokenizer([example + "\n" for example in examples[text_column_name]])
When the Dataset is initially built, it splits it by lines without keeping the newlines on each line. Then the group_texts method concatenates them into batches without adding newlines back. So changing tokenize_function to append \n to each line gives us those newlines back.
Just tested this change out on my fine-tuning job and it worked! Getting newlines being generated in the resulting model.
Related
Have a problem with output file. Everytime I start my program, it needs to write down an answer to the dataframe in a new row in string format. (Output example: 1,5,14,45,99.)
My task needs to be automatically checked by program like
PYSPARK_PYTHON=/opt/conda/envs/dsenv/bin/python spark-submit \
--master yarn \
--name checker \
projects/3/shortest_path.py 12 34 /datasets/twitter/twitter.tsv hw3_output
This program returns output file with only one row. But on my local notebook it works even with several runs.
Here is a part of my program that works with output
output = sys.argv(4)
d = [[answer]]
df_out = spark.createDataFrame(data=d)
df_out.write.format("csv").options(delimiter='\n').mode('append').save(output)
Can you please suggest a way to modify my program or make suggestions what goes wrong?
Tried to change options of .save in dozens of combinations.
Does anyone knows if I can get all the vocabulary for the glove model?
I look to do the same thing that this guy does to BERT on this video [on 15:40]: https://www.youtube.com/watch?v=zJW57aCBCTk&ab_channel=ChrisMcCormickAI
The GloVe vectors and their vocabulary are simply distributed as (space-separated column) text files. On a Unix-derived OS, you can get the vocabulary with a command like:
cut -f 1 -d ' ' glove.6B.50d.txt
If you'd like to do it in Python, the following works. The only trick is that the files use no quoting. Rather, the GloVe files simply use space as a delimiter and space is not allowed inside tokens.
import csv
vocab = set()
with open("glove.6B.100d.txt", encoding="utf-8") as f:
g300 = csv.reader(f, delimiter=" ", quoting=csv.QUOTE_NONE, escapechar=None)
for row in g300:
vocab.add(row[0])
print(vocab)
I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file).
I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?)
the output was:
cooccurrence.bin
cooccurrence.shuf.bin
text8
corpus.txt
vectors.txt
How can I used those files to load it as a GloVe model on python?
You can do it using GloVe library:
Install it: pip install glove_python
Then:
from glove import Corpus, Glove
#Creating a corpus object
corpus = Corpus()
#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)
glove = Glove(no_components=5, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')
Reference: word vectorization using glove
This is how you run the model
$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make
To train it on your own corpus, you just have to make changes to one file, that is demo.sh.
Remove the script from if to fi after 'make'.
Replace the CORPUS name with your file name 'corpus.txt'
There is another if loop at the end of file 'demo.sh'
if [ "$CORPUS" = 'text8' ]; then
Replace text8 with your file name.
Run the demo.sh once the changes are made.
$ ./demo.sh
Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.
your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.
Here is my take on this::
After cloning the repository, edit the demo.sh file as you have to train it using your own corpus replace the CORPUS name with your file's name.
Then remove the script between MAKE and CORPUS as that is for downloading an example corpus for you.
Then run make which will form the four files in the build folder.
Now run ./demo.sh which will train and do all the stuff mentioned in the script on your own corpus and output will be generated as vectors.txt file.
Note : Don't forget to keep your corpus file directly inside the Glove folder.
I am using the python-tensorflow to build a model to train my data. I refer to the google tutorial about the pre-trained inception_v4 model.Here is the link:Fine-tuning a model from an existing checkpoint.
When I run the code
$ python train_image_classifier.py --train_dir=${TRAIN_DIR} \
--dataset_dir=${DATASET_DIR} --dataset_name=flowers \
--dataset_split_name=train --model_name=inception_v4 \
--checkpoint_path=${CHECKPOINT_PATH} \
--checkpoint_exclude_scopes=InceptionV4/Logits,InceptionV4/AuxLogits \
--trainable_scopes=InceptionV4/Logits,InceptionV4/AuxLogits
there are some mistakes.
Error infomation
I am wondering what can I do to fix it? Or is there any other way to use the google pre-trained inception_v4 model?
Thanks for any help!
I want to truncate all tokens in a corpus to have a maximum length of 5 characters. Is there a way to set the --token-regex import option in MALLET to accomplish this? The code I'm currently using to import documents is this:
mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\p{L}[\p{L}\p{P}]*\p{L}'
If this is not possible in the MALLET import command, I’d appreciate suggestions on how to do the same in R.
Yes you can modify the token-regex so that it reads words of maximum 5 or n characters using this regular expression:
\b\w{1,5}\b
where \b is a word boundary, \w is a word and {1,5} defines the minimum (1) and the maximum (5).
Your command line should be:
mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\b\w{1,5}\b'
In Java:
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\b\\w{1,5}\\b")));
Hope this helps.