Fastbert: BertDataBunch error for multilabel text classification - nlp

I'm following the FastBert tutorial from huggingface https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384
The problem is this the code is not exactly reproducible. The main issue I'm facing is the dataset preparation. In the tutorial, this dataset is used https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data
But, if I set-up the folder structure according the tutorial, and place the dataset files in the folders I get errors with the databunch.
databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',
test_data='test.csv',
text_col="comment_text", label_col=label_cols,
batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'],
multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)
It complains about the file format being wrong. How should I format the dataset, labels for this dataset with fastbert?

First of all, you can use the notebook from GitHub for FastBert.
https://github.com/kaushaltrivedi/fast-bert/blob/master/sample_notebooks/new-toxic-multilabel.ipynb
There is a small tutorial in the FastBert README on how to process the dataset before using.
Create a DataBunch object
The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.
from fast_bert.data_cls import BertDataBunch
databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
tokenizer='bert-base-uncased',
train_file='train.csv',
val_file='val.csv',
label_file='labels.csv',
text_col='text',
label_col='label',
batch_size_per_gpu=16,
max_seq_length=512,
multi_gpu=True,
multi_label=False,
model_type='bert')
File format for train.csv and val.csv
index text label
0 Looking through the other comments, I'm amazed that there aren't any warnings to potential viewers of what they have to look forward to when renting this garbage. First off, I rented this thing with the understanding that it was a competently rendered Indiana Jones knock-off. neg
1 I've watched the first 17 episodes and this series is simply amazing! I haven't been this interested in an anime series since Neon Genesis Evangelion. This series is actually based off an h-game, which I'm not sure if it's been done before or not, I haven't played the game, but from what I've heard it follows it very well pos
2 his movie is nothing short of a dark, gritty masterpiece. I may be bias, as the Apartheid era is an area I've always felt for. pos
In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.
labels.csv will contain a list of all unique labels. In this case the file will contain:
pos
neg
For multi-label classification, labels.csv will contain all possible labels:
severe_toxic
obscene
threat
insult
identity_hate
The file train.csv will then contain one column for each label, with each column value being either 0 or 1. Don't forget to change multi_label=True for multi-label classification in BertDataBunch.
id text toxic severe_toxic obscene threat insult identity_hate
0 Why the edits made under my username Hardcore Metallica Fan were reverted? 0 0 0 0 0 0
0 I will mess you up 1 0 0 1 0 0
label_col will be a list of label column names. In this case it will be:
['toxic','severe_toxic','obscene','threat','insult','identity_hate']
So, just keep the train.csv, val.csv (just make a copy of train.csv), and test.csv inside data/
In the labels folder, keep a labels.csv with the following contents.
severe_toxic
obscene
threat
insult
identity_hate

Related

Convert unknown labels to Yolov5

I own a dataset of images with unknown label format, which is:
angry_actor_104.jpg 0 28 113 226 141 22.9362 0
It indicates an image as follows:
image_name face_id_in_image face_box_top face_box_left face_box_right face_box_bottom face_box_cofidence expression_label
My question is: How can this be converted into the yolov5 format?
I have been looking this up for a long time and hope someone can help.
Thank you very much in advance.
Since the format is unknown you are unlikely to find existing code to completely handle the transformation but I can share some tips to get started.
The annotations file does not have enough info to get converted to Yolo format. Because to convert to Yolo you also need to know the dimensions of the images. If all of your images are the same dimension then it easier but if all of the images are different then you will need additional code to extract the dimensions of the images. I will explain why below.
When you are done you will need to get the images and labels in a specific directly structure like this, with one txt file per image:
/images/actor1.jpg
/images/actor2.jpg
/labels/actor1.txt
/labels/actor2.txt
This is the shape that you want to get the annotation files into.
face_id_in_image x_center_image y_center_image width height
There is a clear description of what the values mean here https://stackoverflow.com/a/66563144/5183735.
Now you need to do some math to calculate the values.
width = (face_box_right - face_box_left)/image_width
height = (face_box_bottom - face_box_top)/image_height
x_center_image = face_box_left/image_width + (width/2)
y_center_image = face_box_top/image_height + (height/2)
I have some bits of code that may help you with reading the text file and saving the text files here.
https://github.com/pylabel-project/pylabel/blob/main/pylabel/exporter.py and https://github.com/pylabel-project/pylabel/blob/main/pylabel/importer.py.
If you are able to share your exact files I may be able to identify some shortcut to transform them.

Loading saved NER back into HuggingFace pipeline?

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their website:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
What I would like to do is save and run this locally without having to download the "ner" model every time (which is over 1 GB in size). In their documentation, I see that you can save the pipeline using the "pipeline.save_pretrained()" function to a local folder. The results of this are various files which I am storing into a specific folder.
My question would be how can I load this model back up into a script to continue classifying as in the example above after saving? The output of "pipeline.save_pretrained()" is multiple files.
Here is what I have tried so far:
1: Following the documentation about pipeline
pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')
The error I got was: 'str' object has no attribute "config"
2: Following HuggingFace example on ner:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
This yields an error: list index out of range
I also tried printing out just predictions which is not returning the text format of the tokens along with their entities.
Any help would be much appreciated!
Loading a model like this has always worked for me:
from transformers import pipeline
pipe = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
Have a look at here for further examples on how to use pipelines.

We have many mainframe files which are in EBCDIC format, is there a way in Python to parse or convert the mainframe file into csv file or text file?

I need to read the records from mainframe file and apply the some filters on record values.
So I am looking for a solution to convert the mainframe file to csv or text or Excel workbook so that I can easily perform the operations on the file.
I also need to validate the records count.
Who said anything about EBCDIC? The OP didn't.
If it is all text then FTP'ing with EBCDIC to ASCII translation is doable, including within Python.
If not then either:
The extraction and conversion to CSV needs to happen on z/OS. Perhaps with a COBOL program. Then the CSV can be FTP'ed down with
or
The data has to be FTP'ed BINARY and then parsed and bits of it translated.
But, as so often is the case, we need more information.
I was recently processing the hardcopy log and wanted to break the record apart. I used python to do this as the record was effectively a fixed position record with different data items at fixed locations in the record. In my case the entire record was text but one could easily apply this technique to convert various colums to an appropriate type.
Here is a sample record. I added a few lines to help visualize the data offsets used in the code to access the data:
1 2 3 4 5 6 7 8 9
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
N 4000000 PROD 19114 06:27:04.07 JOB02679 00000090 $HASP373 PWUB02#C STARTED - INIT 17
Note the fixed column positions for the various items and how they are referenced by position. Using this technique you could process the file and create a CSV with the output you want for processing in Excel.
For my case I used Python 3.
def processBaseMessage(self, message):
self.command = message[1]
self.routing = list(message[2:9])
self.routingCodes = [] # These are routing codes extracted from the system log.
self.sysname = message[10:18]
self.date = message[19:24]
self.time = message[25:36]
self.ident = message[37:45]
self.msgflags = message[46:54]
self.msg = [ message[56:] ]
You can then format into the form you need for further processing. There are other ways to process mainframe data but based on the question this approach should suit your needs but there are many variations.

How to load video (image sequences) to become data input for CNN-LSTM deep learning model?

In my project, I have two class for the data, "abnormal" and "normal". I have separated training and validation folder for this two type of data. The structure as follow:
Each video file includes the image frames of the video clip, different video folder contain different number of frame. Now, I need to load the frames into the model as the data input and label them into one group. How can i do that? I use keras API.
train
------abnormal
--------------video1
---------------------image1
---------------------image3
---------------------image3
---------------------image...
--------------video2
--------------video3
--------------video...
------normal
(the same as above)
You can have all the images into a single folder and rename them such as 1.normal.jpg, 1.abnormal.jpg. This way you will reduce the time taken to open the folder of a particular class and read images.
Once you rename them, you can split into labels using the given method:
def label(img):
word_label = img.split('.')[-2]
if word_label == 'normal':
return 1
elif word_label == 'abnormal':
return 0

Basic importing coordinates into R and setting projection

Ok, I am trying to upload a .csv file, get it into a spatial points data frame and set the projection system to WGS 84. I then want to determine the distance between each point This is what I have come up with but I
cluster<-read.csv(file = "cluster.csv", stringsAsFactors=FALSE)
coordinates(cluster)<- ~Latitude+Longitude
cluster<-CRS("+proj=longlat +datum=WGS84")
d<-dist2Line(cluster)
This returns an error that says
Error in .pointsToMatrix(p) :
points should be vectors of length 2, matrices with 2 columns, or inheriting from a SpatialPoints* object
But this isn't working and I will be honest that I don't fully comprehend importing and manipulating spatial data in R. Any help would be great. Thanks
I was able to determine the issue I was running into. With WGS 84, the longitude comes before the latitude. This is just backwards from how all the GPS data I download is formatted (e.g. lat-long). Hope this helps anyone else who runs into this issue!
thus the code should have been
cluster<-read.csv(file = "cluster.csv", stringsAsFactors=FALSE)
coordinates(cluster)<- ~Longitude+Latitude
cluster<-CRS("+proj=longlat +datum=WGS84")

Resources