Input data to mallet in parallel

Input data to mallet in parallel - multithreading

I am trying to build a text classifier using mallet. The data is somehow big so I am looking for a way, if possible, to run the "import" task on multiple threads because it is taking a long time to load. Few questions here:
Is there a way to manually parallelize the process by dividing the data and importing it separately then join them. I know I can run them in parallel and get multiple input files, but can I combine the resulting mallet input files before training the classifier?
Does mallet itself parallalize this process if there are available threads on the machine?
Thanks for help!

Actually your questions doesn't seem to be directly related to mallet. So to answer your question two Mallet doesn't do such thing. But you can split the text into equal parts then use them by keeping all at the same folder and providing Mallet the path of that folder. This link can help you achieve it. You need to follow the instructions on One instance per file part.

Related

Recommended annotation tool to create a Named Entities Recognition data set

I'm new to NLP. I am looking for recommendations for an Annotation tool to create a labeled NER dataset from raw texts.
In details:
I'm trying to create a labeled data set for specific types of Entities in order to develop my own NER project (rule based at first).
I assumed there will be some friendly frameworks that allows create tagging projects, tag text data, create a labeled dataset, and even share projects so several people could work on the same project, but I'm struggling to find one (I admit "friendly" or "intuitive" are subjective, yet this is my experience).
So far I've tried several Frameworks:
I tried LightTag. It makes the tagging itself fast and easy (i.e. marking the words and giving them labels) but the entire process of creating a useful dataset is not as intuitive as I expected (i.e. uploading the text files, split to different tagging objects, save the tags, etc.)
I've installed and tried LabelStudio and found it less mature then LightTag (don't mean to judge here :))
I've also read about spaCy's Prodigy, which offers a paid annotation tool. I would consider purchasing it, but their website only offers a live demo of the the tagging phase and I can't access if their product is superior to the other two products above.
Even in StackOverflow the latest question I found on that matter is over 5 years ago.
Do you have any recommendation for a tool to create a labeled NER dataset from raw text?

⚠️ Disclaimer
I am the author of Acharya. I would limit my answers to the points raised in the question.
Based on your question, Acharya would help you in creating the project and upload your raw text data and annotate them to create a labeled dataset.
It would allow you to mark records individually for train or test in the dataset and would give data-centric reports to identify and fix annotation/labeling errors.
It allows you to add different algorithms (bring your own algorithm) to the project and train the model regularly. Once trained, it can give annotation suggestions from the trained models on untagged data to make the labeling process faster.
If you want to train in a different setup, it allows you to export the labeled dataset in multiple supported formats.
Currently, it does not support sharing of projects.
Acharya community edition is in alpha release.
github page (https://github.com/astutic/Acharya)
website (https://acharya.astutic.com/)
Doccano is another open-source annotation tool that you can check out https://github.com/doccano/doccano

I have used both DOCCANO (https://github.com/doccano/doccano) and BRAT (https://brat.nlplab.org/).
Find the latter very good and it supports more functions. Both are free to use.

Questions about the split of feature in the buidling of decition tree

i am studying the Decision tree algorithm and i read the sklearn source code. When i read the part of the spliting of a feature in the buliding of decision tree, i meet a question in the _splitter.pyx file which is located in the floder sklearn/tree. I have two questions, the first is , everytime doing the split, the algorithm choose a feature randomly? Because the randomness, does one feature can be choose for more than one times? I am so confused about this question and i will appreciate it if you guys can give me this help. the sklearn/tree/_splitter.pyx is in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx

tensorflow for classification of strings vs elasticsearch

So, a little bit on my problem.
TL;DR
Can I use machine-learning instead of Elastic Search to find results depending on the user's text input? Is it a good idea?
I am working on a car spare parts project, and we have split the car into 300 parts that we store on the database, with some data for each part (weight, availability, etc).
When the customer inputs the text of his part, we need to be able to classify the part, and map it to one in our database.
The current way it's being done is by people on our team manually mapping the customer text with the parts on our database, we want to automate that process.
We tried using MongoDB text search, but it was often inaccurate since parts have different names in different parts of the country.
So we wanted something that got more accurate results, and improved by the more data we have, we immediately considered TensorFlow, after some research and taking part of Google's Machine Learning Crash Course, I got to that point where it specified:
Models can't learn from string values, so you'll have to perform some feature engineering to convert those values to something numeric
That would be useful in the case we have limited number of features as strings, but we don't know what the user will input as a text.
So, my questions are:
1- Can we use Machine Learning to map text input by the user with some documents on our database?
2- If we can do that, is it a good idea to favor it over other search tools like ElasticSearch?
3- Can ElasticSearch improve its results the more data we have? How?
4- How would you go about this problem?
Note: I'd be doing that in Node.js, and since TensorFlow.js is new, I am inclining to go for other solutions, but if push comes to shove, and the results are much better, I would definitely go there.

TL;DR: Yes and yes.
TS;WM:
This is a perfectly suited problem for machine learning. Especially so, if you have a database of past customer texts that have already been mapped to parts. Ideally, you have hundreds of texts mapped to each part. If that is present, you can design and train a network. And models can learn from string values with some engineering, and it's not that bad.
I'm not sure ElasticSearch would improve much on the network. I don't know much about auto parts trading, but as a wild guess, "the large round thingy that helps change direction" would never be mapped to "steering wheel" by ES but could be learned easily by a network - provided there are at least some examples of people using that text to specify steering wheel.
You can but don't have to necessarily use tensorflow.js for your network. The AI could run on your server as a webservice, and you'd just send over the customer's text to it and it would send back it's recommendations of part SKUs and names.

Parse Tree for a proper structured sentence using OpenNLP

I have an NLP task where I need to make sure that a paragraph of multiple sentences include at least one well structured question, I'm using OpenNLP to generate the parse trees in the paragraph. My questions are:
1-Is there a way to get a list of possible parse trees for a properly structured question.
2- How can I compare two parse trees
Thanks

Well,you yourself have answered the question. You just have to get the dataset containing different types of questions and play with it.
Get different types of questions and parse trees corresponding to it. Get all the output parse trees in a format such that you can use it in the next step.
When it comes to comparing to parse trees,it's basically comparing text. Which is a quite simple task.
But obviously,doing it like this will take a bit longer time and memory if you directly play with text files. For that,convert and save your parse trees of standard questions in binary and this will take less time and memory when concatenated with the next step.
Hope this helps,All the best!

Multiple task as one - Apache Spark

I have a software which proceeds one picture and give me some results for that picture and a database which contains a lot of pictures.
I would like to build a distributed architecture in order to process these pictures on multiple servers in order to gain time.
I heard about Spark and searched about it, but I'm not sure that this solution is good for me. Nevertheless, I don't want to miss something.
Indeed, in all the example I found for Spark, it's always dealing with tasks/jobs that can be split in smaller tasks/jobs.
For example, a text can be split in mutliple smaller texts and so, the wordcount can be easy processed.
However, when I use my software, I need to give a whole picture and not just parts of it.
So, is it possible to give Spark a task which contains 10 pictures (for example), and then Spark splits it in smaller tasks (1 task = 1 picture) and sends each picture to a worker?
And if it's possible, is this very efficient? I actually heard about Celery and I'm wondering if this kind of solution is better for my case.
Thank you for your help! :)

I think it depends on what you mean by "lot of pictures" and how often you will get "lot of pictures" to process. If you have tens of thousands of pictures and you will get them frequently, then Spark will definitely a good solution.
From an architecture and requirements viewpoint I think either Spark or Storm will fit the bill. My main concern would be wether the overhead is justified. This talk for instance is about realtime image processing with Spark:
https://www.youtube.com/watch?v=I6qmEcGNgDo
You could so look at this quora thread:
https://www.quora.com/Has-anyone-developed-computer-vision-image-processing-algorithms-on-Twitter-Storm

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string