Weka, how to use clustering method to group similar string patterns - string

I am using clustering methods of Weka to group similar string patterns. I have firstly use the fonction "stringtowordVector" of weka and then I directly used some methodes of clustering, but I can't get correct results, could someone give me some correct methods to group this kind of data? This is a small part of my data :
#relation ponds
#ATTRIBUTE LCC string
#data
acegiadfgiacehiacehiacfhjacehjadfhjacfgiadfhjadfhjadfhjacfhjadf
acehiadfhjacehiadfhjadfhjadfhjadfhjacfhfhjacehj
acehiadfhjacehiadfhjadfhjadfhjadfhjacfhjadfhjadfhjadfhjadfhjadfhjacehj
acehiadfhjacehiadfhjadfhjacfhjaacehjadfhjadfhjadfhjacfhj
acehiadfhjacehikkkkkkkkkkk
in fact every line of this data represent an extracted frequent pattern(by a data mining algorithm) and each letter a c or e... represent an attribute, but every pattern(every line) doesn't have the same number of attributes, so how could I use the clustering methods to group similar patterns? Thank you very much!!! Looking forward to your response :)
David

Every string is different, so "string to word vector" will give them different vectors. Please read "bag of words model" for details.
You could try clustering with Levenshtein distance, but I would rather try designing some good features for your problem.

Related

What robust algorithm implementation can I use to perform phrase similarity with two inputs?

This is the problem:
I have two columns in my matadata database "field name" and "field description"
I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
[Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)
Examples:
row
field_name
field_description
1
my_first_field
my first field
2
my_second_field
my------second------field
3
my_third_field
this is a description about the field, the descriprion can contain the name of the field itself
Where the examples 1st and 2nd are similars (thus wrong) and the 3rd is correct.
I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1st example).
Some of the implementations I tried to use:
https://towardsdatascience.com/surprisingly-effective-way-to-name-matching-in-python-1a67328e670e
https://spacy.io/usage/linguistic-features#vectors-similarity
https://docs.python.org/3/library/difflib.html
is there a way to check similarity between two full sentences in python?
[Edit]
I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.
field_name
field_description
Score
my_field_name
my_field_name
1.0000
second_field_name
second field name
0.8483
third_field_name
third-field-name
0.8717
fourth_field_name
this is a correct description field
0.4591
fifth_field_name
fifth_-------field_//////////////name
0.8454
For your examples, the Levenshtein edit distance would work very well. It can also be 'customized', or you could use some preprocessing depending on your data.
But your text description of the problem makes me think that the real problem is likely much more complex, and maybe not even easy to define formally. It looks like you actually need a more semantic method, and this would probably require training a model with annotated data.

Determine text similarity through cluster analysis

I am a senior bachelor student in CS and I currently work on my thesis. For this thesis I wrote a program that uses density-based clustering approach. More specifically, OPTICS algorithm. I have an idea of how to use it, but I don't know if it is valid.
I want to use this algorithm for text classification. Texts are points in the set that have to be clustered, so that the resulting hierarchy consists of categories and subcategories of texts. For example, one such set is "Scientific literature", consisting of subsets "Mathematics", "Biology" etc.
I came up with the idea that I can analyze texts for specific words that are encountered in particular text more often than in the whole dataset, also excluding insignificant words like prepositions. Perhaps I can use open source natural language parsers for that purpose, like Stanford parser. After that the program combines these "characteristic words" from each text into one set, and a certain amount of the most frequent words can be taken from this set. That amount becomes the dimentionality for the clustering, and each word's frequency in a particular text is used as a coordinate of a point. Thus we can cluster them.
The question is, is that idea valid or a complete nonsense? Can clustering in general and density-based clustering in particular be used for such classification? Maybe there is some kind of literature that can point me in the right direction?
Clustering != classification.
Run the clustering algorithm, and study the results. Most likely, there will not be a cluster "scientific literature" with subjects "mathematics" - what do you do then?
Also, clusters will only give you sets, that is too coarse for similarity search - on the contrary, you need first to solve the similarity problem, before you can run clustering algorithms such as OPTICS.
The "idea" you described is pretty much what everybody has been trying for years already.

Clustering a long list of words

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster. For example "algorithm" and "alogrithm" should have high chances to appear in the same cluster.
I am well aware of the classical unsupervised clustering methods like k-means clustering, EM clustering in the Pattern Recognition literature. The problem here is that these methods work on points which reside in a vector space. I have words of strings at my hand here. It seems that, the question of how to represent strings in a numerical vector space and to calculate "means" of string clusters is not sufficiently answered, according to my survey efforts until now. A naive approach to attack this problem would be to combine k-Means clustering with Levenshtein distance, but the question still remains "How to represent "means" of strings?". There is a weight called as TF-IDF weigt, but it seems that it is mostly related to the area of "text document" clustering, not for the clustering of single words. It seems that there are some special string clustering algorithms existing, like the one at http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf
My search in this area is going on still, but I wanted to get ideas from here as well. What would you recommend in this case, is anyone aware of any methods for this kind of problem?
Don't look for clustering. This is misleading. Most algorithms will (more or less forcefully) break your data into a predefined number of groups, no matter what. That k-means isn't the right type of algorithm for your problem should be rather obvious, isn't it?
This sounds very similar; the difference is the scale. A clustering algorithm will produce "macro" clusters, e.g. divide your data set into 10 clusters. What you probably want is that much of your data isn't clustered at all, but you want to want to merge near-duplicate strings, which may stem from errors, right?
Levenshtein distance with a threshold is probably what you need. You can try to accelerate this by using hashing techniques, for example.
Similarly, TF-IDF is the wrong tool. It's used for clustering texts, not strings. TF-IDF is the weight assigned to a single word (string; but it is assumed that this string does not contain spelling errors!) within a larger document. It doesn't work well on short documents, and it won't work at all on single-word strings.
I have encountered the same kind of problem. My approach was to create a graph where each string will be a node and each edge will connect two nodes with weight the similarity of those two strings. You can use edit distance or Sorensen for that. I also set a threshold of 0.2 so that my graph will not be complete thus very computationally heavy. After forming the graph you can use community detection algorithms to detect node communities. Each community is formed with nodes that have a lot of edges with each other, so they will be very similar with each other. You can use networkx or igraph to form the graph and identify each community. So each community will be a cluster of strings. I tested this approach with some string that I wanted to cluster. Here are some of the identified clusters.
University cluster
Council cluster
Committee cluster
I visualised the graph with the gephi tool.
Hope that helps even if it is quite late.

How to represent text documents as feature vectors for text classification?

I have around 10,000 text documents.
How to represent them as feature vectors, so that I can use them for text classification?
Is there any tool which does the feature vector representation automatically?
The easiest approach is to go with the bag of words model. You represent each document as an unordered collection of words.
You probably want to strip out punctuation and you may want to ignore case. You might also want to remove common words like 'and', 'or' and 'the'.
To adapt this into a feature vector you could choose (say) 10,000 representative words from your sample, and have a binary vector v[i,j] = 1 if document i contains word j and v[i,j] = 0 otherwise.
To give a really good answer to the question, it would be helpful to know, what kind of classification you are interested in: based on genre, author, sentiment etc. For stylistic classification for example, the function words are important, for a classification based on content they are just noise and are usually filtered out using a stop word list.
If you are interested in a classification based on content, you may want to use a weighting scheme like term frequency / inverse document frequency,(1) in order to give words which are typical for a document and comparetively rare in the whole text collection more weight. This assumes a vector space model of your texts which is a bag of word representation of the text. (See Wikipedia on Vector Space Modell and tf/idf) Usually tf/idf will yield better results than a binary classification schema which only contains the information whether a term exists in a document.
This approach is so established and common that machine learning libraries like Python's scikit-learn offer convenience methods which convert the text collection into a matrix using tf/idf as a weighting scheme.

Finding related texts(correlation between two texts)

I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?
Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.
First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.

Resources