I have a pandas DataFrame that contains a column with sentences from different languages (6 languages). The DataFrame also contains a column which states which language the corresponding sentence belongs to. However, a sentence may contain non letter ASCII characters such as =## etc.. and words that may not belong to the same language. Even though, it may be written in the same script. For an example please refer to the below sentence which, has been marked as Spanish;
'¿Vas a venir a la tienda conmigo?+== #loja' #Note that 'loja' is a Portuguese word.
Since the sentence is marked as Spanish I would like to remove all non Spanish words and non punctuation characters (+, =, =, #).
I have an idea to remove the non punctuation words by getting the set values and removing the ones that are not letters (there are only few punctuation characters. so no need to search). However, would someone be able to help remove the words that do not belong to the tagged language such as the Portuguese word in the above example using python.?
Thanks & Best Regards
Michael
Related
I have a master corpus containing thousands of popular novels. I then have a .csv file that's a list containing about 450 different phrases (rhetorical_devices.csv). I am trying to use regex to do two things with these data:
Return a boolean telling me whether or not any phrase from the .csv list is present in the master_corpus.
Search for and then count the number exact match phrases between the .csv list and the master_corpus. I don't need to know which phrases matched, just the number of matches.
The .csv list is almost all multi-word phrases, things like:
huffed loudly
felt light-headed
couldn't they?
stop!
Some of the phrases contain pieces of punctuation that are relevant to my search, so for example, I need to be able to ID "couldn't they?" with the words in that exact order, question mark included. I keep getting all sorts of hits on sentences that contain "couldn't" and "they" and "?" in any random order. For this example, "They couldn't just stop?" is returning 2 hits for the count. Seems like my code is just looking for all of the words rather than them in the correct order and containing stipulated punctuation.
Right now, this is my attempt at a boolean, where master_corpus is all of the novels:
phrase_list = self.corpora['rhetorical_devices.csv'][0].to_list()
phrase_list = [i.lower() for i in phrase_list]
regex = '|'.join(phrase_list)
return bool(re.search(regex, master_corpus.lower()))
I think the ! and ? from the list are ending up as regex operators, but also I'm not sure how to import the list and make sure I'm looking for those exact matches.
Any help would be greatly appreciated.
Instead of using a regex, you should loop over the phrases like Mike L suggested:
total_matches = 0
corpus = master_corpus.lower()
for phrase in phrase_list:
total_matches += corpus.count(phrase)
We have a 5000-line text file containing words like so:
BANKS
BEING AFRAID OF DOGS
This is a SENTENCE.
Just another sentence.
COUNTRY
Using vim, I want to capitalize the words only in the lines where all the words are in uppercase (meaning lines 3 and 4 should be left untouched). In other words, what I expect to get is:
Banks
Being Afraid Of Dogs
This is a SENTENCE.
Just another sentence.
Country
By referring to Power of g and Switching_case_of_characters.
Applying the command to line containing upper case character and space only, which is g/^[A-Z ]*$/
Then do Title case conversion s/\<\(\w\)\(\w*\)\>/\u\1\L\2/g
The whole command will be
:g/^[A-Z ]*$/s/\<\(\w\)\(\w*\)\>/\u\1\L\2/g
Let's say I have a text corpus with inconsistently written bi-grams. An example would be "bi gram", "bi-gram", "bigram". Is there any standard text preprocessing method to normalize all these as the same thing? i.e. replace all such occurrences by "bigram". I should also mention that I have no prior knowledge of what exact bi-grams are present in the corpus.
Another thing I'm curious about - spell correction of standard words like common nouns is easy. But what about spell correction of proper nouns? I'm assuming that the correct spelling occurs more frequently than the incorrect spelling - so maybe I have a pandas series of text, in which majority of the rows contains "California", but there are some occurrences of "Califonria" as well.
when converting the name 'Lukasieicz' to soundex (LETTER,DIGIT,DIGIT,DIGIT,DIGIT), I come up with L2222.
However, I am being told by my lecture slides that the actual answer is supposed to be L2220.
Please explain why my answer is incorrect, or if the lecture answer was just a typo or something.
my steps:
Lukasieicz
remove and keep L
ukasieicz
Remove contiguous duplicate characters
ukasieicz
remove A,E,H,I,O,U,W,Y
KSCZ
convert up to first four remaining letters to soundex (as described in lecture directions)
2222
append beginning letter
L2222
If this is American Soundex as defined by the National Archives you're both wrong. American Soundex contains one letter and three numbers, you can't have L2222 nor L2220. It's L222.
But let's say they added another number for some reason.
The basic substitution gives L2222. But you're supposed to collapse adjacent letters with the same numbers (step 3 below) and then pad with zeros if necessary (step 4).
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can't assign [four] numbers, append with zeros until there are [four] numbers. If you have more than [4] letters, just retain the first [4] numbers.
Lukasieicz # the original word
L_2_2___22 # replace with numbers, leave the gaps in
L_2_2___2 # apply step 3 and squeeze adjacent numbers
L2220 # apply step 4 and pad to four numbers
We can check how conventional (ie. three number) soundex implementations behave with the shorter Lukacz which becomes L_2_22. Following rules 3 and 4, it should be L220.
The National Archives recommends an online Soundex calculator which produces L220. So does PostgreSQL and Text::Soundex in both its original flavor and NARA implementations.
$ perl -wle 'use Text::Soundex; print soundex("Lukacz"); print soundex_nara("Lukacz")'
L220
L220
MySQL, predictably, is doing its own thing and returns L200.
This function implements the original Soundex algorithm, not the more popular enhanced version (also described by D. Knuth). The difference is that original version discards vowels first and duplicates second, whereas the enhanced version discards duplicates first and vowels second.
In conclusion, you forgot the squeeze step.
Suppose the given word is" connnggggggrrrraaatsss" and we need to convert it to congrats .
Or for other example is "looooooovvvvvveeeeee" should be changed to "love" .
Here the given words can be repeated for any number of times but it should be changed to correct form. We need to write a java based program.
You cannot really check for every word because there are certain words which have more than 1 alphabets in their spelling. So one way you could go is -
check for each alphabet in the word and restrict its number of consecutive appearances to two
now check the new spelling on the spell checker, you might want to try HUNspell as it is widely used by many word processing softwares.