I have a pandas DataFrame that contains a column with sentences from different languages (6 languages). The DataFrame also contains a column which states which language the corresponding sentence belongs to. However, a sentence may contain non letter ASCII characters such as =## etc.. and words that may not belong to the same language. Even though, it may be written in the same script. For an example please refer to the below sentence which, has been marked as Spanish;
'¿Vas a venir a la tienda conmigo?+== #loja' #Note that 'loja' is a Portuguese word.
Since the sentence is marked as Spanish I would like to remove all non Spanish words and non punctuation characters (+, =, =, #).
I have an idea to remove the non punctuation words by getting the set values and removing the ones that are not letters (there are only few punctuation characters. so no need to search). However, would someone be able to help remove the words that do not belong to the tagged language such as the Portuguese word in the above example using python.?
Thanks & Best Regards
Michael
For instance, I need to find double words like "Saw saw" and replace it with triple words like "Saw saw saw" in a txt file.
I'm thinking I'll be using :%s/pattern/replace/g
There is going to be multiple instances of this in the txt file, so I need to write something that would be universal, or work for different words.
A possible solution is:
%s/\c\([Ss]aw\) \1/\1 \1 \1/g
where \1 is a backreferece for the first capturing group \([Ss]aw\). To force lowercase for second and third occurrences include \l in the replacement string:
%s/\c\([Ss]aw\) \1/\1 \l\1 \l\1/g
I wont say i am totally new to VB but I am not expert either. I have this uni task where I am writing a pig latin program. I have got everything sorted but missing out on a part where punctuation is involved. I am rearranging character in a word based or vowels and consonants but i also need to make sure that any punctuation marks stay as it is, for instance. bat will be changed to atbway but in case it had a period at the end i.e. bat. the output has to be atbway.
This also applies to instances where a word has "" as they need to stay as it is with the words between them rearranged. I can put a function in place to check the first and last character of a word but how do i check if these characters are punctuations ?
How can we solve this problem in a best way? Is there any algorithm for solving this?
"In a paragraph we have to find and print all the words which have starting 3 letters same. Example: we input some paragraph and as a output we get letters like-
a) 1. you 2. your 3. yours 4. yourself
b) 1. early 2. earlier 3. earliest
Like this we get all the words of paragraph which have starting 3 letters common"
A reasonable solution that's not too hard to code up is to maintain a map of some sort where the keys are the first three letters of each word and the values are the sets of words that start with those three letters. You can scan across the words in the paragraph and, for each one you encounter, trim off the first three words, look up the map entry corresponding to those letters, and add in that word to the list. You can then iterate over the map at the end, find all sets containing at least two words, then print out each cluster you find.
Overall, the runtime of this approach is O(L), where L is the total length of all the words in the paragraph. To see this, notice that for each word, we do a map lookup on a constant-sized prefix of that word, then copy all the characters of the word into the map. Overall, this visits each character at most a constant number of times.
Trie with the first three characters and then the word index as the leaf should do the trick.
I'm learning Vim and can't wrap my head around the difference between word and WORD.
I got the following from the Vim manual.
A word consists of a sequence of letters, digits and underscores, or a
sequence of other non-blank characters, separated with white space
(spaces, tabs, ). This can be changed with the 'iskeyword'
option. An empty line is also considered to be a word.
A WORD consists of a sequence of non-blank characters, separated with
white space. An empty line is also considered to be a WORD.
I feel word and WORD are just the same thing. They are both a sequence of non-blank chars separated with white spaces. An empty line can be considered as both word and WORD.
Question:
What's the difference between them?
And why/when would someone use WORD over word?
I've already done Google and SO search, but their search-engine interpret WORD as just word so it's like I'm searching for Vim word vs word and of course won't find anything useful.
A WORD is always delimited by whitespace.
A word is delimited by non-keyword characters, which are configurable. Whitespace characters aren't keywords, and usually other characters (like ()[],-) aren't, neither. Therefore, a word usually is smaller than a WORD; the word-navigation is more fine-grained.
Example
This "stuff" is not-so difficult!
wwww wwwww ww www ww wwwwwwwww " (key)words, delimiters are non-keywords: "-! and whitespace
WWWW WWWWWWW WW WWWWWW WWWWWWWWWW " WORDS, delimiters are whitespace only
To supplement the previous answers... I visualise it like this; WORD is bigger than word, it encompasses more...
If I do viw ("select inner word") while my cursor is on app in the following line, it selects app:
app/views/layouts/admin.blade.php
If I do viW (WORD) while my cursor is at the same place, it selects the whole sequence of characters. A WORD includes characters that words, which are like English words, do not, such as asterisks, slashes, parentheses, brackets, etc.
According to Vim documentation ( :h 03.1 )
A word ends at a non-word character, such as a ".", "-" or ")".
A WORD ends strictly with a white-space. This may not be a word in normal sense, hence the uppercase.
eg.
ge b w e
<- <- ---> --->
This is-a line, with special/separated/words (and some more). ~
<----- <----- --------------------> ----->
gE B W E
If your cursor is at m (of more above)
a word would mean 'more' (i.e delimited by ')' non-word character)
whereas a WORD would mean 'more).' (i.e. delimited by white-space only)
similarly, If your cursor is at p (of special)
a word would mean 'special'
whereas a WORD would mean 'special/separated/words'
That's a grammar problem while understanding the definition of "word".
I get stuck at first in Chinese version of this definition (could be miss-translation).
The definition is definitely correct, but it should be read like that:
A word consists of:
[(a sequence of letters,digits and underscores),
or (a sequence of other non-blank characters)],
separated with white space (spaces, tabs, <EOL>).
Whitespace characters were only needed when delimiting two same types of 'word'
More examples in brackets as follow:
(example^&$%^Example) three "word" :(example), (^&$%^) and (Example)
(^&^&^^ &&^&^) two "word" : (^&^&^^) and (&&^&^)
(we're in stackoverflow) five "word" :(we), ('), (re), (in) and (stackoverflow)
Another way to say it. If ur coding, and want to move thru the line stopping at delimiters and things line that "() . [] , :" use w.
if you want to bypass those and just jump to words lets say like a novel or short story has, use W.
For coding the small w is probably the one used most often. Depends where you are in the code.