Tokenize in Python - python-3.x

I am trying to build a function that python that allows me to tokenize a character string. I have performed the following function:
def tokenize(string):
words = nltk.word_tokenize(string)
return words
This function prints the following:
tokenize("Hello. What’s your name?")
['Hello', '.', 'What', '’', 's', 'your', 'name', '?']
But I need you to print me as follows:
['Hello', '.', 'What’s', 'your', 'name', '?']
How could I implement it?.
Thank you

Related

Creating nested list for elements with same counts

Here's an example list:
['hello', 'hell', 'hel', 'he', 'h', 'he', 'hell', 'hello', 'hel', 'hello', 'hell']
so how would i go about making a nested list for the elements with the same amount of counts? To be more clear, nesting the elements together that appears the same amount of time in a list. Output would be like this:
[['hello','hell'], ['hel', 'he'], ['h']]
Because the count of [hello,hell] is 3 so they are together like the rest of the elements in the list
With some imports it could be done like this:
from collections import Counter
from itertools import groupby
words = ['hello', 'hell', 'hel', 'he', 'h', 'he', 'hell', 'hello', 'hel', 'hello', 'hell']
counts = Counter(words)
res = [list(group) for _, group in groupby(counts, key=lambda k: counts[k])]
res will be:
[['hello', 'hell'], ['hel', 'he'], ['h']]

combining thousands of list strings in python

I have a .txt file of "Alice in the Wonderland" and need to strip all the punctuation and make all of the words lower case, so I can find the number of unique words in the file. The wordlist referred to below is one list of all the individual words as strings from the book, so wordlist looks like this
["Alice's", 'Adventures', 'in', 'Wonderland', "ALICE'S",
'ADVENTURES', 'IN', 'WONDERLAND', 'Lewis', 'Carroll', 'THE',
'MILLENNIUM', 'FULCRUM', 'EDITION', '3.0', 'CHAPTER', 'I',
'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning',
'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her',
'sister', 'on', 'the', 'bank,'
The code i have for the solution so far is
from string import punctuation
def wordcount(book):
for word in wordlist:
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
print(newlist)
This works for stripping punctuation and making all words lowercase, however the newlist = lower_case.split() makes an individual list of every word, so I cannot iterate over one big list to find the number of unique words. The reason I did the .split() is so that when iterated over, python does not count ever letter as a word, rather each word is kept intact since it is its own list item. Any ideas on how I can improve this or a more efficient approach? Here is a sample of the output
['down']
['the']
['rabbit-hole']
['alice']
['was']
['beginning']
['to']
['get']
['very']
['tired']
['of']
['sitting']
['by']
['her']
Here is a modification of your code with outputs
from string import punctuation
wordlist = "Alice fell down down down!.. down into, the hole."
single_list = []
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
and that produces:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole']
and the unique set:
{'fell', 'alice', 'down', 'into', 'the', 'hole'}
and the length of the unique:
6
(This may not be the most efficient approach but it is close to your current code and will suffice for that book of thousands of elements. If this was a backend process serving multiple requests you would optimize it with improvements)
EDIT----------
You may be importing from file using a library that passes in a list, in which case you produce an error AttributeError: 'list' object has no attribute 'split', or you might see the error IndexError: list index out of range because of an empty string. In which case you use this modification:
from string import punctuation
wordlist2 = ["","Alice fell down down down!.. down into, the hole.", "There was only one hole for Alice to fall down into"]
single_list = []
for wordlist in wordlist2:
for word in wordlist.split(" "):
no_punc = word.strip(punctuation)
lower_case = no_punc.lower()
newlist = lower_case.split()
#print(newlist)
if(len(newlist) > 0):
single_list.append(newlist[0])
print(single_list)
#to get the unique
single_list_unique = set(single_list)
print(single_list_unique)
print(len(single_list_unique))
producing:
['alice', 'fell', 'down', 'down', 'down', 'down', 'into', 'the', 'hole', 'there', 'was', 'only', 'one', 'hole', 'for', 'alice', 'to', 'fall', 'down', 'into']
{'there', 'fall', 'fell', 'alice', 'for', 'down', 'was', 'into', 'the', 'to', 'only', 'hole', 'one'}
13

filtering a txt file in python on three letter words only

I need to write a program in Python3 that can filter a txt file in the Linux shell on three letter words only.
This is what I've got so far:
def main():
string = open("verhaaltje.txt", "r")
words = [word for word in string.split() if len(word)==3]
file.close()
print (str(words))
main()
Is there anyone that can help?
Upload your txt file contents and error logs.
string = open("verhaaltje.txt", "r")
words = [word for word in string.read().split() if len(word)==3]
string.close()
print (str(words))
With my above code and some text from the internet, I got
['the', 'the', 'the', 'm).', 'the', 'the', 'The', 'its', 'ago', 'was', 'm),', 'but', 'now', 'm).', 'and', 'its', 'big', 'but', 'the', 'are', 'and', 'its', 'the', 'and', 'for']
For the new line, modify a bit of the print statement.
print ('\n'.join(words))

LPTHW 48: nosetest says AttributeError: 'str' object has no attribute ' ' while output seems correct?

I'm working on LPTHW and I'm trying to understand my error.
This is my code:
class lexicon:
def __init__(self, words):
self.words = words
def scan(self):
direction = ['north', 'south', 'east', 'west']
verb = ['go', 'stop', 'kill', 'eat']
stop = ['the', 'in', 'of', 'from', 'at', 'it']
noun = ['door', 'bear', 'princess', 'cabinet']
number = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
thewords = self.words.split()
output = []
for i in thewords:
if i in direction:
output.append(('direction', i ))
print(output)
lexicon('north').scan()
I've put in the print statement to check if the output is correct. The actual output:
[('direction', 'north')]
I'm trying to get my nose test script running with this:
from nose.tools import *
from ex48 import lexicon
def test_directions():
assert_equal(lexicon.scan("north"), [('direction', 'north')])
But when I execute it:
thewords = self.words.split()
AttributeError: 'str' object has no attribute 'words'
The output of the script seems to match what the nose script expects, a tupil with those values. But the error seems to say that I gave it a string and it wants something else? I've searched on the error message but I can't wrap my head around what it's trying to tell me here.
Can somebody explain what I'm doing wrong here?

Splitting punctuation into a list (Python)

I want to be able to generate a list that includes the punctuation but I am struggling to find a solution.
Example: "Hello world! I am here."
["Hello","world","!","I","am","here","."]
So far I know that
"Hello World! I am here.".split()
will evaluate to
['Hello', 'World!', 'I', 'am', 'here.']
You can use regex :
>>> s="Hello world! I am here."
>>>
>>> import re
>>> re.findall(r'\w+|[^\w\s]',s)
['Hello', 'world', '!', 'I', 'am', 'here', '.']
re.findall() with Regex r'\w+|[^\w\s]' will find all combinations of word characters (\w+) or every thing except word character or white-spaces ([^\w\s]).

Resources