Splitting punctuation into a list (Python) - string

I want to be able to generate a list that includes the punctuation but I am struggling to find a solution.
Example: "Hello world! I am here."
["Hello","world","!","I","am","here","."]
So far I know that
"Hello World! I am here.".split()
will evaluate to
['Hello', 'World!', 'I', 'am', 'here.']

You can use regex :
>>> s="Hello world! I am here."
>>>
>>> import re
>>> re.findall(r'\w+|[^\w\s]',s)
['Hello', 'world', '!', 'I', 'am', 'here', '.']
re.findall() with Regex r'\w+|[^\w\s]' will find all combinations of word characters (\w+) or every thing except word character or white-spaces ([^\w\s]).

Related

Struggling with removing stop words using nltk

I'm trying to remove the stop words from "I don't like ice cream." I have defined:
stop_words = set(nltk.corpus.stopwords.words('english'))
and the function
def stop_word_remover(text):
return [word for word in text if word.lower() not in stop_words]
But when I apply the function to the string in question, I get this list:
[' ', 'n', '’', ' ', 'l', 'k', 'e', ' ', 'c', 'e', ' ', 'c', 'r', 'e', '.']
which, when joining the strings together as in ''.join(stop_word_remover("I don’t like ice cream.")), I get
' n’ lke ce cre.'
which is not what I was expecting.
Any tips on where have I gone wrong?
word for word in text iterates over characters of text (not over words!)
you should change your code as below:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words = set(nltk.corpus.stopwords.words('english'))
def stop_word_remover(text):
word_tokens = word_tokenize(text)
word_list = [word for word in word_tokens if word.lower() not in stop_words]
return " ".join(word_list)
stop_word_remover("I don't like ice cream.")
## 'n't like ice cream .'

Tokenize in Python

I am trying to build a function that python that allows me to tokenize a character string. I have performed the following function:
def tokenize(string):
words = nltk.word_tokenize(string)
return words
This function prints the following:
tokenize("Hello. What’s your name?")
['Hello', '.', 'What', '’', 's', 'your', 'name', '?']
But I need you to print me as follows:
['Hello', '.', 'What’s', 'your', 'name', '?']
How could I implement it?.
Thank you

python3 split comma separated string ignoring comma within quotes [duplicate]

I have some input that looks like the following:
A,B,C,"D12121",E,F,G,H,"I9,I8",J,K
The comma-separated values can be in any order. I'd like to split the string on commas; however, in the case where something is inside double quotation marks, I need it to both ignore commas and strip out the quotation marks (if possible). So basically, the output would be this list of strings:
['A', 'B', 'C', 'D12121', 'E', 'F', 'G', 'H', 'I9,I8', 'J', 'K']
I've had a look at some other answers, and I'm thinking a regular expression would be best, but I'm terrible at coming up with them.
Lasse is right; it's a comma separated value file, so you should use the csv module. A brief example:
from csv import reader
# test
infile = ['A,B,C,"D12121",E,F,G,H,"I9,I8",J,K']
# real is probably like
# infile = open('filename', 'r')
# or use 'with open(...) as infile:' and indent the rest
for line in reader(infile):
print line
# for the test input, prints
# ['A', 'B', 'C', 'D12121', 'E', 'F', 'G', 'H', 'I9,I8', 'J', 'K']

Double for statement in list comprehension

wordlist = ['cat','dog','rabbit']
letterlist = [ ]
To list all the characters in all the words, we can do this:
letterlist = [word[i] for word in wordlist for i in range(len(word))]
['c', 'a', 't', 'd', 'o', 'g', 'r', 'a', 'b', 'b', 'i', 't']
However, when I try to do it in this way:
letterlist = [character for character in word for word in wordlist]
I get the error:
NameError: name 'word' is not defined on line 9
Can someone explain my error in understanding how list comprehension works?
Thanks.
Writing
wordlist = ["cat", "dog", "rabbit"]
letterlist = [character for character in word for word in wordlist]
is comparable to the following nested loop:
wordlist = ["cat", "dog", "rabbit"]
letterlist = []
for character in word:
for word in wordlist:
letterlist.append(character)
This loop will throw the same error as your list comprehension because you are attempting to reference character in word before defining word as an element of wordlist. You just have the order backwards. Try the following:
letterlist = [character for word in wordlist for character in word]

Is there any way to force ipython to interpret utf-8 symbols?

I'm using ipython notebook.
What I want to do is search a literal string for any spanish accented letters (ñ,á,é,í,ó,ú,Ñ,Á,É,Í,Ó,Ú) and change them to their closest representation in the english alphabet.
I decided to write down a simple function and give it a go:
def remove_accent(n):
listn = list(n)
for i in range(len(listn)):
if listn[i] == 'ó':
listn[i] =o
return listn
Seemed simple right simply compare if the accented character is there and change it to its closest representation so i went ahead and tested it getting the following output:
in []: remove_accent('whatever !## ó')
out[]: ['w',
'h',
'a',
't',
'e',
'v',
'e',
'r',
' ',
'!',
'#',
'#',
' ',
'\xc3',
'\xb3']
I've tried to change the default encoding from ASCII (I presume since i'm getting two positions for te accented character instead of one '\xc3','\xb3') to UTF-8 but this didnt work. what i would like to get is:
in []: remove_accent('whatever !## ó')
out[]: ['w',
'h',
'a',
't',
'e',
'v',
'e',
'r',
' ',
'!',
'#',
'#',
' ',
'o']
PD: this wouldn't be so bad if the accented character yielded just one position instead of two I would just require to change the if condition but I haven't find a way to do that either.
Your problem is that you are getting two characters for the 'ó' character instead of one. Therefore, try to change it to unicode first so that every character has the same length as follows:
def remove_accent(n):
n_unicode=unicode(n,"UTF-8")
listn = list(n_unicode)
for i in range(len(listn)):
if listn[i] == u'ó':
listn[i] = 'o'.encode('utf-8')
else:
listn[i]=listn[i].encode('utf-8')
return listn

Resources