I am converting a user input list of strings into tuples. The user inputs a list of fractions ie: (Please no "import fractions" suggestions)
fractions = ["1/2","3/5","4/3","3/8","1/9","4/7"]
I normally would use the following code that works:
user_input = 0
list_frac = []
print('Enter fractions into a list until you type "stop" in lower case:')
while user_input != 'stop':
user_input = input('Enter a fraction ie: "1/2" >>>')
list_frac.append(user_input)
list_frac.pop() # pop "stop" off the list
result = []
for i in list_frac:
result.append(tuple(i.split('/')))
print(result)
The result is a list of tuples:
fractions = [('1','2'),('3','5')('4','3'),('3','8'),('1','9'),('4','7')]
I want to change the values in the tuples to integers as well and I dont know how
However I also wish to learn lambda functions so I am practicing on simple code like this. This is my attempt at the same code using lambda function syntax:
tup_result = tuple(map(lambda i: result.append(i.split('/')), result))
But the result is an empty list and no error codes to help me.
The Question: How to change the strings in the list of tuples to ints, and then accomplish all this with the lambda function one liner.
Any suggestions, I have the general concept pf a lambda function down but actually implementing this is a little confusing, thanks for the help folks!
I used comprehensitions to solve the task:
fractions = ["1/2","3/5","4/3","3/8","1/9","4/7"]
print([(int(x),int(y)) for (x,y) in [k.split('/') for k in fractions]])
>>>[(1, 2), (3, 5), (4, 3), (3, 8), (1, 9), (4, 7)]
I started with python not long time ago myself and was confused how to use lambda in the beginning as well. Then I read, that Guido van Rossum had suggested, that lambda forms would disappear in Python3.0 AlternateLambdaSyntax, since then I have not used lambda at all and have no problem with the issue at all. You have to understand how it works when you see it in some code, but you can almost always can write more readable code without using lambda (though I can be wrong). I hope, it helped.
Update
there is a solution solution with map() and lambda, though I would not wish to see it in my code on my worst enemy:
print([(int(x),int(y)) for [x,y] in list(map(lambda frac: frac.split('/'),fractions))])
>>>[(1, 2), (3, 5), (4, 3), (3, 8), (1, 9), (4, 7)]
Related
I am solving a text classification problem and while annotating my data I found very long words which are sentence itself but are not separated by space.
One of the example which I found while annotating my data point is:
Throughnumerousacquisitionsandtransitions,Anacompstillexiststodaywithagreaterfocusondocumentmanagement
Desired output:
Through numerous acquisitions and transitions, Anacomp still exists today with a greater focus on document management.
I have looked upon various frameworks such as Keras, PyTorch to see if they provide any functionality to solve this issue but I couldn't find anything.
The problem that you are trying to solve is text/word segmentation. It is possible to approach this based on ML using a sequence model (such as LSTM) and a word embedding (such as BERT).
This link details such an approach for Chinese language. Chinese language does not adopt spaces, so this sort of approach is necessary as a preprocessing component in Chinese NLP processing tasks.
I would like to describe an automaton based approach using Aho-Corasick Algorithm.
First do a pip install pyahocorasick
I'm resorting to use only the words in your input string for the sake of demonstration. In a real world scenario you could just use a dictionary of words from something like Wordnet.
import ahocorasick
automaton = ahocorasick.Automaton()
input = 'Throughnumerousacquisitionsandtransitions, Anacompstillexiststodaywithagreaterfocusondocumentmanagement'
# Replace this with a large dictionary of words
word_dictionary = ['Through', 'numerous', 'acquisition', 'acquisitions', 'and', 'transitions', 'Anacomp', 'still',
'exists', 'today', 'with', 'a', 'greater', 'focus', 'on', 'document', 'management']
# add dictionary words to automaton
for idx, key in enumerate(word_dictionary):
automaton.add_word(key, (idx, key))
# Build aho-corasick automaton for search
automaton.make_automaton()
# to check for ambiguity, if there is a longer match then prefer that
previous_rng = range(0, 0)
previous_rs = set(previous_rng)
# Holds the end result dictionary
result = {}
# search the inputs using automaton
for end_index, (insert_order, original_value) in automaton.iter(input):
start_index = end_index - len(original_value) + 1
current_rng = range(start_index, end_index)
current_rs = set(current_rng)
# ignore previous as there is a longer match available
if previous_rs.issubset(current_rs):
# remove ambiguous short entry in favour of the longer entry
if previous_rng in result:
del result[previous_rng]
result[current_rng] = (insert_order, original_value)
previous_rng = current_rng
previous_rs = current_rs
# if there is no overlap of indices, then its a new token, add to result
elif previous_rs.isdisjoint(current_rs):
previous_rng = current_rng
previous_rs = current_rs
result[current_rng] = (insert_order, original_value)
# ignore current as it is a subset of previous
else:
continue
assert input[start_index:start_index + len(original_value)] == original_value
for x in result:
print(x, result[x])
Produces results :
range(0, 6) (0, 'Through')
range(7, 14) (1, 'numerous')
range(15, 26) (3, 'acquisitions')
range(27, 29) (4, 'and')
range(30, 40) (5, 'transitions')
range(43, 49) (6, 'Anacomp')
range(50, 54) (7, 'still')
range(55, 60) (8, 'exists')
range(61, 65) (9, 'today')
range(66, 69) (10, 'with')
range(71, 77) (12, 'greater')
range(78, 82) (13, 'focus')
range(83, 84) (14, 'on')
range(85, 92) (15, 'document')
range(93, 102) (16, 'management')
dict = {'word1':8, 'word2':5, 'word3' : 15, 'word4' : 1}
sorted(dict.items(), key=lambda x: x[1])
[('word4', 1), ('word2', 5), ('word1', 8), ('word3', 15)]
Ok that's what I want.
OrderedDict(sorted(dict.items(), key=lambda x: x[1]))
OrderedDict([('word1', 8), ('word2', 5), ('word3', 15), ('word4', 1)])
That's not what I want.
I don't understand. What am I doing wrong ? Why is the OrderedDict not ordered as desired ?
I checked the code that you wrote,the same case is working for me.Maybe since you are using it in a shell based environment so the values are not getting saved or something
Here is a screenshot of the code along with the output that you expect
Let me know if you have anymore questions.
I want to plot each list of tuples generated by groupby command.
import more_itertools as mit
df=pd.DataFrame({'a': [0,1,2,0,1,2,3], 'b':[2,10,24,56,90,1,3]})
for group in mit.consecutive_groups(zip(df['a'],df['b']),ordering=lambda t:t[0]):
print(list(group))
output:
[(0, 2), (1, 10),(2,24)]
[(0,56),(1,90),(2,1),(3,3)]
I want to plot first index of group [(0, 2), (1, 10),(2,24)] taking first element as x and second element of tuple as y ( x=0,y=2). The same applies to following list of tuples. I am still trying, but have not figured yet.
You are looking for:
df.assign(grp = df.a.diff().ne(1).cumsum()).groupby('grp').plot('a','b')
I'm writing a python script to read in a netCDF file in Ugrid format. This requires reading in two 2D arrays:
x_coordinate = [[0,0],[1,200],[2,400],[3,600],[4,800]...]
y_coordinate = [[0,0],[1,5],[2,10],[3,15],[4,20]...]
and outputting an array:
coordinates = [[0,0],[200,5],[400,10],[600,15],[800,20]...]
so that I can then display it through mpl. Is there a way to do this efficiently without iterating through with comparative if statements?
In my opinion, zip() with list comprehension should solve your problem.
Example as follows:
>>>list(zip([el[1] for el in y_coordinate], [el[1] for el in x_coordinate]))
[(0, 0), (5, 200), (10, 400), (15, 600), (20, 800)]
Can you try:
[[x[0][1], x[1][1]] for x in zip(x_coordinate, y_coordinate)]
when i ran the tfidf for a set of documents it returned me a tfidf matrix which looked like this.
(1, 12) 0.656240233446
(1, 11) 0.754552023393
(2, 6) 1.0
(3, 13) 1.0
(4, 2) 1.0
(7, 9) 1.0
(9, 4) 0.742540927053
(9, 5) 0.66980069547
(11, 19) 0.735138466738
(11, 7) 0.677916982176
(12, 18) 1.0
(13, 14) 0.697455191865
(13, 11) 0.716628394177
(14, 5) 1.0
(15, 8) 1.0
(16, 17) 1.0
(18, 1) 1.0
(19, 17) 1.0
(22, 13) 1.0
(23, 3) 1.0
(25, 6) 1.0
(26, 19) 0.476648253537
(26, 7) 0.879094103268
(28, 10) 0.532672175403
(28, 7) 0.523456282204
I want to know what is this, I am not able to understand how this is provided.
when i was in debug mode I got to know about indices, indptr and data... this things are some where co-relating with the data given. what are these?
there are lots of confusion in the numbers, I dont see 0th, 5th, 6th document if I say the first element in the parenthesis is the documents based on my prediction.
Kindly help me figure out how it is working here. However I know the general working of tfidf from wiki, taking the log on inverse documents and other stuff. I just want to know what are these 3 different kind of numbers here, what are the refering it to?
The source code is :
#This contains the list of file names
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():
def kmeansClusters(self):
global _report
self.num_clusters = 5
km = KMeans(n_clusters=self.num_clusters)
vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
km.fit(self.tfidf_matrix)
self.clusters = km.labels_.tolist()
joblib.dump(km, 'doc_cluster2.pkl')
km = joblib.load('doc_cluster2.pkl')
class TokenizingAndPanda():
def tokenize_only(self,text):
'''
This function tokenizes the text
:param text: Give the text that you want to tokenize
:return: it gives the filter tokes
'''
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
def tokenize_and_stem(self,text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [_stemmer.stem(t) for t in filtered_tokens]
return stems
def getFilnames(self):
'''
:return:
'''
global _path
global _filenames
path = _path
_filenames = FileAccess().read_all_file_names(path)
def getContentsForFilenames(self):
global _contents
global _file_contents
for filename in _filenames:
content = FileAccess().read_the_contents_from_files(_path, filename)
_contents.append(content)
_file_contents[filename] = content
def createPandaVocabFrame(self):
global _totalvocab_stemmed
global _totalvocab_tokenized
#Enable this if you want to load the filenames and contents from a file structure.
# self.getFilnames()
# self.getContentsForFilenames()
# for name, i in _file_contents.items():
# print(name)
# print(i)
for i in _contents:
allwords_stemmed = self.tokenize_and_stem(i)
_totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = self.tokenize_only(i)
_totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
print(vocab_frame)
return vocab_frame
class TfidfProcessing():
def getTfidFPropertyData(self):
tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
min_df=0.02, stop_words='english',
use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
# print(_contents)
tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
return tfidf_matrix, terms, dist
Result of tfidf applied to data is usually a 2D matrix A, where A_ij is the normalized j'th term (word) frequency in i'th document. What you see in your output is sparse representation of this matrix, in other words - only elements which are non-zero are printed out, so:
(1, 12) 0.656240233446
means that 12th word (according to some vocabulary that has been built by sklearn) has a normalized frequency 0.656240233446 in the first document. The "missing" bits are zero, meaning that for example 3rd word cannot be find in 1st document (since there is no (1,3)) and so on.
The fact that some documents are missing is a result of your particular code/data (which you did not include), maybe you set the vocabulary by hand? Or maximum number of features considered? There are many parameters in TfidfVectorizer that can cause that, but without your exact code (and some exemplary data) nothing else can be said. For example setting min_df can cause that (as it drops very rare words) similarly max_features (the same effect)