I am trying to produce a vector that represents the match of a string and a list's elements. I have made a function in python3.x:
def vector_build (docs, var):
vector = []
features = docs.split(' ')
for ngram in var:
if ngram in features:
vector.append(docs.count(ngram))
else:
vector.append(0)
return vector
It works fine:
vector_build ('hi my name is peter',['hi', 'name', 'are', 'is'])
Out: [1, 1, 0, 1]
But this function is not scalable to significant data. When its string parameter 'docs' is heavier than 190kb it takes more time that need. So I am trying to replace the for loop with map function like:
var = ['hi', 'name', 'are', 'is']
doc = 'hi my name is peter'
features = doc.split(' ')
vector = list(map(var,if ngram in var in features: vector.append(doc.count(ngram))))
But this return this error:
SyntaxError: invalid syntax
Is there a way to replace that for loop with map, lambda, itertools in order to make the execution faster?
You can use list comprehension for this task. Also, lookups in a set of features should help the function some as well.
var = ['hi', 'name', 'are', 'is']
doc = 'hi my name is peter'
features = doc.split(' ')
features_set = set(features) #faster lookups
vector = [doc.count(ngram) if ngram in features_set else 0 for ngram in var]
print(vector)
Related
How can I improve the time efficiency of my algorithm to solve the following problem?
Problem:
Write a function that takes in a dictionary and query, two string arrays.
It should return an array of integers where each element i contains the number
of anagrams of the query[i] that exists in the dictionary. An anagram of a string is
another string with the same characters in the same frequency, in any order.
Example: "bca", "abc", "cba", "cab" are all anagram of "abc".
Current Solution
from collections import defaultdict
def IsAnagram(word):
# a list of 26 elements, each represents the number of occurrence of every english letter in the word
lst = [0] * 26
for c in word:
lst[ord(c) - 97] += 1
return lst
def anagram_string(query, dictionary):
d = defaultdict(int)
res = []
for w_1 in query:
# if the anagram of w_1 was counted before
if w_1 in d:
res.append(d[w_1])
else:
# define num of anagram for the query in dictionary
num_anagram = 0
# loop through all words of the dictionary
for w_2 in dictionary:
if len(w_1) != len(w_2):
continue
# add 1 to num of anagrams in case all letters of word_1 are the same as in word_2.
num_anagram += IsAnagram(w_1) == IsAnagram(w_2)
res.append(num_anagram)
# record the word with its number of anagrams in the dictionary
d[w_1] = num_anagram
return res
The above code has a Time Complexity of O(n*m) where n is the number of words in the query array and m is the number of words in dictionary array. Although it works fine for a small length array, it takes forever to compute the out array for a list of length 5000+. So, how can I improve it or maybe if someone has a different idea?
Here is my solution in Javascript
const dictionary = ['hack', 'rank', 'khac', 'ackh', 'kran', 'rankhacker', 'a', 'ab', 'ba', 'stairs', 'raits']
const query = ['a', 'nark', 'bs', 'hack', 'stairs']
function evaluate (dictionary, query) {
const finalResult = query.reduce((map, obj) => {
map[obj.split('').sort().join('')] = 0
return map
}, {})
dictionary.forEach(item => {
const stringEvaluate = item.split('').sort().join('')
if (finalResult.hasOwnProperty(stringEvaluate)) {
finalResult[stringEvaluate]++
}
})
return Object.values(finalResult)
}
console.log(evaluate(dictionary, query))
Solution in Python returning count of anagram in sequence which was found in dictionary list.
dictionary = ['hack', 'rank', 'khac', 'ackh', 'kran', 'rankhacker', 'a', 'ab', 'ba', 'stairs', 'raits']
query = ['a', 'nark', 'bs', 'hack', 'stairs']
myList=[sorted(element) for element in dictionary]
def anagram_string(query, dictionary):
resList=[]
for element in query:
count = myList.count(sorted(element))
resList.append(count)
return resList
print(anagram_string(query,dictionary))
I have a string related to a programs output, now I need to convert the string into a dictionary. I have tried it by using dict() and zip() commands but I am not able to fetch the results.
This is the code I have so far:
string = "Eth1/1 vlan-1 typemode-eth status:access eth1/2 vlan-1 type-eth status:access"
list1=string.split(' ')
print(list1)
['Eth1/1', 'vlan-1', 'typemode-access']
and further than this I have no idea:
{'eth1/1': {'Speed': '10Gb', 'Vlan': 1, 'Type Mode': 'eth', 'status': 'access'}, 'eth1/2': {'Speed': '10Gb', 'Vlan': 1, 'Type Mode': 'eth', 'status': 'access'}}
From your result to get a value see the following example. See inline comments.
import re
result = {}
string = "Eth1/1 vlan-1 typemode-eth status:access eth1/2 vlan-1 type-eth status:access"
a = re.search('access', string) # this gives 2 positions for the word access.
list1 = [string[0:a[0]], string[[a[0]+1]:]] # two substrings. a[0] is used to get
# roughly the middle of the string where the spplitpoint is of both
# substrings. Using access as key word gives flexibility if there is a third
# substring as well.
result = dict(list1) # result should be same as result2.
# y1 z1
result2 = {'eth1/1': {'Speed': '10Gb', 'Vlan': 1, 'Type Mode': 'eth', 'status': 'access'},
'eth1/2': {'Speed': '10Gb', 'Vlan': 1, 'Type Mode': 'eth', 'status': 'access'}}
# y2 = eth1/2.
# y1 y2
x = result['eth1/1']['Speed'] # replace any word at y1 or z1 to fetch another result.
print ('Got x : %s' % x) # this prints '10Gb'.
Basically what you've created is nested dictionaries. So addressing y1 first is enabling to get data from that particular dictionary. after y1 calling for z1 is getting the value from that particular key inside your first nested dictionary. If you change the keywords at x you get different different values back (regardless that it looks the same in your example; ttry with different values to see the result). Enjoy!
Try this code below:
string = "Eth1/1 vlan-1 typemode-eth status:access eth1/2 vlan-1 type-eth status:access eth1/3 vlan-1 type-eth status:access"
strList = string.split(" ")
indexPos = []
for data in range(0,len(strList)):
if strList[data].lower()[0:3] == 'eth':
print('Found',data)
indexPos.append(data)
dataDict = dict()
for i in range(0,len(indexPos)):
stringDict = dict()
stringDict['Speed'] = '10Gb'
if i is not len(indexPos)-1:
string = strList[indexPos[i]:indexPos[i+1]]
else:
string = strList[indexPos[i]:]
for i in range(0,len(string)):
if i is not 0:
if i is not 3:
valueSplit = string[i].split('-')
else:
print(i)
valueSplit = string[i].split(':')
stringDict[valueSplit[0]] = valueSplit[1]
dataDict[string[0]] = stringDict
I have written this code according the pattern in code. Please let me know if it work for you.
I have written a program which is counting trigrams that occur 5 times or more in a text file. The trigrams should be printed out according to their frequency.
I cannot find the problem!
I get the following error message:
list index out of range
I have tried to make the range bigger but that did not work out
f = open("bsp_file.txt", encoding="utf-8")
text = f.read()
f.close()
words = []
for word in text.split():
word = word.strip(",.:;-?!-–—_ ")
if len(word) != 0:
words.append(word)
trigrams = {}
for i in range(len(words)):
word = words[i]
nextword = words[i + 1]
nextnextword = words[i + 2]
key = (word, nextword, nextnextword)
trigrams[key] = trigrams.get(key, 0) + 1
l = list(trigrams.items())
l.sort(key=lambda x: x[1])
l.reverse()
for key, count in l:
if count < 5:
break
word = key[0]
nextword = key[1]
nextnextword = key[2]
print(word, nextword, nextnextword, count)
The result should look like this:(simplified)
s = "this is a trigram which is an example............."
this is a
is a trigram
a trigram which
trigram which is
which is an
is an example
As the comments pointed out, you're iterating over your list words with i, and you try to access words[i+1], when i will reach the last cell of words, i+1 will be out of range.
I suggest you read this tutorial to generate n-grams with pure python: http://www.albertauyeung.com/post/generating-ngrams-python/
Answer
If you don't have much time to read it all here's the function I recommend adaptated from the link:
def get_ngrams_count(words, n):
# generates a list of Tuples representing all n-grams
ngrams_tuple = zip(*[words[i:] for i in range(n)])
# turn the list into a dictionary with the counts of all ngrams
ngrams_count = {}
for ngram in ngrams_tuple:
if ngram not in ngrams_count:
ngrams_count[ngram] = 0
ngrams_count[ngram] += 1
return ngrams_count
trigrams = get_ngrams_count(words, 3)
Please note that you can make this function a lot simpler by using a Counter (which subclasses dict, so it will be compatible with your code) :
from collections import Counter
def get_ngrams_count(words, n):
# turn the list into a dictionary with the counts of all ngrams
return Counter(zip(*[words[i:] for i in range(n)]))
trigrams = get_ngrams_count(words, 3)
Side Notes
You can use the bool argument reverse in .sort() to sort your list from most common to least common:
l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)
this is a tad faster than sorting your list in ascending order and then reverse it with .reverse()
A more generic function for the printing of your sorted list (will work for any n-grams and not just tri-grams):
for ngram, count in l:
if count < 5:
break
# " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
print(" ".join(ngram), count)
def bibek():
test_list=[[]]
x=int(input("Enter the length of String elements using enter -: "))
for i in range(0,x):
a=str(input())
a=list(a)
test_list.append(a)
del(test_list[0]):
def filt(b):
d=['b','i','b']
if b in d:
return True
else:
return False
for t in test_list:
x=filter(filt,t)
for i in x:
print(i)
bibek()
suppose test_list=[['b','i','b'],['s','i','b'],['r','i','b']]
output should be ib since ib is common among all
an option is to use set and its methods:
test_list=[['b','i','b'],['s','i','b'],['r','i','b']]
common = set(test_list[0])
for item in test_list[1:]:
common.intersection_update(item)
print(common) # {'i', 'b'}
UPDATE: now that you have clarified your question i would to this:
from difflib import SequenceMatcher
test_list=[['b','i','b','b'],['s','i','b','b'],['r','i','b','b']]
# convert the list to simple strings
strgs = [''.join(item) for item in test_list]
common = strgs[0]
for item in strgs[1:]:
sm = SequenceMatcher(isjunk=None, a=item, b=common)
match = sm.find_longest_match(0, len(item), 0, len(common))
common = common[match.b:match.b+match.size]
print(common) # 'ibb'
the trick here is to use difflib.SequenceMatcher in order to get the longest string.
one more update after clarification of your question this time using collections.Counter:
from collections import Counter
strgs='app','bapp','sardipp', 'ppa'
common = Counter(strgs[0])
print(common)
for item in strgs[1:]:
c = Counter(item)
for key, number in common.items():
common[key] = min(number, c.get(key, 0))
print(common) # Counter({'p': 2, 'a': 1})
print(sorted(common.elements())) # ['a', 'p', 'p']
I am trying to create a list of dictionaries that contain lists of words at 'body' and 'summ' keys using spacy. I am also using BeautifulSoup since the actual data is raw html.
This i what I have so far
from pymongo import MongoClient
from bs4 import BeautifulSoup as bs
import spacy
import string
clt = MongoClient('localhost')
db1 = clt['mchack']
db2 = clt['clean_data']
nlp = spacy.load('en')
valid_shapes = ['X.X','X.X.','X.x','X.x.','x.x','x.x.','x.X','x.X.']
cake = list()
sent_x = list()
temp_b = list()
temp_s = list()
sent_y = list()
table = str.maketrans(dict.fromkeys(string.punctuation))
for item in db1.article.find().limit(1):
finale_doc = {}
x = bs(item['news']['article']['Body'], 'lxml')
y = bs(item['news']['article']['Summary'], 'lxml')
for content in x.find_all('p'):
v = content.text
v = v.translate(table)
sent_x.append(v)
body = ' '.join(sent_x)
for content in y.find_all('p'):
v = content.text
v = v.translate(table)
sent_y.append(v)
summ = ' '.join(sent_y)
b_nlp = nlp(body)
s_nlp = nlp(summ)
for token in b_nlp:
if token.is_alpha:
temp_b.append(token.text.lower())
elif token.shape_ in valid_shapes:
temp_b.append(token.text.lower())
elif token.pos_=='NUM':
temp_b.append('<NUM>')
elif token.pos_=="<SYM>":
temp_b.append('<SYM>')
for token in s_nlp:
if token.is_alpha:
temp_s.append(token.text.lower())
elif token.shape_ in valid_shapes:
temp_s.append(token.text.lower())
elif token.pos_=='NUM':
temp_s.append('<NUM>')
elif token.pos_=="<SYM>":
temp_s.append('<SYM>')
finale_doc.update({'body':temp_b,'summ':temp_s})
cake.append(finale_doc)
print(cake)
del sent_x[:]
del sent_y[:]
del temp_b[:]
del temp_s[:]
del finale_doc
print(cake)
The first print statement gives proper output
'summ': ['as', 'per', 'the', 'budget', 'estimates', 'we', 'are', 'going', 'to', 'spend', 'rs', '<NUM>', 'crore', 'in', 'the', 'next', 'year'],
'body': ['central', 'government', 'has', 'proposed', 'spendings', 'worth', 'over', 'rs', '<NUM>', 'crore', 'on', 'medical', 'and', 'cash', 'benefits', 'for', 'workers', 'and', 'family', 'members']}]
However, after emptying the lists sent_x, sent_y, temp_b and temp_s, the output comes:
[{'summ': [], 'body': []}]
You keep passing the references to temp_b and temp_s. That's why after emptying these lists cake's content also changes (values of the dictionary are the same objects as temp_b and temp_s)!
You simply need to make a copy before appending the finale_doc dict to cake list.
finale_doc.update({'body': list(temp_b), 'summ': list(temp_s)})
You should try creating a minimal reproducible version of this, as it would meet stack overflow guidelines and you would be likely to answer your own problem.
I think what you are asking is this:
How can I empty a list without changing other instances of that list?
I made some code and I think it should work:
items = []
contents = []
for value in (1, 2):
contents.append(value)
items.append(contents)
print(contents)
del contents[:]
print(items)
This prints [1], [2] like I want, but then it prints [[], []] instead of [[1], [2]].
Then I could answer your question:
Objects (including lists) are permanent, this won't work
Instead of modifying (adding to and then deleting) the same list, you probably want to create a new list inside the loop. You can verify this by looking at id(contents) and id(items[0]), etc., and see they are all the same list. You can even do contents.append(None); print(items) and see that you now have [None, None].
Try doing
for ...
contents = []
contents.append(value)
instead of
contents = []
for ...
del contents[:]
Edit: Another answer suggests making a copy of the values as you add them. This will work, but in your case I feel that making a copy and then nulling is unnecessarily complicated. This might be appropriate if you continued to add to the list.