I am doing string processing in Matlab and I usually use cell arrays to store individual words in the text
Example:
a = {'this', 'is', 'an', 'array', 'of', 'strings'}
For searching for the word 'of' in this array I loop through the array and check each individual element against my word. This method does not scale since if I get a large dataset my array a will grow large and looping through elements is not wise. I am wondering if there is any more smart way, perhaps a better native data structure in Matlab, that can help me run this faster?
A map container is one option. I don't know what specific sort of string processing you intend to do, but here's an example for how you can store each string as a key which is associated with a vector of index positions of that word in a cell array:
a = {'this', 'is', 'an', 'array', 'of', 'strings', 'this', 'is'};
strMap = containers.Map(); %# Create container
for index = 1:numel(a) %# Loop over words to add
word = a{index};
if strMap.isKey(word)
strMap(word) = [strMap(word) index]; %# Add to an existing key
else
strMap(word) = index; %# Make a new key
end
end
You could then get the index positions of a word:
>> indices = strMap('this')
indices =
1 7 %# Cells 1 and 7 contain 'this'
Or check if a word exists in the cell array (i.e. if it is a key):
>> strMap.isKey('and')
ans =
0 %# 'and' is not present in the cell array
Related
I'm working on some text file which contains too many words and i want to get all words with there length . For example first i wanna get all words who's length is 2 and the 3 then 4 up to 15 for example
Word = this , length = 4
hate :4
love :4
that:4
china:5
Great:5
and so on up to 15
I was trying to do with this following code but i couldn't iterate it through all keys one by one .And through this code I'm able to get just words which has the length 5 but i want this loop to start it from 2 to up to 15 with sequence
text = open(r"C:\Users\israr\Desktop\counter\Bigdata.txt")
d = dict()
for line in text:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word not in d:
d[word] = len(word)
def getKeysByValue(d, valueToFind):
listOfKeys = list()
listOfItems = d.items()
for item in listOfItems:
if item[1] == valueToFind:
listOfKeys.append(item[0])
return listOfKeys
listOfKeys = getKeysByValue(d, 5)
print("Keys with value equal to 5")
#Iterate over the list of keys
for key in listOfKeys:
print(key)
What I have done is:
Changed the structure of your dictionary:
In your version of dictionary, a "word" has to be the key having value equal to its length. Like this:
{"hate": 4, "love": 4}
New version:
{4: ["hate", "love"], 5:["great", "china"]} Now the keys are integers and values are lists of words. For instance, if key is 4, the value will be a list of all words from the file with length 4.
After that, the code is populating dictionary from the data read from file. If the key is not present in the dictionary it is created otherwise the words are added to the list against that key.
Keys are sorted and their values are printed. That is all words of that length are printed in sequence.
You Forgot to close the file in your code. Its a good practice to release any resource being used by a program when it finishes execution. (To avoid Resource or Memory Leak and other such errors). Most of the time this can be done by just closing that resource. Closing the file, for instance, releases the file and it can thus be used by other program now.
# 24-Apr-2020
# 03:11 AM (GMT +05)
# TALHA ASGHAR
# Open the file to read data from
myFile = open(r"books.txt")
# create an empty dictionary where we will store word counts
# format of data in dictionary will be:
# {1: [words from file of length 1], 2:[words from file of length 2], ..... so on }
d = dict()
# iterate over all the lines of our file
for line in myFile:
# get words from the current line
words = line.lower().strip().split(" ")
# iterate over each word form the current line
for word in words:
# get the length of this word
length = len(word)
# there is no word of this length in the dictionary
# create a list against this length
# length is the key, and the value is the list of words with this length
if length not in d.keys():
d[length] = [word]
# if there is already a word of this length append current word to that list
else:
d[length].append(word)
for key in sorted(d.keys()):
print(key, end=":")
print(d[key])
myFile.close()
Your first part of code is correct, dictionary d will give you all the unique words with their respective length.
Now you want to get all the words with their length, as shown below:
{'this':4, 'that':4, 'water':5, 'china':5, 'great':5.......till length 15}
To get such dictionary you can sort the dictionary by their values as below.
import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))
sorted_d will be in the below format:
{'this':4, 'that':4, 'water':5, 'china':5, 'great':5,......., 'abcdefghijklmno':15,...}
What I want to get accomplished is an algorithm that finds the most duplicated letter from the entire list of strings. I'm new to Python so its taken me roughly two hours to get to this stage. The problem with my current code is that it returns every duplicated letter, when I'm only looking for the most duplicated letter. Additionally, I would like to know of a faster way that doesn't use two for loops.
Code:
rock_collections = ['aasdadwadasdadawwwwwwwwww', 'wasdawdasdasdAAdad', 'WaSdaWdasSwd', 'daWdAWdawd', 'QaWAWd', 'fAWAs', 'fAWDA']
seen = []
dupes = []
for words in rock_collections:
for letter in words:
if letter not in seen:
seen.append(letter)
else:
dupes.append(letter)
print(dupes)
If you are looking for the letter which appears the greatest number of times, I would recommend the following code:
def get_popular(strings):
full = ''.join(strings)
unique = list(set(full))
return max(
list(zip(unique, map(full.count, unique))), key=lambda x: x[1]
)
rock_collections = [
'aasdadwadasdadawwwwwwwwww',
'wasdawdasdasdAAdad',
'WaSdaWdasSwd',
'daWdAWdawd',
'QaWAWd',
'fAWAs',
'fAWDA'
]
print(get_popular(rock_collections)) # ('d', 19)
Let me break down the code for you:
full contains each of the strings together with without any letters between them. set(full) produces a set, meaning that it contains every unique letter only once. list(set(full)) makes this back into a list, meaning that it retains order when you iterate over the elements in the set.
map(full.count, unique) iterates over each of the unique letters and counts how many there are of them in the string. zip(unique, ...) puts those numbers with their respective letters. key=lambda x: x[1] is a way of saying, don't take the maximum value of the tuple, instead take the maximum value of the second element of the tuple (which is the number of times the letter appears). max finds the most common letter, using the aforementioned key.
I have a list of strings =
['after','second','shot','take','note','of','the','temp']
I want to strip all strings after the appearance of 'note'.
It should return
['after','second','shot','take']
There are also lists which does not have the flag word 'note'.
So in case of a list of strings =
['after','second','shot','take','of','the','temp']
it should return the list as it is.
How to do that in a fast way? I have to repeat the same thing with many lists with unequal length.
tokens = [tokens[:tokens.index(v)] if v == 'note' else v for v in tokens]
There is no need of an iteration when you can slice list:
strings[:strings.index('note')+1]
where s is your input list of strings. The end slice is exclusive, hence a +1 makes sure 'note' is part.
In case of missing data ('note'):
try:
final_lst = strings[:strings.index('note')+1]
except ValueError:
final_lst = strings
if you want to make sure the flagged word is present:
if 'note' in lst:
lst = lst[:lst.index('note')+1]
Pretty much the same as #Austin's answer above.
I have working code to perform a nested dictionary lookup and append results of another lookup to each key's list using the results of numpy's nonzero lookup function. Basically, I need a list of strings appended to a dictionary. These strings and the dictionary's keys are hashed at one point to integers and kept track of using separate dictionaries with the integer hash as the key and the string as the value. I need to look up these hashed values and store the string results in the dictionary. It's confusing so hopefully looking at the code helps. Here's a simplified version of code:
for key in ResultDictionary:
ResultDictionary[key] = []
true_indices = np.nonzero(numpy_array_of_booleans)
for idx in range(0, len(true_indices[0])):
ResultDictionary.get(HashDictA.get(true_indices[0][idx])).append(HashDictB.get(true_indices[1][idx]))
This code works for me, but I am hoping there's a way to improve the efficiency. I am not sure if I'm limited due to the nested lookup. The speed is also dependent on the number of true results returned by the nonzero function. Any thoughts on this? Appreciate any suggestions.
Here are two suggestions:
1) since your hash dicts are keyed with ints it might help to transform them into arrays or even lists for faster lookup if that is an option.
k, v = map(list, (HashDictB.keys(), HashDictB.values())
mxk, mxv = max(k), max(v, key=len)
lookupB = np.empty((mxk+1,), dtype=f'U{mxv}')
lookupB[k] = v
2) you probably can save a number of lookups in ResultDictionary and HashDictA by processing your numpy_array_of_booleans row-wise:
i, j = np.where(numpy_array_of_indices)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
ResultDict = {HashDictA[i[l]]: [HashDictB[jj] for jj in j[l:r]] for l, r in zip(bnds[:-1], bnds[1:])}
2b) if for some reason you need to incrementally add associations you could do something like (I'll shorten variable names for that)
from operator import itemgetter
res = {}
def add_batch(data, res, hA, hB):
i, j = np.where(data)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
for l, r in zip(bnds[:-1], bnds[1:]):
if l+1 == r:
res.setdefault(hA[i[l]], set()).add(hB[j[l]])
else:
res.setdefault(hA[i[l]], set()).update(itemgetter(*j[l:r])(hB))
You can't do much about the dictionary lookups - you have to do those one at a time.
You can clean up the array indexing a bit:
idxes = np.argwhere(numpy_array_of_booleans)
for i,j in idxes:
ResultDictionary.get(HashDictA.get(i)).append(HashDictB.get(j)
argwhere is transpose(nonzero(...)), turning the tuple of arrays into a (n,2) array of index pairs. I don't think this makes a difference in speed, but the code is cleaner.
I have a vector with > 30000 words. I want to create a subset of this vector which contains only those words whose length is greater than 5. What is the best way to achieve this?
Basically df contains mutiple sentences.
So,
wordlist = df2;
wordlist = [strip(wordlist[i]) for i in [1:length(wordlist)]];
Now, I need to subset wordlist so that it contains only those words whose length is greater than 5.
sub(A,find(x->length(x)>5,A)) # => creates a view (most efficient way to make a subset)
EDIT: getindex() returns a copy of desired elements
getindex(A,find(x->length(x)>5,A)) # => makes a copy
You can use filter
wordlist = filter(x->islenatleast(x,6),wordlist)
and combine it with a fast condition such as islenatleast defined as:
function islenatleast(s,l)
if sizeof(s)<l return false end
# assumes each char takes at least a byte
l==0 && return true
p=1
i=0
while i<l
if p>sizeof(s) return false end
p = nextind(s,p)
i += 1
end
return true
end
According to my timings islenatleast is faster than calculating the whole length (in some conditions). Additionally, this shows the strength of Julia, by defining a primitive competitive with the core function length.
But doing:
wordlist = filter(x->length(x)>5,wordlist)
will also do.