How to sort list of strings without using any pre-defined function? - python-3.x

I am new to python and I am stuck to find solution for one problem.
I have a list like ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again'] which I want to sort without using any pre defined function.
I thought a lot but not able to solve it properly.
Is there any short and elegant way to sort such list of string without using pre-defined functions.
Which algorithm will be best suitable to sort list of strings?
Thanks.

This sounds like you're learning about sorting algorithms. One of the simplest sorting methods is bubblesort. Basically, it's just making passes through the list and looking at each neighboring pair of values. If they're not in the right order, we swap them. Then we keep making passes through the list until there are no more swaps to make, then we're done. This is not the most efficient sort, but it is very simple to code and understand:
values = ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
def bubblesort(values):
'''Sort a list of values using bubblesort.'''
sorted = False
while not sorted:
sorted = True
# take a pass through every pair of values in the list
for index in range(0, len(values)-1):
if values[index] > values[index+1]:
# if the left value is greater than the right value, swap them
values[index], values[index+1] = values[index+1], values[index]
# also, this means the list was NOT fully sorted during this pass
sorted = False
print(f'Original: {values}')
bubblesort(values)
print(f'Sorted: {values}')
## OUTPUT ##
# Original: ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
# Sorted: ['again', 'and', 'hello', 'makes', 'perfect', 'practice', 'world']
There are lots more sorting algorithms to learn about, and they each have different strengths and weaknesses - some are faster than others, some take up more memory, etc. It's fascinating stuff and worth it to learn more about Computer Science topics. But if you're a developer working on a project, unless you have very specific needs, you should probably just use the built-in Python sorting algorithms and move on:
values = ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
print(f'Original: {values}')
values.sort()
print(f'Sorted: {values}')
## OUTPUT ##
# Original: ['hello', 'world', 'and', 'practice', 'makes', 'perfect', 'again']
# Sorted: ['again', 'and', 'hello', 'makes', 'perfect', 'practice', 'world']

Related

Memory efficient way to create a set from a list of lists in Python

I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.
corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]
Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:
{'this', 'is', 'text', 'one', 'two}
Currently, I have:
all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)
But this seems a memory-inefficient way of doing it.
Is there a more efficient way to obtain this set?
I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.
You can use a simple for loop with set update operation.
vocabulary = set()
for tokens in corpus:
vocabulary.update(tokens)
Output:
{'this', 'one', 'text', 'two', 'is'}

If I have a nested list for translating words how do I print the translation of a users input word?

I have a nested list as follows:
lst == [ ('cat', 'gatto'), ('one', 'uno'), ('two', 'due'), ('three', 'tre'), ('four', 'quattro') ]
And a user inputs the second element of any of these sublists how do I return the first element?
So if the user inputs 'gatto', the returned value is 'cat'
A simple answer will be as the following but I suggest you learn about dictionary in python also which I mentioned in the second answer.
//First answer
lst =[ ('cat', 'gatto'), ('one', 'uno'), ('two', 'due'), ('three', 'tre'), ('four', 'quattro') ]
inp = 'gatto'
for i in last:
if inp == i[1]:
print(i[0])
You can use the python dictionary to store the elements which may be very useful in your case. Dictionary in python is very handy for such an application. Dictionary holds key:value pair. Key-value is provided in the dictionary to make it more optimized.
//Second way
dict = {'gatto' : 'cat', 'uno':'one', 'due':'two', 'tre': 'three', 'quattro':'four'}
print('gatto',dict['gatto'])
return cat
(as you wanted)
Also, note that you have given two equals to while defining your list. Only one is required.
Hope this helps you.

How to remove a string element from a list in Python .remove is not working for me

I am trying to remove strings from a list after they have been chosen to avoid getting the same word again but when I try to .remove or .pop it doesn't remove the word. Why is this and how could i sort it?
I also tried to create a copy of the word incase it got removed before returning the word from the function, would this affect the word if its already chosen?
Thanks for any help, I am new to programming, as you could probably tell!
def choose_a_word_easy(): # function for choosing random easy word.
words = ['ant', 'bee', 'cat', 'dog', 'egg', 'hat', 'golf', 'jelly', 'king', 'bird', 'hot', 'cold', 'fish', 'log',
'dad', 'mum', 'goal', 'help', 'file', 'neat', 'car', 'moon', 'eye', 'tree', 'rice', 'ice', 'speed', 'rat',
'water', 'rain', 'snow', 'spoon', 'light', 'gold', 'zoo', 'oil', 'goat', 'yoga', 'judo', 'japan', 'hello']
pick = random.choice(words) # randomly choose any word from the list.
# p1 = pick
words.remove(pick)
return pick
By declaring your list inside the function choose_a_word_easy, a new list is created on every call. You want the same list to be reused on every call. Do so by creating the list outside the function's scope and passing it as an argument.
words = ['ant', 'bee', 'cat', 'dog', ...]
def pick_and_remove(lst):
pick = random.choice(lst)
lst.remove(pick)
return pick
pick = pick_and_remove(words)
print(pick) # 'bee'
print(words) # ['ant', 'cat', 'dog', ...]
Note that your function can be made slightly more efficient by randomly picking an index and poping it.
def pick_and_remove(lst):
i = random.randrange(len(lst))
return lst.pop(i)

Break down a long string into multiple lists

Is there a simple way to break down this string into multiple lists in Python so that I can then create a dataframe with those lists?
1|Mirazur|Menton, France|2|Noma|Copenhagen, Denmark|3|Asador Etxebarri|Axpe, Spain|4|Gaggan|Bangkok, Thailand|5|Geranium|Copenhagen, Denmark|6|Central|Lima, Peru|7|Mugaritz|San Sebastián, Spain|8|Arpège|Paris, France|9|Disfrutar|Barcelona, Spain|10|Maido|Lima, Peru|11|Den|Tokyo, Japan
I want to break it down so that it looks like:
[1, Mirazur, Menton, France]
[2, Noma, Copenhagen, Denmark]
and so on so forth.
I'm really new to all this, so any advice really appreciated. The more simple answer is possible, rather than any 'fancier' ones would be great so that I can understand the more basic concepts first!
Piece of cake. The basis is splitting on the | character; this will give you a flat list of all items. Next, split the list into smaller ones of a fixed size; a well-researched question with lots of answers. I chose https://stackoverflow.com/a/5711993/2564301 because it does not use any external libraries and returns a useful base for the next step:
print (zip(*[data.split('|')[i::3] for i in range(3)]))
This returns a zip type, as can be seen with
for item in zip(*[data.split('|')[i::3] for i in range(3)]):
print (item)
which comes pretty close:
('1', 'Mirazur', 'Menton, France')
('2', 'Noma', 'Copenhagen, Denmark')
('3', 'Asador Etxebarri', 'Axpe, Spain')
etc.
(If you are wondering why zip is needed, print the result of [data.split('|')[i::3] for i in range(3)].)
The final step is to convert each tuple into a list of its own.
Putting it together:
import pprint
data = '1|Mirazur|Menton, France|2|Noma|Copenhagen, Denmark|3|Asador Etxebarri|Axpe, Spain|4|Gaggan|Bangkok, Thailand|5|Geranium|Copenhagen, Denmark|6|Central|Lima, Peru|7|Mugaritz|San Sebastián, Spain|8|Arpège|Paris, France|9|Disfrutar|Barcelona, Spain|10|Maido|Lima, Peru|11|Den|Tokyo, Japan'
data = [list(item) for item in zip(*[data.split('|')[i::3] for i in range(3)])]
pprint.pprint (data)
Result (nice indentation courtesy of pprint):
[['1', 'Mirazur', 'Menton, France'],
['2', 'Noma', 'Copenhagen, Denmark'],
['3', 'Asador Etxebarri', 'Axpe, Spain'],
['4', 'Gaggan', 'Bangkok, Thailand'],
['5', 'Geranium', 'Copenhagen, Denmark'],
['6', 'Central', 'Lima, Peru'],
['7', 'Mugaritz', 'San Sebastián, Spain'],
['8', 'Arpège', 'Paris, France'],
['9', 'Disfrutar', 'Barcelona, Spain'],
['10', 'Maido', 'Lima, Peru'],
['11', 'Den', 'Tokyo, Japan']]

Difference between the total number of words (length of a list) and vocabulary of a list or file in NLP?

How to compute the total number of words and vocabulary of a corpus stored as a list in python? What is the major difference between these two terms?
Suppose, I am using the following list. The total number of words or the length of the list can be computed by len(L1). However, I am interested to know how to calculate the vocabulary of the below mentioned list.
L1 = ['newnes', 'imprint', 'elsevier', 'elsevier', 'corporate', 'drive', 'suite',
'burlington', 'usa', 'linacre', 'jordan', 'hill', 'oxford', 'uk',
'elsevier', 'inc', 'right', 'reserved', 'exception', 'newness', 'uk', 'military',
'organization', 'summary', 'task', 'definition', 'system', 'definition',
'system', 'engineering', 'military', 'project', 'military', 'project',
'definition', 'input', 'output', 'operation', 'requirement', 'development',
'overview', 'spacecraft', 'development', 'architecture', 'design']
Is this what you're looking for?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list_of_tokens = ['cat', 'dog','cats', 'children','dog']
unique_tokens = set(list_of_tokens)
### {'cat', 'cats', 'children', 'dog'}
tokens_lemmatized = [ lemmatizer.lemmatize(token) for token in unique_tokens]
#### ['child', 'cat', 'cat', 'dog']
unique_tokens_lemmatized = set(tokens_lemmatized)
#### {'cat', 'child', 'dog'}
print('Input tokens:',len(list_of_tokens) , 'Lemmmatized tokens:', len(unique_tokens_lemmatized)
#### Input tokens: 5 Lemmmatized tokens: 3
If your question is regarding how to get the number of unique words in a list, that can be achieved using sets. (From what I remember from NLP, the vocabulary of a corpus should mean the collection of unique words in that corpus.)
Convert your list to a set using the set() method, then call len() on that. In your case, you would get the number of unique words in the list L1 like so:
len(set(L1)) #number of unique words in L1
Edit: You now mentioned that the vocabulary is the set of lemmatized words. In this case, you would do the same thing except import a lemmatizer from NLTK or whatever NLP library you're using, run your list or whatever into the lemmatizer, and convert the output into a set and proceed with the above.

Resources