Best way to split free text to items list python - python-3.x

I have a free text like '2packetLays 1 liter milk 2loafsbrownbread2kgbasmatirice' and want to split to below format - to separate out closely matched food items:
['2', 'packet', 'Lays', '1', 'liter', 'milk', '2', 'loafs', 'brownbread', '2', 'kg', 'basmatirice']
I was able to split above text to words like below:
['2', 'packet', 'Lays', '1', 'liter', 'milk', '2', 'loafs', 'brown' ,'bread', '2', 'kg', 'basmati', 'rice']
But I want 'brownbread' as one word but I was getting as 2 different words:
['brown' ,'bread']
and 'basmatirice' also as one word

Related

How to change tokenization (huggingface)?

In NER task we want to classification sentence tokens with using different approaches (BIO, for example). But we cant join any subtokens when tokenizer divides sentences stronger.
I would like to classificate 'weight 40.5 px' sentence with custom tokenization (by space in this example)
But after tokenization
tokenizer.convert_ids_to_tokens(tokenizer(['weight', '40.5', 'px'], is_split_into_words=True)['input_ids'])
i had
['[CLS]', 'weight', '40', '.', '5', 'p', '##x', '[SEP]']
when '40.5' splitted into another tokens '40', '.', '5'. Its problem for me, because i want to classificate 3 tokens ('weight', '40.5', 'px'), but it not merge automaticaly, because '40', '.', '5' not looks like '40', '##.', '##5'.
What can i do to solve this problem?
you can get the relation between raw text and tokenized tokens through “offset_mapping”

Looping through a list using a range function stops working after a number of steps

So I have a list of words and numbers, and i was trying to use a for loop and range function to remove the numbers so that I remain with only the words, as shown in the format below;
# This is a sample of elements in the list:
billboard_artists = \['The Chainsmokers Featuring Halsey', '6', '1', '3', '6', '1', '3', 'Sia Featuring Sean Paul', '1', '1', '27', '1', '1', '27', 'Major Lazer Featuring Justin Bieber & MO', '2', '2', '4', '2', '2', '4', 'twenty one pilots', '4', '4', '9', '4', '4', '9', 'Calvin Harris Featuring Rihanna'\]
for item in billboard_artists:
try:
for num in range(100):
if int(item) == num:
billboard_artists.remove(item)
except ValueError:
print(item)
print(billboard_artists)
The expected result was to get a full list containing just the songs. However, the loop only works for the first 56 elements or so, before a few of the numbers start re-appearing again:
# ['The Chainsmokers Featuring Halsey', 'Sia Featuring Sean Paul', 'Major Lazer Featuring Justin Bieber & MO', 'twenty one pilots', 'Calvin Harris Featuring Rihanna', 'twenty one pilots', 'Drake Featuring WizKid & Kyla', 'The Chainsmokers Featuring Daya', '3', 'Justin Timberlake', '1', '1', 'Adele', 'Rihanna']
The names on the list are to be used in making a playlist using the Spotify Restful API.
After noticing this issue, I went through a different solution using the ValueError exception to make a different list of the artists.
my_list = []
for item in billboard_artists:
try:
for num in range(100):
if int(item) == num:
billboard_artists.remove(item)
except ValueError:
my_list.append(item)
Which DID eventually return the full list of only the words and not the numbers. (Mission Accomplished I guess)
However, it still bothers me WHY the RANGE function BROKE DOWN AFTER A FEW ITERATIONS(60 or so). Any possible explanations will be greatly appreciated by the author.
I think there are a number of reasons that could impact your output. One being that the size of your list is changing as your iterate over it (you are removing items from the list while you are going over it). But it seems like since your second implementation worked this isn't a huge issue in this case.
The other reason I can think of is that in your list item would for example be "1". There are many instances of "1" in your list, and the official docs say that list.remove() removes the first item in the list that it encounters with that value. But if the size of the list is changing as you are trying to remove this could skip over some items.
Lastly, this is not the best way you could have achieved this, for each item in the list, you are iterating over another list 100 times, and then on top of that list.remove() iterates over the entire list each time as well - making it very inefficient. You don't really need to worry about this, you can rewrite your function as follows:
my_list = [item for item in billboard_artists if not item.isnumeric()]
in this implementation, you are not removing anything from the billboard_artists list so you won't have any wonky behvaiour! And it only goes over the billboards_artists list once :)

Xpath. How to get the node following text()

This is the structure I try to parse.
I have a for loop for every child in p. I need to get the Name and associated number in <sup> node.
desired output would be like:
Toloo, 1;
William C Baker, 2;
etc.
here is my for loop:
for b in i.xpath('./p/text() | ./p/b/text()'):
b.xpath('.//following-sibling::sup[1]/text()').get()
It does not return any result. Where am I wrong?
PS. If u run xpath without for loop it gets the thing done:
i.xpath('./p[2]/text()/following-sibling::sup/text() | ./p[2]/b/text()/following-sibling::sup/text()').getall()
['2', '1', '1', '1', '2', '1', '1', '1', '2']
Without more details:
link, parser (html, lmxl...), scraper (beautifoulsoup, selenium...), the desired output format (list, dict...), do you really need ";" as separator?
If the number of names ALWAYS matches the number of numbers then try this:
So given your image:
from io import StringIO
from lxml import etree
f = StringIO('<p><b>Toloo Taghian</b><sup>1</sup>", Willam C. Baker"<sup>1</sup>", Stephanie Bertrand"<sup>2</sup></p>')
i = etree.parse(f)
Then:
groups = i.xpath('//p')
for el in groups:
name = el.xpath('.//text()')
Result:
['Toloo Taghian', '1', '", Willam C. Baker"', '1', '", Stephanie Bertrand"', '2']
Then:
name2 = list(zip(name[::2], name[1::2]))
Result as a list of tuples:
[('Toloo Taghian', '1'),
('", Willam C. Baker"', '1'),
('", Stephanie Bertrand"', '2')]
Consider list comprehension e.g. result = [text.xpath('concat(., ": ", following-sibling::sup[1])') for text in i.xpath('./p/text() | ./p/b/text()')]
Thanks for your posts, guys.
What I did was actually parsing all the text nodes and then for each of them finding first preceding-sibling. That appeared to be the simplest solution.

sqlite3.ProgrammingError: Incorrect number of bindings supplied. Following along CS50

I know there are a lot of questions about this on SO but none of them helped me. I am trying to follow along in the CS50 lecture 9 using sqlite3 instead of their SQL library but keep getting the error: sqlite3.ProgrammingError: Incorrect number of bindings supplied. regarding the following line of code:
books = cursor.execute("SELECT * FROM books WHERE id IN (?)", session["cart"])
No matter how I format it I can't get it to work. I have tried using a list, a tuple, a and all combinations of those two as well as a tuple of tuples and a list of lists. Nothing works. I have tried using executemany as well but that didn't work either.
The table books has only the id and name of the book with the id being primary, session["cart"] is a list which in this case happened to be:
['1', '1', '2', '2', '4', '2', '5', '7', '6', '3', '4', '4']

Break down a long string into multiple lists

Is there a simple way to break down this string into multiple lists in Python so that I can then create a dataframe with those lists?
1|Mirazur|Menton, France|2|Noma|Copenhagen, Denmark|3|Asador Etxebarri|Axpe, Spain|4|Gaggan|Bangkok, Thailand|5|Geranium|Copenhagen, Denmark|6|Central|Lima, Peru|7|Mugaritz|San Sebastián, Spain|8|Arpège|Paris, France|9|Disfrutar|Barcelona, Spain|10|Maido|Lima, Peru|11|Den|Tokyo, Japan
I want to break it down so that it looks like:
[1, Mirazur, Menton, France]
[2, Noma, Copenhagen, Denmark]
and so on so forth.
I'm really new to all this, so any advice really appreciated. The more simple answer is possible, rather than any 'fancier' ones would be great so that I can understand the more basic concepts first!
Piece of cake. The basis is splitting on the | character; this will give you a flat list of all items. Next, split the list into smaller ones of a fixed size; a well-researched question with lots of answers. I chose https://stackoverflow.com/a/5711993/2564301 because it does not use any external libraries and returns a useful base for the next step:
print (zip(*[data.split('|')[i::3] for i in range(3)]))
This returns a zip type, as can be seen with
for item in zip(*[data.split('|')[i::3] for i in range(3)]):
print (item)
which comes pretty close:
('1', 'Mirazur', 'Menton, France')
('2', 'Noma', 'Copenhagen, Denmark')
('3', 'Asador Etxebarri', 'Axpe, Spain')
etc.
(If you are wondering why zip is needed, print the result of [data.split('|')[i::3] for i in range(3)].)
The final step is to convert each tuple into a list of its own.
Putting it together:
import pprint
data = '1|Mirazur|Menton, France|2|Noma|Copenhagen, Denmark|3|Asador Etxebarri|Axpe, Spain|4|Gaggan|Bangkok, Thailand|5|Geranium|Copenhagen, Denmark|6|Central|Lima, Peru|7|Mugaritz|San Sebastián, Spain|8|Arpège|Paris, France|9|Disfrutar|Barcelona, Spain|10|Maido|Lima, Peru|11|Den|Tokyo, Japan'
data = [list(item) for item in zip(*[data.split('|')[i::3] for i in range(3)])]
pprint.pprint (data)
Result (nice indentation courtesy of pprint):
[['1', 'Mirazur', 'Menton, France'],
['2', 'Noma', 'Copenhagen, Denmark'],
['3', 'Asador Etxebarri', 'Axpe, Spain'],
['4', 'Gaggan', 'Bangkok, Thailand'],
['5', 'Geranium', 'Copenhagen, Denmark'],
['6', 'Central', 'Lima, Peru'],
['7', 'Mugaritz', 'San Sebastián, Spain'],
['8', 'Arpège', 'Paris, France'],
['9', 'Disfrutar', 'Barcelona, Spain'],
['10', 'Maido', 'Lima, Peru'],
['11', 'Den', 'Tokyo, Japan']]

Resources