Menu extracting - nlp

I am interesting in extracting and structuring information about restaurant menus. What is needed is to extract the items from the menu in form category / name / price
For instance, we have the following website. Here we have a drinks sections, and there a number of items. For that website I'd like to be able to extract
Drink / Cappuccino / € 1,50
SANDWICHES / filled sandwich, pistolet (round roll) or emperor roll / € 1,30
etc ...
Of course it shouldn't be limited only to this website.
The only way I can see to handle that is applying a bunch of regexps, but I don't believe listing all possible dish names is feasible.
I know that the topic might be too broad for a question, but anyway any suggestions or references to relevant articles or books will be much appreciated.

This seems quite possible. You many not be able to list all possible dishes but you can list all possible categories.
Assuming that in every menu, dish names follows category name and it is followed by the price, you can identify dish names.
The algorithm will look like this:
foreach(category: category_list):
foreach(word:document):
if(category == word):
dish = Read next(if data is structures with table read next row or col)
price = Read next and check it format to see if its Currency or a price
The point is you will need to analyse different websites to understand how the information is structured and prepare your algorithm to deal with all possible structures.

Related

REGEX for Netflix Viewing Activity Titles (TV Show vs Movie using 'Episode')

I have a csv file containing Netflix viewing data for all users on an account (38k entries) which I am analyzing in Power Bi.
There was no column for Movie/TV which I needed, but its clear that entries with the word 'Episode' in them are episodes from TV shows/Netflix series etc so I created a column in Power Query based on that. Here is an example of what I mean (other columns removed).
Title
ContentType
The Office (U.S.): Season 5: Heavy Competition (Episode 24)
TV
Forensic Files: Collection 1: A Tight Leash (Episode 2)
TV
Kung Fu Panda
Movie
Teen Wolf: Season 1: Lunatic (Episode 8)
TV
Kung Fu Panda 2
Movie
This seems to have worked quite well, but ideally I want a way to be sure I don't have any erroneously labeled entries (e.g "Star Wars: A New Hope (Episode IV)", not an entry in this dataset, but there is a risk of other 'Movie' titles using this format that I cant manually check for.).
I am a total Regex beginner, and sloppily put together the expression \b[Ee]pisode[\s\S]\b[^0123456789] to try and find any entries with the word episode that didn't have a number following it, and all entries were still TV Shows, but this would not account for something like "A New Hope(Episode 4)".
I'm a little stuck now and there are likely other exceptions to the 'Episode' rule that I am not considering. Functionally, the way I have done this is working for my purposes, but I'm trying to show due diligence for anyone that reads my report.
My question: is there a better expression to try that would account for such outliers?
Thanks!

Asigning values to a list using certain parameters in Excel

I'm new here. I was wondering if someone could help me simplify the way I assign a type of value to a list of products, maybe with some macro which I'm really not an expert at.
For now what I have are two tables:
Table 1: has details on what type of product is being manufactured. It has a name (ID), description (Tipico), Qty of connections (puntas), and amount of hours used to manufacture each product (Hs estimadas)
Table 2: has specific information about each individual product that is currently being manufactured, so people will complete the specific amount of connections (puntas) each product will require an this assigns a certain type of product (in column "Tipico") considering the parameters defined on table 1, for example:
To do this I have created this simple IF / AND / OR function in excel, but I find it too long, too messy and really difficult to correct or add new typical products when needed:
=IF(AND(OR([#Area]="PROTECCIÓN";[#Area]="CONTROL");RIGHT([#Tablero];3)<>"RTU";[#Area]<>"DAG";AND([#[Cant. Puntas]]>40;[#[Cant. Puntas]]<800));"TIPICO 1";IF(AND(OR([#Area]="PROTECCIÓN";[#Area]="CONTROL");RIGHT([#Tablero];3)<>"RTU";[#Area]<>"DAG";AND([#[Cant. Puntas]]>800;[#[Cant. Puntas]]<1200));"TIPICO 2";IF(AND(OR([#Area]="PROTECCIÓN";[#Area]="CONTROL");RIGHT([#Tablero];3)<>"RTU";[#Area]<>"DAG";AND([#[Cant. Puntas]]>1200;[#[Cant. Puntas]]<1600));"TIPICO 3";IF(AND(OR([#Area]="PROTECCIÓN";[#Area]="CONTROL");RIGHT([#Tablero];3)<>"RTU";[#Area]<>"DAG";AND([#[Cant. Puntas]]>1600;[#[Cant. Puntas]]<2000));"TIPICO 4";IF(AND(OR([#Area]="PROTECCIÓN";[#Area]="CONTROL");RIGHT([#Tablero];3)<>"RTU";[#Area]<>"DAG";AND([#[Cant. Puntas]]>2000;[#[Cant. Puntas]]<2400));"TIPICO 5";IF(AND([#Area]="COMUNICACIONES";LEFT([#Tablero];14)="TABLERO ETL600");"TIPICO 6";IF(AND([#Area]="COMUNICACIONES";OR(LEFT([#Tablero];11)="TABLERO FOX";LEFT([#Tablero];11)="TABLERO NSD570";LEFT([#Tablero];15)="TABLERO CENTRAL"));"TIPICO 7";IF(AND([#Area]="CONTROL";RIGHT([#Tablero];3)="RTU");"TIPICO 8";IF(AND([#Area]="DAG";AND([#[Cant. Puntas]]>0;[#[Cant. Puntas]]<=900));"TIPICO 9";IF(AND([#Area]="DAG";AND([#[Cant. Puntas]]>900;[#[Cant. Puntas]]<=1200));"TIPICO 10";IF(AND([#Area]="DAG";AND([#[Cant. Puntas]]>1200;[#[Cant. Puntas]]<=1600));"TIPICO 11";IF(AND([#Area]="CONTROL";RIGHT([#Tablero];4)="TIOR");"TIPICO 12";IF(AND([#Area]="CONTROL";H662OR(RIGHT([#Tablero];8)="IEC61850";RIGHT([#Tablero];9)="OPERACIÓN"));"TIPICO 13";"")))))))))))))
Is it possible to do this any other way?
Any ideas will be much appreciated!
Thanks to everyone

Sharepoint Lookup

I have been absolutely stymied by Sharepoint lookups and the complete lack of information anywhere that relates to my problem so one last stab at seeing if anyone has a clue if not I am going to find another job which doesnt involve the use of this very good but highly complicated system.
My problem is that I want to add a column in list 1 that looks up a column in list 2, all very easy you may think and the nice videos on you tube make it seem very easy. So imagine the frustration when I create the column choose the appropriate list from "get information from" drop down and then go to the "in this column" drop down to find the fields i want are missing. I have tried this many ways and could not resolve it. So I decided to start again and set up a brand new list 2 which contains just 4 columns each created manually one column is "just single line of text" called Financial year the other three are a "number" and called Population, Dwellings and Non Domestic. Now having done that I would expect to see thos 4 columns (Financial year, Population, Dwellings and Non Domestic) all appear in the "in this column" drop down when setting up the lookup. But no of course not bloody Sharepoint will only show: Title, Financial year, ID , Content Type, version, and Title (linked to item) none of which I am the slightest bit interested in. I want to look up Population, or Dwellings or Non Domestic. Why is this so easy to do in Excel but in Sharepoint it seems as if Bill Gates has decided he's in charge of what people lookup!! in a word its crap.
I should have added that the same happens whatever list I select to look up from.

Book ordering comparison between spreadsheets for existing catalogue of a Library

I have recently asked this question of google's spreadsheet page.
I a significant data comparison problem I would like to solve. It relates to purchasing books for a Library. We have a catalogue of over 11,000 books. When we order new books we need to compare our proposed purchases to the current stock. Currently we can manually compare them to our catalogue, very laboriously book by book.
We need to do 3 things to make our life easier -
1 easily clean out bad data/characters in the ISBN's - these are either spaces, - (hyphen's) or . (period mark or full stops). A simple formula to run over all ISBN fields would be great.
2 I need to compare data between 1 spreadsheet with 11,000 books in it (current library stock), a second with up to 1000 books in it (currently on order) and finally the third currently active one (about to be ordered) with 50 to 200 books listed in it.
All spreadsheets use the same column configuration as below
Library orders
Title Author Publisher ISBN (long version) US$ UKgpd HK$ Other$ P/O no. Date ordered
UNNATURAL SELECTION MARA HVISTENDAHL Public Affairs Publishing; Reprint edition (May 1, 2012) 978610391511
Finally, the out put of these comparisons should quickly and easily identify on what lines we have matches. and what type of match it is, Author only, Author and Title, or Author, title and ISBN etc for all the possible combinations. To make this easier assume spreadsheet 1 is an unalterable master table, with spreadsheet two similar. It is really only on Spreadsheet 3 we need to be clear if we are starting to reorder materials.
If it is possible to have these as different sheets in a workbook it would be ideal. The only additional feature is that any scripts that run need to be able to cope with spreadsheet 1 increasing in size as new acquisitions arrive and are included. Both spreadsheets 2 and 3 will vary (increase and decrease) as the ordering process proceeds.
Finally the absolute ideal would be for this comparison process to be instant (live) and ongoing as data is included.
If anyone would like to take this on 3 Library staff will be eternally grateful.
regards
Nick
This would be very much easier had you one sheet rather than three (simply add a column to each existing sheet to show whether in stock, on order or to be ordered – three individual letters would be sufficient, then append each of the smaller two files to the largest). Then for example you could apply Conditional Formatting to highlight duplicates one column at a time (Author, Title etc). Apart from the initial data cleansing it would mean in the future switching ‘between sheets’ would merely involve changing a one-letter flag. Filtering would allow you and your colleagues to appear to have three separate sheets and if anyone asks for a particular Title the search would be one-time, not in triplicate.
Also, http://www.microsoft.com/en-gb/download/details.aspx?id=15011 may be of interest, also =SUBSTITUTE.And with data validation you would prevent entry of a new ISBN that already is in your list.

Can't get INDEX/MATCH functions to do what I need?

It is quite an in depth excel sheet (to me) so here is a link to it: https://dl.dropboxusercontent.com/u/19122839/Movies.xlsm
On the Filters sheet, I have a search feature. This allows you to put in different genres, years, etc. and will pull up results.
The genre part does not seem to be working correctly for some reason.
In the movie_genres sheet, there is a Genre Equals and Genre Count column that seem to be marking the information correctly, but when you go to the movies sheet, the Matches Genre column does not. I use this function:
=INDEX(Genres[Genre Count],MATCH(Movies[[#This Row],[ID]],Genres[ID],0))
Which, to me, should pull the Genre Count, but in the case where there are more than one genre (I used Blank Check as an example in this case), it doesn't mark it as a 1. How can I make it so that this gets corrected.
For example, if you add the Comedy as a second genre, it pulls up more results than if you only have Family. I think I just need a fresh pair of eyes looking at this and it is probably something dumb, but any help would be great.
I believe I need to make it so that the index/match function I use in Movies[Matches Genre] will work as long as there is a 1 in Genres[Genre Count] for that ID. It only seems to work if there is a 1 in the first instance of the ID.
EDIT: I have added in a COUNT feature to better explain what I am talking about. With only Family as a genre, it shows there are 10 results, but when you add Comedy as a second genre, you get 40 results. This number should never go up as you add genres.
Perhaps try using SUMIF like this
=SUMIF(Genres[ID],[#ID],Genres[Genre Count])
If one movie might have several 1s but you only want 1 maximum then change to
=IF(SUMIF(Genres[ID],[#ID],Genres[Genre Count])>0,1,0)

Resources