Finding an observation's value given values for other variables

Finding an observation's value given values for other variables - search

A simplified version of my data would look like:
Own_Name Own_Position Year Boss_Name Boss_Position
John Director 2017 Tess Managing Director
Tess Lead Director 2017 Jim CEO
John Lead Director 2018 Jim CEO
Tess CFO 2018 Jim CEO
All data concerning Own_Name, Own_Position and Year is present (so Jim and any others would get their own rows with Own_Name and Boss_Name etc.), but there are many entries where the Boss_Position (but notBoss_Name`) is missing.
Thus, I'm trying to find what the Boss_Position will be, given the Boss_Name and Year. I can do this if I need to pull just one observation (just keep the relevant data where Own_Name matches Boss_Name and Year matches, and use the corresponding Own_Position), but I am not sure what the best way to do this would be where I'm looping through all the missing observations for Boss_Name, since using keep would seem to be very destructive and time consuming.
Ideally, the code would look something like
replace Boss_Position = Own_Position[YearNameMatcher(Boss_Name Year)] if missing(Boss_Position)
where YearNameMatcher is the function that does what I'm asking for, but I am not sure how to best proceed.
I'm also new to Stata, so I may not be aware of more obvious solutions, though I have tried searching without success.

One solution is to generate a separate and temporary dataset with the unique values of variables Boss_Name-Year and then merge this dataset with the original data. You can try this code with your data:
snapshot save
keep if Boss_Position!=""
drop Own*
duplicates drop Year Boss_name, force
tempfile boss
save `boss'
snapshot restore 1
drop Boss_Position
merge m:1 Boss_Name Year using `boss'
snapshot erase _all
This code supposes that the paired values Boss_name and Year are unique. With this approach you don't need to loop through all the missing observations.

Related

Date Entity Parsing Incorrect Year for Incomplete Dates

I have a dataset (df_test) containing of several news articles (Text_4). Using SpaCy, I've extracted the 'DATE' entities. For those I want to see whether they are in the future or in the past (to identify news articles that reference future events such as product launches) compared to the article's publication date (RP_DateFormatted)
My current code is
for index, row in df_test.iterrows():
doc = nlp(row.Text_4)
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}
... some other steps ... then:
ListDATE3 = [dateparser.parse(replace_all((i.text), od), languages=['en'],
settings={'RELATIVE_BASE': datetime.strptime(row.RP_DateFormatted, '%Y-%m-%d'),
'PREFER_DAY_OF_MONTH': 'last',
'PREFER_DATES_FROM': 'future'}) for i in entities['DATE']]
df_test.PY_Entities_DatesParsed[index] = ListDATE3
I have trouble with the line 'PREFER_DATES_FROM': 'future', for example:
Article was written on August 15th 2005 but no year is given in the text. SpaCy extracts "Aug 15" as Date. The dateparser sets the year to 2006 (because it is in the future). Consequently, I would then believe that the news article talks about the future - which it does not.
Setting 'PREFER_DATES_FROM': 'past' would also not help me in a case when an event is described that happens in February (without a year given in the text). This is likely to be next February but the dateparser would set it to this year's February.
Is there a way to add an if statement to the settings or to create a new function based on the dateparser? Please note that each news articles can have multiple dates (entities['DATE'] is a list for each row in my dataframe).
I am using Python 3.8

I don't think you're going to be able to solve this just with options to DateParser. That interprets dates mechanically given a string, but in order to tell whether these dates are in the past or future you're using knowledge of the surrounding words and context of the article ("at next February's festival...").
This is a pretty hard thing to get right in an automated system. In NLP research this is referred to as "grounding", and includes related problems, like telling who "President of the United States" refers to (what year was it?), or what color "red" is (is it red like a stop sign, or red like red hair?).
What I would do is start by using rule-based techniques to identify whether dates are in the past or future before passing them to date parser. So take some words from around date entities, and if "last" is there then it's in the past, if "next" is there then it's in the future, that sort of thing. See how well it does. (You might think you could just take words before the date entity, but you can also have "February last year was really cold" or something.)
If you want to try a statistical system after that, you could look at using the spancat in spaCy with different kinds of context windows to classify dates as "future" or "past".

How to identify a string source using a standalone list of identifiers

I have a really quick questions about which function I need to use for my current conundrum:
I'm building a tool that automatically identifies a retailer from the first 5 digits of the account number (their "code" so to speak).
To illustrate in account number "1111122222" the "11111" will be the retailer code and the "22222" will be the customer's unique ID.
Each retailer can have several dozen unique codes so I have a separate sheet with a code table in it. (Separated because it will be split off into a standalone workbook later on)
Codes table looks like this:
Bobs Burgers | Johns Chicken | Ali's Shwarma
12345 | 56784 |77774
45698 | 33333 |44444
12398 | 99999 |55555
As we receive data in blocks of 20~30 accounts at a time, all I'd like this thing to do check the accounts against the code list and output the name of the retailer. And maybe yell "conflict, abort and run for the border!" if more than one retailer is identified :)
Apologies for the stupid question, but by this point I'm on my ninth cup of coffee and I just can't remember what functions I need to use.
P.S. The reason why I'm making my life difficult and not using a standard lookup table is because higher ups want no manual involvement from the end users with the data, so it's all gotta be identified and forwarded to relevant parties without them touching the data or destinations. I've already got the Importing automated and have the distribution ready to go, just the middle part that sent me for a loop. I'll post the full code of the tool once it's complete in case anyone needs something like this.

Apologies for the brain fart - I figured out the solution. I was trying set codes up as a table with the Retailer as the header, with each retailer in their own column. Which just wasn't working in any way. My less than elegant solution was to reformat the codebook as a "code:Retailer" table which allowed a VLookup to actually pull the data proper, and to have the codes extracted via =LEFT(TEXT(cell),5) function inside a hidden buffer sheet in the workbook rather than via VBA.
I then set up a pivot table in the hidden sheet that gave me a nice percentage value to work off of and set up a data refresh gate at every step in the macros.
Whole thing is a bit slow and will require a bit of manual installation on everyone's PCs but it works now.
P.S. Thanks #Cyril for reminding me of Index - made another one of my projects ten times easier!

Ranking text to see if they got a better health plan

I need to write code that will evaluate whether their plan change from the previous year and put "Moved Up" in a column to the right if the plan changed. First the code needs to make sure that they have the same member ID and that it is a new year. Here is what it looks like in excel:
000880093121 2015 Bronze 60 HMO
000880093121 2016 Silver HMO
My first thought was to use nested IF statements but I do not know how to tell excel that the Silver plan is a better plan than Bronze. There is a total of five different plans that members can have.
=IF(A3=A2,IF(B3>B2,"Moved Up"))
This will successfully compare the member ids and make sure that it is a new year. I just do not understand how to give text values a numeric value so that it can be compared. Also there is over 30k rows that I will be applying it to.
The output that I am looking for should be this:
000880093121 2015 Bronze 60 HMO -
000880093121 2016 Silver HMO Moved Up
Thanks for the help, much appreciated.

Put the different plans in a table sorted from "least-good" (top) to "best" (bottom), and name that range (e.g.) "planTable".
Then you can do this:
=IF(AND(A3=A2,B3>B2),
IF(MATCH(C3,plantable,0)>MATCH(C2,plantable,0),"Moved Up",""),
"")

Building a customized, fuzzy and multiple Vlookup

Ok so, twice a month I receive a large file of about 100 rows, which contains 4 columns:
Building name - value - county - state
I´ve to complete 2 other columns based on a master list that have thousands of entries.
I want to produce something very similar to this fabulous add-in (http://www.microsoft.com/en-us/download/details.aspx?id=15011), but a bit simpler and that I could use at work without problems.
What I need to do is the following:
In order to match my input with the master file, I know the county and state must match, but then, the building names can change a bit in each file for the same building (ie "John Miller #34" can be "Miller, John 34 A"), and that the values may vary but not too much.
Based on that, I want to bring from the master to my file, all the entries that may match each of my rows, filtering by County and State first, and then by similarity in name and value.
Could you please share your thoughts on how you´d approach this?
I know this is not a simple thing, but anything may help!

You could also use wildcards to try and match on the primary identifier within the name. from your example, that might be "Miller", for example.

Unfortunately for you, the vlookup "fuzzy logic" is nowhere near reliable for your purpose (see the comment on my answer below for details), and you won't have any indicator as to whether the returned result is accurate or not.
It's possible to get 100% of what you want through some heavy coding in a user-defined function, but this is probably well beyond your comfort zone.
A clunky solution, although somewhat easy to explain and adopt, is to create an "identity column" for every unique scenario that can occur. So, for example:
Then you can import your master sheet and add the same identity column to the left, and perform your vlookup. When a new configuration is added you can just add that to the master list and it will populate in your imported file in future instances.
That said, if you are interested in learning, there have been many people who have walked in your shows and felt your pain. You may want to indulge in this:
http://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html
Because what you are truly requesting is an algorithm. It's not a simple thing, but it's very possible. And if you take the time to learn you not only solve your immediate problem, but make yourself marketable as an Excel wiz. Good luck!

Book ordering comparison between spreadsheets for existing catalogue of a Library

I have recently asked this question of google's spreadsheet page.
I a significant data comparison problem I would like to solve. It relates to purchasing books for a Library. We have a catalogue of over 11,000 books. When we order new books we need to compare our proposed purchases to the current stock. Currently we can manually compare them to our catalogue, very laboriously book by book.
We need to do 3 things to make our life easier -
1 easily clean out bad data/characters in the ISBN's - these are either spaces, - (hyphen's) or . (period mark or full stops). A simple formula to run over all ISBN fields would be great.
2 I need to compare data between 1 spreadsheet with 11,000 books in it (current library stock), a second with up to 1000 books in it (currently on order) and finally the third currently active one (about to be ordered) with 50 to 200 books listed in it.
All spreadsheets use the same column configuration as below
Library orders
Title Author Publisher ISBN (long version) US$ UKgpd HK$ Other$ P/O no. Date ordered
UNNATURAL SELECTION MARA HVISTENDAHL Public Affairs Publishing; Reprint edition (May 1, 2012) 978610391511
Finally, the out put of these comparisons should quickly and easily identify on what lines we have matches. and what type of match it is, Author only, Author and Title, or Author, title and ISBN etc for all the possible combinations. To make this easier assume spreadsheet 1 is an unalterable master table, with spreadsheet two similar. It is really only on Spreadsheet 3 we need to be clear if we are starting to reorder materials.
If it is possible to have these as different sheets in a workbook it would be ideal. The only additional feature is that any scripts that run need to be able to cope with spreadsheet 1 increasing in size as new acquisitions arrive and are included. Both spreadsheets 2 and 3 will vary (increase and decrease) as the ordering process proceeds.
Finally the absolute ideal would be for this comparison process to be instant (live) and ongoing as data is included.
If anyone would like to take this on 3 Library staff will be eternally grateful.
regards
Nick

This would be very much easier had you one sheet rather than three (simply add a column to each existing sheet to show whether in stock, on order or to be ordered – three individual letters would be sufficient, then append each of the smaller two files to the largest). Then for example you could apply Conditional Formatting to highlight duplicates one column at a time (Author, Title etc). Apart from the initial data cleansing it would mean in the future switching ‘between sheets’ would merely involve changing a one-letter flag. Filtering would allow you and your colleagues to appear to have three separate sheets and if anyone asks for a particular Title the search would be one-time, not in triplicate.
Also, http://www.microsoft.com/en-gb/download/details.aspx?id=15011 may be of interest, also =SUBSTITUTE.And with data validation you would prevent entry of a new ISBN that already is in your list.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string