Date Entity Parsing Incorrect Year for Incomplete Dates - python-3.x

I have a dataset (df_test) containing of several news articles (Text_4). Using SpaCy, I've extracted the 'DATE' entities. For those I want to see whether they are in the future or in the past (to identify news articles that reference future events such as product launches) compared to the article's publication date (RP_DateFormatted)
My current code is
for index, row in df_test.iterrows():
doc = nlp(row.Text_4)
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}
... some other steps ... then:
ListDATE3 = [dateparser.parse(replace_all((i.text), od), languages=['en'],
settings={'RELATIVE_BASE': datetime.strptime(row.RP_DateFormatted, '%Y-%m-%d'),
'PREFER_DAY_OF_MONTH': 'last',
'PREFER_DATES_FROM': 'future'}) for i in entities['DATE']]
df_test.PY_Entities_DatesParsed[index] = ListDATE3
I have trouble with the line 'PREFER_DATES_FROM': 'future', for example:
Article was written on August 15th 2005 but no year is given in the text. SpaCy extracts "Aug 15" as Date. The dateparser sets the year to 2006 (because it is in the future). Consequently, I would then believe that the news article talks about the future - which it does not.
Setting 'PREFER_DATES_FROM': 'past' would also not help me in a case when an event is described that happens in February (without a year given in the text). This is likely to be next February but the dateparser would set it to this year's February.
Is there a way to add an if statement to the settings or to create a new function based on the dateparser? Please note that each news articles can have multiple dates (entities['DATE'] is a list for each row in my dataframe).
I am using Python 3.8

I don't think you're going to be able to solve this just with options to DateParser. That interprets dates mechanically given a string, but in order to tell whether these dates are in the past or future you're using knowledge of the surrounding words and context of the article ("at next February's festival...").
This is a pretty hard thing to get right in an automated system. In NLP research this is referred to as "grounding", and includes related problems, like telling who "President of the United States" refers to (what year was it?), or what color "red" is (is it red like a stop sign, or red like red hair?).
What I would do is start by using rule-based techniques to identify whether dates are in the past or future before passing them to date parser. So take some words from around date entities, and if "last" is there then it's in the past, if "next" is there then it's in the future, that sort of thing. See how well it does. (You might think you could just take words before the date entity, but you can also have "February last year was really cold" or something.)
If you want to try a statistical system after that, you could look at using the spancat in spaCy with different kinds of context windows to classify dates as "future" or "past".

Related

Finding an observation's value given values for other variables

A simplified version of my data would look like:
Own_Name Own_Position Year Boss_Name Boss_Position
John Director 2017 Tess Managing Director
Tess Lead Director 2017 Jim CEO
John Lead Director 2018 Jim CEO
Tess CFO 2018 Jim CEO
All data concerning Own_Name, Own_Position and Year is present (so Jim and any others would get their own rows with Own_Name and Boss_Name etc.), but there are many entries where the Boss_Position (but notBoss_Name`) is missing.
Thus, I'm trying to find what the Boss_Position will be, given the Boss_Name and Year. I can do this if I need to pull just one observation (just keep the relevant data where Own_Name matches Boss_Name and Year matches, and use the corresponding Own_Position), but I am not sure what the best way to do this would be where I'm looping through all the missing observations for Boss_Name, since using keep would seem to be very destructive and time consuming.
Ideally, the code would look something like
replace Boss_Position = Own_Position[YearNameMatcher(Boss_Name Year)] if missing(Boss_Position)
where YearNameMatcher is the function that does what I'm asking for, but I am not sure how to best proceed.
I'm also new to Stata, so I may not be aware of more obvious solutions, though I have tried searching without success.
One solution is to generate a separate and temporary dataset with the unique values of variables Boss_Name-Year and then merge this dataset with the original data. You can try this code with your data:
snapshot save
keep if Boss_Position!=""
drop Own*
duplicates drop Year Boss_name, force
tempfile boss
save `boss'
snapshot restore 1
drop Boss_Position
merge m:1 Boss_Name Year using `boss'
snapshot erase _all
This code supposes that the paired values Boss_name and Year are unique. With this approach you don't need to loop through all the missing observations.

excel comparison of two datasets

I am having a little difficulty conceptually understanding how to complete a task. Please forgive the context, but it will help.
I have a set of timetable information that contains the following
Date_Start (mm/dd/yyyy hh:mm:ss)
Date_End (mm/dd/yyyy hh:mm:ss)
Activity_location (String, code, example: B/B/012)
Other information that is not important
We have performed an audit (people going and doing a manual check on room occupancy). This audit was done using a google form which has now produced a spreadsheet. Unfortunately this doesn't quite match the format of the other one and instead contains:
Date
Time
B/B/012
B/B/011
... etc.
The problem is that each room is an individual column, regardless of if it was audited, which produces .... a lot of columns. I have already combined the Date and Time from the second dataset to produce a comparable datetime.
My task it to compare the information, so I have the timetable data (what should have happened) and I have the audit information (what did happen) and I need to find any discrepancies.
I am just having a little difficulty understanding how I might get these datasets into a format where I can compare them. I would really appreciate any help you might be able to give.
If Excel and you do have the dates (stripped from unneeded characters) in a column, you simply need to tell Excel how to interpret these values as dates. For instance, if you have the value 1/3/16, Excel may interpret it as March 1st, 2016 or Jan 3td, 2016.
To tell Excel how to interpret dates, you select the column (all cells in the column having values), right-click and select Format cells.... There, you can tell Excel that the value should be read as dd/mm/yy or mm/dd/yy.
Once you have Excel fully aware of the meaning of those dates, you can simply compare them (e.g. if(B3>G3... will check if the date in B3 is later than that in G3).
Hope this assists you to proceed.
UPDATE
Based on the exchange through comments, here is my final answer.
If you need to establish a relation (say between spreadsheet "A" and spreadsheet "B") when not only there is a on-to-one relation between columns/rows of both sheets, and (even worse) the one-to-many correlation is not predictable (meaning, in one case you have a one-to-one, in the next a one-to-4 and in the next a one-to-17), the only solution is either pivoting one of the tables or writing some MACROS.
I don't see any other way our of this. Sorry.

categorizing documents depending on their date fields

I've been stuck with an annoying problem for a while that I can't fix. I have a field in all of the documents that represents time- a date in format dd.mm.yyyy.
What I'm trying to do is to categorise them- Show the documents that have todays date, that will have todays date in closest 7 days, etc.
Here's the code (formula for the categorized field) that I have:
#If(#Today > pi_due_date; "Late docs"; #Today=pi_due_dat; "Todays docs";((pi_due_date - #Now)/86400)>0 &((pi_due_date - #Now)/86400)<7;"This weeks docs";"Future docs")
Everything was fine until today (after 12:00 PM) I noticed that this part: #Today=pi_due_dat; "Todays docs"; does not work, it does not return the document in the "Todays docs" category. Pretty much the same thing is happening to all the other categories and I don't understand what is causing this problem.
pi_due_dat is missing the 'e' at the end.
Assuming it is more than that, though, you'll want to make sure that you are only comparing the dates and not a date/time.
Try #Date(pi_due_date) = #Today instead.
I would like to point out that using #Today or #Now in a view (selection criteria or column value) will create serious performance issues, as the view will be constantly re-indexed. It will affect all applications on that server as well.
You may want to rethink the design, perhaps have a scheduled nightly agent that set a flag on the documents to indicate how they are boing categorized.

Book ordering comparison between spreadsheets for existing catalogue of a Library

I have recently asked this question of google's spreadsheet page.
I a significant data comparison problem I would like to solve. It relates to purchasing books for a Library. We have a catalogue of over 11,000 books. When we order new books we need to compare our proposed purchases to the current stock. Currently we can manually compare them to our catalogue, very laboriously book by book.
We need to do 3 things to make our life easier -
1 easily clean out bad data/characters in the ISBN's - these are either spaces, - (hyphen's) or . (period mark or full stops). A simple formula to run over all ISBN fields would be great.
2 I need to compare data between 1 spreadsheet with 11,000 books in it (current library stock), a second with up to 1000 books in it (currently on order) and finally the third currently active one (about to be ordered) with 50 to 200 books listed in it.
All spreadsheets use the same column configuration as below
Library orders
Title Author Publisher ISBN (long version) US$ UKgpd HK$ Other$ P/O no. Date ordered
UNNATURAL SELECTION MARA HVISTENDAHL Public Affairs Publishing; Reprint edition (May 1, 2012) 978610391511
Finally, the out put of these comparisons should quickly and easily identify on what lines we have matches. and what type of match it is, Author only, Author and Title, or Author, title and ISBN etc for all the possible combinations. To make this easier assume spreadsheet 1 is an unalterable master table, with spreadsheet two similar. It is really only on Spreadsheet 3 we need to be clear if we are starting to reorder materials.
If it is possible to have these as different sheets in a workbook it would be ideal. The only additional feature is that any scripts that run need to be able to cope with spreadsheet 1 increasing in size as new acquisitions arrive and are included. Both spreadsheets 2 and 3 will vary (increase and decrease) as the ordering process proceeds.
Finally the absolute ideal would be for this comparison process to be instant (live) and ongoing as data is included.
If anyone would like to take this on 3 Library staff will be eternally grateful.
regards
Nick
This would be very much easier had you one sheet rather than three (simply add a column to each existing sheet to show whether in stock, on order or to be ordered – three individual letters would be sufficient, then append each of the smaller two files to the largest). Then for example you could apply Conditional Formatting to highlight duplicates one column at a time (Author, Title etc). Apart from the initial data cleansing it would mean in the future switching ‘between sheets’ would merely involve changing a one-letter flag. Filtering would allow you and your colleagues to appear to have three separate sheets and if anyone asks for a particular Title the search would be one-time, not in triplicate.
Also, http://www.microsoft.com/en-gb/download/details.aspx?id=15011 may be of interest, also =SUBSTITUTE.And with data validation you would prevent entry of a new ISBN that already is in your list.

how to search for latest content on google?

in google search box when we type something like " 'java code' + inurl:javalobby " we will get the search results where the website link contains the string javalobby and the page will contain the string java code.
Similarly is there a way to search the latest updated content in the internet which will contain the keyword entered in the search box ?
Thanks.
There are two tricks in google to narrow your search based on date. It is using either the keyword daterange:startdate-enddate or by content creation date.
1. Using the syntax daterange:startdate-enddate : The catch is that the date must be expressed as a Julian date, a continuous count of days since noon UTC on January 1, 4713 BC. So, for example, July 8, 2002 is Julian date 2452463.5 and May 22, 1968 is 2439998.5. Furthermore, Google isn't fond of decimals in its daterange: queries; use only integers: 2452463 or 2452464. You can convert Julian dates online here.
Example:- Geri Halliwell left the Spice Girls around May 27, 1998. If you wanted to get a lot of information about the breakup, you could try doing a date search in a ten-day window—Say, May 25 to June 4. That query would look like this:
"Geri Halliwell" "Spice Girls" daterange:2450958-2450968
2. Searching by content creation date : Try adding a string of common date formats to your query. If you wanted something from May 2003, for example, you could try appending:
("May * 2003" | "May 2003" | 05/03 | 05/*/03)
A query like that uses up most of your ten-query limit, however, so it's best to be judicious— perhaps by cycling through these formats one a time. If any one of these is giving you too many results, try restricting your search to the title tag of the page.
If you're feeling really lucky you can search for a full date, like May 9, 2003. Your decision then is if you want to search for the date in the format above or as one of many variations: 9 May 2003, 9/5/2003, 9 May 03, and so forth. Exact-date searching will severely limit your results and shouldn't be used except as a last-ditch option.
When using date-range searching, you'll have to be flexible in your thinking, more general in your search than you otherwise would be (because the date-range search will narrow your results down a lot), and persistent in your queries because different dates and date ranges will yield very different results. But you'll be rewarded with smaller result sets that are focused on very specific events and topics.

Resources