Named Entity Extraction of dates - nlp

I am absolutely new to the NER and Extraction and programming in general. I am trying to figure out a way where I can extract due dates and start date of certain documents. Is there a way to do this? A place where I can start? I have been looking around but the problem I run into is the same. Can extract dates but not whether the date is due or post. If it only has 1 date, is it post or due. Stuff like that. Any help would be appreciated.
Example:
"Essay on Medieval Asia was due on September 3rd."
"Your last assignment that was given on April 6th was supposed to be submitted in 10 days."
"The bid is due no later than a month from the date it was posted(today)."

The amount of possibilities to express dates in free text is huge. There are a few solutions:
You can come with a set of regular expressions and try to parse them for yourself.
Another option is to train a supervised sequence classifier like CRF, if you have a document with dates annotated.
A third option, which can have quick results is to use this framework from Facebook research https://github.com/facebookincubator/duckling, it will identify expressions which are dates or time expressions, and it will even normalise them into a single unique date.
Yet another options is ct-parse, based on Duckling but a pure python package to parse time expressions from natural language in German and English.

Related

How to convert BC dates in NodaTime Instant to a regular date

Instant.FromUnixTimeSeconds(-100100000000).ToDateTimeUtc1
Once the date gets too ancient this doesn't work anymore, for example, BC dates.
Is there any easy way to convert NodaTime instant values to years months days, that works for the entire range of supported Instant values (aka 27000 BCE to 31000 CE) ?
I don't mind what data type, I am just looking to easily extract the regular time periods from the Instant values.
It's been a year or more since I've used Nodatime, but [this page in the user guide][1] says
Additionally, all calendars are restricted to four digit formats, even
in year-of-era representations, which avoids ever having to parse
5-digit years. This leads to a Gregorian calendar from 9999 BCE to
9999 CE inclusive, or -9998 to 9999 in "absolute" years.
You're question could be read to mean you didn't think BC dates worked at all. When you get more than a few thousand years from the present, strange things start to happen, such as the changing rotation rate of the Earth means there are different kinds of days; those that would be counted from sunrises vs. the kind of time used in radioactive decay, or calculation of planetary positions. It might be helpful if you mentioned your application.
[1]: https://nodatime.org/3.1.x/userguide/range

Forecast or Estimate Next Month Sales For Each Customer

The problem statement that I am currently working on has data available for 27 customers and the purchase amount they have transacted on (in total) for each month in 2021 from Jan until Sept. The data looks like the attached image with this question/post.
sample dataset
I could simply use average to find the next value but that'd not be precise to a very good extent, but then, in absence of any other data or features/columns, is that the only way to solve this question, or are there any other methods anyone can suggest? Note, both Excel &/or Python examples are fine.
Additional Note: I have already tried FORECAST functions in Excel, but I am not sure if the outcome is correct or not, since Microsoft documentation merely provides the formula by means of which this function performs the calculations. Overall there are 5 total types of FORECAST(.**) functions that Excel provides, but the documentation is poor, hence tomorrow, if I want to write the same solution in Python or any other programming language.
Taking a cursory glance at the data, there's a complexity that I'm missing like seasonality, trend, noise, outliers, etc., but let's just assume that this data is a simple trend line for each client.
From a purely high-level, excel can do a simple FORECAST.ETS(target_date, values, timeline, [seasonality], [data_completion], [aggregation]) formula.
It can be streamlined with excel's built in data tool Forecast Sheet.
I could talk about Python but that's a little more hands on with a time series forecast.

Why does ICU have distinctions for "stand alone" values for dates?

ICU has different formatting symbols for "stand alone" values. For example:
q Stand Alone quarter
L Stand Alone month in year
c Stand Alone local day of week
The documentation states:
"Stand Alone" values refer to those designed to stand on their own, as opposed to being with other formatted values. "2nd quarter" would use the stand alone format (QQQQ), whereas "2nd quarter 2007" would use the regular format (qqqq yyyy).
However, this doesn't explain why there is a distinction. I presume that this matters for some languages, but what are some examples?
(More confusingly, the documentation contradicts itself since it uses both q and Q for the stand alone version.)
I also presume that stand alone versions aren't needed for other fields (such as year, hour, minute, seconds) because those are numeric. If that's the case, however, why do the stand alone values for weekday, month, and quarter support numeric forms?
Distinction is relevant for ie Polish in which expressing day dd MMMM yyyy: "22 września 2022" is different than expressing just a month like ie in calendar LLLL yyyy: "wrzesień 2022".
I ended up filing ICU-21225 to correct the contradiction in the documentation and to ask for clarification. One of the comments directed me to https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras, which states:
The context is either format (the default), the form used within a complete date format string (such as "Saturday, November 12"), or stand-alone, the form for date elements used independently, such as in calendar headers. The most important distinction between format and stand-alone forms is a grammatical distinction, for languages that require it. For example, many languages require that a month name without an associated day number (i.e. an independent form) be in the basic nominative form, while a month name with an associated day number (as in a complete date format) should be in a different grammatical form: genitive, partitive, etc.
I'm still curious about specific examples (which languages?), though.

Change the Excel date input format?

I've been struggling with Excel (2016) date formats. I know how to change display formats for dates and cells but the problem I have is the input format for dates. If I input a date as "DD.MM" or "DD.MM.YYYY" it does recognize it as a date but if I input the date as "DD.MM." (with the second dot after the month), Excel does not recognize it as a date anymore. The column in question is formatted as short date.
Is there anything that can be done or is this by-design? If so, it seems really strange as at least in my country it's the official way to write the date containing that second dot after the month number when there is no year included in the date.
I've been searching and Googling for solution but couldn't find anything on this really. I appreciate all comments and help regarding this question!
SUMMARY/TL;DR:
Excel version is 2016, country is Finland and language is finnish
Excel accepts/recognizes these as dates: 12.5 or 30.8
Excel does NOT accept/recognize these as dates: 12.5. or 30.8.
The column in question is formatted as short date
The dot after the month seems to be screwing things up
Why is this happening? Can anything be done?
Kind regards,
Tenttu
Yes, it is/was by design. (Funny enough, my Excel won't allow dots, only dashes (-), so I couldn't even test if "15.8" works)
So, there's a slight chance that the language of Excel (the defaults of time (24 hours or AM/PM), dates (MM/DD or DD/MM), decimals (comma or dot) etc.) wouldn't allow the dot at after the month. Here's an example of a user that has that dot, and wants to get rid of it. So, your system language is a good candidate for why this wouldn't work for you.
However, I realize that the example linked above don't feature a date with a dot at the end. Which could suggest it is rather by design. For example, if I add a dot to a valid date or time, it will result in some #VALUE!-error. And that's because of how Excel is programmed to convert text to a date - and remember, dates are actually just really large numbers. So, adding a dot at the end makes that conversion "impossible". We might think it's as easy as to remove a dot, but in programming, we need to program that explicitly to do that, and I'm leaning towards there is no such operation done during text to date conversion (certainly not on my system, as I get #VALUE!).
One work-around is to strip the ending dot from the date to make it a valid date. So, you can import sheets with dates with dots at the end, then strip them away, and you'll be good to go!

How should I store old dates in SharePoint?

I need to store dates in SharePoint that need to go back around 5000 BC. Ideally, I would like to be able to do date addition/subtraction, like this:
oldDate = '5000 BC';
newDate = '1995 AD';
DateDiff(oldDate, newDate, 'Years'); // equals 6995
How should I proceed? Build an old_date class based on strings? Just use regular dates, but add an AD or BC that makes the date negative?
This is a seriously non-trivial problem, and really depends on what exactly you want to do with those dates. For example, we've only used the current (Gregorian) calendar since 1582. Before that it was the Julian calendar, and before that an old Roman calendar. To make matters worse, this info is really only for Western Europe (and culturally-related areas). So if you are hoping to have someting that will give you proper accepted dates for historical events with a little simple math, you are in for a big dissappointment.
If you just want to carry the Gregorian calendar backwards, I suppose that's doable. However, there still is error, and on that scale it matters. From Wikipedia:
On timescales of thousands of years,
the Gregorian calendar falls behind
the seasons because the slowing down
of the Earth's rotation makes each day
slightly longer over time (see tidal
acceleration and leap second) while
the year maintains a more uniform
duration
If you are interested only in years and not in days then you could build a custom field with custom editor and store the year value as integer value.
Values less than zero mean BC and values higher or equal that zero mean AD.
I ended up storing dates as a text field in ISO 8601 format:
YYYY-MM-DDThh:mm:ss.sTZD
You don't have to store the entire string, for instance if you wanted to store 5000 BC, you would enter -5000-01-01. I don't get my date addition and subtraction very easily, but it was much easier to get the data in there in the format I wanted.

Resources