How to identify and homogenize date format of instances in a string? - string

I can't find a way to identify the date formats of a string in MATLAB and put all of them in the same format. I have the following cell array:
list = {'01-Sep-1882'; ...
'01-Aug-1895'; ...
'04/01/1912'; ...
'Tue, 05/28/46'; ...
'Tue, 03/10/53'; ...
'06/20/58'; ...
'Thu, 09/20/73'; ...
'Fri, 08/15/75'; ...
'Sun, 12/01/1996'};
If I do datenum(list) there's an error message because all the rows don't have the same date format. Can you think of a way to circumvent this?

You can do this by successively applying datetime to convert each format and isnat to identify those dates that didn't convert properly. In addition, you can specify days of the week in the date format string and what pivot year to use for years with only the last two numbers. Starting with the sample data and expected date formats in your question, here's the code to do it:
% Input:
list = {'01-Sep-1882'; ...
'01-Aug-1895'; ...
'04/01/1912'; ...
'Tue, 05/28/46'; ...
'Tue, 03/10/53'; ...
'06/20/58'; ...
'Thu, 09/20/73'; ...
'Fri, 08/15/75'; ...
'Sun, 12/01/1996'};
% Conversion code:
dt = datetime(list, 'Format', 'dd-MMM-yyyy');
index = isnat(dt);
dt(index) = datetime(list(index), 'Format', 'MM/dd/yy', 'PivotYear', 1900);
index = isnat(dt);
dt(index) = datetime(list(index), 'Format', 'eee, MM/dd/yy', 'PivotYear', 1900)
% Output:
dt =
9×1 datetime array
01-Sep-1882
01-Aug-1895
01-Apr-1912
28-May-1946
10-Mar-1953
20-Jun-1958
20-Sep-1973
15-Aug-1975
01-Dec-1996
Now you can convert these to numeric values with datenum:
dnum = datenum(dt);

Part of the issue here is that MATLAB's datenum cannot understand this format: Tue, 05/28/46. So let's clean the original list so that it can understand.
% original list
list = {'01-Sep-1882','01-Aug-1895','04/01/1912','Tue, 05/28/46','Tue, 03/10/53','06/20/58','Thu, 09/20/73','Fri, 08/15/75','Sun, 12/01/1996'}
% split each cell in the list
list_split = cellfun(#(x) strsplit(x,' '), list, 'UniformOutput', false);
% now detect where there is unusual format, this will give logical array
abnormal_idx = cellfun(#(x) length(x) == 2, list_split,'UniformOutput', true)
% make copy
clean_list = list;
% now at abnormal indices retain only the part that MATLAB understands
clean_list(abnormal_idx) = cellfun(#(x) x{2}, list_split(abnormal_idx), 'UniformOutput', false);
% now run datenum on clean list
date_num = cellfun(#datenum, clean_list, 'UniformOutput', true);

Related

powershell, csv-date imported in different formate

I want to import this kind of csv into Excel
Work Item Type,ID,State,Date Request,Created Date
"Task","4533","Closed","2-9-2020 14:26:00","3-9-2020 08:17:39"
"Task","4535","Closed","3-9-2020 12:26:44","3-9-2020 12:29:33"
"Task","4577","Closed","3-9-2020 15:56:00","4-9-2020 09:12:21"
"Task","4580","New","17-8-2020 09:47:00","4-9-2020 09:49:39"
"Task","4581","Resolved","28-8-2020 10:22:00","4-9-2020 10:24:46"
"Task","4582","Resolved","24-8-2020 10:05:00","4-9-2020 10:31:12"
"Task","4604","Resolved","8-9-2020 08:06:58","8-9-2020 08:07:23"
"Task","4605","Resolved","8-9-2020 09:18:32","8-9-2020 09:18:58"
All dates in this example must be seen with a format day-month-year hour:minute:second
I do the import like this:
Import-Csv -Path '.\Issues.csv' | ForEach-Object {
$sheet1.Cells.Item(1,1) = 'ID'
$sheet1.Cells.Item(1,2) = 'Status'
$sheet1.Cells.Item(1,3) = 'Date Request'
$sheet1.Cells.Item(1,4) = 'Date Created'
$DateRequest = ([datetime]::ParseExact(($($_."Date Request")),$fmtDate,$inv).ToString($fmtDate))
$sheet1.Cells.Item($row,1) = $($_.ID)
$sheet1.Cells.Item($row,2) = $($_.State)
$sheet1.Cells.Item($row,3) = $($_."Date Request")
$sheet1.Cells.Item($row,4) = $($_."Created Date")
$row = $row + 1
}
The result of my Import
ID Status Date Request Date Created
4533 Closed 9/02/2020 14:26 9/03/2020 8:17
4535 Closed 9/03/2020 12:26 9/03/2020 12:29
4577 Closed 9/03/2020 15:56 9/04/2020 9:12
4580 New 17-8-2020 09:47:00 9/04/2020 9:49
4581 Resolved 28-8-2020 10:22:00 9/04/2020 10:24
4582 Resolved 24-8-2020 10:05:00 9/04/2020 10:31
4604 Resolved 9/08/2020 8:06 9/08/2020 8:07
4605 Resolved 9/08/2020 9:18 9/08/2020 9:18
As you can see, some dates are red in the CSV with a month-day-year format,
other are red with a day-month-year format.
The date 3 september has become 9 march
I have tried using CultureInfo, but without any succes.
$inv = [System.Globalization.CultureInfo]::InvariantCulture<br>
$fmtDate = "dd/MM/YYYY HH:mm:ss"
$DateRequest = ([datetime]::ParseExact(($($_."Date Request")),$fmtDate,$inv).ToString($fmtDate))
Does anyone hove any suggestions to solve this?
First of all, the dates in your CSV file have this format d-M-yyyy HH:mm:ss (yyyy is in lowercase and the days and months in the fields do not have a leading zeroes).
Try
$fmtDate = "d-M-yyyy HH:mm:ss"
What puzzles me is why you want to first parse the date in the csv and then use ToString() to reformat it in the exact same string format.
Take off the .ToString($fmtDate) as in
$DateRequest = [datetime]::ParseExact($_."Date Request",$fmtDate, $inv)
and feed that DateTime object into the Excel cell
dd/MM/YYYY HH:mm:ss does NOT describe the input date format you have - dd and MM are for day and month numbers with leading zeros.
Use:
$fmtDateInput = 'd-M-yyyy HH:mm:ss'
$fmtDateOutput = "dd/MM/yyyy HH:mm:ss"
[datetime]::ParseExact($dateString, $fmtDateInput, $culture).ToString($fmtDateOutput)

To check if the continuity of dates are missing in a column

I want to check in my dataframe's column that if there is a missing date for a certain month then the code should output the following month in the format MMM- YYYY
The data set looks like this :
date_start_balance date_end_balance start_balance
22.02.16 22.03.16 3590838
22.04.16 22.05.16 69788
15.06.16 21.07.16 452165
Both date cols are in datetime format. Now in the above data set the dates are missing for March and May in the start col and this should be returned as MMM-YYYYY
I have tried the following code :
import datetime
dates = df1['date_start_balance'].tolist()
missing = []
for i in range(0,len(dates)-1):
if dates[i+1].month - dates[i+1].month != 1:
for j in range(dates[i].month+1,dates[i+1].month):
missing.append(datetime(dates[i].year, j,1))
print(missing)
You can first create a date range with pd.date_range
march = pd.date_range(start='2016-05-01', end='2016-05-31')
And then you will have the list with the dates that you already have, in the example there is only one date: 2016-05-15:
your_list = [datetime.datetime.strptime('15052016', "%d%m%Y").date()]
And then you can calculate the difference between the range and your list and get the dates that you are missing:
march.difference(your_list)
DatetimeIndex(['2016-05-01', '2016-05-02', '2016-05-03', '2016-05-04',
'2016-05-05', '2016-05-06', '2016-05-07', '2016-05-08',
'2016-05-09', '2016-05-10', '2016-05-11', '2016-05-12',
'2016-05-13', '2016-05-14', '2016-05-16', '2016-05-17',
'2016-05-18', '2016-05-19', '2016-05-20', '2016-05-21',
'2016-05-22', '2016-05-23', '2016-05-24', '2016-05-25',
'2016-05-26', '2016-05-27', '2016-05-28', '2016-05-29',
'2016-05-30', '2016-05-31'],
dtype='datetime64[ns]', freq=None)

How to change a given date in "yyyy-MM-dd HH:mm:ss.SSS" format to "yyyy-MM-dd'T'HH:mm:ss.SSS'Z" format in groovy

How to convert a given date in yyyy-MM-dd HH:mm:ss.SSS format to yyyy-MM-dd'T'HH:mm:ss.SSS'Z' format in groovy
For example, the given date is 2019-03-18 16:20:05.6401383. I want it to converted to 2019-03-18T16:20:05.6401383Z
This is the code Used:
def date = format1.parse("2019-03-18 16:20:05.6401383");
String settledAt = format2.format(date)
log.info ">>> "+*date*+" "+*settledAt*
The result, where the date is getting changed somehow: Mon Mar 18 18:06:46 EDT 2019 & 2019-03-18T18:06:46.383Z
Thanks in advance for all the answers.
If you're on Java 8+ and Groovy 2.5+, I would use the new Date/Time API:
import java.time.*
def date = LocalDateTime.parse('2019-03-18 16:20:05.6401383', 'yyyy-MM-dd HH:mm:ss.nnnnnnn')
String settledAt = date.format(/yyyy-MM-dd'T'HH:mm:ss.nnnnnnn'Z'/)
This is presuming the input date has a "Zulu" time zone.
it's a feature of java
def date = Date.parse("yyyy-MM-dd HH:mm:ss.SSS","2019-03-18 16:20:05.6401383")
returns
Mon Mar 18 18:06:46 EET 2019
the problem that java handles only milliseconds SSS (3 digits after seconds)
but you are providing 7 digits for milliseconds 6401383
as workaround remove extra digits with regexp:
def sdate1 = "2019-03-18 16:20:05.6401383"
sdate1 = sdate1.replaceAll( /\d{3}(\d*)$/, '$1') //keep only 3 digits at the end
def date = Date.parse("yyyy-MM-dd HH:mm:ss.SSS",sdate1)
def sdate2 = date.format("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")

how to get only date string from a long string

I know there are lots of Q&As to extract datetime from string, such as dateutil.parser, to extract datetime from a string
import dateutil.parser as dparser
dparser.parse('something sep 28 2017 something',fuzzy=True).date()
output: datetime.date(2017, 9, 28)
but my question is how to know which part of string results this extraction, e.g. i want a function that also returns me 'sep 28 2017'
datetime, datetime_str = get_date_str('something sep 28 2017 something')
outputs: datetime.date(2017, 9, 28), 'sep 28 2017'
any clue or any direction that i can search around?
Extend to the discussion with #Paul and following the solution from #alecxe, I have proposed the following solution, which works on a number of testing cases, I've made the problem slight challenger:
Step 1: get excluded tokens
import dateutil.parser as dparser
ostr = 'something sep 28 2017 something abcd'
_, excl_str = dparser.parse(ostr,fuzzy_with_tokens=True)
gives outputs of:
excl_str: ('something ', ' ', 'something abcd')
Step 2 : rank tokens by length
excl_str = list(excl_str)
excl_str.sort(reverse=True,key = len)
gives a sorted token list:
excl_str: ['something abcd', 'something ', ' ']
Step 3: delete tokens and ignore space element
for i in excl_str:
if i != ' ':
ostr = ostr.replace(i,'')
return ostr
gives a final output
ostr: 'sep 28 2017 '
Note: step 2 is required, because it will cause problem if any shorter token a subset of longer ones. e.g., in this case, if deletion follows an order of ('something ', ' ', 'something abcd'), the replacement process will remove something from something abcd, and abcd will never get deleted, ends up with 'sep 28 2017 abcd'
Interesting problem! There is no direct way to get the parsed out date string out of the bigger string with dateutil. The problem is that dateutil parser does not even have this string available as an intermediate result as it really builds parts of the future datetime object on the fly and character by character (source).
It, though, also collects a list of skipped tokens which is probably your best bet. As this list is ordered, you can loop over the tokens and replace the first occurrence of the token:
from dateutil import parser
s = 'something sep 28 2017 something'
parsed_datetime, tokens = parser.parse(s, fuzzy_with_tokens=True)
for token in tokens:
s = s.replace(token.lstrip(), "", 1)
print(s) # prints "sep 28 2017"
I am though not 100% sure if this would work in all the possible cases, especially, with the different whitespace characters (notice how I had to workaround things with .lstrip()).

Excel to matlab timestamps

I have data in excel in form of timestamps s it looks like
30/11/12 12:42 AM
30/11/12 12:47 AM
30/11/12 12:56 AM
30/11/12 1:01 AM
I need to get it to matlab to look like this
dateStrings = {...
'30/11/12 12:42 AM' ...
'30/11/12 12:47 AM' ...
'30/11/12 12:56 AM' ...
'30/11/12 1:01 AM' ...
};
I have tried xlsread but it doesn't put in the strings.
The following works for me (in Octave, but should be the same in MATLAB):
>> [num,txt,raw]=xlsread('dates.xls','A1:A4')
num =
4.1243e+004
4.1243e+004
4.1243e+004
4.1243e+004
txt = {}(0x0)
raw =
{
[1,1] = 4.1243e+004
[2,1] = 4.1243e+004
[3,1] = 4.1243e+004
[4,1] = 4.1243e+004
}
>> datestr(num+datenum(1900,1,1,0,0,0)-2)
ans =
30-Nov-2012 00:42:00
30-Nov-2012 00:47:00
30-Nov-2012 00:56:00
30-Nov-2012 01:01:00
>> whos ans
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
ans 4x20 80 char
Total is 80 elements using 80 bytes
Check out the datestr function for the various output format options.
Arnaud
I manage to find a way how to solve it
1. Copy and paste your dates into Excel in dd-mm-yyyy format
2. In Excel, highlight the data and go Right Click, Format Cells/Number
3. In Matlab go a=xlsread(xlsfile);
4. Type datestr(a+693960)

Resources