I have millions of records in Azure Data Explorer. Each of this record has a timestamp value associated with it. I want to be able to convert this timestamp value in the specific time zone.
For example in SQL I use AT TIME ZONE to convert timestamp value from one zone into another -
Select CONVERT(datetime, timestampvalueColumn) AT TIME ZONE 'UTC' AT TIME ZONE 'US Eastern Standard Time' as 'TimeInEST' from Table;
I am not willing to use offset value as it doesn't support daylight saving changes.
How I can do this with Kusto query language in ADX?
Well, the Kusto team is moving fast :-)
Support for timezones conversion has been added:
datetime_local_to_utc()
datetime_utc_to_local()
// Sample generation. Not part of the solution
let t = materialize(range i from 1 to 15 step 1 | extend dt_utc = ago(rand()*365d*10));
// Solution Starts here
t
| extend dt_et = datetime_utc_to_local(dt_utc, "US/Eastern")
| extend offset = dt_et - dt_utc
i
dt_utc
dt_et
offset
5
2012-12-03T17:24:51.6057076Z
2012-12-03T12:24:51.6057076Z
-05:00:00
14
2012-12-10T05:04:17.8507406Z
2012-12-10T00:04:17.8507406Z
-05:00:00
10
2013-03-23T14:42:00.4276416Z
2013-03-23T10:42:00.4276416Z
-04:00:00
15
2013-10-01T06:28:36.4665806Z
2013-10-01T02:28:36.4665806Z
-04:00:00
11
2017-07-18T06:10:30.9963876Z
2017-07-18T02:10:30.9963876Z
-04:00:00
3
2017-11-17T21:57:58.4443366Z
2017-11-17T16:57:58.4443366Z
-05:00:00
6
2018-05-09T03:36:24.7533896Z
2018-05-08T23:36:24.7533896Z
-04:00:00
12
2018-06-05T17:36:41.7970716Z
2018-06-05T13:36:41.7970716Z
-04:00:00
4
2018-08-03T16:25:19.9323686Z
2018-08-03T12:25:19.9323686Z
-04:00:00
8
2019-02-21T17:33:52.9957996Z
2019-02-21T12:33:52.9957996Z
-05:00:00
2
2020-09-24T18:37:08.0049776Z
2020-09-24T14:37:08.0049776Z
-04:00:00
1
2020-12-09T19:57:23.7480626Z
2020-12-09T14:57:23.7480626Z
-05:00:00
7
2021-01-17T13:42:55.0632136Z
2021-01-17T08:42:55.0632136Z
-05:00:00
9
2021-03-04T23:44:01.7192366Z
2021-03-04T18:44:01.7192366Z
-05:00:00
13
2022-06-04T16:26:57.8826486Z
2022-06-04T12:26:57.8826486Z
-04:00:00
Fiddle
Usually the answer is "Don't do it in Kusto", do it in the client that is reading the results from Kusto, which most certainly will have a "utc-to-local-time" or "utc-to-this-timezone" functions.
You can build a convenience function using a similar ideia of the function given bellow. Note that the conversion works for DST (Daylight Saving Time) as well. You just need a way to map a place to its timezone string. In the function that follows, the mapping is from a Brazilian state abbreviation to its timezone string.
See the documentation for a list of available timezones.
.create-or-alter function with (
docstring = 'Dada uma UF e um datetime (UTC), retorna o horário local. Não é tratado o GMT-5 ao oeste de AM.'
) ToLocalDatetime(state: string, dtutc: datetime) {
let selected_tz = iff('GO,DF,MG,ES,RJ,SP,PR,SC,RS' has state, 'America/Sao_Paulo',
iff('MA,PI,CE,RN,PB' has state, 'America/Fortaleza',
iff('AL,SE' has state, 'America/Maceio',
iff('BA' == state, 'America/Bahia',
iff('RR' == state, 'America/Boa_Vista',
iff('MS' == state, 'America/Campo_Grande',
iff('MT' == state, 'America/Cuiaba',
iff('AM' == state, 'America/Manaus',
iff('PA,AP' has state, 'America/Belem',
iff('AC' == state, 'America/Rio_Branco',
iff('RO' == state, 'America/Porto_Velho',
iff('PE' == state, 'America/Recife',
iff('TO' == state, 'America/Araguaina', '')))))))))))));
let localdt = datetime_utc_to_local(dtutc, selected_tz);
let dt_hr = split(format_datetime(localdt, "yyyy-MM-dd HH:mm:ss"), " ");
iff(isnotempty(localdt),
strcat(dt_hr[0], "T", dt_hr[1], format_timespan(localdt - dtutc, "HH:mm")),
'')
}
A couple tests at the moment when DST ended in the DF Brazilian state:
print(ToLocalDatetime('DF', datetime('2019-02-17 01:00:00')))
Output: 2019-02-16T23:00:00-02:00
print(ToLocalDatetime('DF', datetime('2019-02-17 02:00:00')))
Outputs 2019-02-16T23:00:00-03:00
I agree with other answers stating that is better to do it on the client side for most cases. Additionally, the iff sequence of the function is ugly. For a more elegant solution, it is possible to define a datatable such as:
datatable(state:string, tz:string) [
'GO,DF,MG,ES,RJ,SP,PR,SC,RS', 'America/Sao_Paulo',
'MA,PI,CE,RN,PB', 'America/Fortaleza',
......
However, if you to it you cannot use the function on some scenarios, due to documented restrictions.
Related
I have 5 .csv files I am importing and all contain emails:
Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
usecols=["Email Address"])
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
usecols=["Email"])
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
usecols=["Email"])
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv",
usecols=["Email Address"])
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
usecols=["Email Address"])
Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv",
usecols=["Email"])
While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc.
Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'
I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have:
S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count
The total on that last line is 166,749
Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier.
I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ).
Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record.
Any help would be greatly appreciated.
I'll give it a try with some made-up data:
import pandas as pd
fa = pd.DataFrame([['paul#mail.com', 'Donors'], ['max#mail.com', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['paul#mail.com', 'Activists'], ['annie#mail.com', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])
print(S1)
gives
Handle Value
0 paul#mail.com Donors
1 max#mail.com Donors
0 paul#mail.com Activists
1 annie#mail.com Activists
You can group by Handle and then pick any Value you like, e.g. the first:
for handle, group in S1.groupby('Handle'):
print(handle, group.reset_index().loc[0, 'Value'])
gives
annie#mail.com Activists
max#mail.com Donors
paul#mail.com Donors
or collect all roles of a person:
for handle, group in S1.groupby('Handle'):
print(handle, group.Value.unique())
gives
annie#mail.com ['Activists']
max#mail.com ['Donors']
paul#mail.com ['Donors' 'Activists']
I've been trying to optimise a bokeh server to calculate live stats by selected country on Covid19.
I found myself repeating a groupby function to calculate new columns and was wondering, having selected the groupby, if I could then apply it in a similar way to .agg() on multiple columns ?
For example:
dfall = pd.DataFrame(db("SELECT * FROM C19daily"))
dfall.set_index(['geoId', 'date'], drop=False, inplace=True)
dfall = dfall.sort_index(ascending=True)
dfall.head()
id date geoId cases deaths auid
geoId date
AD 2020-03-03 70119 2020-03-03 AD 1 0 AD03/03/2020
2020-03-14 70118 2020-03-14 AD 1 0 AD14/03/2020
2020-03-16 70117 2020-03-16 AD 3 0 AD16/03/2020
2020-03-17 70116 2020-03-17 AD 9 0 AD17/03/2020
2020-03-18 70115 2020-03-18 AD 0 0 AD18/03/2020
I need to create new columns based on 'cases' and 'deaths' and applying various functions like cumsum(). Currently I do this the long way
dfall['ccases'] = dfall.groupby(level=0)['cases'].cumsum()
dfall['dpc_cases'] = dfall.groupby(level=0)['cases'].pct_change(fill_method='pad', periods=7)
.....
dfall['cdeaths'] = dfall.groupby(level=0)['deaths'].cumsum()
dfall['dpc_deaths'] = dfall.groupby(level=0)['deaths'].pct_change(fill_method='pad', periods=7)
I tried to optimise the groupby call like this:-
with dfall.groupby(level=0) as gr:
gr = g['cases'].cumsum()...
But the error suggest the class doesn't support this
AttributeError: __enter__
I thought I could use .agg({}) and supply dictionary
g = dfall.groupby(level=0).agg({'cc' : 'cumsum', 'cd' : 'cumsum'})
but that produces another error
pandas.core.base.SpecificationError: nested renamer is not supported
I have plenty of other bits to optimise, I thought this python part would be the easiest and save a few ms!
Could anyone nudge me in the right direction?
To avoid repeating dfall.groupby(level=0) you can just save it in a variable:
gb = dfall.groupby(level=0)
gb_cases = gb['cases']
dfall['ccases'] = gb_cases.cumsum()
dfall['dpc_cases'] = gb_cases.pct_change(fill_method='pad', periods=7)
...
And to run multiple aggregations using a single expression, I think you can use named aggregation. But I have no clue whether it will be more performant or not. Either way, it's better to profile the code and improve the actual bottlenecks.
I am trying to parse raw data results from a text file into an organised tuple but having trouble getting it right.
My raw data from the textfile looks something like this:
Episode Cumulative Results
EpisodeXD0281119
Date collected21/10/2019
Time collected10:00
Real time PCR for M. tuberculosis (Xpert MTB/Rif Ultra):
PCR result Mycobacterium tuberculosis complex NOT detected
Bacterial Culture:
Bottle: Type FAN Aerobic Plus
Result No growth after 5 days
EpisodeST32423457
Date collected23/02/2019
Time collected09:00
Gram Stain:
Neutrophils Occasional
Gram positive bacilli Moderate (2+)
Gram negative bacilli Numerous (3+)
Gram negative cocci Moderate (2+)
EpisodeST23423457
Date collected23/02/2019
Time collected09:00
Bacterial Culture:
A heavy growth of
1) Klebsiella pneumoniae subsp pneumoniae (KLEPP)
ensure that this organism does not spread in the ward/unit.
A heavy growth of
2) Enterococcus species (ENCSP)
Antibiotic/Culture KLEPP ENCSP
Trimethoprim-sulfam R
Ampicillin / Amoxic R S
Amoxicillin-clavula R
Ciprofloxacin R
Cefuroxime (Parente R
Cefuroxime (Oral) R
Cefotaxime / Ceftri R
Ceftazidime R
Cefepime R
Gentamicin S
Piperacillin/tazoba R
Ertapenem R
Imipenem S
Meropenem R
S - Sensitive ; I - Intermediate ; R - Resistant ; SDD - Sensitive Dose Dependant
Comment for organism KLEPP:
** Please note: this is a carbapenem-RESISTANT organism. Although some
carbapenems may appear susceptible in vitro, these agents should NOT be used as
MONOTHERAPY in the treatment of this patient. **
Please isolate this patient and practice strict contact precautions. Please
inform Infection Prevention and Control as contact screening might be
indicated.
For further advice on the treatment of this isolate, please contact.
The currently available laboratory methods for performing colistin
susceptibility results are unreliable and may not predict clinical outcome.
Based on published data and clinical experience, colistin is a suitable
therapeutic alternative for carbapenem resistant Acinetobacter spp, as well as
carbapenem resistant Enterobacteriaceae. If colistin is clinically indicated,
please carefully assess clinical response.
EpisodeST234234057
Date collected23/02/2019
Time collected09:00
Authorised by xxxx on 27/02/2019 at 10:35
MIC by E-test:
Organism Klebsiella pneumoniae (KLEPN)
Antibiotic Meropenem
MIC corrected 4 ug/mL
MIC interpretation Resistant
Antibiotic Imipenem
MIC corrected 1 ug/mL
MIC interpretation Sensitive
Antibiotic Ertapenem
MIC corrected 2 ug/mL
MIC interpretation Resistant
EpisodeST23423493
Date collected18/02/2019
Time collected03:15
Potassium 4.4 mmol/L 3.5 - 5.1
EpisodeST45445293
Date collected18/02/2019
Time collected03:15
Creatinine 32 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Creatinine 28 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Essentially the pattern is that ALL information starts with a unique EPISODE NUMBER and follows with a DATE and TIME and then the result of whatever test. This is the pattern throughout.
What I am trying to parse into my tuple is the date, time, name of the test and the result - whatever it might be. I have the following code:
with open(filename) as f:
data = f.read()
data = data.splitlines()
DS = namedtuple('DS', 'date time name value')
parsed = list()
idx_date = [i for i, r in enumerate(data) if r.strip().startswith('Date')]
for start, stop in zip(idx_date[:-1], idx_date[1:]):
chunk = data[start:stop]
date = time = name = value = None
for row in chunk:
if not row: continue
row = row.strip()
if row.startswith('Episode'): continue
if row.startswith('Date'):
_, date = row.split()
date = date.replace('collected', '')
elif row.startswith('Time'):
_, time = row.split()
time = time.replace('collected', '')
else:
name, value, *_ = row.split()
print (name)
parsed.append(DS(date, time, name, value))
print(parsed)
My error is that I am unable to find a way to parse the heterogeneity of the test RESULT in a way that I can use later, for example for the tuple DS ('DS', 'date time name value'):
DATE = 21/10/2019
TIME = 10:00
NAME = Real time PCR for M tuberculosis or Potassium
RESULT = Negative or 4.7
Any advice appreciated. I have hit a brick wall.
I have a variable called FOUNDATION_DATE which includes the following date observations in string format:
'01/Jan/12'
''
''
''
'01/Jan/08'
''
'01/Jan/44'
''
''
'14/Oct/08'
''
''
'12/Jul/04'
'03/Aug/05'
'20/Apr/10'
'30/Dec/98'
'09/Apr/16'
'01/Jan/10'
'01/Dec/01'
'01/Jan/93'
I am using the Matlab function datetime to transform the above observations in datetime data type. The code is
datetime(FOUNDATION_DATE,'InputFormat','dd/MMM/yy')
which provides the following results:
01-Jan-2012
NaT
NaT
NaT
01-Jan-2008
NaT
01-Jan-2044
NaT
NaT
14-Oct-2008
NaT
NaT
12-Jul-2004
03-Aug-2005
20-Apr-2010
30-Dec-1998
09-Apr-2016
01-Jan-2010
01-Dec-2001
01-Jan-1993
While for the majority of the cases the transformation is conducted properly, for the observation '01/Jan/44' this is not the case as the year becomes 2044. This issue appears in many other date observations of my variable (only a small sample is presented here) and it is quite strange that this issue appears for date observations for years before 1969.
Does anyone have a solution for accurately transforming these strings to datetime variables? Any explanation also why this happens?
You want the 'PivotYear' option, which defines which 100-year date range the 2 digit date refers to:
datetime( '01/Jan/44', 'inputformat', 'dd/MMM/yy', 'pivotyear', 1930 )
So here the 100-year range is 1930 - 2029
The default as documented (therefore not very "strange"), is
year(datetime('now'))-50 % = 1969 at time of writing (2019)
When only 2 years are represented matlab makes an assumption on what the first two digits are, you can override this by:
startYear = year(datetime('now')) - 99;
datetime('01/Jan/69', 'InputFormat', 'dd/MMM/yy', 'PivotYear', startYear)
That will make any dates in 2 digits up until today be historic.
I give current day in a string in MATLAB. For example if today is '20180703', I need 7 strings containing:
'20180702'
'20180701'
'20180630'
'20180629'
'20180628'
'20180627'
'20180626'
Simple:
t = datetime('20180703', 'InputFormat', 'yyyyMMdd')
t = t - days(1:7)
datestr(t, 'yyyymmdd')
Edit.
As excaza pointed out, datetime and datestr use different input format. Hence, 'MM' in the first function, and 'mm' in the second one.
I would go with something like:
lastSevenDays = arrayfun(#(offset) datestr(now-offset, 'yyyymmdd'), 1:7, 'UniformOutput', false)
or more matlaby:
datestr(now - days(1:7), 'yyyymmdd')