Related
I am trying to modify the overlapping time period problem so that if there is 1 day difference between dates, it should still be counted as an overlap. As long as the difference in dates is less than 2 days it should be seen as an overlap.
This is the dataframe containing the dates
df_dates = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103, 104, 104, 104, 102, 104, 104, 103, 106, 106, 106],
"start dates": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2002,10,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 8), pd.Timestamp(1993, 1, 1), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 16), pd.Timestamp(2002, 11, 16), pd.Timestamp(2005, 2, 23), pd.Timestamp(2005, 10, 11), pd.Timestamp(2015, 2, 9), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 12, 21)],
"end dates": [pd.Timestamp(2002, 1, 3), pd.Timestamp(2002, 12, 3),pd.Timestamp(2002,11,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2004, 11, 1), pd.Timestamp(2015, 2, 8), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 15) , pd.Timestamp(2005, 2, 21), pd.Timestamp(2003, 2, 16), pd.Timestamp(2005, 10, 8), pd.Timestamp(2005, 10, 21), pd.Timestamp(2015, 2, 17), pd.Timestamp(2011, 12, 31), pd.Timestamp(2011, 11, 25), pd.Timestamp(2011, 12, 22)]
})
This was helpful with answering the overlap question but I am not sure how to modify it (red circle) to include 1 day difference
This was my attempt at answering the question, which kind of did (red circle), but then the overlap calculation is not always right (yellow circle)
def Dates_Restructure(df, pers_id, start_dates, end_dates):
df.sort_values([pers_id, start_dates], inplace=True)
df['overlap'] = (df.groupby(pers_id)
.apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
.reset_index(level=0, drop=True))
df['cumsum'] = df.groupby(pers_id)['overlap'].cumsum()
return df.groupby([pers_id, 'cumsum']).aggregate({start_dates: min, end_dates: max}).reset_index()
I will appreciate your help with this. Thanks
This was the answer I came up with and it worked. I combined the 2 solutions in my question to get this solution.
def Dates_Restructure(df_dates, pers_id, start_dates, end_dates):
df2 = df_dates.copy()
startdf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[start_dates], 'start_end': 1})
enddf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[end_dates], 'start_end': -1})
mergedf2 = pd.concat([startdf2, enddf2]).sort_values([pers_id, 'time'])
mergedf2['cumsum'] = mergedf2.groupby(pers_id)['start_end'].cumsum()
mergedf2['new_start'] = mergedf2['cumsum'].eq(1) & mergedf2['start_end'].eq(1)
mergedf2['group'] = mergedf2.groupby(pers_id)['new_start'].cumsum()
df2['group_id'] = mergedf2['group'].loc[mergedf2['start_end'].eq(1)]
df3 = df2.groupby([pers_id, 'group_id']).aggregate({start_dates: min, end_dates: max}).reset_index()
df3.sort_values([pers_id, start_dates], inplace=True)
df3['overlap'] = (df3.groupby(pers_id).apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
.reset_index(level=0, drop=True))
df3['GROUP_ID'] = df3.groupby(pers_id)['overlap'].cumsum()
return df3.groupby([pers_id, 'GROUP_ID']).aggregate({start_dates: min, end_dates: max}).reset_index()
I'm reading data from an API and have a list of lists like this:
listData = [[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
I need to create a complete list filling in the missing values. I've created a destination, like this:
listDest = [[datetime.datetime(2018, 1, 1, 5, 0), None],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), None],
[datetime.datetime(2018, 1, 1, 8, 0), None]]
The end result should look like this:
[[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
Here is the code I've tried:
for blankTime, blankValue in listDest:
for dataTime, dataValue in listData:
if blankTime == dataTime:
blankIndex = listDest.index(blankTime)
dataIndex = listData.index(dataTime)
listDest[blankIndex] = tempRm7[dataIndex]
This returns the following error, which is confusing since I know that value is in both lists.
ValueError: datetime.datetime(2018, 1, 1, 5, 0) is not in list
I attempted to adapt the methods in this answer but that's for a 1D list and I couldn't figure out how to make it work for my 2D list.
If both lists are sorted, you can merge them and then group them (using heapq.merge/itertools.groupby):
import datetime
from heapq import merge
from itertools import groupby
listData = [[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
listDest = [[datetime.datetime(2018, 1, 1, 5, 0), None],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), None],
[datetime.datetime(2018, 1, 1, 8, 0), None]]
out = [next(g) for _, g in groupby(merge(listData, listDest, key=lambda k: k[0]), lambda k: k[0])]
# pretty print to screen:
from pprint import pprint
pprint(out)
Prints:
[[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
I have frequency of each bigrams of a dataset.I need to sort it by descending order and visualise the top n bigrams.This is my frequency associated with each bigrams
{('best', 'price'): 95, ('price', 'range'): 190, ('range', 'got'): 5, ('got', 'diwali'): 2, ('diwali', 'sale'): 2, ('sale', 'simply'): 1, ('simply', 'amazed'): 1, ('amazed', 'performance'): 1, ('performance', 'camera'): 30, ('camera', 'clarity'): 35, ('clarity', 'device'): 1, ('device', 'speed'): 1, ('speed', 'looks'): 1, ('looks', 'display'): 1, ('display', 'everything'): 2, ('everything', 'nice'): 5, ('nice', 'heats'): 2, ('heats', 'lot'): 14, ('lot', 'u'): 2, ('u', 'using'): 3, ('using', 'months'): 20, ('months', 'no'): 10, ('no', 'problems'): 8, ('problems', 'whatsoever'): 1, ('whatsoever', 'great'): 1
Can anyone help me visualise these bigrams?
If I understand you correctly, this is what you need
import seaborn as sns
bg_dict = {('best', 'price'): 95, ('price', 'range'): 190, ('range', 'got'): 5, ('got', 'diwali'): 2, ('diwali', 'sale'): 2, ('sale', 'simply'): 1,
('simply', 'amazed'): 1, ('amazed', 'performance'): 1, ('performance', 'camera'): 30, ('camera', 'clarity'): 35, ('clarity', 'device'): 1,
('device', 'speed'): 1, ('speed', 'looks'): 1, ('looks', 'display'): 1, ('display', 'everything'): 2, ('everything', 'nice'): 5, ('nice', 'heats'): 2, ('heats', 'lot'): 14,
('lot', 'u'): 2, ('u', 'using'): 3, ('using', 'months'): 20, ('months', 'no'): 10, ('no', 'problems'): 8, ('problems', 'whatsoever'): 1, ('whatsoever', 'great'): 1}
bg_dict_sorted = sorted(bg_dict.items(), key=lambda kv: kv[1], reverse=True)
bg, counts = list(zip(*bg_dict_sorted))
bg_str = list(map(lambda x: '-'.join(x), bg))
sns.barplot(bg_str, counts)
I have a numpy array of milliseconds in integers, which I want to convert to an array of Python datetimes via a timedelta operation.
The following MWE works, but I'm convinced there is a more elegant approach or with better performence than multiplication by 1 ms.
start = pd.Timestamp('2016-01-02 03:04:56.789101').to_pydatetime()
dt = np.array([ 19, 14980, 19620, 54964615, 54964655, 86433958])
time_arr = start + dt * timedelta(milliseconds=1)
So your approach produces:
In [56]: start = pd.Timestamp('2016-01-02 03:04:56.789101').to_pydatetime()
In [57]: start
Out[57]: datetime.datetime(2016, 1, 2, 3, 4, 56, 789101)
In [58]: dt = np.array([ 19, 14980, 19620, 54964615, 54964655, 86433958])
In [59]: time_arr = start + dt * timedelta(milliseconds=1)
In [60]: time_arr
Out[60]:
array([datetime.datetime(2016, 1, 2, 3, 4, 56, 808101),
datetime.datetime(2016, 1, 2, 3, 5, 11, 769101),
datetime.datetime(2016, 1, 2, 3, 5, 16, 409101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 404101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 444101),
datetime.datetime(2016, 1, 3, 3, 5, 30, 747101)], dtype=object)
The equivalent using np.datetime64 types:
In [61]: dt.astype('timedelta64[ms]')
Out[61]: array([ 19, 14980, 19620, 54964615, 54964655, 86433958], dtype='timedelta64[ms]')
In [62]: np.datetime64(start)
Out[62]: numpy.datetime64('2016-01-02T03:04:56.789101')
In [63]: np.datetime64(start) + dt.astype('timedelta64[ms]')
Out[63]:
array(['2016-01-02T03:04:56.808101', '2016-01-02T03:05:11.769101',
'2016-01-02T03:05:16.409101', '2016-01-02T18:21:01.404101',
'2016-01-02T18:21:01.444101', '2016-01-03T03:05:30.747101'], dtype='datetime64[us]')
I can produce the same array from your time_arr with np.array(time_arr, dtype='datetime64[us]').
tolist converts these datetime64 items to datetime objects:
In [97]: t1=np.datetime64(start) + dt.astype('timedelta64[ms]')
In [98]: t1.tolist()
Out[98]:
[datetime.datetime(2016, 1, 2, 3, 4, 56, 808101),
datetime.datetime(2016, 1, 2, 3, 5, 11, 769101),
datetime.datetime(2016, 1, 2, 3, 5, 16, 409101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 404101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 444101),
datetime.datetime(2016, 1, 3, 3, 5, 30, 747101)]
or wrap it back in an array to get your time_arr:
In [99]: np.array(t1.tolist())
Out[99]:
array([datetime.datetime(2016, 1, 2, 3, 4, 56, 808101),
...
datetime.datetime(2016, 1, 3, 3, 5, 30, 747101)], dtype=object)
Just for the calculation datatime64 is faster, but with the conversions, it may not be the fastest overall.
https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
I am using pyspark and pyspark-cassandra.
I have noticed this behaviour on multiple versions of Cassandra(3.0.x and 3.6.x) using COPY, sstableloader, and now saveToCassandra in pyspark.
I have the following schema
CREATE TABLE test (
id int,
time timestamp,
a int,
b int,
c int,
PRIMARY KEY ((id), time)
) WITH CLUSTERING ORDER BY (time DESC);
and the following data
(1, datetime.datetime(2015, 3, 1, 0, 18, 18, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 19, 12, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 22, 59, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 23, 52, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 2, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 8, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 43, 30, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 44, 12, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 48, 49, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 49, 7, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 5, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 53, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 53, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 59, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 54, 35, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 28, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 55, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 56, 24, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 14, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 17, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 8, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 10, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 43, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 49, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 12, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 24, tzinfo=<UTC>), 2, 1, 0)
Towards the end of the data, there are two rows which have the same timestamp.
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
It is my understanding that when I save to Cassandra, one of these will "win" - there will only be one row.
After writing to cassandra using
rdd.saveToCassandra(keyspace, table, ['id', 'time', 'a', 'b', 'c'])
Neither row appears to have won. Rather, the rows seem to have "merged".
1 | 2015-03-01 01:17:43+0000 | 1 | 2 | 0
1 | 2015-03-01 01:17:49+0000 | 0 | 3 | 0
1 | 2015-03-01 01:24:12+0000 | 1 | 2 | 0
1 | 2015-03-01 01:24:18+0000 | 2 | 2 | 0
1 | 2015-03-01 01:24:24+0000 | 2 | 1 | 0
Rather than the 2015-03-01 01:24:18+0000 containing (1, 2, 0) or (2, 1, 0), it contains (2, 2, 0).
What is happening here? I can't for the life of me figure out this behaviour is being caused.
This is a little known effect that comes from the batching together of data. Batching writes assigns the same timestamp to all Inserts in the batch. Next, if two writes are done with the exact same timestamp then there is a special merge rule since there was no "last" write. The Spark Cassandra Connector uses intra-partition batches by default so this is very likely to happen if you have this kind of clobbering of values.
The behavior with two identical write timestamps is a merge based on the Greater value.
Given Table (key, a, b)
Batch
Insert "foo", 2, 1
Insert "foo", 1, 2
End batch
The batch gives both mutations the same timestamp. Cassandra cannot chose a "last-written" since they both happened at the same time, instead it just chooses the greater value of the two. The merged result will be
"foo", 2, 2