Python: Calculate number of days between two days - python-3.x

I have two arrays, reference array and target array. Each array has day of year (DOY) information and I am trying to find the difference in actual number of days between the two. Here is the code,
import numpy as np
array_ref = np.array([[362,284],[89,360]])
array_ref
array([[362, 284],
[ 89, 360]])
array_n = np.array([[2, 365], [194, 10]])
array_n
array([[ 2, 365],
[194, 10]])
The absolute difference gives this,
print(abs(array_ref-array_n))
[[360 81]
[105 350]]
However, I am trying to achieve this,
[[5, 81]
[105, 15]]
I am not sure if I have to use any datetime or timedelta function or if there is a more simpler way to achieve this. Thanks for your help.

With remainder division.
(array_n-array_ref)%365
array([[ 5, 81],
[105, 15]], dtype=int32)
In general, you may want to check which subtraction is closer:
np.minimum((array_ref-array_n)%365, (array_n-array_ref)%365)
array([[ 5, 81],
[105, 15]], dtype=int32)
Though this will clearly fail to take leap years into account.

Related

Make predictions on a dataframe with list categorical columns and other types of data

I have a dataframe that looks like this:
df = {'user_id': [23, 34, 12, 9],
'car_id': [[22, 132, 999], [22, 345, 2], [134], [87, 44, 3, 222]],
'start_date': ['2012-02-17', '2013-11-22', '2013-11-22', '2014-03-15'],
'cat_col1': ['str1', 'str2', 'str3', 'str3'],
'cat_col2': [['str1', 'str2'], ['str4'], ['str5, str1'], ['str6', 'str2']],
'cat_col3': [['str11', 'str22', 'str34'], ['str444'], ['str51, str111'], ['str62', 'str233']],
'num_sold': [23, 43, 111, 23],
'to_predict': [0.4, 0.5, 0.22, 0.9]}
There are around 100 000 unique user_ids and 200 000 unique car_ids and categorical columns have thousands of unique values so OHE is not an option. I need to predict to_predict for a given value of cat_col1, cat_col2, cat_col3 (I need to have their original values at the end for predictions). There is a relationship between those categorical columns but it is not clearly defined. Is it possible to do this in keras with embedding layers perhaps and would that make sense for categorical columns? If so, would it make sense utilise the date column and convert it into time series using LSTMs? Or what would be the best approach for this kind of prediction in general?

Large rounding errors in python plots

I try to plot the following simple sequence
a_n=\frac{3^n+1}{7^n+8}
which should tend to 0, but the plot shows a weird effect for values of $n$ near 20....
I use the code
import numpy as np
import matplotlib.pyplot as plt
def f(n):
return (3**n+1)/(7**n+8)
n=np.arange(0,25, 1)
plt.plot(n,f(n),'bo-')
On the other hand, computing numerically the above sequence one does not find such large values
for i in range(0,25):
print([i,f(i)])
[0, 0.2222222222222222]
[1, 0.26666666666666666]
[2, 0.17543859649122806]
[3, 0.07977207977207977]
[4, 0.034039020340390205]
[5, 0.014510853404698185]
[6, 0.0062044757218015075]
[7, 0.0026567874970706124]
[8, 0.0011382857610720493]
[9, 0.00048778777316480816]
[10, 0.00020904485804220367]
[11, 8.958964415487241e-05]
[12, 3.8395417418579486e-05]
[13, 1.6455158259653074e-05]
[14, 7.05220773432529e-06]
[15, 3.022374322043928e-06]
[16, 1.295303220696569e-06]
[17, 5.551299431298911e-07]
[18, 2.3791283154177113e-07]
[19, 1.0196264191387531e-07]
[20, 4.3698275080881505e-08]
[21, 1.872783217393992e-08]
[22, 8.026213788319863e-09]
[23, 3.439805909206865e-09]
[24, 1.4742025325067883e-09]
​
Why is this happening?
The issue is not with matplotlib, but with the datatype of the numbers that arange is producing. You are not specifying the dtype, because in the docs for arange, it states that is inferred from the input. Your inputs are integers, so it must assume they are 32-bit integers since the dtype is unmodified so that when I check the type:
print(type(n[0]))
<class 'numpy.int32'>
If I change the dtype to single precision floats, we get the behavior you expect:
n = np.arange(0,25,1, dtype=np.float32)
print(type(n[0]))
<class 'numpy.float32'>
plt.plot(n,f(n),'bo-')
Alternatively, you could just put a period behind the 1 -> 1. to imply you want double-precision floats (even if the resulting array contains integer-esque numbers [0., 1., 2., ...])

Value Error when trying to create a dictionary with lists as values

I am having issues creating a dictionary that assigns a list of multiple values to each key. Currently the data is in a list of list of 2 items:Category and Value, for example:
sample_data = [["January", 9],["Februrary", 10], ["June", 12], ["March", 15], ["January", 10],["June", 14], ["March", 16]]
It has to be transformed into a dicitonary like this:
d = {"January" : [9,10], "February":[10], "June":[12,14], "March": [15,16]}
This is my current code:
d = defaultdict(list)
for category, value in sample_data:
d[category].append(value)
This works for small samples but with very large samples of data it raises a ValueError saying too much values to unpack. Is there any way I could improve on this code or is there another way of doing this?
So, the setdefault method creates a list as the value for a key.
d = defaultdict(list)
for category, value in sample_data:
d.setdefault(category, []).append(value)
Output:
defaultdict(<class 'list'>, {'January': [9, 10], 'Februrary': [10], 'June': [12, 14], 'March': [15, 16]})
Note: I do not have a larger sample set to work with but the setdefault() method could possibly help out with that.
One way to solve this is prob. change the code to accept more than one values. This is just a wild guess - could it be something in your data (eg. value) problem - eg. one particular month has 2+ more data points showing all at once.
Note - *value means that it can take multiple values (more than one)
Without the * before value, it can only take one number at a time. That is why you got the error "Too many values to unpack..."
Because the sample data is not complete enough to show the exact error point, there's prob. other issue with data. But it could help you eliminate the earlier "error"...or narrow down to it.
data = [["January", 9],["Februrary", 10], ["June", 12],
["March", 15], ["January", 10],["June", 14], ["March", 16],
['April', 20, 21, 22]] # <--- add April & 3 values (to handle the earlier error)
from collections import defaultdict
# d = {"January" : [9,10], "February":[10], "June":[12,14],
# "March": [15,16]}
# This is my current code:
dc = defaultdict(list)
for category, *value in data: # *value to accept multiple values
dc[category].append(value)
print(dc)
output:
defaultdict(<class 'list'>, {'January': [[9], [10]], 'Februrary': [[10]], 'June': [[12], [14]], 'March': [[15], [16]], 'April': [[20, 21, 22]]})

count the number of array elements that belong to a given numerical range

Given an numpy array and a given numerical range, e.g., [0.2, 0.3]. Are there any functions to calculate the count of array elements that belong to this numerical range? Thanks.
Like the answer posted here you can use np.where and np.logical_and to get what you want, also apply len() to the result:
import numpy as np
a = np.array([1, 3, 5, 6, 9, 10, 14, 15, 56])
b = np.where(np.logical_and(a>=6, a<=10))
# returns (array([3, 4, 5]),)
len(b)
# returns 3

Get top-n items of every row in a scipy sparse matrix

After reading this similar question, I still can't fully understand how to go about implementing the solution im looking for. I have a sparse matrix, i.e.:
import numpy as np
from scipy import sparse
arr = np.array([[0,5,3,0,2],[6,0,4,9,0],[0,0,0,6,8]])
arr_csc = sparse.csc_matrix(arr)
I would like to efficiently get the top n items of each row, without converting the sparse matrix to dense.
The end result should look like this (assuming n=2):
top_n_arr = np.array([[0,5,3,0,0],[6,0,0,9,0],[0,0,0,6,8]])
top_n_arr_csc = sparse.csc_matrix(top_n_arr)
What is wrong with the linked answer? Does it not work in your case? or you just don't understand it? Or it isn't efficient enough?
I was going to suggest working out a means of finding the top values for a row of an lil format matrix, and apply that row by row. But I would just be repeating my earlier answer.
OK, my previous answer was a start, but lacked some details on iterating through the lol format. Here's a start; it probably could be cleaned up.
Make the array, and a lil version:
In [42]: arr = np.array([[0,5,3,0,2],[6,0,4,9,0],[0,0,0,6,8]])
In [43]: arr_sp=sparse.csc_matrix(arr)
In [44]: arr_ll=arr_sp.tolil()
The row function from the previous answer:
def max_n(row_data, row_indices, n):
i = row_data.argsort()[-n:]
# i = row_data.argpartition(-n)[-n:]
top_values = row_data[i]
top_indices = row_indices[i] # do the sparse indices matter?
return top_values, top_indices, i
Iterate over the rows of arr_ll, apply this function and replace the elements:
In [46]: for i in range(arr_ll.shape[0]):
d,r=max_n(np.array(arr_ll.data[i]),np.array(arr_ll.rows[i]),2)[:2]
arr_ll.data[i]=d.tolist()
arr_ll.rows[i]=r.tolist()
....:
In [47]: arr_ll.data
Out[47]: array([[3, 5], [6, 9], [6, 8]], dtype=object)
In [48]: arr_ll.rows
Out[48]: array([[2, 1], [0, 3], [3, 4]], dtype=object)
In [49]: arr_ll.tocsc().A
Out[49]:
array([[0, 5, 3, 0, 0],
[6, 0, 0, 9, 0],
[0, 0, 0, 6, 8]])
In the lil format, the data is stored in 2 object type arrays, as sublists, one with the data numbers, the other with the column indices.
Viewing the data attributes of sparse matrix is handy when doing new things. Changing those attributes has some risk, since it mess up the whole array. But it looks like the lil format can be tweaked like this safely.
The csr format is better for accessing rows than csc. It's data is stored in 3 arrays, data, indices and indptr. The lil format effectively splits 2 of those arrays into sublists based on information in the indptr. csr is great for math (multiplication, addition etc), but not so good when changing the sparsity (turning nonzero values into zeros).

Resources