Make predictions on a dataframe with list categorical columns and other types of data - python-3.x

I have a dataframe that looks like this:
df = {'user_id': [23, 34, 12, 9],
'car_id': [[22, 132, 999], [22, 345, 2], [134], [87, 44, 3, 222]],
'start_date': ['2012-02-17', '2013-11-22', '2013-11-22', '2014-03-15'],
'cat_col1': ['str1', 'str2', 'str3', 'str3'],
'cat_col2': [['str1', 'str2'], ['str4'], ['str5, str1'], ['str6', 'str2']],
'cat_col3': [['str11', 'str22', 'str34'], ['str444'], ['str51, str111'], ['str62', 'str233']],
'num_sold': [23, 43, 111, 23],
'to_predict': [0.4, 0.5, 0.22, 0.9]}
There are around 100 000 unique user_ids and 200 000 unique car_ids and categorical columns have thousands of unique values so OHE is not an option. I need to predict to_predict for a given value of cat_col1, cat_col2, cat_col3 (I need to have their original values at the end for predictions). There is a relationship between those categorical columns but it is not clearly defined. Is it possible to do this in keras with embedding layers perhaps and would that make sense for categorical columns? If so, would it make sense utilise the date column and convert it into time series using LSTMs? Or what would be the best approach for this kind of prediction in general?

Related

How to select indices according to another tensor in pytorch?

I have two tensors a and b. And I want to retrive the values of b according to the positions of max values in a. That is,
max_values, indices = torch.max(a, dim=0, keepdim=True)
However, I do not know how to use the indices to retrive the values of b. Can anybody helps to solve it? Thanks a lot!!
Edit:
Sorry for not describing my problem concretely. To give a minimal example, the value of tensors a and b are:
a = torch.tensor([[1,2,4],[2,1,3]])
b = torch.tensor([[10,24,2],[23,4,5]])
If I use torch.max(a, dim=0, keepdim=True), it will return:
max: tensor([[2, 2, 4]])
indices: tensor([[1, 0, 0]])
What I want to obtain is the selected value of tensor b according to the indices of max values of a in dim=0, that is,
tensor([[23, 24, 2]])
I have tried b[indices], whereas the result is not what I want:
tensor([[[ 2, 3, 5],
[10, 30, 40],
[10, 30, 40]]])
You can use torch.gather:
torch.gather(b, dim=0, index=indices)

Value Error when trying to create a dictionary with lists as values

I am having issues creating a dictionary that assigns a list of multiple values to each key. Currently the data is in a list of list of 2 items:Category and Value, for example:
sample_data = [["January", 9],["Februrary", 10], ["June", 12], ["March", 15], ["January", 10],["June", 14], ["March", 16]]
It has to be transformed into a dicitonary like this:
d = {"January" : [9,10], "February":[10], "June":[12,14], "March": [15,16]}
This is my current code:
d = defaultdict(list)
for category, value in sample_data:
d[category].append(value)
This works for small samples but with very large samples of data it raises a ValueError saying too much values to unpack. Is there any way I could improve on this code or is there another way of doing this?
So, the setdefault method creates a list as the value for a key.
d = defaultdict(list)
for category, value in sample_data:
d.setdefault(category, []).append(value)
Output:
defaultdict(<class 'list'>, {'January': [9, 10], 'Februrary': [10], 'June': [12, 14], 'March': [15, 16]})
Note: I do not have a larger sample set to work with but the setdefault() method could possibly help out with that.
One way to solve this is prob. change the code to accept more than one values. This is just a wild guess - could it be something in your data (eg. value) problem - eg. one particular month has 2+ more data points showing all at once.
Note - *value means that it can take multiple values (more than one)
Without the * before value, it can only take one number at a time. That is why you got the error "Too many values to unpack..."
Because the sample data is not complete enough to show the exact error point, there's prob. other issue with data. But it could help you eliminate the earlier "error"...or narrow down to it.
data = [["January", 9],["Februrary", 10], ["June", 12],
["March", 15], ["January", 10],["June", 14], ["March", 16],
['April', 20, 21, 22]] # <--- add April & 3 values (to handle the earlier error)
from collections import defaultdict
# d = {"January" : [9,10], "February":[10], "June":[12,14],
# "March": [15,16]}
# This is my current code:
dc = defaultdict(list)
for category, *value in data: # *value to accept multiple values
dc[category].append(value)
print(dc)
output:
defaultdict(<class 'list'>, {'January': [[9], [10]], 'Februrary': [[10]], 'June': [[12], [14]], 'March': [[15], [16]], 'April': [[20, 21, 22]]})

How to calculate WAPE for given dataframe in python

I want to know to how to calculate WAPE value if we have the dataframe in the below format.
I am using python. I need it for evaluating Forecasting.
According to Wikipedia, the WAPE (Weighted Absolute Percent Error) can be calculated by
dividing the sum of the absolute deviations by the total sales of all products.
In pandas, you can do that by using the - operator, which will work element-wise between series, combined with the abs() and sum() methods for series (and regular float division):
import pandas as pd
df = pd.DataFrame({'Actual': [23, 32, 44, 37, 48, 42, 39],
'Forecasted': [25, 30, 41, 34, 45, 47, 40]})
wape = (df.Actual - df.Forecasted).abs().sum() / df.Actual.sum()
print(wape)
0.07169811320754717

Python: Calculate number of days between two days

I have two arrays, reference array and target array. Each array has day of year (DOY) information and I am trying to find the difference in actual number of days between the two. Here is the code,
import numpy as np
array_ref = np.array([[362,284],[89,360]])
array_ref
array([[362, 284],
[ 89, 360]])
array_n = np.array([[2, 365], [194, 10]])
array_n
array([[ 2, 365],
[194, 10]])
The absolute difference gives this,
print(abs(array_ref-array_n))
[[360 81]
[105 350]]
However, I am trying to achieve this,
[[5, 81]
[105, 15]]
I am not sure if I have to use any datetime or timedelta function or if there is a more simpler way to achieve this. Thanks for your help.
With remainder division.
(array_n-array_ref)%365
array([[ 5, 81],
[105, 15]], dtype=int32)
In general, you may want to check which subtraction is closer:
np.minimum((array_ref-array_n)%365, (array_n-array_ref)%365)
array([[ 5, 81],
[105, 15]], dtype=int32)
Though this will clearly fail to take leap years into account.

Input missed data in DF with predicted data

I have a data 200 cols and 30k rows. I have a missing data and I'd like to predict it to fill in the missing data. I want to predict None values and put the predicted data there.
I want to split data by indexes, train model on Known data, predict Unknown values, join Known and Predicted values and return them back to data on exactly the same places.
P.S. Median, dropna and other methods are not interesting, just prediction of missed values.
df = {'First' : [30, 22, 18, 49, 22], 'Second' : [80, 28, 16, 56, 30], 'Third' : [14, None, None, 30, 27], 'Fourth' : [14, 85, 17, 22, 14], 'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['First', 'Second', 'Third', 'Fourth'])
Same DF with all cols comleated by data.
I do not really understand your question as well but I might have an idea for you. Have a look at the fancyimpute package. This package offers you imputation methods based on predictive models (e.g. KNN). Hope this will solve your question.
It is hard to understand the question. However, it seems like you may be interested in this question and the answer.
Using a custom function Series in fillna
Basically (from the link), you would
create a column with predicted values
use fillna with that column as parameter

Resources