Input missed data in DF with predicted data - python-3.x

I have a data 200 cols and 30k rows. I have a missing data and I'd like to predict it to fill in the missing data. I want to predict None values and put the predicted data there.
I want to split data by indexes, train model on Known data, predict Unknown values, join Known and Predicted values and return them back to data on exactly the same places.
P.S. Median, dropna and other methods are not interesting, just prediction of missed values.
df = {'First' : [30, 22, 18, 49, 22], 'Second' : [80, 28, 16, 56, 30], 'Third' : [14, None, None, 30, 27], 'Fourth' : [14, 85, 17, 22, 14], 'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['First', 'Second', 'Third', 'Fourth'])
Same DF with all cols comleated by data.

I do not really understand your question as well but I might have an idea for you. Have a look at the fancyimpute package. This package offers you imputation methods based on predictive models (e.g. KNN). Hope this will solve your question.

It is hard to understand the question. However, it seems like you may be interested in this question and the answer.
Using a custom function Series in fillna
Basically (from the link), you would
create a column with predicted values
use fillna with that column as parameter

Related

Make predictions on a dataframe with list categorical columns and other types of data

I have a dataframe that looks like this:
df = {'user_id': [23, 34, 12, 9],
'car_id': [[22, 132, 999], [22, 345, 2], [134], [87, 44, 3, 222]],
'start_date': ['2012-02-17', '2013-11-22', '2013-11-22', '2014-03-15'],
'cat_col1': ['str1', 'str2', 'str3', 'str3'],
'cat_col2': [['str1', 'str2'], ['str4'], ['str5, str1'], ['str6', 'str2']],
'cat_col3': [['str11', 'str22', 'str34'], ['str444'], ['str51, str111'], ['str62', 'str233']],
'num_sold': [23, 43, 111, 23],
'to_predict': [0.4, 0.5, 0.22, 0.9]}
There are around 100 000 unique user_ids and 200 000 unique car_ids and categorical columns have thousands of unique values so OHE is not an option. I need to predict to_predict for a given value of cat_col1, cat_col2, cat_col3 (I need to have their original values at the end for predictions). There is a relationship between those categorical columns but it is not clearly defined. Is it possible to do this in keras with embedding layers perhaps and would that make sense for categorical columns? If so, would it make sense utilise the date column and convert it into time series using LSTMs? Or what would be the best approach for this kind of prediction in general?

Value Error when trying to create a dictionary with lists as values

I am having issues creating a dictionary that assigns a list of multiple values to each key. Currently the data is in a list of list of 2 items:Category and Value, for example:
sample_data = [["January", 9],["Februrary", 10], ["June", 12], ["March", 15], ["January", 10],["June", 14], ["March", 16]]
It has to be transformed into a dicitonary like this:
d = {"January" : [9,10], "February":[10], "June":[12,14], "March": [15,16]}
This is my current code:
d = defaultdict(list)
for category, value in sample_data:
d[category].append(value)
This works for small samples but with very large samples of data it raises a ValueError saying too much values to unpack. Is there any way I could improve on this code or is there another way of doing this?
So, the setdefault method creates a list as the value for a key.
d = defaultdict(list)
for category, value in sample_data:
d.setdefault(category, []).append(value)
Output:
defaultdict(<class 'list'>, {'January': [9, 10], 'Februrary': [10], 'June': [12, 14], 'March': [15, 16]})
Note: I do not have a larger sample set to work with but the setdefault() method could possibly help out with that.
One way to solve this is prob. change the code to accept more than one values. This is just a wild guess - could it be something in your data (eg. value) problem - eg. one particular month has 2+ more data points showing all at once.
Note - *value means that it can take multiple values (more than one)
Without the * before value, it can only take one number at a time. That is why you got the error "Too many values to unpack..."
Because the sample data is not complete enough to show the exact error point, there's prob. other issue with data. But it could help you eliminate the earlier "error"...or narrow down to it.
data = [["January", 9],["Februrary", 10], ["June", 12],
["March", 15], ["January", 10],["June", 14], ["March", 16],
['April', 20, 21, 22]] # <--- add April & 3 values (to handle the earlier error)
from collections import defaultdict
# d = {"January" : [9,10], "February":[10], "June":[12,14],
# "March": [15,16]}
# This is my current code:
dc = defaultdict(list)
for category, *value in data: # *value to accept multiple values
dc[category].append(value)
print(dc)
output:
defaultdict(<class 'list'>, {'January': [[9], [10]], 'Februrary': [[10]], 'June': [[12], [14]], 'March': [[15], [16]], 'April': [[20, 21, 22]]})

How to calculate WAPE for given dataframe in python

I want to know to how to calculate WAPE value if we have the dataframe in the below format.
I am using python. I need it for evaluating Forecasting.
According to Wikipedia, the WAPE (Weighted Absolute Percent Error) can be calculated by
dividing the sum of the absolute deviations by the total sales of all products.
In pandas, you can do that by using the - operator, which will work element-wise between series, combined with the abs() and sum() methods for series (and regular float division):
import pandas as pd
df = pd.DataFrame({'Actual': [23, 32, 44, 37, 48, 42, 39],
'Forecasted': [25, 30, 41, 34, 45, 47, 40]})
wape = (df.Actual - df.Forecasted).abs().sum() / df.Actual.sum()
print(wape)
0.07169811320754717

Python: unique weighted random values

From this thread I found out that I can use an approach with the random.choices for my needs:
class Weights:
ITEM = {
'a': 0.5,
'b': 0.4,
'c': 0.3,
'd': 0.2,
'e': 0.1
}
import random
slot_1 = random.choices(population=list(Weights.ITEM.keys()), weights=list(Weights.ITEM.values()), k=1)[0]
slot_2 = ...?
slot_3 = ...?
Is it possible for me to get an array with the k=3 that will have "unique" results (probably [a,b,c]) or somehow to exclude any previously selected value from the next call (with k=1)?
For example lets say slot_1 got "b" and slot_2 will get a random from the list of everything else without the "b" value.
This step can be sensitive to the performance and I think that creating new arrays each time is not a good idea.
Maybe there is something except random.choices that can be applied in this case.
You could take all the samples all at once using numpy's random.choice with the replace = False option (assuming the weights are just renormalized between steps,) and store them using multiple assignment, to get it into one line of code.
import numpy as np
slot_1, slot_2, slot_3 = np.random.choice(list(Weights.ITEM.keys()), size = 3, replace=False, p=list(Weights.ITEM.values()))
More generally, you could have a function that generated arbitrary length subsamples (k is length, n is number of samples):
def a(n,k,values,weights):
a = np.split(np.random.choice(values, size = n*k,replace=False, p=weights), n)
return [list(sublist) for sublist in a]
>>> a(3,5, range(100), [.01]*100)
[[39, 34, 27, 91, 88], [19, 98, 62, 55, 38], [37, 22, 54, 11, 84]]

DataFrame display not as expected

I tried to color the title of the columns in purple, but what I got from the output doesn't seem aligned. Is there a way to fix it?
import pandas as pd
purple_text = '\033[35m'
reset = '\033[m'
list_1 = [12, 27, 33, 56, 11, 90]
list_2 = [43, 55, 76, 26, 84, 62]
df = pd.DataFrame({f'{purple_text} Numbers_1 {reset}': list_1,
f'{purple_text} Numbers_2 {reset}': list_2})
print(df.to_string(index=False))
Your issue comes from the fact that this formatting is making header text an incorrect size.
In order to remedy this, you should use display settings, this one works fine:
pd.set_option('display.colheader_justify', 'left')
Results:
Aligned

Resources