sklearn: create sparse feature vector from multiple choice features - scikit-learn

I have data which looks like this:
20003.0, 20003.1, 20004.0, 20004.1, 34
1, 2, 3, 4, 5
Where the columns with ending .* are multiple choice and the values they take on are categorical.
Ideally, I would be able to pass [{'20003': ['1', '2'], '20004': ['3', '4'], '34': 5}] to DictVectorizer but this is not supported yet.
How should this data be loaded to create sparse feature vectors?

If you can sum up you the values for the same groups of columns:
In [63]: df
Out[63]:
20003.0 20003.1 20004.0 20004.1 34
0 1 2 3 4 5
In [64]: d = df.groupby(df.columns.str.split('.').str[0], axis=1).sum().to_dict('r')
In [65]: d
Out[65]: [{'20003': 3, '20004': 7, '34': 5}]
In [66]: from sklearn.feature_extraction import DictVectorizer
In [67]: v = DictVectorizer()
In [68]: X = v.fit_transform(d)
In [69]: X.toarray()
Out[69]: array([[ 3., 7., 5.]])
In [70]: v.inverse_transform(X)
Out[70]: [{'20003': 3.0, '20004': 7.0, '34': 5.0}]
Explanation:
In [71]: df.groupby(df.columns.str.split('.').str[0], axis=1).sum()
Out[71]:
20003 20004 34
0 3 7 5
In [72]: df.groupby(df.columns.str.split('.').str[0], axis=1).sum().to_dict('r')
Out[72]: [{'20003': 3, '20004': 7, '34': 5}]

Related

How to wright a function to work with dictionary type Serires and a column in Dataframe?

I am trying to wright a function that works with Series and Dataframe.
dct= {10: 0.5, 20: 2, 30: 3,40:4}
#Defining the function
def funtion_dict(row,dict1):
total_area=row['total_area']
if total_area.round(-1) in dict1:
return dict1.get(total_area.round(-1))*total_area
#checking function in a test situation
row = pd.DataFrame(
{
'total_area': [53, 14.8, 94, 77, 12],
'b': [5, 4, 3, 2, 1],
'c': ['X', 'Y', 'Y', 'Y', 'Z'],
}
)
print(funtion_dict(row,dct))
I keep getting an error 'Series' objects are mutable, thus they cannot be hashed'. Please help
This is the expected behavior because you are trying to use a "Series" as a lookup for a dictionary which is not allowed.
From your code,
dct= {10: 0.5, 20: 2, 30: 3,40:4}
df = pd.DataFrame({
'total_area': [53, 14.8, 94, 77, 12],
'b': [5, 4, 3, 2, 1],
'c': ['X', 'Y', 'Y', 'Y', 'Z'],
})
If you want to add another column to your data frame with multipliers matched from a dictionary, you can do it like so:
df['new_column'] = df['total_area'].round(-1).map(dct) * df['total_area']
which will then give you
total_area b c new_column
0 53.0 5 X NaN
1 14.8 4 Y 7.4
2 94.0 3 Y NaN
3 77.0 2 Y NaN
4 12.0 1 Z 6.0

Calculating weighted average using grouped .agg in pandas

I would like to calculate, by group, the mean of one column and the weighted mean of another column in a dataset using the .agg() function within pandas. I am aware of a few solutions, but they aren't very concise.
One solution has been posted here (pandas and groupby: how to calculate weighted averages within an agg, but it still doesn't seem very flexible because the weights column is hard coded in the lambda function definition. I'm looking to create a syntax closer to this:
(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
wt_avg_y=('y', 'weighted_mean', weights='weight')
)
Here is a fully worked example with code that seems needlessly complicated:
import pandas as pd
import numpy as np
# sample dataset
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>> group x y weights
#>>> 0 a 1 5 0.75
#>>> 1 a 2 6 0.25
#>>> 2 b 3 7 0.75
#>>> 3 b 4 8 0.25
# aggregation logic
summary = pd.concat(
[
df.groupby(['group']).x.mean(),
df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
Using the .apply() method on the entire DataFrame was the simplest solution I could arrive to that does not hardcode the column name inside the function definition.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.mean(x['x']),
np.average(x['y'], weights=x['weights'])
], index=['avg_x', 'wt_avg_y'])
)
.reset_index()
)
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
How about this:
grouped = df.groupby('group')
def wavg(group):
group['mean_x'] = group['x'].mean()
group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
return group
grouped.apply(wavg)
Try:
df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])
res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})
Outputs:
group x y
0 a 1.5 5.25
1 b 3.5 7.25
Since your weights sum to 1 within groups, you can assign a new column and groupby as usual:
(df.assign(wt_avg_y=df['y']*df['weights'])
.groupby('group')
.agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
.assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)
Output:
x wt_avg_y weights
group
a 1.5 5.25 1.0
b 3.5 7.25 1.0
Steven M. Mortimer's solution is clean and easy to read. Alternatively, one could use dict notation inside pd.Series() such that the index= argument is not needed. This provides slightly better readability in my opinion.
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series({
'avg_x' : np.mean(x['x']),
'wt_avg_y': np.average(x['y'], weights=x['weights'])
}))
.reset_index()
)

Pandas dataframe to dictionary with custom format

I'm trying to convert a pandas dataframe into a dictionary but I need an specifyc ouput format, I have been reading and reviewing many other answers but I can't resolve; my dataframe looks like:
label Min Max Prom Desv. Est. Cr Tz Cpk Zup Zlow PPM % OOS # Datos
0 test1 1.25 1.46 1.329 0.0426 1.161 -0.023 0.697 2.090 3.077 19354 2 268
1 test2 4.80 5.50 5.110 0.1368 0.774 -1.097 0.926 2.778 4.972 2735 0 268
2 test3 2.58 2.96 2.747 0.0709 0.760 -1.029 0.973 2.918 4.977 1762 0 268
I've tried this (and others options but this is the most similar to the desire output):
dict = df.set_index('label').groupby('label').apply(lambda g: g.values.tolist()).to_dict()
And I got:
{'test1': [[1.25, 1.46, 1.329, 0.0426, 1.161, -0.023, 0.697, 2.09, 3.077, 19354.0, 2.0, 268.0]],
'test2': [[4.8, 5.5, 5.11, 0.1368, 0.774, -1.097, 0.926, 2.778, 4.972, 2735.0, 0.0, 268.0]],
'test3': [[2.58, 2.96, 2.747, 0.0709, 0.76, -1.0290, 0.973, 2.918, 4.977, 1762.0, 0.0, 268.0]]}
But what I'm looking for is something like:
{'label':'test1', 'cols':[1.25, 1.46, 1.329, 0.0426, 1.161, -0.023, 0.697, 2.09, 3.077, 19354.0, 2.0, 268.0]},
{'label':'test2', 'cols': [4.8, 5.5, 5.11, 0.1368, 0.774, -1.097, 0.926, 2.778, 4.972, 2735.0, 0.0, 268.0]},
{'label':'test3', 'cols': [2.58, 2.96, 2.747, 0.0709, 0.76, -1.0290, 0.973, 2.918, 4.977, 1762.0, 0.0, 268.0]}
Many thanks in advance for any idea or suggestion.
You can use a lambda function to build the output you want:
df.apply(lambda x: {'label':x.label, 'cols': x.tolist()[1:]}, axis=1).tolist()
Well, reading the title of your question literally, there's always .to_dict():
>>> df = pd.DataFrame([dict(a=1, b=2), dict(a=3, b=4), dict(a=5, b=6)])
>>> df
a b
0 1 2
1 3 4
2 5 6
>>> df.to_dict()
{'a': {0: 1, 1: 3, 2: 5}, 'b': {0: 2, 1: 4, 2: 6}}
But your example suggests you're looking for a list of dicts,
as might conveniently be produced by iterrows or itertuples:
>>> df = pd.DataFrame([dict(a=1, b=2), dict(a=3, b=4), dict(a=5, b=6)])
>>> df
a b
0 1 2
1 3 4
2 5 6
>>>
>>> for i, row in df.iterrows():
... print(dict(row), list(row))
...
{'a': 1, 'b': 2} [1, 2]
{'a': 3, 'b': 4} [3, 4]
{'a': 5, 'b': 6} [5, 6]
>>>
>>> for row in df.itertuples(index=False):
... print(dict(row._asdict()))
...
{'a': 1, 'b': 2}
{'a': 3, 'b': 4}
{'a': 5, 'b': 6}
Using list(row)[1:], to skip past the label, would probably fit the bill for you.

how tensorflow deals with np.nan?

Please consider the following code,
x = tf.constant([[[1, np.nan, np.nan], [4, 3, -1]], [[10, np.nan, 3], [20,5,-7]], [[5, np.nan, 3], [np.nan,15,-17]]])
x_max = tf.reduce_max(x, reduction_indices=[0])
with tf.Session() as sess:
print (np.shape(sess.run(x)))
print (sess.run(x))
print (sess.run(x_max))
The output is as following:
(3, 2, 3)
[[[ 1. nan nan]
[ 4. 3. -1.]]
[[ 10. nan 3.]
[ 20. 5. -7.]]
[[ 5. nan 3.]
[ nan 15. -17.]]]
[[ 10. -inf 3.]
[ 20. 15. -1.]]
Now my question is how tensorflow deals with np.nan, like numpy.nanmax or similar?
Quoting this link (credit goes to Yaroslav Bulatov):
Different parts of TensorFlow treat them differently:
* Float computations (usually?) propagate them.
* Int conversion treats them as 0.
* Int computations fail with Python parts of TensorFlow often raise an error on "NaN", ie, trying to add a NaN summary to histogram will fail with Python
exception.
Here is an example for some float operations:
a = tf.constant([1.0, np.nan])
b = tf.constant(np.nan)
r = tf.reduce_min(a)
m = a * b
with tf.Session() as sess:
print(sess.run(r)) # prints 1.0
print(sess.run(m)) # array([nan, nan], dtype=float32)

Vector combination and array sorting

I have two column vector
A = [8, 2, 2, 1]
B = ['John', 'Peter', 'Paul', 'Evans']
How do I combine them to have an array of
C =
8 'John'
2 'Peter'
2 'Paul'
1 'Evans'
And how do I sort C in ascending order such that I have
C =
1 'Evans'
2 'Paul'
2 'Peter'
8 'John'
I just migrated to python from Matlab and I am having difficulty in this.
Here's an approach using np.argsort with kind = mergesort that also maintains the order to get the sort ordered indices. So, we can stack the input arrays in columns and index with those indices for desired output, like so -
In [172]: A
Out[172]: array([8, 2, 2, 1])
In [173]: B
Out[173]:
array(['John', 'Peter', 'Paul', 'Evans'],
dtype='|S5')
In [174]: sidx = A.argsort(kind='mergesort')
In [175]: np.column_stack((A,B))[sidx]
Out[175]:
array([['1', 'Evans'],
['2', 'Peter'],
['2', 'Paul'],
['8', 'John']],
dtype='|S21')
If you would like to keep the int type for the first column in the output array, you could create an object dtype array, like so -
arr = np.empty((len(A),2),dtype=object)
arr[:,0] = A
arr[:,1] = B
out = arr[sidx] # sidx is same as from previous approach
Result -
In [189]: out
Out[189]:
array([[1, 'Evans'],
[2, 'Peter'],
[2, 'Paul'],
[8, 'John']], dtype=object)
Use np.column_stack() and np.sort():
In [9]: np.column_stack((A, B))
Out[9]:
array([['8', 'John'],
['2', 'Peter'],
['2', 'Paul'],
['1', 'Evans']],
dtype='<U5')
In [10]:
In [10]: np.sort(np.column_stack((A, B)))[::-1]
Out[10]:
array([['1', 'Evans'],
['2', 'Paul'],
['2', 'Peter'],
['8', 'John']],
dtype='<U5')

Resources