How to determine relationship or correlation among categorical input variables or between input and output var in classification? - python-3.x

I starts to learn machine learning by myself.
I have a set of input variables (categorical and continuous): job (retired, manager, technician, etc.) education (high school, unknown, bachelor, master, etc.) , duration of contact, age, marital; etc.... output variable (yes or no) (agree to purchase new product?)
First of all I want to analyze the dataset, but I do not know how to find correlation between input and output variable for discrete input data in python?
Should I clear all the missing data (unknown)?

Two things come to mind:
1. Look at how features a correlated
2. Look at how purchased vs. not-purchased statistical distributions look like
Feature correlation
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'job': ['retired', 'retired', 'manager', 'manager', 'manager', 'technician', 'technician', None, None],
'education': ['high', 'high', 'master', 'unknown', 'master', 'master', 'high', 'unknown', 'master'],
'duration_of_contact': [3, 1, 5, 3, 1, 9 ,8, 3, 1],
'age': [50, 65, 30, 29, 38, 42, 25, 10, 10],
'married': [1, 1, 0, 1, 0, 1, 0, 0, 0],
'purchase': [0, 0, 1, 1, 1, 0, 0, 1, 1]
})
sns.heatmap(df.corr())
Statistical properties
You could look at the distributions of the two cases when purchase is True, and when purchase is False:
sns.boxplot(x="purchase", y="age", data=df)
See more plots you can use from seaborn: https://seaborn.pydata.org/tutorial/categorical.html

Related

Is there a way to use Huggingface pretrained tokenizer with wordpiece prefix?

I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging.
For example I want the word New~york tokenized into New ##~ ##york, and looking at some old examples on the internet, that was what you get by using BertTokenizer before, but clearly not anymore (Says their documentation)
So when I run:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer(batch_sentences, return_tensors="pt")
decoded = tokenizer.decode(inputs["input_ids"][0])
print(decoded)
and I get:
[CLS] hello, i'm testing this efauenufefu [SEP]
But the encoding clear suggesting otherwise that the nonsense at the end was indeed broken up into pieces...
In [4]: inputs
Out[4]:
{'input_ids': tensor([[ 101, 19082, 117, 178, 112, 182, 5193, 1142, 174, 8057,
23404, 16205, 11470, 1358, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
I also tried to use the BertTokenizerFast, which unlike the BertTokenizer, it allows you to specify wordpiece prefix:
tokenizer2 = BertTokenizerFast("bert-base-cased-vocab.txt", wordpieces_prefix = "##")
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer2(batch_sentences, return_tensors="pt")
decoded = tokenizer2.decode(inputs["input_ids"][0])
print(decoded)
Yet the decoder gave me exactly the same...
[CLS] hello, i'm testing this efauenufefu [SEP]
So, is there a way to use the pretrained Huggingface tokenizer with prefix, or must I train a custom tokenizer myself?
Maybe you are looking for tokenize:
from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
t.tokenize("hello, i'm testing this efauenufefu")
Output:
['hello',
',',
'i',
"'",
'm',
'testing',
'this',
'e',
'##fa',
'##uen',
'##uf',
'##ef',
'##u']
You can also get a mapping of each token to the respecting word and other things:
o = t("hello, i'm testing this efauenufefu", add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
o.words()
Output:
[0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7]

How to retain 2D (or more) shape when using pytrorch masked_select

Suppose I have the following two matching shape tensors:
a = tensor([[ 0.0113, -0.1666, 0.5960, -0.0667], [-0.0977, -0.1984, 0.5153, 0.0420]])
selectors = tensor([[ True, True, False, False], [ True, False, True, False]])
When using torch.masked_select to find the values in a that match True indices in selectors like this:
torch.masked_select(a, selectors)
The output will be in 1D shape instead of the original 2D shape:
tensor([ 0.0113, -0.1666, -0.0977, 0.5153])
This is consistent with masked_select behavior as it is given in the documentation (torch.masked_select). However, my goal is to get a result that matches the shape of the two original tensors. I.e.:
tensor([[0.0113, -0.1666], [-0.0977, 0.5153]])
Is there a way to get this without having to loop over all the elements in the tensors and find the mask for each one? Please note that I have also looked into using torch.where, but it doesn't fit the case I have as I see it.
As #jodag pointed out, for general inputs, each row on the desired masked result might have a different number of elements, depending on how many True values there are on the same row in selectors. However, you could overcome this by allowing trailing zero padding in the result.
Basic solution:
indices = torch.masked_fill(torch.cumsum(selectors.int(), dim=1), ~selectors, 0)
masked = torch.scatter(input=torch.zeros_like(a), dim=1, index=indices, src=a)[:,1:]
Explanation:
By applying cumsum() row-wise over selectors, we compute for each unmasked element in a the target column number it should be copied to in the output tensor. Then, scatter() performs a row-wise scattering of a's elements to these computed target locations. We leave all masked elements with the index 0, so that the first element in each row of the result would contain one of the masked elements (maybe arbitrarily. we don't care which). We then ignore these un-wanted 1st values by taking the slice [:,1:]. The output resulting masked tensor has the exact same size as the input a (this is the maximum needed size, for the case where there is a row of full True values in selectors).
Usage example:
>>> a = Torch.tensor([[ 1, 2, 3, 4, 5, 6], [10, 20, 30, 40, 50, 60]])
>>> selectors = Torch.tensor([[ True, False, False, True, False, True], [False, False, True, True, False, False]])
>>> torch.cumsum(selectors.int(), dim=1)
tensor([[1, 1, 1, 2, 2, 3],
[0, 0, 1, 2, 2, 2]])
>>> indices = torch.masked_fill(torch.cumsum(selectors.int(), dim=1), ~selectors, 0)
>>> indices
tensor([[1, 0, 0, 2, 0, 3],
[0, 0, 1, 2, 0, 0]])
>>> torch.scatter(input=torch.zeros_like(a), dim=1, index=indices, src=a)
tensor([[ 5, 1, 4, 6, 0, 0],
[60, 30, 40, 0, 0, 0]])
>>> torch.scatter(input=torch.zeros_like(a), dim=1, index=indices, src=a)[:,1:]
tensor([[ 1, 4, 6, 0, 0],
[30, 40, 0, 0, 0]])
Adapting output size: Here, the length of dim=1 of the output resulting masked tensor is the max number of un-masked items in a row. For your original show-case, the output shape would be (2,2) as you desired. Note that if this number is not previously known and a is on CUDA, it would cause an additional host-device synchronization that might affect the performance.
To do so, instead of allocating input=torch.zeros_like(a) for scatter(), allocate it by a.new_zeros(size=(a.size(0), torch.max(indices).item() + 1)). The +1 is for the 1st place which is later sliced-out. The host-device synchronization would occur by accessing the result of max() to calculate the allocated output size.
Example:
>>> torch.scatter(input=a.new_zeros(size=(a.size(0), torch.max(indices).item() + 1)), dim=1, index=indices, src=a)[:,1:]
tensor([[ 1, 4, 6],
[30, 40, 0]])
Changing the padding value: If another custom default value is wanted as a padding, one could use torch.full_like(my_custom_value) rather than torch.zeros_like() when allocating the output for scatter().

Getting Concordance result of lifelines CoxPH model in a dataframe

I am using CoxPH implementation of lifelines package in python. Currently, results are in tabular view of coefficients and related stats and can be seen with print_summary(). Here is an example
df = pd.DataFrame({'duration': [4, 6, 5, 5, 4, 6],
'event': [0, 0, 0, 1, 1, 1],
'cat': [0, 1, 0, 1, 0, 1]})
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event', show_progress=True)
cph.print_summary()
out[]
[Table of results from print_summary()][1]
How can I get only Concordance index as dataframe or list. cph.summary
returns a dataframe of main results i.e. p-values and coef but it does not include concordance index and other surrounding information.
you can access the c-index with cph.concordance_index_ - and you could put this into a list or dataframe if you wish.
You can also compute the concordance index for Cox model using a small script available at this link. The code is given below.
from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
Cindex = concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
This code will give C-index value, which also matches with cph.concordance_index_

Avoid duplication of date column for child entity

I have a simple entity set parent1 <- child -> parent2 and a need to use a cutoff dataframe. My target is the parent1 and it's accessible at any time of predictions. I want to specify a date column only for the parent2 so that this time information could be joined to the child. It doesn't work this way and I get data leakage on the first level features from the parent1-child entities. The only thing I can do is to duplicate the date column to the child too. Is it possible to normalize the child avoiding the date column?
Example. Imagine we have 3 entities. Box player information (parent1 with "name"), match information (parent2 with "country"), and their combination (child with "n_hits" in one specific match):
import featuretools as ft
import pandas as pd
players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50]})
matches = pd.DataFrame({
"match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
"country": ["Russia", "Germany"]})
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id="players", dataframe=players,
index="player_id",
variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="matches", dataframe=matches,
index="match_id",
time_index="match_date",
variable_types={"match_id": ft.variable_types.Categorical})
es = es.add_relationship(ft.Relationship(es["players"]["player_id"],
es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"],
es["player_stats"]["match_id"]))
Here I want to use all available information that I have at the 15th January. So the only legal is the information for the first match, not for the second.
cutoff_df = pd.DataFrame({
"player_id":[1, 2, 3],
"match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})
fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df,
cutoff_time_in_index=True, agg_primitives = ["mean"])
fm
I got
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 30
2 2014-01-15 Kirill 30
3 2014-01-15 Max 50
The only way I know to set up a proper match_date to player_stats is to join this information from matches
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50],
"match_date": pd.to_datetime(
['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
time_index="match_date", ## a change here too
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
And I get the expected result
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 20.0
2 2014-01-15 Kirill 30.0
3 2014-01-15 Max NaN
Featuretools is very conservative when it comes to the time index of an entity. We try not to infer a time index if it isn't provided. Therefore, you have to create the duplicate column as you suggest.

Reshaping / Transforming pandas.Dataframe

Hej there,
I've got the following pandas.DataFrame
df = pandas.DataFrame({
"date": ["2016-12-11", "2016-12-12", "2016-12-13", "2016-12-14", "2016-12-15"],
"dim1": ["dim11", "dim12", "dim12", "dim11", "dim13"],
"dim2": ["dim22", "dim21", "dim21", "dim22", "dim23"],
"dim3": ["dim31", "dim32", "dim32", "dim31", "dim33"],
"val1": [1, 2, 3, 4, 5],
"val2": [6, 7, 8, 9, 10],
"val3": [11,12,13,14,15]
})
What I want now is to specify multiple "dimensions" and multiple "values", so that the DataFrame
is reshaped / transformed so that the specified dimensions and values are "combined" with each other.
Not specified values may vanish but specified dimensions should stay in the resulting DataFrame.
To make it clear a simple example of a resulting DataFrame.
Specified dimensions are: dim1, dim2
Specified values are: val1, val2
df_res = pandas.DataFrame({
"date": ["2016-12-11", "2016-12-12", "2016-12-13", "2016-12-14", "2016-12-15"],
"dim3": ["dim31", "dim32", "dim32", "dim31", "dim33"],
"dim11_dim22_val1": [1, 0, 0, 4, 0],
"dim12_dim21_val1": [0, 2, 3, 0, 0],
"dim13_dim23_val1": [0, 0, 0, 0, 5],
"dim11_dim22_val2": [6, 0, 0, 9, 0],
"dim12_dim21_val2": [0, 7, 8, 0, 0],
"dim13_dim23_val2": [0, 0, 0, 0, 10]
})
So basically there are multiple combinations of dim1, dim2, val1 and val2. val3 drops from the result but the dimensions date_id and dim3 stay in there.
As a side note: Afterwards I will do a df_res.to_dict(orient="records"), which should output
[
{"date_id": "2016-12-11", "dim3": "dim31", "dim11_dim22_val1": 1, "dim12_dim21_val1": 0, "dim13_dim23_val1": 0, "dim11_dim22_val2": 6, "dim12_dim21_val2": 0, "dim13_dim23_val2": 0}
...
]
Can I do this with some pandas magic?
Maybe in multiple steps of df.pivot?
Kind regards
Dennis
Part 1:
1) You could set the columns starting with dim along with date which would remain static during the whole operation as the index axis. Provide append=True to confront duplicated indices.
2) unstack the required levels. Drop the unwanted val3 column and fill missing values with 0's.
3) Rename the columns by joining the multi-index tuples with an underscore in between them.
4) Reset the same levels as unstacked and additionally sort the column names to match the required output.
df.set_index(df.filter(like='dim').columns.tolist()+['date'], append=True, inplace=True)
df = df.unstack(level=[2,1]).drop('val3', axis=1).fillna(0).astype(int)
df.columns = ['_'.join(c[::-1]) for c in df.columns]
df_res = df.reset_index(level=[2,1]).sort_index(axis=1)
df_res
Part 2:
df_res.to_dict('r')
produces:
[{'date': '2016-12-11',
'dim11_dim22_val1': 1,
'dim11_dim22_val2': 6,
'dim12_dim21_val1': 0,
'dim12_dim21_val2': 0,
'dim13_dim23_val1': 0,
'dim13_dim23_val2': 0,
'dim3': 'dim31'}, ..........

Resources