For a boolean tensor of shape (15,10), I want to perform bitwise_or along axis 0 so that the resulting tensor would be of shape 10. torch.bitwise_or does not support this.
I know it can done in numpy using np.bitwise_or.reduce(x,axis=0). I did not find something similar in torch. How to reduce torch tensor?
Hi figured out the problem here if you look at the docstring for the reduce function it's essentially just a for loop adding itself from 0
# ufunc docstring
# op.identiy is 0
r = op.identity # op = ufunc
for i in range(len(A)):
r = op(r, A[i])
return r
So to solve and fix your problem
import numpy as np
import torch
bool_arr = np.random.randint(0, 2, (15, 10), dtype=np.bool) # create a bool arr
tensor_bool_arr = torch.tensor(bool_arr) # Create torch version
output_np = np.bitwise_or.reduce(bool_arr, axis=0)
# array([ True, True, True, True, True, True, True, True, True,
True])
# Create a pytorch equivalent of bitwise reduce
r = torch.tensor(0)
for i in range(len(tensor_bool_arr)):
r = torch.bitwise_or(r, tensor_bool_arr[i])
torch_output = r.type(torch.bool)
# tensor([True, True, True, True, True, True, True, True, True, True])
assert torch_output.shape[0] == np_output.shape[0]
Related
I have the following simple function:
def f1(y_true, y_pred):
return {"f1": 100 * sklearn.metrics.f1_score(y_true, y_pred)}
According to the scikit-learn documentation, the arguments to f1_score can have the following types:
y_true: 1d array-like, or label indicator array / sparse matrix
y_pred: 1d array-like, or label indicator array / sparse matrix
and the output is of type:
float or array of float, shape = [n_unique_labels]
How do I add type hints to this function so that mypy doesn't complain?
I tried variations of the following:
Array1D = NewType('Array1D', Union[np.ndarray, List[np.float64]])
def f1(y_true: Union[List[float], Array1D], y_pred: Union[List[float], Array1D]) -> Dict[str, Union[List[float], Array1D]]:
return {"f1": 100 * sklearn.metrics.f1_score(y_true, y_pred)}
but that gave errors.
This is the approach I use to avoid similar mypy issues. It takes advantage of numpy typing introduced in 1.20. The ArrayLike type covers List[float], so no need to worry about covering it explicitly.
Running mypy v0.971 with numpy v1.23.1 on this shows no issues.
from typing import List, Dict
import numpy as np
import numpy.typing as npt
import sklearn.metrics
def f1(y_true: npt.ArrayLike, y_pred: npt.ArrayLike) -> Dict[str, npt.ArrayLike]:
return {"f1": 100 * sklearn.metrics.f1_score(y_true, y_pred)}
y_true_list: List[float] = [1, 0, 1, 0]
y_pred_list: List[float] = [1, 0, 1, 1]
y_true_np: npt.ArrayLike = np.array(y_true_list)
y_pred_np: npt.ArrayLike = np.array(y_pred_list)
assert f1(y_true_list, y_pred_list) == f1(y_true_np, y_pred_np)
Instead of
Array1D = NewType("Array1D", Union[np.ndarray, List[np.float64]])
you may use
Array1D = Union[np.ndarray, List[np.float64]]
I have a tensor it defined as:
import torch
it = torch.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')
Given this definition, what does it > 0 then mean?
Using the > operator is the same as using the torch.gt() function.
In other words,
it > 0
is the same as
torch.gt(it, 0)
and it returns a ByteTensor (a Boolean tensor) of the same shape as it where out[i] is True if it[i] > 0 and False otherwise.
As seen from the question it is a 1-d tensor consisting of 10 elements.
When we write it > 0 each element of the tensor it is compared with 0 and the output is set as True or False depending upon whether the number is greater than 0 or not. The result is also a 1-d boolean tensor consisting of True and False values.
In your case, you will get a 1-d tensor like this: [False, False, False, False, False, False, False, False, False, False,] as none of the elements in it is equal to 0.
In more simpler terms,
If result is a variable (actually a 1-d tenosr) to store the output having the same shape as it then the equation result= it > 0 can be written as:
if it[i]>0:
result[i]= True
else:
result[i]= False
But when it's performed as result= it > 0, the execution is much faster than writing our own for/while loop.
Hope this helps you.
I am about 4 weeks into the whole python, machine learning area.
I have written something using LinearClassifier in tensor flow using lending clubs data.
However, when I run the script it hangs at some point.
Any experienced persons help would be appreciated. Here is a copy of the script.
""" Collect and load the data """
import os
import tarfile
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from six.moves import urllib
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
HOME_PATH = os.getcwd()
""" load the csv file with the lending data and convert to tensors """
def convert_duration(s):
try:
if pd.isnull(s):
return s
elif s[0] == '<':
return 0.0
elif s[:2] == '10':
return 10.0
else:
return np.float(s[0])
except TypeError:
return np.float64(s)
def load_data(file_name):
csv_path = os.path.join(HOME_PATH, file_name)
csv_data = pd.read_csv(csv_path, encoding = "ISO-8859-1", dtype={'desc': np.str, 'verification_status_joint': np.str, 'loan_status': np.str})
loans = csv_data.loc[csv_data['loan_status'].isin(['Fully Paid', 'Charged Off'])] # Sort out only fully Paid (Paid) and Charged Off (Default)
loans['loan_status'] = loans['loan_status'].apply(lambda s: np.float(s == 'Fully Paid')) # Convert to boolean integer
# Drop Columns with one distinct data field
for col in loans.columns:
if loans[col].nunique() == 1:
del loans[col]
for col in loans.columns:
if (loans[col].notnull().sum() / len(loans.index)) < 0.1 :
del loans[col]
# Remove all irrelevant columns & hifg prediction columns based on pure descetion
loans.drop(labels=['id', 'member_id', 'grade', 'sub_grade', 'last_credit_pull_d', 'emp_title', 'url', 'desc', 'title', 'issue_d', 'earliest_cr_line', 'last_pymnt_d','addr_state'], axis=1, inplace=True)
# Process the text based variables
# Term
loans['term'] = loans['term'].apply(lambda s:np.float(s[1:3]))
loans['emp_length'] = loans['emp_length'].apply(lambda s: convert_duration(s))
#change zip code to just the first 3 significant digits
loans['zip_code'] = loans['zip_code'].apply(lambda s:np.float(s[:3]))
loans.fillna(0,inplace=True)
loan_data = shuffle(loans)
X = loan_data.drop(labels=['loan_status'], axis=1)
Y = loan_data['loan_status']
## consider processing tensorflow feature columns here and return as one response and standardise at one
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.fit_transform(X_test)
return (X_train, Y_train), (X_test, Y_test)
def my_input_fn(features, labels, batch_size , shuffle=True):
# consider changing categorical columns and all
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
dataset = dataset.shuffle(buffer_size=1000).repeat(count=None).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
#Start on calls to make data available
(X_train, Y_train), (X_test, Y_test) = load_data("loan_data.csv")
my_feature_columns = []
numerical_columns = ['loan_amnt',
'funded_amnt',
'funded_amnt_inv',
'int_rate',
'installment',
'annual_inc',
'dti',
'delinq_2yrs',
'inq_last_6mths',
'mths_since_last_delinq',
'mths_since_last_record',
'open_acc',
'pub_rec',
'revol_bal',
'revol_util',
'total_acc',
'total_pymnt',
'total_pymnt_inv',
'total_rec_prncp',
'total_rec_int',
'total_rec_late_fee',
'recoveries',
'collection_recovery_fee',
'last_pymnt_amnt',
'collections_12_mths_ex_med',
'mths_since_last_major_derog',
'acc_now_delinq',
'tot_coll_amt',
'tot_cur_bal',
'total_rev_hi_lim']
categorical_columns = ['home_ownership',
'verification_status',
'pymnt_plan',
'purpose',
'initial_list_status',
'application_type']
for key in numerical_columns:
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
for key in categorical_columns:
my_feature_columns.append(tf.feature_column.categorical_column_with_hash_bucket(key=key, hash_bucket_size = 10))
classifier = tf.estimator.LinearClassifier(
feature_columns=my_feature_columns
)
classifier.train(
input_fn=lambda:my_input_fn(X_train, Y_train, 100),
steps=100
)
eval_result = classifier.evaluate(
input_fn=lambda:my_input_fn(X_test, Y_test, 100)
)
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
Here is a sample of the output in the console before it hangs;
43: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
loans['loan_status'] = loans['loan_status'].apply(lambda s: np.float(s == 'Fully Paid')) # Convert to boolean integer
/Users/acacia/Desktop/work/machine_learning/tensor_flow/logistic_regression.py:53: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
loans.drop(labels=['id', 'member_id', 'grade', 'sub_grade', 'last_credit_pull_d', 'emp_title', 'url', 'desc', 'title', 'issue_d', 'earliest_cr_line', 'last_pymnt_d','addr_state'], axis=1, inplace=True)
/Users/acacia/Desktop/work/machine_learning/tensor_flow/logistic_regression.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
loans['term'] = loans['term'].apply(lambda s:np.float(s[1:3]))
/Users/acacia/Desktop/work/machine_learning/tensor_flow/logistic_regression.py:59: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
loans['emp_length'] = loans['emp_length'].apply(lambda s: convert_duration(s))
/Users/acacia/Desktop/work/machine_learning/tensor_flow/logistic_regression.py:62: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
loans['zip_code'] = loans['zip_code'].apply(lambda s:np.float(s[:3]))
/Users/acacia/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py:3035: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
downcast=downcast, **kwargs)
INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /var/folders/2t/bhtmq3ln5mb6mv26w6pfbq_m0000gn/T/tmpictbxp6x
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/2t/bhtmq3ln5mb6mv26w6pfbq_m0000gn/T/tmpictbxp6x', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a205d6358>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/2t/bhtmq3ln5mb6mv26w6pfbq_m0000gn/T/tmpictbxp6x/model.ckpt.
INFO:tensorflow:loss = 69.31472, step = 1
INFO:tensorflow:Saving checkpoints for 100 into /var/folders/2t/bhtmq3ln5mb6mv26w6pfbq_m0000gn/T/tmpictbxp6x/model.ckpt.
INFO:tensorflow:Loss for final step: 0.0.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-07-10:55:12
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/2t/bhtmq3ln5mb6mv26w6pfbq_m0000gn/T/tmpictbxp6x/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I am trying to train a tensorflow based random forest regression on numerical and continuos data.
When I try to fit my estimator it begins with the message below:
INFO:tensorflow:Constructing forest with params =
INFO:tensorflow:{'num_trees': 10, 'max_nodes': 1000, 'bagging_fraction': 1.0, 'feature_bagging_fraction': 1.0, 'num_splits_to_consider': 10, 'max_fertile_nodes': 0, 'split_after_samples': 250, 'valid_leaf_threshold': 1, 'dominate_method': 'bootstrap', 'dominate_fraction': 0.99, 'model_name': 'all_dense', 'split_finish_name': 'basic', 'split_pruning_name': 'none', 'collate_examples': False, 'checkpoint_stats': False, 'use_running_stats_method': False, 'initialize_average_splits': False, 'inference_tree_paths': False, 'param_file': None, 'split_name': 'less_or_equal', 'early_finish_check_every_samples': 0, 'prune_every_samples': 0, 'feature_columns': [_NumericColumn(key='Average_Score', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='lat', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), _NumericColumn(key='lng', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)], 'num_classes': 1, 'num_features': 2, 'regression': True, 'bagged_num_features': 2, 'bagged_features': None, 'num_outputs': 1, 'num_output_columns': 2, 'base_random_seed': 0, 'leaf_model_type': 2, 'stats_model_type': 2, 'finish_type': 0, 'pruning_type': 0, 'split_type': 0}
Then the process breaks down and I get a value error below:
ValueError: Shape must be at least rank 2 but is rank 1 for 'concat' (op: 'ConcatV2') with input shapes: [?], [?], [?], [] and with computed input tensors: input[3] = <1>.
This is the code I am using:
import tensorflow as tf
from tensorflow.contrib.tensor_forest.python import tensor_forest
from tensorflow.python.ops import resources
import pandas as pd
from tensorflow.contrib.tensor_forest.client import random_forest
from tensorflow.python.estimator.inputs import numpy_io
import numpy as np
def getFeatures():
Average_Score = tf.feature_column.numeric_column('Average_Score')
lat = tf.feature_column.numeric_column('lat')
lng = tf.feature_column.numeric_column('lng')
return [Average_Score,lat ,lng]
# Import hotel data
Hotel_Reviews=pd.read_csv("./DataMining/Hotel_Reviews.csv")
Hotel_Reviews_Filtered=Hotel_Reviews[(Hotel_Reviews.lat.notnull() |
Hotel_Reviews.lng.notnull())]
Hotel_Reviews_Filtered_Target = Hotel_Reviews_Filtered[["Reviewer_Score"]]
Hotel_Reviews_Filtered_Features = Hotel_Reviews_Filtered[["Average_Score","lat","lng"]]
#Preprocess the data
x=Hotel_Reviews_Filtered_Features.to_dict('list')
for key in x:
x[key] = np.array(x[key])
y=Hotel_Reviews_Filtered_Target.values
#specify params
params = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
feature_colums= getFeatures(),
num_classes=1,
num_features=2,
regression=True,
num_trees=10,
max_nodes=1000)
#build the graph
graph_builder_class = tensor_forest.RandomForestGraphs
est=random_forest.TensorForestEstimator(
params, graph_builder_class=graph_builder_class)
#define input function
train_input_fn = numpy_io.numpy_input_fn(
x=x,
y=y,
batch_size=1000,
num_epochs=1,
shuffle=True)
est.fit(input_fn=train_input_fn, steps=500)
The variables x is a list of numpy array of shape (512470,):
{'Average_Score': array([ 7.7, 7.7, 7.7, ..., 8.1, 8.1, 8.1]),
'lat': array([ 52.3605759, 52.3605759, 52.3605759, ..., 48.2037451,
48.2037451, 48.2037451]),
'lng': array([ 4.9159683, 4.9159683, 4.9159683, ..., 16.3356767,
16.3356767, 16.3356767])}
The variable y is numpy array of shape (512470,1):
array([[ 2.9],
[ 7.5],
[ 7.1],
...,
[ 2.5],
[ 8.8],
[ 8.3]])
Force each array in x to be 2 dim using ndmin=2. Then the shapes should match and concat should be able to operate.
I want to skip some data that have specific labels (like if label >= 7 or other). My code is here:
true = tf.constant(True)
less_op = tf.less(label, tf.constant(delimiter))
label = tf.cast(
tf.slice(record_bytes, [0], [label_bytes]), tf.int32)
tf.cond(less_op, lambda: true, lambda: true)
and on the 4th line I have error: ValueError: Shapes (1,) and () are not compatible. My assumption that it's caused by less_op (if I substitute it with true code works). Also I investigated that there is some problem with label: code less_op = tf.less(tf.constant(1), tf.constant(delimiter)) works perfectly.
Tensorflow expects it to be of shape None or [] and not (1,). It's weird behavior that should be fixed in my opionion because tf.less returns a tensor of shape (1,) and not shape ().
Change this:
tf.cond(less_op, lambda: true, lambda: true)
to this:
tf.cond(tf.reshape(less_op,[]), lambda: true, lambda: true)