I am trying to code a predictive model and i Found this code somewhere and wanted to know what it does mean please. Here it is "X_train.reset_index(inplace = True)"?
I think it would help if you provide more context to your question. But in the meanwhile, it seems that the line of code that you have shown here is enumerating the training dataset of whatever model you're working with (usually X denotes the data and Y denotes the labels).
The dataset is a pandas DataFrame object, and the reset_index function enumerates the items in the DataFrame so that each item in the DataFrame is numbered instead of named. You can find more information about this in the documentation for this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html
Related
I'm familiar with SBERT and its pre-trained models and they are amazing! But at the same time, I want to understand how the results are calculated, and I can't find anything more specific in their website.
For example, I have a document and I want to find other documents that are similar to it. I used 2 documents containing 200-250 words each (I changed the model.max_seq_length to 350 so the model can handle bigger texts), and in the end we can see that the cosine-similarity is 0.79. Is that all we can see? Is there a way to extract the main phrases/keywords that made the model return this high value of similarity?
Thanks in advance!
Have you tried to make either a simple word-count-comparison between the two documents and other random documents? Or a tf-idf, if the two documents are part of a bigger corpus?
Another thing you can do, is to look inside the "stored_embeddings" matrix (see code below and here) in which SBERT encodes your sentences (e.g. for 20.000 documents you'll get a 20.000*384 matrix), after having saved it into a pickle file like:
from sentence_transformers import SentenceTransformer
import pickle
embeddings = model.encode(sentences)
with open('embeddings.pkl', "wb") as fOut:
pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)
with open('embeddings.pkl', "rb") as fIn:
stored_data = pickle.load(fIn)
stored_embeddings = stored_data['embeddings']
The stored embeddings variable can be handled as a numpy matrix and can therefore be (for example) indexed to access single elements. By looking at the values of the single 384 dimensions (to do this, you can go column by column but in case of a big matrix I suggest you not to .enumerate(), it'll take forever) and compare the values that the two documents take in one precise dimension. You can see which dimension has the highest values or variance, for example.
I'm not saying it'll be interpretable what you'll find, but at least you can try and see what you find.
After training my dataset which has a number of categorical data using fastai's tabular model, I wish to read out the entity embedding and use it to map to my original data values.
I can see the embedding weights. The number of input don't seem to match anything, but maybe it is based on the unique categorical values in the train_ds.
To get that map, I would like to get the self.categories dictionary from the Categorify transform class. Is there anyway to get that from the data variable obtained by calling TabularList.from_df?
Or maybe someone can tell me a better way to get this map. I know the input df into the TabularList.from_df() is not it, because the number of rows are wrong. Most likely because df is splitted into train and valid subsets. But there is no easy way to obtain the train part of the TabularList to check just the train part.
It's strange I can't find any code example that shows this. Doesn't anyone else care to map the entity embedding value back to its original categorical value?
I found it.
It is in data.train_ds.inner_df.
I'm trying to wrap some functionalities of the lime python library over spark ml models. The general idea is to have a PipelineModel (containg each phase of data transformation and the application of the model) as an input and build a functionality the calls the spark model, apply the lime algorithm and give an explanation for each single row.
Some context
The lime algorithm consists in approximating locally a trained machine learning model. In its implementation, lime just basically needs a function that, given a feature vector as input, evaluates the predictions of the model. With this function, lime can perturb slightly the feature input, see how the model predictions change and then give an explanation. So, theoretically, it can be applied to any model, evaluated with any engine.
The idea here is to use it with Spark ml models.
The wrapping
In particular, I'm wrapping the LimeTabularExplainer. In order to work, it needs a feature vector in which each element is an index corresponding to the category. Digging with the StringIndexer and similar, it's pretty easy to build such vector from the "raw" values of the data. Then, I built a function that, from such vector (or a 2d array if you have more than one case), create a Spark DataFrame, apply the PipelineModel and returns the model predictions.
The task
Ideally, I would a like to build a function that does the following:
process a row of an input DataFrame
from the row, it builds and collect a numpy vector that works as input for the lime explainer
internally, the lime explainer slightly changes that vector in many ways, building a 2d array of "similar" cases
the above cases are transformed back as a Spark DataFrame
the PipelineModel is applied on the above DataFrame, the results collected and brought the lime explainer that will continue its work
The problem
As you see (if you read so far!), for each row of the DataFrame you build another DataFrame. So, you cannot define an udf, since you are not allowed to call Spark functions inside the udf.
So the question is: how can I parallelize the above procedure? Is there another approach that I could follow to avoid the problem?
I think you can still use udfs in this case, followed by explode() to retrieve all the results on different lines. You just have to make sure the input column is already the vector you want to feed lime.
That way you don't even have to collect out of spark, which is expensive. Maybe you can even use vectorized udfs in your case to gain speed(not sure)
def function(base_case):
list_similarCases = limefunction(base_case)
return list_similarCases
f_udf = udf(function, ArrayType())
df_result = df_start.withColumn("similar_cases", explode(f_udf("base_case")))
I am using tf.contrib.data.make_csv_dataset to convert my CSV data into a dataset that serves Features and Lables nicely for the specified columns from the CSV.
How can I specify extra column(s) from the CSV which I may want to be available during testing of the model, but not to be used for training and for model calculations? For instance, while evaluating testing accuracy, I want to know for which specific rows in the CSV dataset the predictions were wrong. Is there a way to serve an additional parameter that I could leverage to figure out what exactly the model got wrong?
Right now the code looks something like this (based on the Tensorflow example pages):
test_dataset = tf.contrib.data.make_csv_dataset(
CSV_file,
BATCH_TEST_SIZE,
column_names=column_names,
select_columns=column_select,
label_name=label_name,
num_epochs=1,
shuffle=False)\
.map(pack_features_vector)
And then during testing, the code does this:
for (x, y) in test_dataset:
logits = model(x)
prediction = tf.argmax(logits, axis=1, output_type=tf.int32)
print('Act\t{}\nPred\t{}\n\n'.format(y, prediction))
Since the generator function only serves x and y values, how can i say for which row specifically from the original CSV file the predictions may have been wrong?
How could I do something like
for (x, y, z) in test_dataset:
print(z[x])
where z would be that additional column, which I could then examine?
Thank you for clarifying your question. I believe the answer you are looking for in order to see which rows were predicted incorrectly is by using model.predict_classes() in keras. The following code should give you an array of what was guessed by your model:
predictionArr = model.predict_classes(testData).reshape(-1)
This will give you an array the length of your test data set that you can compare the ground truths between.
Hope this is helpful and answers your question!
I am using Spark (core/mlib) with Java, version 2.3.1.
I am applying three transformations to a Dataset - StringIndexer, OneHotEncoderEstimator, VectorAssember - this is to a transform a categorical variable in my dataset into individual columns of 1 and 0 for each category. On my train data, this transformation works with no issues, everything is as expected, and I am saving this model to file.
My issue comes when I try to use this model on a new datapoint:
public static double loadModel(Obj newData) {
SparkSession spark = Shots.buildSession();
//Function which applies transformations
Dataset<Row> data = buildDataset(spark, Arrays.asList(newData));
LogisticRegressionModel lrModel = LogisticRegressionModel.load(modelPath);
//Error is thrown here as the model doesn't seem to understand the input
Dataset<Row> preds = lrModel.transform(data);
preds.show();
}
The issue, I believe, is that the transformation is now being applied to only one row of data which outputs only one category for the categorical feature and a vector with only one element after transformation. This causes an error when the LogisticRegressionModel transform is applied, which is expecting a vector with length greater than one for that feature... I think.
I know my error is not knowing how to apply the train transform to the new data... but I am unsure where exactly the error is and, as a result, do not know where to find the answer (is the issue with saving the model, do I need to save something else like the pipeline, etc.).
The actual error being thrown is -
java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 7, y.size = 2 - the reason why I have come to the conclusions above is a visual examination of the data.
An example may help explain: I have a categorical feature with 3 values [Yes, No, Maybe]. My train data includes all three values, I end up with a vector feature of length 3 signifying the category.
I am then using the same pipeline on a single data point to predict a value but the categorical feature can only be Yes, No, or Maybe as there is only one data point. Therefore, when you apply the same transformation as above you ended up with a vector with one element, as opposed to three, causing the model transform to throw an error.
In general you don't use API correctly. Correct workflow should include preserving a whole set of Models (in your case it will be at least StringIndexerModel, other components looks like Transformers) trained in the process and reapplying these on the new data.
The most convenient way of doing it is using Pipeline:
val pipeline = new Pipeline().setStages(Arrray(indexer, encoder, assembler, lr))
val pipelineModel = pipeline.fit(data)
pipelineModel.transform(data)
PipelineModels can be save as any other component, as long as all its stages are writable.
The issue here was that I was saving the model at the wrong point.
To preserve the effect of previous transformations you need to fit the pipeline to the data and then write/save the model. This means saving PipelineModel rather than Pipeline. If you fit after you load the data, then the transformation will be reapplied in full and you will lose the state required for the transformation to work.
I was also facing the same issue. StringIndexer will fail for new values in test or new data so we can choose to skip those unknown values.
new StringIndexer().setHandleInvalid("skip")
or pass union of both the train and test data to the pipeline and split it post-transformation.
You have two options:
handle_data_PipelineModel ==> df ---> split_dataset ==> train_df/test_df--> arithmetic_PipelineModel----->test_model--->evaluate
df == > split_dataset ==> train_df/test_df--> PipelineModel(handle_data_stage and arithmetic_stage) ---> probably error
Option 1 is safe: You need to save handle_data_PipelineModel and arithmetic_PipelineModel.
Option 2 is bad: No matter how you save the model. When you split data first, the distribution of train_df and test_df will change。
Note: The divided data set must not be processing data prior to the PipelineModel.