How to view and interpret the output of lda model using gensim - python-3.x

I am able to create the lda model and save it. Now I am trying load the model, and pass a new document
lda = LdaModel.load('..\\models\\lda_v0.1.model')
doc_lda = lda[new_doc_term_matrix]
print(doc_lda )
On printing the doc_lda I am getting the object. <gensim.interfaces.TransformedCorpus object at 0x000000F82E4BB630>
However I want to get the topic words associated with it. What is the method I have to use. I was referring to this.

Not sure if this is still relevant, but have you tried get_document_topics()? Though I assume that would only work if you've updated your LDA model using update().
I don't think there is anything wrong with your code - the "Usage example" from the documentation link you posted uses doc2bow which returns a sparse vector - I don't know what new_doc_term_matrix consists of, but I'll assume it worked fine.
You might want to look at this stackoverflow question: you want to print an "object" - that isn't printable, the data you want is somewhere in the object, and that in itself is printable.
Alternatively, you can also use your IDE's capabilities - the Variable explorer in Spyder, for example - to click yourself into the objects and get the info you need.
For more info on similarity analysis with gensim, see this tutorial.

Use this
doc_lda.print_topics(-1)

Related

Turning a Dataframe into a Series with .squeeze("columns")

I'm studying how to work with data right now and so I'm following along with a tutorial for working with Time Series data. Among the first things he does is read_csv on the file path and then use squeeze=True to read it as a Series. Unfortunately (and as you probably know), squeeze has been depricated from read_csv.
I've been reading documentation to figure out how to read a csv as a series, and everything I try fails. The documentation itself says to use pd.read_csv('filename').squeeze('columns') , but, when I check the type afterward, it is always still a Dataframe.
I've looked up various other methods online, but none of them seem to work. I'm doing this on a Jupyter Notebook using Python3 (which the tutorial uses as well).
If anyone has any insights into why I cannot change the type in this way, I would appreciate it. I'm not sure if I've misunderstood the tutorial altogether or if I'm not understanding the documentation.
I do literally type .squeeze("columns") when I write this out because when I write a column name or index, it fails completely. Am I doing that correctly? Is this the correct method or am I missing a better method?
Thanks for the help!
shampoo = pd.read_csv('shampoo_with_exog.csv',index_col= [0], parse_dates=True).squeeze("columns")
I would start with this...
#Change the the stuff between the '' to the entire file path of where your csv is located.
df = pd.read_csv(r'c:\user\documents\shampoo_with_exog.csv')
To start this will name your dataframe as df which is kind of the unspoken industry standard the same as pd for pandas.
Additionally, this will allow you to use a "raw" (the r) string which makes it easier to insert directories into your python code.
Once you are are able to successfully run this you can simply put df in a separate cell in jupyter. This will show you what your data looks like from your CSV. Once you have done all that you can start manipulating your data. While you can use the fancy stuff in pd.read_csv() I mostly just try to get the data and manipulate it from the code itself. Obviously there are reasons not to only do a pd.read_csv but as you progress you can start adding things here and there. I almost never use squeeze although I'm sure there will be those here to comment stating how "essential" it is for whatever the specific case might be.

What's the name arguments in string_to_hash_bucket_fast for?

Just new to tensorflow. When checking impl for feature_column.categorical_column_with_hash_bucket, I found this code:
sparse_id_values = string_ops.string_to_hash_bucket_fast(sparse_values,
self.hash_bucket_size, name='lookup')
Not sure why use name='lookup' here, is it related with lookup_ops.py? Documentation specified in tf.string_to_hash_bucket_fast:
name: A name for the operation (optional).
But not quite understand, trying to go deeper in source code, found it's wrapped in go/wrapper in a interface, even can't find a detailed algo impl. Any suggestions?
tf.string_to_hash_bucket_fast() creates an actual op in the graph. It's called StringToHashBucketFast in the native implementation, see the source code in tensorflow/core/kernels/string_to_hash_bucket_op.cc. In particular, it has a gradient. So the name can be helpful to recognize this op in the graph, for example in tensorboard.
The name lookup in this place explains the meaning of sparse_id_values: it's a conversion of sparse_values to ids (hashes).

Convolution layer in keras

I am exploring convolution layer in keras from:
https://github.com/fchollet/keras/blob/master/keras/layers/convolutional.py#L233
everywhere i found following type of code lines:
#interfaces.legacy_conv1d_support
#interfaces.legacy_conv2d_support
what is working and role of these lines. I searched in google but not find answer anywhere. please explain.
These lines starting with # are called decorators in python. Check out this page to read a brief summary about them. The basic function of this decorators is, that they wrap the following function into another function which has some kind of "wrapper" functions, like preprocessing the arguments, changing the accessibility of the function etc.
Taking a look at the interfaces.py file you will see this:
legacy_conv1d_support = generate_legacy_interface(
allowed_positional_args=['filters', 'kernel_size'],
conversions=[('nb_filter', 'filters'),
('filter_length', 'kernel_size'),
('subsample_length', 'strides'),
('border_mode', 'padding'),
('init', 'kernel_initializer'),
('W_regularizer', 'kernel_regularizer'),
('b_regularizer', 'bias_regularizer'),
('W_constraint', 'kernel_constraint'),
('b_constraint', 'bias_constraint'),
('bias', 'use_bias')],
preprocessor=conv1d_args_preprocessor)
So, the use of this function is basicly to rename parameters. Why is this? The keras API changed the names of some arguments of some functions (like W_regularizer -> kernel_regularizer). To allow users to be able to run old code, they added this decorator, which will just replace the names of old arguments with the corresponding new parameter name before calling the real function. This allows you to run "old" keras 1 code, even though you have installed keras 2.
Tl;dr: These lines are just used to for compatibility reasons. As this are just internal aspects of keras there is nothing you have to worry about or to take care of.

Making a Datum from a Data Set, Stanford NLP

I was trying to run through examples for the Stanford NLP Classifier and had a question about classifying a new data set. I see that the ".test" file contains the "goldClass" which is the right answer as well as the String which is supposed to be tested.
The example test set has the following format:
<label> <string>
<label> <String>
...
....
This makes sense for evaluation of a model once we a model has been created from a hand classified data set. But now, once a model is created, how do I classify a completely new data set? I no longer have the associated Labels... I just have the new set of strings that I want to know the class for...
But to classify them, I will have to create a Datum object. To create a datum object, I will need to use makeDatumFromLine(), which requires a TSV line... WHY does this have to be TSV? What is the use of specifying a goldClass when classifying new data?
I hope my question was clear..
"Why does it have to be TSV?"
Because it does. My solution to get the class of an unknown line is the following:
Datum<String, String> example = cdc.makeDatumFromLine("\tEverybody here needs food badly!");
System.out.println("Example data-> "+cl.classOf(example));
Seems to work fine. I don't know if this is the best use- but it works for me.

Capybara: Should I get rid of extracted constants or keep them?

I was wondering about some best practices regarding extraction of selectors to constants. As a general rule, it is usually recommended to extract magic numbers and string literals to constants so they can be reused, but I am not sure if this is really a good approach when dealing with selectors in Capybara.
At the moment, I have a file called "selectors.rb" which contains the selectors that I use. Here is part of it:
SELECTORS = {
checkout: {
checkbox_agreement: 'input#agreement-1',
input_billing_city: 'input#billing\:city',
input_billing_company: 'input#billing\:company',
input_billing_country: 'input#billing\:country_id',
input_billing_firstname: 'input#billing\:firstname',
input_billing_lastname: 'input#billing\:lastname',
input_billing_postcode: 'input#billing\:postcode',
input_billing_region: 'input#billing\:region_id',
input_billing_street1: 'input#billing\:street1',
....
}
In theory, I put my selectors in this file, and then I could do something like this:
find(SELECTORS[:checkout][:input_billing_city]).click
There are several problems with this:
If I want to know the selector that is used, I have to look it up
If I change the name in selectors.rb, I could forget to change it somewhere else in the file which will result in find(nil).click
With the example above, I can't use this selector with fill_in(SELECTORS[:checkout][:input_billing_city]), because it requires an ID, name or label
There are probably a few more problems with that, so I am considering to get rid of the constants. Has anyone been in a similar spot? What is a good way to deal with this situation?
Someone mentioned the SitePrism gem to me: https://github.com/natritmeyer/site_prism
A Page Object Model DSL for Capybara
SitePrism gives you a simple, clean and semantic DSL for describing
your site using the Page Object Model pattern, for use with Capybara
in automated acceptance testing.
It is very helpful in that regard and I have adjusted my code accordingly.

Resources