Using {scatter, line}_kws argument in Seaborn - python-3.x

I am trying to customize a regplot using the _kws argument but I get an exception. In fact I do not know how to use this argument, how to pass values to it and even what kind of properties I can influence through this argument. I can not find relevant documentation and examples.
Here is a sample of code that returns an exception. I tried other things as well but all return error messages.
toy_data # json format
'{"REGION":{"3041":2,"1335":5,"6261":8,"548":7,"4471":8,"3226":5,"5601":5,"1141":4,"1175":4,"1825":5,"5038":4,"3767":5,"1536":1,"168":5,"6247":6,"31":5,"1107":3,"5067":6,"985":3,"6176":5,"3415":6,"3013":2,"4785":1,"2676":3,"228":8,"5807":7,"530":7,"4678":5,"1062":3,"1698":3,"6648":3,"4686":5,"571":7,"760":5,"5178":9,"6090":8,"4945":2,"5636":7,"490":8,"1734":4,"3012":2,"14":5,"4637":2,"3239":4,"5866":2,"5297":4,"3011":2,"612":1,"1137":4,"1384":5,"3194":5,"632":2,"3820":3,"3923":9,"6580":3,"3870":9,"5952":5,"6423":5,"1101":3,"4622":6,"975":3,"1954":7,"4515":3,"1252":4,"457":8,"4712":1,"4446":6,"788":5,"2392":2,"704":5,"2378":2,"547":7,"115":6,"3703":8,"1949":7,"5852":8,"1468":2,"1680":3,"471":8,"750":5,"2605":3,"3974":6,"3029":2,"1237":4,"1521":1,"2543":5,"5907":6,"5782":4,"5974":5,"4070":9,"1838":5,"3880":9,"1938":4,"2596":4,"6533":2,"2941":2,"6160":2,"3572":7,"2326":2,"1355":5},"Tuition":{"3041":20825.0,"1335":10948.0,"6261":null,"548":14144.0,"4471":8622.0,"3226":14190.0,"5601":12897.0,"1141":23799.0,"1175":13141.0,"1825":2372.0,"5038":null,"3767":3732.0,"1536":null,"168":19143.0,"6247":9804.0,"31":8000.0,"1107":20203.0,"5067":null,"985":12334.0,"6176":8459.0,"3415":6561.0,"3013":13496.0,"4785":20544.0,"2676":32395.0,"228":13328.0,"5807":21132.0,"530":8212.0,"4678":15113.0,"1062":17176.0,"1698":17596.0,"6648":null,"4686":null,"571":14405.0,"760":4987.0,"5178":15505.0,"6090":15685.0,"4945":23896.0,"5636":13710.0,"490":5906.0,"1734":22306.0,"3012":21284.0,"14":4499.0,"4637":25300.0,"3239":19052.0,"5866":null,"5297":10399.0,"3011":11401.0,"612":35653.0,"1137":19869.0,"1384":15669.0,"3194":18833.0,"632":22675.0,"3820":21771.0,"3923":7139.0,"6580":null,"3870":10359.0,"5952":null,"6423":null,"1101":15326.0,"4622":21863.0,"975":4613.0,"1954":12967.0,"4515":9531.0,"1252":18609.0,"457":2140.0,"4712":22745.0,"4446":8585.0,"788":11430.0,"2392":18870.0,"704":28870.0,"2378":null,"547":null,"115":12977.0,"3703":12633.0,"1949":8881.0,"5852":23186.0,"1468":29049.0,"1680":7137.0,"471":null,"750":6894.0,"2605":3283.0,"3974":5282.0,"3029":18048.0,"1237":6355.0,"1521":29464.0,"2543":6558.0,"5907":11972.0,"5782":17544.0,"5974":5769.0,"4070":3452.0,"1838":4592.0,"3880":7932.0,"1938":6861.0,"2596":2265.0,"6533":null,"2941":34857.0,"6160":null,"3572":11571.0,"2326":34945.0,"1355":19308.0}}'
sns.regplot(data= toy_data,
y='Tuition',
x="REGION",
x_estimator=np.mean,
scatter_kws['color'] = 'r',
line_kws['color'] = 'b')
plt.show()
plt.clf()
scatter_kws['color'] = 'r',
^
SyntaxError: keyword can't be an expression

Looking at the documentation:
{scatter,line}_kws : dictionaries
Additional keyword arguments to pass to plt.scatter and plt.plot.
It can be seen that you they are keyword arguments to regplot and that they are dictionaries. In addition, the paramters that can be accepted can be found by looking at the documentation of plt.plot and plt.scatter depending on which argument you are using.
Therefore, your call to regplot would look something like:
sns.regplot(data= toy_data,
y='Tuition',
x="REGION",
x_estimator=np.mean,
scatter_kws={'c': 'r'},
line_kws={'color': 'b'})

Related

How to add stop words from Tfidvectorizer?

I am trying to add stop words into my stop_word list, however, the code I am using doesn't seem to be working:
Creating stop words list:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords.extend(CustomListofWordstoExclude)
Here I am converting the text to a dtm (document term matrix) with tfidf weighting:
vect = TfidfVectorizer(stop_words = 'english', min_df=150, token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape
But when I do this, I get this error:
FutureWarning: Pass input=None as keyword args. From version 0.25 passing these as positional arguments will result in an error
warnings.warn("Pass {} as keyword args. From version 0.25 "
What does this mean? Is there an easier way to add stopwords?
I'm unable to reproduce the warning. However, note that a warning such as this does not mean that your code did not run as intended. It means that in future releases of the package it may not work as intended. So if you try the same thing next year with updated packages, it may not work.
With respect to your question about using stop words, there are two changes that need to be made for your code to work as you expect.
list.extend() extends the list in-place, but it doesn't return the list. To see this you can do type(stopwords1) which gives NoneType. To define a new variable and add the custom words list to stopwords in one line, you could just use the built-in + operator functionality for lists:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords + CustomListofWordstoExclude
To actually use stopwords1 as your new stopwords list when performing the TF-IDF vectorization, you need to pass stop_words=stopwords1:
vect = TfidfVectorizer(stop_words=stopwords1, # Passed stopwords1 here
min_df=150,
token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape

sklearn.metrics.ConfusionMatrixDisplay using scientific notation

I am generating a confusion matrix as follows:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(truth_labels, predicted_labels, labels=n_classes)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp = disp.plot(cmap="Blues")
plt.show()
However, some of my values for True Positive, True Negative, etc. are over 30,000, and they are being displayed in scientific format (3e+04). I want to show all digits and have found the values_format parameter in the ConfusionMatrixDisplay documentation. I have tried using it like this:
disp = ConfusionMatrixDisplay(confusion_matrix=cm, values_format='')
But I get a type error:
TypeError: __init__() got an unexpected keyword argument 'values_format'.
What I am doing wrong? Thanks in advance!
In case somebody runs into the same problem, I just found the answer. The values_format argument had to be added to disp.plot, not to the ConfusionMatrixDisplay call, as such:
disp.plot(cmap="Blues", values_format='')

padding in tf.data.Dataset in tensorflow

Code:
a=training_dataset.map(lambda x,y: (tf.pad(x,tf.constant([[13-int(tf.shape(x)[0]),0],[0,0]])),y))
gives the following error:
TypeError: in user code:
<ipython-input-32-b25101c2110a>:1 None *
a=training_dataset.map(lambda x,y: (tf.pad(tensor=x,paddings=tf.constant([[13-int(tf.shape(x)[0]),0],[0,0]]),mode="CONSTANT"),y))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:264 constant **
allow_broadcast=True)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py:282 _constant_impl
allow_broadcast=allow_broadcast))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:456 make_tensor_proto
_AssertCompatible(values, dtype)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:333 _AssertCompatible
raise TypeError("Expected any non-tensor type, got a tensor instead.")
TypeError: Expected any non-tensor type, got a tensor instead.
However, when I use:
a=training_dataset.map(lambda x,y: (tf.pad(x,tf.constant([[1,0],[0,0]])),y))
Above code works fine.
This brings me to the conclusion that something is wrong with: 13-tf.shape(x)[0] but cannot understand what.
I tried converting the tf.shape(x)[0] to int(tf.shape(x)[0]) and still got the same error.
What I want the code to do:
I have a tf.data.Dataset object having variable length sequences of size (None,128) where the first dimension(None) is less than 13. I want to pad the sequences such that the size of every collection is 13 i.e (13,128).
Is there any alternate way (if the above problem cannot be solved)?
A solution that works:
using:
paddings = tf.concat(([[13-tf.shape(x)[0],0]], [[0,0]]), axis=0)
instead of using:
paddings = tf.constant([[13-tf.shape(x)[0],0],[0,0]])
works for me.
However, I still cannot figure out why the latter one did not work.

Unusual nans Returned by scipy LinearNDInterpolator

I'm trying to interpolate the following data with python (3.8.1) using the aforementioned scipy function (official documentation here; source code here). The official documentation is incredibly sparse, so I'm hopeful that someone else out there will have some experience using the function and may know the source of this issue. Specifically, I run the following four lines of code:
predictor = [[-1.7134013337139833, 0.9582376963057636, -0.21528572746395735], [3.25933089248862, -0.7087236333980123, 0.012808817274351122], [-0.5596739049487544, -1.8723369742231246, 0.03114189522349198], [0.23080764211370225, 1.0639221305852422, -0.602148693975945], [-0.9879484423429669, -0.16678510825693527, 0.5570132252912631], [0.0029439785978213986, -0.10016927713200409, -0.18197412051828055], [0.3530872261969887, 0.6347161018351574, 0.7285361235605389], [-1.122894723267098, 0.22837861478723648, -0.9022469946784363], [-0.02862856314533664, 0.014623415207400122, 3.078346263312741], [-1.3367570531570616, -0.3218239542354167, 0.489878302042675]]
respose = [0.020235605909933625, 1.4729016163456679e-05, 0.021931080605237303, 0.21271851410989498, 0.26870984350693583, 0.9577608837143238, 0.3470452852299319, 0.11918254249689647, 7.657429164576589e-05, 0.1187813551565562]
from scipy.interpolate import LinearNDInterpolator
away = LinearNDInterpolator(predictor, response)
Now, if I write away.__call__([0,0,0])[0] then python returns 0.8208492283847619,
which is the desired outcome and is a sensible value based on the given test data. Similarly, away.__call__([0,0,1])[0] returns 0.22018657078617598 which is also a sensible value.
However, away.__call__([0,1,1])[0] returns nan. What changed? Does anyone happen to know?
Thank you.
This occurs when away.__call__(x) is passed a value x which lies outside of the convex hull - essentially, when x lies outside of the region of interpolation.

How to use select() transformation in Apache Spark?

I am following the Intro to Spark course on edX. However, I cant understand few things, following is an lab assignment. FYI, I am not looking for solution.
I am not able to understand as why I am receiving the error
TypeError: 'Column' object is not callable
Following is the code
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""
Args:
column (Column): A Column containing a sentence.
"""
# This following is giving error. I believe I am calling all the rows from the dataframe 'column' where the attribute is named as 'sentence'
result = column.select('sentence')
return result
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
Can you be little elaborate? TIA.
The column parameter is not a DataFrame object and, therefore, does not have access to the select method. You'll need to use other functions to solve this problem.
Hint: Look at the import statement.

Resources