Maxima: Rounding the entries of a list using Robert Dodier's excel_round.mac - rounding

I want the entries of a Maxima list to be rounded in Excel style using Robert Dodier's excel_round.mac (see Maxima: Round like in Excel) on the fly.
I created a list, say the values of exp(0.5*x) over the range from [-2,2].
(%i1) makelist(exp(0.5*x),x,-2,2)$
soln: %
(%o1) [0.36788,0.60653,1,1.6487,2.7183]
The problem I'm having is how to get all the list entries into the x of below.
(%i1) excel_round(x,2)$
ev(%, x=?)
As trying the following
(%i2) map(exel_round(x,2),soln);
(%o2) [exel_round(x,2)(0.36788),exel_round(x,2)(0.60653),
exel_round(x,2)(1),exel_round(x,2)(1.6487),exel_round(x,2)(2.7183)]
and
(%i12) makelist(exp(0.5*x),x,-2,2);
soln: %$
excel_round(x,2)$
ev(%, x=soln)
(%o12) [0.36788,0.60653,1,1.6487,2.7183]
(%o15) excel_round([0.36788,0.60653,1,1.6487,2.7183],2)
didn't yield the desired results I returned to semi-brute force:
(%i18) excel_round(x,2)$
ev(%, x=soln[1]);
excel_round(x,2)$
ev(%, x=soln[2]);
excel_round(x,2)$
ev(%, x=soln[3]);
excel_round(x,2)$
ev(%, x=soln[4]);
excel_round(x,2)$
ev(%, x=soln[5])
(%o19) 0.37
(%o21) 0.61
(%o23) 1.0
(%o25) 1.65
(%o27) 2.72
This is probably better than typing in all the list entries by hand. There surely must be a better and more elegant way.
Any help/suggestions welcome.

Related

Regularized l1 Logistic regression Feature Selection returns different coef_ when rerun

I have a strange issue already mentioned here: LinearSVC Feature Selection returns different coef_ in Python
but I cannot really relate to that.
I have a Regularised L1 logistic regression that I am using for feature selection.
When I simply rerun the code the number of the feature selected changes.
The target variable is binary 1, 0. The number of feature is 709. The training observations are 435, so the feature are more than the observations. The penalty C has been obtained through TimeSeriesSplit CV and never changes when I rerun, I verified that.
Below the code for the feature selection part..
X=df_training_features
y=df_training_targets
lr_l1 = LogisticRegression(C = LR_penalty.C, max_iter=10000,class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=None, n_jobs=None,
penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False).fit(X,y)
model = SelectFromModel(lr_l1, threshold=1e-5, prefit=True)
feature_idx = model.get_support()
feature_name = X.columns[feature_idx]
X_new = model.transform(X)
# Plot
importance = lr_l1.coef_[0]
for i,v in enumerate(importance):
if np.abs(v)>=1e-5:
print('Feature: %0d, Score: %.5f' % (i,v))
sel = importance[np.abs(importance)>=1e-5]
# plot feature importance
plt.figure(figsize=(12, 10))
pyplot.bar([x for x in feature_name], sel)
pyplot.xticks(fontsize=10, rotation=70)
pyplot.ylabel('Feature Importance', fontsize = 14)
pyplot.show()
As seen above, the result sometimes gives me 22 feature selected (first plot), and some other times 24 (second plot), or 23. Not sure what is happening. I thought the issue was in the SelectFromModel so I decided to explicitly state the threshold 1e-5 (which is the default for l1 regularisation), but nothing changes.
It is always the same features which are sometimes in and sometimes out so I checked their coefficients as I thought they might be close to that threshold instead they are not (1 or 2 order of magnitude higher).
Can please anybody help? I have been struggling more than a day on this
You used solver=liblinear. From the documentation:
random_state : int, RandomState instance, default=None
Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.
So try setting a fixed value for random_state and you should converge to the same results.
After a very quick search, I found liblinear uses coordinate descent to minimize the cost function (source). This means that it will choose a random set of coefficients and minimize the cost function one step at a time. I suppose your results are slightly different because they each started at different points.

How can I easily change tf-idf similarity dataframe using apply

I'm using Python 3.
I am doing TF_IDF, and I record more than 80% of results.
But for is too slow. because shape is 51,336 x 51,336.
How can you create dataframes faster without using for statement.
It's taking 50 minutes now.
I want to make a dataframe like this.
[column_0],[column_1],[similarity]
index[0], column[0], value
index[0], column[1], value
index[0], column[2], value
....
index[100], column[51334], value
index[100], column[51335], value
index[100], column[51336], value
...
index[51336], column[51335], value
index[51336], column[51336], value
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
indices = pd.Series(df.index, index=df['index_name'])
tfidf_matrix = tf.fit_transform(df['text'])
similarity = pd.DataFrame(columns=['a', 'b', 'similarity'])
for n in range(len(cosine_sim)):
for i in list(enumerate(cosine_sim[n])):
if i[1] > 0.8 and i[1] < 0.99:
similarity = similarity.append({'column_0': indices.index[n],'column_1': indices.index[i[0]],'similarity': i[1]},ignore_index=True)
If you think of parallelize a job, unfortunately there is no-way to parallelize/distribute access to the vocabulary that is need for these vectorizers.
Hence you choose the alternative hack for that. By using the hashingvectorizer.
well for this scikit docs provide an example using this vectorizer to train a classifier in batches.
https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html.
Hope this will help you

Python- ARIMA predictions returning all NaNs

I'm trying to follow the time series tutorial here (using my own dataset):
https://www.analyticsvidhya.com/blog/2018/02/time-series-forecasting-methods/
Surprisingly, I am able to satisfactorily successfully reach Part 7: ARIMA. In this section, I am stumbling quite a bit. All the values in the Prediction column for it are NaN.
In the terminal, I see
a date index has been provided but it has no associated frequency information and so will be ignored when forecasting
My test data set has a few date gaps for when no transactions occurred, so I fill it with
test=test.set_index('DATE').asfreq('D', fill_value=0)
. I also do the same thing with my ARIMA dataset, so the index matches the test set.
The rest of the relevant code is as follows:
train=df[0:180]
test=df[180:]
SARIMA=test.copy()
fit=sm.tsa.statespace.SARIMAX(train['COUNT'], order=(1,1,1), seasonal_order=(0,0,0,5)).fit()
SARIMA['SARIMA']=fit3.predict(start=0,
end=93,dynamic=True)
print(SARIMA)
print(test)
In the print output, the index for the test set and ARIMA set are the same. The ARIMA contains a column SARIMA which contains the predictions, except they are all NaN. What am I missing?
test
DATE COUNT
2018-06-21 1
2018-06-22 3
..
2018-11-21 3
2018-11-22 4
SARIMA
DATE COUNT SARIMA
2018-06-21 1 NaN
2018-06-22 3 NaN
..
2018-11-21 3 NaN
2018-11-22 4 NaN
edit:
for some reason statsmodels simply cannot detect the index frequency. I've tried
SARIMA=SARIMA.set_index('DATE').asfreq('D',fill_value=0)
SARIMA.index=pd.to_datetime(SARIMA.index)
SARIM.index=pd.DatetimeIndex(SARIMA.index.values, freq='D')
But the warning always appears
edit: I straight up tried to make a new dataset in Excel:
DATE COUNT
2018/01/01 1
2018/01/02 2
..
2018/01/10 3
2018/01/11 4
created the model with the same lines above, except setting enforce_stationarity and enforce invertibility to False. All the predictions are still NaN
edit3: using the fake excel dataset, I've come 1 step closer. Passing start='2018-01-01' and end='2018-01-21' yielded predictions of all 0s, which is better than NaN. Can anyone make sense of these results?
edit4: setting dynamic=False returned reasonable predictions. Clearly I'm no statistican
Another reason behind this behavior could be the 'sarimax' parameters. I have not found a way to overwrite it yet, so if this is the cause try changing your initial params.
import random
import statsmodels.api
import numpy as np
import matplotlib.pyplot as plt
endog = np.array(random.sample(range(100,200), 17))
for cd in range(2):
m = statsmodels.api.tsa.statespace.SARIMAX(
endog = endog,
order = (1,1,1),
seasonal_order = (0,cd,0,12),
trend = 'n'
).fit()
plt.plot(endog)
plt.plot(m.fittedvalues)
plt.title('D: ' + str(cd))
plt.show()
Some dates were missing in the dataset
SARIMA.index=pd.DatetimeIndex(SARIMA.index.values, freq='D') corrected this.

Python3: dictionary of dictionaries for table-content in file

the task on hand where I got stuck is, that I have to put the table content of a file in a dictionary of dictionaries structure.
The file contains something like this: (first six lines of ascii-file)
Name-----------|Alt name-------|------RA|-----DEC|-----z|---CR|----FX|---FX*|Error|---LX|--NH|ID-|Ref#----
RXCJ0000.1+0816 UGC12890 0.0295 8.2744 0.0396 0.26 5.80 5.39 12.4 0.37 5.9 1,3
RXCJ0001.9+1204 A2692 0.4877 12.0730 0.2033 0.08 1.82 1.81 17.9 3.24 5.1 1
RXCJ0004.9+1142 UGC00032 1.2473 11.7006 0.0761 0.17 3.78 3.68 12.7 0.93 5.3 2,4
RXCJ0005.3+1612 A2703 1.3440 16.2105 0.1164 0.24 4.96 4.94 11.8 2.88 3.7 B 2,5
RXCJ0006.3+1052 a) 1.5906 10.8677 0.1698 0.15 3.28 3.28 19.3 4.05 5.6 1
I can provide a file sample if necessary.
The following code works fine till it comes to storing each line-dict into a second dict.
#!/usr/bin/env python3
from collections import *
from re import *
obsrun = {}
objects = {}
re = compile('\d+.\d\d\d\d')
filename = 'test.asc'
with open(filename, 'r') as f:
lines = f.readlines()
for l in line[2:]:
#split the read lines into a list
o_bject = l.split()
#print(o_bject)
#interate over each entry and people the line-dictionary with values of interest
#what's needed (in col of table): identifier, common name, rightascension, declination
for k in o_bject:
objects.__setitem__('id', o_bject[0])
objects.__setitem__('common_name', o_bject[1])
# sometimes the common name has blanks, multiple entries or replacements
if re.match(o_bject[2]):
objects.__setitem__('ra', float(o_bject[2] ) )
objects.__setitem__('dec', float(o_bject[3] ) )
else:
objects.__setitem__('ra', float(o_bject[3] ) )
objects.__setitem__('dec', float(o_bject[4] ) )
#extract the identifier (name of the object) for use as key
name = objects.get('id')
#print(name)
print(objects) #*
# as documented in http://stackoverflow.com/questions/1024847/add-to-a-dictionary-in-python
obsrun[name] = objects
#print(obsrun)
#getting an ordered dictionary sorted by keys
OrderedDict(sorted(obsrun.items(), key= lambda t: t[0] ) ) #t[0] keys,t[1] values
What one can see from the output on console is, that the inner for-loop does what's supposed to do. It's confirmed by the print(objects) at *.
But when it comes to storing the row-dicts as value in the second dict, it's people with the same values. The keys are correctly built.
What I don't understand is, that the print() command displays the correct content of "objects" but they are not stored into "obsrun" correctly.
Does the error lie in the dict view nature or what did I do wrong?
How should I improve the code?
Thanks in advance,
Christian
You created only one dictionary, so each time through the loop you are modifying the same one.
Move the line
objects = {}
into the for l in line[2:]: loop. This will create a separate dict for each line of the file.
Also, using __setitem__ directly is unnecessary and makes the code harder to read. Change the lines from objects.__setitem__('id', o_bject[0]) to objects['id'] = o_bject[0].
It's worth pointing out that you don't really need a dict-of-dicts unless you are trying to look up the entries by name. (You don't explain much what the use case is, here.)
The one thing that leaps out from your code is that you're using setitem a lot - I think maybe you are coming from C++ or Java, where dictionaries do not have language support built in. In Python, this is not the case- you can say d[key]=value to add an item to a dictionary.
Here's some code to create a list (array) of dictionaries. It would be pretty trivial to make Table a dictionary keyed on one of the fields. I'll leave that for you to figure out. :)
Alternatively, a list is much easier to iterate over than a dict, if your problem is going to be performing computations on the data. So if you have to add up or average up or find the min/max, you probably want this version.
#!/usr/bin/env python3 -tt
data = open('test.asc')
header = data.readline().replace('-', '')
Field_names = header.split('|')
Table = []
# Read in the remaining lines, one at a time
for line in data:
fields = line.split()
Table.append(dict(zip(Field_names, fields)))
from pprint import pprint
pprint(Table)
So you say, that giving "objects" to obsrun is just linking "objects" and not copying the content? So I have to keep each inner dict since it's just linked.
You're right about setitem. I used it to make it more clear to me, what exactly I'm doing there.
I will try moving objects = {} into the inner for-loop.
Thanks for the answer. Will get back to report if that did the trick.
Update: That did it! Thanks so much, I really got stuck there, but I learned something import about dictionaries and that, in this cased, they are just linked, so it's memory saving already.
cheers,
Christian

Basic Python Calculation

How would you get 1.75 if:
num2=24
num1=3
total=14
print('Subtotal left: %.1f' % (num2/(num1*total)))
Is their another way to write this calculation to get that result? I can only get 0.4, 2.3 etc etc I know Python follows PEMDAS I've tried all night to get that result but unfortunately I'm not smart enough with Python yet.
Please Help! :)
Cheers.
14/(24/3)==1.75
I'll leave the coding to you.
You could try:
result = (total*num1)/num2
print("Subtotal left: %f" % result)
Output:
Subtotal left: 1.75

Resources