Python Progress Bar for non-iterable process - python-3.x

I'm using this Notebook, where section Apply DocumentClassifier is altered as below.
Jupyter Labs, kernel: conda_mxnet_latest_p37.
tqdm is a progress bar wrapper. It seems to work both on for loops and in CLI. However, I would like to use it on line:
classified_docs = doc_classifier.predict(docs_to_classify)
This is an iterative process, but under the bonnet.
How can I apply tqdm to this line?
Code Cell:
doc_dir = "GRIs/" # contains 2 .pdfs
with open('filt_gri.txt', 'r') as filehandle:
tags = [current_place.rstrip() for current_place in filehandle.readlines()]
doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
task="zero-shot-classification",
labels=tags,
batch_size=2)
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify)

Based on this TDS Article; all Python progress bar libraries work with for loops. Hypothetically, I could alter the predict() function and append there but that's simply too much work.
Note: I'm happy to remove this answer if there is indeed a solution for non-iterablly "accessible" processes.

Related

How to download pipeline code from Pycaret AutoML into .py files?

PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']

jupyter notebook buffers when writing to file until closed? How to see results without closing jupyter notebook?

I am new to python programming, had taught myself quickbasic back in high school and trying to self teach myself python now using the only tools I have available right now - a chromebook running crouton with a jupyter notebook install on linux (which may be my first set of mistakes, teaching myself and learning using an "interesting" set up, but here I am). Anyhow, I am regularly finding interesting links on the web on my phone and emailing myself them to review later, and I figured I would create a python program to sort the list of URLs and output them to file.
I included the code below, which does work, but the problem I have is with jupyter notebook itself, and I googled a bit to try to find an answer but was unsuccessful so far. Here is the actual issue:
When I write to file in jupyter notebook, it does not show up in local filesystem until I close out the jupyter notebook. This is less than ideal, as I like to use the jupyter notebook for quick testing and bug fixing instead of using idle or bash, so having to close out of jupyter notebook just to see what is output to file is not helpful. Is there a way to flush buffer or something so as to be able to open the text file in a text editor and see the results - without having to quit out of the jupyter notebook?
-if needed, here is the program:
##################################################################
# #
# parse_my_search.py - I save URLs to a text file for later #
# viewing and/or bookmarking. These lists grow rather quickly, #
# so this is a program to sort them by certain subject or #
# websites and make sure that I have no duplicates. #
# #
##################################################################
import os.path
from os import listdir
URL_set = {} # set - used to remove duplicates when sorting/organizing
URL_list = [] # list - used to allow duplicates when sorting/organizing
temp_list = [] # list - temporary usage, discard or reassign to [] when not used
input_file = "" # file to read in, get the URLs, and then close
output_file = 'python.txt'
# file name for output. Will probably change this to a list to write the
# output files to. This should NOT overwrite
# the existing file, but it should open a python file (as in, a text
# file with extention '.txt' which has the URLs that include the word
# 'python' in it), a youtube file (any/all youtube URLs), and an 'else'
# file that is the catch-all for anything that is not in the first two
# files. NOTE: this has not been done yet, only opens single file
input_file = 'My_searches.txt'
while True:
try:
#for elem in os.listdir('.'):
# print(elem)
#print(os.listdir('.'), end=" ")
#print(onlyfiles)
#print("enter filename")
#input_file = input()
read_from = open(input_file)
break
except OSError as err:
print("OS error: {0}".format(err))
ItemsReadInFromFile = read_from.readlines()
for item in ItemsReadInFromFile:
URL_list.append(item.strip('\n'))
# using list comprehension to
# perform removal of empty strings
URL_list = [i for i in URL_list if i]
# removing duplicates:
URL_set = set(URL_list)
URL_list = list(URL_set)
URL_list.sort() # this will change the list/ is destructive. No need to do the following:
# URL_list = URL_list.sort()
# In fact, doing so returns 'None'
# sorting for python
write_to = open(output_file, 'w+')
for item in URL_list:
if 'python' in item.lower():
item += ('\n')
write_to.write(item)
# this is a test that the program works to this point
print("so far so good", input_file)
read_from.close()
I figured out the answer today, and it turns out that it was not jupyter at all but instead was due to after writing the data not closing the file at the end of the program. This is why jupyter did not have the data written to file and showing in a text editor (until jupyter was closed), the file was still open.
Insert facepalm here.
A rookie mistake, and a little embarassing, however I am happy that I figured it out. This can now be closed

Python 3.7, Feedparser module cannot parse BBC weather feed

When I parse the example rss link provided by BBC weather it gives only an empty feed, the example link is: "https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123"
Ive tried using the feedparser module in python, I would like to do this in either python or c++ but python seemed easier. Ive also tried rewriting the URL without https:// and with .xml and it still doesn't work.
import feedparser
d = feedparser.parse('https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123')
print(d)
Should give a result similar to the RSS feed which is on the link, but it just gets an empty feed
First, I know you got no result - not an error like me. Perhaps you are running a different version. As I mentioned, it yields a result on an older version in Python 2, using a program that has been running solidly every night for about 5 years, but it throws an exception on a freshly installed feedparser 5.2.1 on Python 3.7.4 64 bit.
I'm not entirely sure what is going on, but the function called _gen_georss_coords which is throwing a StopIteration on the first call. I have noted some references to this error due to the implementation of PEP479. It is written as a generator, but for your rss feed it only has to return 1 tuple. Here is the offending function.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = map(float, value.strip().replace(',', ' ').split())
nxt = latlons.__next__
while True:
t = [nxt(), nxt()][::swap and -1 or 1]
if dims == 3:
t.append(nxt())
yield tuple(t)
There is something curious going on, perhaps to do with PEP479 and the fact that there are two separate generators happening in the same function, that is causing StopIteration to bubble up to the calling function. Anyway, I rewrote it is a somewhat more straightforward way.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = list(map(float, value.strip().replace(',', ' ').split()))
for i in range(0, len(latlons), 3):
t = [latlons[i], latlons[i+1]][::swap and -1 or 1]
if dims == 3:
t.append(latlons[i+2])
yield tuple(t)
You can define the above new function in your code, then execute the following to patch it into feedparser
saveit, feedparser._gen_georss_coords = (feedparser._gen_georss_coords, _gen_georss_coords)
Once you're done with it, you can restore feedparser to its previous state with
feedparser._gen_georss_coords, _gen_georss_coords = (saveit, feedparser._gen_georss_coords)
Or if you're confident that this is solid, you can modify feedparser itself. Anyway I did this trick and your rss feed suddenly started working. Perhaps in your case it will also result in some improvement.

Hide code cells in Jupyter Notebook, execute with Papermill, transform to PDF with nbconvert

I want to run a python program (not command line) using papermill to execute a jupyter notebook and then transform it into a PDF. The idea is working, but I'm unable to hide the input cells.
For papermill report_mode = True is supposed to hide input cells, but there seems to be a problem with jupyter Classic (https://github.com/nteract/papermill/issues/130)
Other extension like hide_input or html scripts are also not sufficient.
Maybe a nbconvert template for hide cells is a solution, but I didn't get that running.
My minimal Code:
pm.execute_notebook(
"Input.ipynb",
"Output.ipynb",
parameters=dict(id=id),
report_mode=True,
)
notebook_filename = "Output.ipynb"
with open(notebook_filename) as f:
nb = nbformat.read(f, as_version=4)
pdf_exporter = PDFExporter()
pdf_data, resources = pdf_exporter.from_notebook_node(nb)
So I'm looking for a way to execute the Notebook, hide the input cells and transform the Notebook to a PDF. I want to use nbconvert in Python and not as a command line tool, since the Script shall run daily.
I know you said you "don't want to use command line", but how about having your python script execute a subprocess command after running papermill? Mixing with this answer:
import subprocess
subprocess.call('jupyter nbconvert --to pdf --TemplateExporter.exclude_input=True Output.ipynb')
You're missing one small detail in your code:
pm.execute_notebook(
"Input.ipynb",
"Output.ipynb",
parameters=dict(id=id),
report_mode=True,
)
notebook_filename = "Output.ipynb"
with open(notebook_filename) as f:
nb = nbformat.read(f, as_version=4)
pdf_exporter = PDFExporter()
# add this
pdf_exporter.exclude_input = True
pdf_data, resources = pdf_exporter.from_notebook_node(nb)
You can also use Ploomber for this; it has a few benefits, like running notebooks in parallel:
# convert.py
from pathlib import Path
from ploomber import DAG
from ploomber.tasks import NotebookRunner
from ploomber.products import File
from ploomber.executors import Parallel
dag = DAG(executor=Parallel())
# hide input and convert to PDF using the LaTeX converter
NotebookRunner(
Path('input.ipynb'),
File('output-latex.pdf'),
dag=dag,
# do not include input code (only cell's output)
nbconvert_export_kwargs={'exclude_input': True},
name='latex')
# hide input and convert to PDF using the web PDF converter (no need to install LaTeX!
NotebookRunner(
Path('input.ipynb'),
File('output-web.pdf'),
dag=dag,
# do not include input code (only cell's output)
nbconvert_export_kwargs={'exclude_input': True},
# use webpdf converter
nbconvert_exporter_name='webpdf',
name='web')
# generate both PDFs
if __name__ == '__main__':
dag.build()
Note: as of Ploomber 0.20, notebooks must have a "parameters" cell (you can add an empty one). See instructions here.
To run it:
pip install ploomber
python convert.py

pyldavis Unable to view the graph

I am trying to visually depict my topics in python using pyldavis. However i am unable to view the graph. Is it that we have to view the graph in the browser or will it get popped upon execution. Below is my code
import pyLDAvis
import pyLDAvis.gensim as gensimvis
print('Pyldavis ....')
vis_data = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary)
pyLDAvis.display(vis_data)
The program is continuously in execution mode on executing the above commands. Where should I view my graph? Or where it will be stored? Is it integrated only with the Ipython notebook?Kindly guide me through this.
P.S My python version is 3.5.
This not work:
pyLDAvis.display(vis_data)
This will work for you:
pyLDAvis.show(vis_data)
I'm facing the same problem now.
EDIT:
My script looks as follows:
first part:
import pyLDAvis
import pyLDAvis.sklearn
print('start script')
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',stop_words = 'english',lowercase = True,token_pattern = r'\b[a-zA-Z]{3,}\b',max_df = 0.5,min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
lda_tf = LatentDirichletAllocation(n_topics=20, learning_method='online')
print('fit')
lda_tf.fit(dtm_tf)
second part:
print('prepare')
vis_data = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
print('display')
pyLDAvis.display(vis_data)
The problem is in the line "vis_data = (...)".if I run the script, it will print 'prepare' and keep on running after that without printing anything else (so it never reaches the line "print('display')).
Funny thing is, when I just run the whole script it gets stuck on that line, but when I run the first part, got to my console and execute purely the single line "vis_data = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)" this is executed in a couple of seconds.
As for the graph, I saved it as html ("simple") and use the html file to view the graph.
I ran into the same problem (I use PyCharm as IDE) The problem is that pyLDAvize is developed for Ipython (see the docs, https://media.readthedocs.org/pdf/pyldavis/latest/pyldavis.pdf, page 3).
My fix/workaround:
make a dict of lda_tf, dtm_tf, tf_vectorizer (eg., pyLDAviz_dict)dump the dict to a file (eg mydata_pyLDAviz.pkl)
read the pkl file into notebook (I did get some depreciation info from pyLDAviz, but that had no effect on the end result)
play around with pyLDAviz in notebook
if you're happy with the view, dump it into html
The cause is (most likely) that pyLDAviz expects continuous user interaction (including user-initiated "exit"). However, I rather dump data from a smart IDE and read that into jupyter, than develop/code in jupyter notebook. That's pretty much like going back to before-emacs times.
From experience this approach works quite nicely for other plotting rountines
If you received the module error pyLDA.gensim, then try this one instead:
import pyLdAvis.gensim_models
You get the error because of a new version update.

Resources