I'm using a Jupyter Notebook on a Spark EMR cluster, want to learn more about a certain command but I don't know what the right technology stack is to search. Is that Spark? Python? Jupyter special syntax? Pyspark?
When I try to google it, I get only a couple results and none of them actually include the content I quoted. It's like it ignores the %%.
What does "%%spark_sql" do, what does it originate from, and what are arguments you can pass to it like -s and -n?
An example might look like
%%spark_sql -s true
select
*
from df
These are called magic commands/functions. Try running %pinfo %%spark_sql or %pinfo2 %%spark_sqlin a Jupyter cell and see if it gives you a detailed information about %%spark_sql.
I want my program to print the progress to the console as it is running using papermill. I'm using the following code:
print(str(int(float(percentage)*100))
In a for loop and it doesn't print the string to the console so I have no idea as to how far along the progress is other than watching the stage part of the output in papermill. The papermill code goes as follows:
/your/dir/>papermill "yournotebook.ipynb" "your notebook output.ipynb"
Any suggestions?
Have you used:
--progress-bar
Example:
/your/dir/>papermill "yournotebook.ipynb" "your notebook output.ipynb" --log-output --log-level DEBUG --progress-bar
I have passed a parameter from JobScheduling to Databricks Notebook and tried capturing it inside the python Notebook using dbutils.widgets.get ().
When I run the scheduled job, I get the error "No Input Widget defined", thrown by the library module ""InputWidgetNotDefined"
May I know the reason, thanks.
To use widgets, you fist need to create them in the notebook.
For example, to create a text widget, run:
dbutils.widgets.text('date', '2021-01-11')
After the widget has been created once on a given cluster, it can be used to supply parameters to the notebook on subsequent job runs.
I have spark 2.1 standalone cluster installed on 2 hosts.
There are two notebook's Zeppelin(0.7.1):
first one: preparing data, make aggregation and save output to files by:
data.write.option("header", "false").csv(file)
second one: notebook with shell paragraphs merge all part* files from spark output to one file
I would like to ask about 2 cases:
How to configure Spark to write output to one file
After notebook 1 is completed how to add relations to run all paragraphs in notebook2 eg:
NOTEBOOK 1:
data.write.option("header", "false").csv(file)
"run notebook2"
NOTEBOOK2:
shell code
Have you tried adding a paragraph at the end of note1 that executes note2 through Zeppelin API? You can optionally add a loop that checks whether all paragraphs finished execution, also through API.
I am trying to visually depict my topics in python using pyldavis. However i am unable to view the graph. Is it that we have to view the graph in the browser or will it get popped upon execution. Below is my code
import pyLDAvis
import pyLDAvis.gensim as gensimvis
print('Pyldavis ....')
vis_data = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary)
pyLDAvis.display(vis_data)
The program is continuously in execution mode on executing the above commands. Where should I view my graph? Or where it will be stored? Is it integrated only with the Ipython notebook?Kindly guide me through this.
P.S My python version is 3.5.
This not work:
pyLDAvis.display(vis_data)
This will work for you:
pyLDAvis.show(vis_data)
I'm facing the same problem now.
EDIT:
My script looks as follows:
first part:
import pyLDAvis
import pyLDAvis.sklearn
print('start script')
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',stop_words = 'english',lowercase = True,token_pattern = r'\b[a-zA-Z]{3,}\b',max_df = 0.5,min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
lda_tf = LatentDirichletAllocation(n_topics=20, learning_method='online')
print('fit')
lda_tf.fit(dtm_tf)
second part:
print('prepare')
vis_data = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
print('display')
pyLDAvis.display(vis_data)
The problem is in the line "vis_data = (...)".if I run the script, it will print 'prepare' and keep on running after that without printing anything else (so it never reaches the line "print('display')).
Funny thing is, when I just run the whole script it gets stuck on that line, but when I run the first part, got to my console and execute purely the single line "vis_data = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)" this is executed in a couple of seconds.
As for the graph, I saved it as html ("simple") and use the html file to view the graph.
I ran into the same problem (I use PyCharm as IDE) The problem is that pyLDAvize is developed for Ipython (see the docs, https://media.readthedocs.org/pdf/pyldavis/latest/pyldavis.pdf, page 3).
My fix/workaround:
make a dict of lda_tf, dtm_tf, tf_vectorizer (eg., pyLDAviz_dict)dump the dict to a file (eg mydata_pyLDAviz.pkl)
read the pkl file into notebook (I did get some depreciation info from pyLDAviz, but that had no effect on the end result)
play around with pyLDAviz in notebook
if you're happy with the view, dump it into html
The cause is (most likely) that pyLDAviz expects continuous user interaction (including user-initiated "exit"). However, I rather dump data from a smart IDE and read that into jupyter, than develop/code in jupyter notebook. That's pretty much like going back to before-emacs times.
From experience this approach works quite nicely for other plotting rountines
If you received the module error pyLDA.gensim, then try this one instead:
import pyLdAvis.gensim_models
You get the error because of a new version update.