When I wanted to do a sentiment analysis project I searched alot online, and atlast I landed on this website, which explained the code but what it did not explain is how to use spark with respect to the code, I mean where to add the code.
Website :http://stdatalabs.blogspot.in/2017/09/twitter-sentiment-analysis-using-spark.html?m=1
It will be of great help, if anyone can explain me completely, as Iam a begginer and this my first project on big data.
Thank you.
In the bottom there is a link to the github (https://github.com/stdatalabs/sparkNLP-elasticsearch) you should check that out (literally)
The main class is
com.stdatalabs.SparkES.TwitterSentimentAnalysis according to the pom.xml
So running mvn package will yield you an executable .jar (user java -jar)
Running the jar will prompt you for some twitter config (keys, etc) and saves to a local es cluster using hardcoded index (& mapping) twitter_020717/tweet
You can now alter the code anyway you want, build, run, and check the results.
Related
My company is in the process of migrating all our pipelines over to Databricks from AzureML, and I have been tasked with refactoring one of our existing pipelines made with azureml-sdk (using functions such as PipelineData, PythonScriptStep etc.), and converting it into a dbx pipeline which uses a deployment.yml file.
I have found this "Deployment file reference" on dbx documentation page, and I think it's quite adequate compared to some of AzureML's documentation. However, if I had an example project to compliment that page, it would help me greatly to put it into practice.
Is there any repos/sources which gives an example of building a dbx pipeline which uses .py-files instead of notebooks?
However, if I had an example project to compliment that page, it would help me greatly to put it into practice.
Please take a look at the Quickstart doc which generates a sample project and walks you through it step by step.
If you're looking for more profound and in-depth example with orientation towards MLOps practices, take a look at the following session - MLOps on Databricks: A How-To Guide. It also links to an example repo with dbx.
I'm trying to follow the quide
https://towardsdatascience.com/how-to-train-your-neural-networks-in-parallel-with-keras-and-apache-spark-ea8a3f48cae6#
to try out keras with systemml and spark.
Anyway I could not find the mentioned free spark plan on ibm watson. Can anyone help me to find it or is it just not available anymore?
Thanks!
In your IBM Watson Studio (you can have a free account), go to your projects (or create one), in the project go to "Environments" tab and choose one of your existing environments or create a new environment, it allows you to choose the Spark environment described at the URL that you mentioned.
Thanks!
Acutally that was just the way I was trying out. But I was confused that it says it will consume capasity units per hour and the block said it will be free.
Now I found out, that one has 50 capacities free each month. So I can try the code.
My plan is to fetch the GA API with python3 and google2Pandas.
My problem so far is that I don't know where to start first, when I look at the google2pandas README it looks easy but I have issues to build my own script with that and implementing the Oauth2 stuff.
What is the right way to start with these boiler plates?
All those functions are a bit confusing to me.
What do I really need to use the analytics v4 API and fetch some simple stuff for my dashboard? Which Parameters do I have to set and how or where in the file should I do that? Another question is, do I have to use those functions in a new python file or can I go start with the _panalysis_ga.py?
It would be really helpful if you can guide me here or at least steer me in the right direction with some example.
The link to the repository kind of has the answer, but appreciate it's not always clear if you've never seen it before. There is no need to do anything on the OAth2 process as the library seems to take care of that.
Use pip to install the google2Pandas library on your machine.
You then need to create a GCP account if you don't already have one, and follow step 1 here to get the credentials.
you can then use the Quick Demo shown on the README file of the repository (modify the query to your needs).
EDIT
Look into the New and Improved section of the README file as it is the most up to date one.
I am trying to query Cassandra using Apache Drill. The only connector I could find is here:
http://www.confusedcoders.com/bigdata/apache-drill/sql-on-cassandra-querying-cassandra-via-apache-drill
However this does not build. It comes up with an artifact not found error. I also had another developer who is more versed in these tools take a stab at it, but he also had no luck.
I tried contacting the developer of the plugin I referenced, but the blog does not work and won't let me post comments. Has anyone got this plugin to work (if so how?) or is there another plugin or method I can use to connect apache drill to Cassandra? If anyone could show me how to connect an execute a simple SQL query that would be much appreciated.
I looked at the latest Cassandra storage plugin patch and the latest apache drill source. The drill code has changed and the patch can no longer be applied.
I then manually took the patch apart (it id mostly diff output). Most of the patch was new classes which I could easily add to the latest drill source tree. Most of the other updates were easy to insert into the current source. There were two specific classes that required some minor code modifications/extensions. I rebuilt the distribution from the modified source and installed the drill servers it on a 3 node cluster. The Cassandra schema failed to initialize properly throwing a null pointer exception one of the new classes. This leads me to believe that the (latest) modifed storage plugin is incompatible with the latest version of Cassandra. Since the author of the original storage plugin is unreachable and no one else is stepping up to support the code, this is a dead horse. Beat it if you must.
I was the author of the patch written a year back. Could not get it merged into Drill then, and later got occupied with other stuffs :(
With so many changes to Drill internals, I am not sure what amount of welding would be needed at this point to get it working. Please use the code just as a reference for writing a Drill storage plugin.
Have added this banner on top of the blog post to save fellow developer's hours.
I don't know if anyone is still interested in this topic but I've been experimenting with this plugin and got it to work with Drill 1.18-SNAPSHOT. Here is a link to my branch with this code: 1. My plan is to submit this as a PR for Drill, but it still needs some work. This code will successfully query Cassandra 3.11.5 (latest stable version).
I'm searching a Stanford_CoreNLP plugin with Stanford NER(not StanfordParser or StandfordPOSTagger) for GATE (General Architecture for Text Engineering). I found some information about the plugin here. But I couldn't find it integrated with GATE (version 8) by default. I also tried to find a link to download the plugin, but couldn't find...
Does anyone has a clue about how to activate it or from where to download it?
Thank you in advance...
As Jon mentioned, somehow the Stanford_CoreNLP plugin was left out of the most recent release package of GATE. However, it is included in the daily snapshots built by their Jenkins server. You can download those here:
http://jenkins.gate.ac.uk/job/GATE-Nightly/lastSuccessfulBuild/
Unfortunately, there is no pre-built .gapp file for Stanford NER included with the GATE plugin. This means it isn't as simple as loading an application file to run Stanford NER inside GATE -- there's quite a bit more configuration involved. You might be able to build a custom .gapp file of your own, but in the meantime, the NER.java file in the source code for the Stanford NER plugin will help you get started running it inside GATE:
http://sourceforge.net/p/gate/code/HEAD/tree/gate/trunk/plugins/Stanford_CoreNLP/src/gate/stanford/NER.java