I am fairly new to Spark . starting my first project . Need to analyze the twitter data for sentiment analysis . I need to use TextBlob library in Python for doing it .
I am able to get the twitter data and have the Dstream created after all necessary transformation . I am facing challange as how to make the dstream data available ( which is having the tweet text) to the TextBlob for analysis , as TextBlob accepts only string value . How can i get the dstream value into TextBlob for sentiment analysis. Any pointers is highly appreciated .
Thanks ,
Kary
I recently tried using textblob for streaming dataset and wrote a small function to convert tweets to text and apply Textblob.
so you may write somethin glike this :
def getSentiment(self, text):
sentiment = TextBlob(text).sentiment.polarity
if sentiment > float(benchmark):
return float(positive)
elif sentiment < float(benchmark):
return float(negative)
else:
return float(noresponse)
and then write UDF that accepts the text
sentiment_score_udf = F.udf(lambda x: obj.getSentiment(x), FloatType())
here F is pyspark sql functions
and then you may use beow to calculate the sentiment score
sentiment_score_udf(col("value")).alias("sentiment_score")
hope this helps
Related
Am using Spark NLP for Healthcare(John Snow. In the below code, we are exporting one particular document(12) results into an HTML file.
Visualize results
from sparknlp_display import NerVisualizer
NerVisualizer().display(
result = result.collect()[12],
label_col = 'ner_chunk',
document_col = 'document',
save_path='./export.html'
)
How do export all the records(data frame) results into CSV/excel file?
As mentioned above I'm running a 64GB csv file on AWS EMR cluster using Jupyter notebook. I concatenated my two columns into one docum = concat(title, abstract) this is a sample of the data
| docum|
+--------------------+
|Clinical features...|
|Nitric oxide: a p...|
|Surfactant protei...|
|Role of endotheli...|
|Gene expression i...|
+--------------------+
only showing top 5 rows```
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
df2=df.select(concat(df.title,df.abstract))
df2 = df2.withColumnRenamed("concat(title, abstract)","docum")
now I just need to figure out stopwords so I can continue.
Thnak you for your time.
You can use Spark ML transformer for that:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
text = """
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
"""
df = spark.createDataFrame([(1, text)], ["id", "text"])
# seperate text to words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)
# remove defined stop words
remover = StopWordsRemover(inputCol="words", outputCol="result", stopWords=["the", "a", "is", "it", "to"])
final_df = remover.transform(words_df).select("result")
display(final_df)
Links:
StopWordsRemover
I have managed to get the BERT model to work on johnsnowlabs-spark-nlp library. I am able to save the "trained model" on disk as follows.
Fit Model
df_bert_trained = bert_pipeline.fit(textRDD)
df_bert=df_bert_trained.transform(textRDD)
save model
df_bert_trained.write().overwrite().save("/home/XX/XX/trained_model")
However,
First, as per the docs here https://nlp.johnsnowlabs.com/docs/en/concepts, it's stated that one can load the model as
EmbeddingsHelper.load(path, spark, format, reference, dims, caseSensitive)
but it's unclear to me what the variable "reference" represents at this point.
Second, has anyone managed to save the BERT embeddings as a pickle file in python?
In Spark NLP, BERT comes as a pre-trained model. It means it's already a model that was trained, fit, etc. and saved in the right format.
That's being said, there is no reason to fit or save it again. You can, however, save the result of it once you transform your DataFrame to a new DataFrame that has BERT embeddings for each token.
Example:
Start a Spark Session in spark-shell with Spark NLP package
spark-shell --packages JohnSnowLabs:spark-nlp:2.4.0
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.base._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Download and load the pretrained BERT model
val embeddings = BertEmbeddings.pretrained(name = "bert_base_cased", lang = "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
.setCaseSensitive(true)
.setPoolingLayer(0)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings
))
// Test and transform
val testData = Seq(
"I like pancakes in the summer. I hate ice cream in winter.",
"If I had asked people what they wanted, they would have said faster horses"
).toDF("text")
val predictionDF = pipeline.fit(testData).transform(testData)
The predictionDF is a DataFrame that contains BERT embeddings for each token inside your dataset. The BertEmbeddings pre-trained models are coming from TF Hub, which means they are the exact same pre-trained weights published by Google. All 5 models are available:
bert_base_cased (en)
bert_base_uncased (en)
bert_large_cased (en)
bert_large_uncased (en)
bert_multi_cased (xx)
Let me know if you have any questions or problems and I'll update my answer.
References:
https://github.com/JohnSnowLabs/spark-nlp
https://github.com/JohnSnowLabs/spark-nlp-models
https://github.com/JohnSnowLabs/spark-nlp-workshop
I use spark 2.0.0 and I'd like to train a LDA model to Tweets dataset, when I try to execute
val ldaModel = new LDA().setK(3).run(corpus)
I get this error
error: reference to LDA is ambiguous;
it is imported twice in the same scope by import org.apache.spark.ml.clustering.LDA and import org.apache.spark.mllib.clustering.LDA
Could someone please help me ?
Thanks !
It looks like you have both of the following import statements:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.mllib.clustering.LDA
You would need to remove one of them.
If you are using Spark ML (data frame based API), the proper syntax would be:
import org.apache.spark.ml.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.fit(corpus)
if you are using RDD-based API then you would have to write:
import org.apache.spark.mllib.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.run(corpus)
I am currently working on a sparkling water application and I am a total beginner in spark and h2o.
What I want to do:
loading a input textfile
create a word2vec model
create a dataframe with a column word and a column Vector
using the dataframe as input for h2o
By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:
word | Vector
assert | [0.3, 0.4.....]
sense | [0.6, 0.2.....]
and so on.
This is my code so far:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row
# Starting h2o application on spark cluster
hc = H2OContext(sc).start()
# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))
# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)
# Sanity check
model.findSynonyms("property",5)
# assign vector representation (map to variable
wordVectorsDF = model.getVectors()
# Transform wordVectorsDF word into dataframe
Is there any approach to that or functions provided by spark?
Thanks in advance
I found out that there are two libraries for a Word2Vec transformation - I don't know why.
from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec
The second line returns a data frame with the function getVectors()and has diffenrent parameters for building a model from the first line.
Maybe somebody can comment on that concerning the 2 different libraries.
Thanks in advance.
First of all in H2O we don't support a Vector column type, you'd have to make a frame like this:
word | V1 | V2 | ...
assert | 0.3 | 0.4 | ...
sense | 0.6 | 0.2 | ...
Now for the actual question - no, since it's a Scala Map, we provide ways to create frames from data sources (files on HDFS/S3, databases etc) or conversions from RDDs/DataFrames but not from Java/Scala collections. Writing one would be possible but quite cumbersome.
Not the most performant solution but the easiest code-wise would be to make a DF (or RDD) first (by running sc.parallelize on map.toSeq) and then convert it to an H2OFrame:
import hc._
val wordsDF = sc.parallelize(wordVectorsDF.toSeq).toDF
val h2oFrame = asH2OFrame(wordsDF)