Spark NLP for Healthcare visualization library - apache-spark

Am using Spark NLP for Healthcare(John Snow. In the below code, we are exporting one particular document(12) results into an HTML file.
Visualize results
from sparknlp_display import NerVisualizer
NerVisualizer().display(
result = result.collect()[12],
label_col = 'ner_chunk',
document_col = 'document',
save_path='./export.html'
)
How do export all the records(data frame) results into CSV/excel file?

Related

Running PySpark on AWS' EMR need help removing stop words from dataframe

As mentioned above I'm running a 64GB csv file on AWS EMR cluster using Jupyter notebook. I concatenated my two columns into one docum = concat(title, abstract) this is a sample of the data
| docum|
+--------------------+
|Clinical features...|
|Nitric oxide: a p...|
|Surfactant protei...|
|Role of endotheli...|
|Gene expression i...|
+--------------------+
only showing top 5 rows```
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
df2=df.select(concat(df.title,df.abstract))
df2 = df2.withColumnRenamed("concat(title, abstract)","docum")
now I just need to figure out stopwords so I can continue.
Thnak you for your time.
You can use Spark ML transformer for that:
from pyspark.ml.feature import Tokenizer, StopWordsRemover
text = """
The data set is too large to even post a full document on here. But I need
help removing the stopwords so I can run Kmeans on this data.
I tried using the gensim but the module is not available on pyspark, I tried throwing it into a
python list but it was too large of a file I ran out or memory. This is the last Step I did
"""
df = spark.createDataFrame([(1, text)], ["id", "text"])
# seperate text to words
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)
# remove defined stop words
remover = StopWordsRemover(inputCol="words", outputCol="result", stopWords=["the", "a", "is", "it", "to"])
final_df = remover.transform(words_df).select("result")
display(final_df)
Links:
StopWordsRemover

Reading Unzipped Shapefiles stored in AWS S3 from AWS EMR Cluster using PySpark in Jupyter Notebook

I'm completely new to AWS EMR and apache spark. I'm trying to assign GeoID's to residential properties using shapefiles. I'm not able to read the shapefiles from my s3 bucket. Please help me in understanding what is going on as I couldn't find any answer on the internet that explains the exact problem.
<!-- language: python 3.4 -->
import shapefile
import pandas as pd
def read_shapefile(shp_path):
"""
Read a shapefile into a Pandas dataframe with a 'coords' column holding
the geometry information. This uses the pyshp package
"""
#read file, parse out the records and shapes
sf = shapefile.Reader(shp_path)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10")
Files That I want to read
The error that I'm getting while reading from the bucket
I really want to read these shapefiles in AWS EMR cluster, as it's not possible for me to work locally on them individually. Any kind of help is appreciated.
I was able to read my shape files from s3 bucket as a binary object in the beginning and then build a wrapper function around it, finally parsed the individual file objects to shapefile.reader() method in .dbf, .shp ,.shx formats separately.
This was happening because PySpark cannot read formats that are not provided in SparkContext. Found this link helpful Using pyshp to read a file-like object from a zipped archive.
My solution
def read_shapefile(shp_path):
import io
import shapefile
blocks = sc.binaryFiles(shp_path)
block_dict = dict(blocks.collect())
sf = shapefile.Reader(shp=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shp")][0]]),
shx=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shx")][0]]),
dbf=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".dbf")][0]]))
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
block_shapes = read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10*")
This works fine without breaking.

Spark streaming using TextBlob for sentiment analysis

I am fairly new to Spark . starting my first project . Need to analyze the twitter data for sentiment analysis . I need to use TextBlob library in Python for doing it .
I am able to get the twitter data and have the Dstream created after all necessary transformation . I am facing challange as how to make the dstream data available ( which is having the tweet text) to the TextBlob for analysis , as TextBlob accepts only string value . How can i get the dstream value into TextBlob for sentiment analysis. Any pointers is highly appreciated .
Thanks ,
Kary
I recently tried using textblob for streaming dataset and wrote a small function to convert tweets to text and apply Textblob.
so you may write somethin glike this :
def getSentiment(self, text):
sentiment = TextBlob(text).sentiment.polarity
if sentiment > float(benchmark):
return float(positive)
elif sentiment < float(benchmark):
return float(negative)
else:
return float(noresponse)
and then write UDF that accepts the text
sentiment_score_udf = F.udf(lambda x: obj.getSentiment(x), FloatType())
here F is pyspark sql functions
and then you may use beow to calculate the sentiment score
sentiment_score_udf(col("value")).alias("sentiment_score")
hope this helps

save() on a Pyspark ML Word2vec model is creating empty folders

I'm trying to save a word2vec model that I built in pyspark on spark 2.0.
word2vec_model.write().overwrite().save('filepath/word2vec')
This successfully finishes and creates 2 sub-folders (data & metadata) under the folder word2vec but these 2 subfolders are empty except for an empty file titled _SUCCESS.
And subsequently the load fails.
w2vw = Word2Vec.load('filepath/word2vec')
with the exception: java.lang.UnsupportedOperationException: empty collection
The word2vec model itself works fine and I create it via series of simple transformers. I'm not sure what is going wrong. My model creation code snippet:
tokenizer = Tokenizer(inputCol="input", outputCol="words")
remover = StopWordsRemover(inputCol="words", outputCol="filtered_1")
customRemover = CustomRemover(inputCol="filtered_1",outputCol="filtered")
word2vec = Word2Vec(inputCol="filtered",vectorSize=100, minCount=10)
Any help would be appreciated.
as I think ,I guess you save word2vec model instead of word2vec,SO
for word2vec model you must read it by code below:
from pyspark.ml.feature import Word2VecModel
w2vw_model = Word2VecModel.load('filepath/word2vec')
and if you save just only word2vec that i mean this object:
word2vec = Word2Vec(inputCol="filtered",vectorSize=100, minCount=10)
word2vec.write().overwrite().save('filepath_to_just_word2vec_not_its_model')
you must import with this code blocks
w2vw = Word2Vec.load('filepath_to_just_word2vec_not_its_model')

Run LDA algorithm on Spark 2.0

I use spark 2.0.0 and I'd like to train a LDA model to Tweets dataset, when I try to execute
val ldaModel = new LDA().setK(3).run(corpus)
I get this error
error: reference to LDA is ambiguous;
it is imported twice in the same scope by import org.apache.spark.ml.clustering.LDA and import org.apache.spark.mllib.clustering.LDA
Could someone please help me ?
Thanks !
It looks like you have both of the following import statements:
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.mllib.clustering.LDA
You would need to remove one of them.
If you are using Spark ML (data frame based API), the proper syntax would be:
import org.apache.spark.ml.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.fit(corpus)
if you are using RDD-based API then you would have to write:
import org.apache.spark.mllib.clustering.LDA
/*feature extraction step*/
val lda = new LDA().setK(3)
val model = lda.run(corpus)

Resources