How to use large volume of data in Spark - apache-spark

I'm working with spark in python, trying to map PDF files to some custom parsing. Currently I'm loading the PDFS with pdfs = sparkContext.binaryFiles("some_path/*.pdf").
I set the RDD to be cachable on disk with pdfs.persist( pyspark.StorageLevel.MEMORY_AND_DISK ).
I then try to map the parsing operation. And then to save a pickle But it fails with a out-of-memory error in the heap. Could you help me please?
Here is the simplified code of what I do:
from pyspark import SparkConf, SparkContext
import pyspark
#There is some code here that set a args object with argparse.
#But it's not very interesting and a bit long, so I skip it.
def extractArticles( tupleData ):
url, bytesData = tupleData
#Convert the bytesData into `content`, a list of dict
return content
sc = SparkContext("local[*]","Legilux PDF Analyser")
inMemoryPDFs = sc.binaryFiles( args.filePattern )
inMemoryPDFs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData = inMemoryPDFs.flatMap( extractArticles )
pdfData.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData.saveAsPickleFile( args.output )

Related

Job failure with no more details. I used a simple rdd.map, convert to DF and show()

I'm super begginer with pyspark. Just trying some code to process my documents in Databricks Community. I have a lot of html pages in a Dataframe and need to map a function that clean all html tags.
from selectolax.parser import HTMLParser
def get_text_selectolax(html):
tree = HTMLParser(html)
if tree.body is None:
return None
for tag in tree.css('script'):
tag.decompose()
for tag in tree.css('style'):
tag.decompose()
for node in tree.css('body'):
if node.tag == "strong":
print( "node.html" )
print( node.html )
text = tree.body.text(separator='\n')
return text
df_10 = df.limit(10) #Out: df_10:pyspark.sql.dataframe.DataFrame
rdd_10_2 = df_10.select("html").rdd.map( get_text_selectolax )
schema = StructType([
StructField("html", StringType()),
])
df_10_2 = spark.createDataFrame(rdd_10_2, schema)
df_10_2.show() #-----------> here the code failure
I want to clean all my documents and get a Dataframe to work with.
Thx
Here is the complete notebook:https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5506005740338231/939083865254574/8659136733442891/latest.html
I could get the thing working, but in scala, what is fine for me.
val version = "3.9.1"
val baseUrl = s"http://repo1.maven.org/maven2/edu/stanford/nlp/stanford-corenlp"
val model = s"stanford-corenlp-$version-models.jar" //
val url = s"$baseUrl/$version/$model"
if (!sc.listJars().exists(jar => jar.contains(model))) {
import scala.sys.process._
// download model
s"wget -N $url".!!
// make model files available to driver
s"jar xf $model".!!
// add model to workers
sc.addJar(model)
}
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._
val df_limpo = ds.select(cleanxml('html).as("acordao"))

Reading Unzipped Shapefiles stored in AWS S3 from AWS EMR Cluster using PySpark in Jupyter Notebook

I'm completely new to AWS EMR and apache spark. I'm trying to assign GeoID's to residential properties using shapefiles. I'm not able to read the shapefiles from my s3 bucket. Please help me in understanding what is going on as I couldn't find any answer on the internet that explains the exact problem.
<!-- language: python 3.4 -->
import shapefile
import pandas as pd
def read_shapefile(shp_path):
"""
Read a shapefile into a Pandas dataframe with a 'coords' column holding
the geometry information. This uses the pyshp package
"""
#read file, parse out the records and shapes
sf = shapefile.Reader(shp_path)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10")
Files That I want to read
The error that I'm getting while reading from the bucket
I really want to read these shapefiles in AWS EMR cluster, as it's not possible for me to work locally on them individually. Any kind of help is appreciated.
I was able to read my shape files from s3 bucket as a binary object in the beginning and then build a wrapper function around it, finally parsed the individual file objects to shapefile.reader() method in .dbf, .shp ,.shx formats separately.
This was happening because PySpark cannot read formats that are not provided in SparkContext. Found this link helpful Using pyshp to read a file-like object from a zipped archive.
My solution
def read_shapefile(shp_path):
import io
import shapefile
blocks = sc.binaryFiles(shp_path)
block_dict = dict(blocks.collect())
sf = shapefile.Reader(shp=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shp")][0]]),
shx=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shx")][0]]),
dbf=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".dbf")][0]]))
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
block_shapes = read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10*")
This works fine without breaking.

How to evaluate spark Dstream objects with an spark data frame

I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database
Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it .
Now I am getting the streaming data as
import re
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext,functions as func,Row
sc = SparkContext("local[2]", "realtimeApp")
ssc = StreamingContext(sc,10)
files = ssc.textFileStream("hdfs://RealTimeInputFolder/")
########Lets get the data from the db which is relavant for streaming ###
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
dataurl = "jdbc:sqlserver://myserver:1433"
db = "mydb"
table = "stream_helper"
credential = "my_credentials"
########basic data for evaluation purpose ########
files_count = files.flatMap(lambda file: file.split( ))
pattern = '(TranAmount=Decimal.{2})(.[0-9]*.[0-9]*)(\\S+ )(TranDescription=u.)([a-zA-z\\s]+)([\\S\\s]+ )(dSc=u.)([A-Z]{2}.[0-9]+)'
tranfiles = "wasb://myserver.blob.core.windows.net/RealTimeInputFolder01/"
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def pre_parse(logline):
"""
to read files as rows of sql in pyspark streaming using the pattern . for use of logging
added 0,1 in case there is any failure in processing by this pattern
"""
match = re.search(pattern,logline)
if match is None:
return(line,0)
else:
return(
Row(
customer_id = match.group(8)
trantype = match.group(5)
amount = float(match.group(2))
),1)
def parse():
"""
actual processing is happening here
"""
parsed_tran = ssc.textFileStream(tranfiles).map(preparse)
success = parsed_tran.filter(lambda s: s[1] == 1).map(lambda x:x[0])
fail = parsed_tran.filter(lambda s:s[1] == 0).map(lambda x:x[0])
if fail.count() > 0:
print "no of non parsed file : %d", % fail.count()
return success,fail
success ,fail = parse()
Now I want to evaluate it by the data frame that I get from the historical data
base_data = sqlContext.read.format("jdbc").options(driver=driver,url=dataurl,database=db,user=credential,password=credential,dbtable=table).load()
Now since this being returned as a data frame how do I use this for my purpose .
The streaming programming guide here says
"You have to create a SQLContext using the SparkContext that the StreamingContext is using."
Now this makes me even more confused on how to use the existing dataframe with the streaming object . Any help is highly appreciated .
To manipulate DataFrames, you always need a SQLContext so you can instanciate it like :
sc = SparkContext("local[2]", "realtimeApp")
sqlc = SQLContext(sc)
ssc = StreamingContext(sc, 10)
These 2 contexts (SQLContext and StreamingContext) will coexist in the same job because they are associated with the same SparkContext.
But, keep in mind, you can't instanciate two different SparkContext in the same job.
Once you have created your DataFrame from your DStreams, you can join your historical DataFrame with the DataFrame created from your stream.
To do that, I would do something like :
yourDStream.foreachRDD(lambda rdd: sqlContext
.createDataFrame(rdd)
.join(historicalDF, ...)
...
)
Think about the amount of streamed data you need to use for your join when you manipulate streams, you may be interested by the windowed functions

How to save IDFmodel with PySpark

I have produced an IDFModel with PySpark and ipython notebook as follows:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:
Save Apache Spark mllib model in python
But when I tried the suggestion in the answer
idf_train.save(sc, "/home/ubuntu/newfolder")
I get the error code
AttributeError: 'IDFModel' object has no attribute 'save'
Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!
I did something like that in Scala/Java. It seems to work, but might be not very efficient. The idea is to write a file as a serialized object and read it back later. Good Luck! :)
try {
val fileOut:FileOutputStream = new FileOutputStream(savePath+"/idf.jserialized");
val out:ObjectOutputStream = new ObjectOutputStream(fileOut);
out.writeObject(idf);
out.close();
fileOut.close();
System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
case foe:FileNotFoundException => foe.printStackTrace()
case ioe:IOException => ioe.printStackTrace()
}

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Resources