Splitting a mishappen csv file using pyspark RDD. EMR. Yarn memory exception errors - apache-spark

I have been working on this code for a while. Following I have listed the code and most of the cluster attributes I am using on EMR. The purpose of the code is to split some csv files in two at a certain line number based on some basic iteration (I have included a simple split in the code below).
I frequently get this error "Container killed by YARN for exceeding memory limits" and have followed these design principles (link below) to resolve it, but I just don't know why this would get memory problems. I have over 22GB for yarn overhead, and the files are in the MB to single digit GB ranges.
I am using sometimes r5a.12xlarges to no avail. I just really don't see any kind of memory leak in this code. It also seems very slow, was only able to process something like 20GB in 16 hours output to S3. Is this a good way to parallelize this split operation? Is there a memory leak? What gives?
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-yarn-memory-limit/
[
{
"Classification": "spark",
"Properties": {
"spark.maximizeResourceAllocation": "true"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.executor.memoryOverheadFactor":".2"
}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Configurations": [],
"Properties": {
"PYSPARK_PYTHON": "python36"
},
"Classification": "export"
}
],
"Properties": {
}
}
]
def writetxt(txt: Union[List[str], pandas.DataFrame], path: str) -> None:
s3 = boto3.resource('s3')
s3path = S3Url(path)
object = s3.Object(s3path.bucket, s3path.key)
if isinstance(txt, pandas.DataFrame):
csv_buffer = StringIO()
txt.to_csv(csv_buffer)
object.put(Body=csv_buffer.getvalue())
else:
object.put(Body='\n'.join(txt).encode())
def main(
x: Iterator[Tuple[str, str]],
output_files: str
) -> None:
filename, content = x
filename = os.path.basename(S3Url(filename).key)
content = content.splitlines()
# Split the csv file
columnAttributes, csvData = data[:100], data[100:]
writetxt(csvData, os.path.join(output_files, 'data.csv', filename))
writetxt(columnAttributes, os.path.join(output_files, 'attr.csv', filename))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Split some mishapen csv files.')
parser.add_argument('input_files', type=str,
help='The location of the input files.')
parser.add_argument('output_files', type=str,
help='The location to put the output files.')
parser.add_argument('--nb_partitions', type=int, default=4)
args = parser.parse_args()
# creating the context
sc = SparkContext(appName="Broadcom Preprocessing")
# We use minPartitions because otherwise small files get put in the same partition together
# by default, which we have a lot of
# We use foreachPartition to reduce the number of function calls, which slow down spark
distFiles = sc.wholeTextFiles(args.input_files, minPartitions=args.nb_partitions) \
.foreach(partial(main, output_files=args.output_files))

I think your memory issues are because you're doing the actual data-splitting with Python code. Spark processes run in the JVM, but when you call custom Python code, the related data must be serialized over to a Python process (on each worker node) in order to execute. This adds a lot of overhead. I believe you can accomplish what you're trying to do entirely with Spark operations - meaning the final program will run entirely in the JVM-based Spark processes.
Try something like this:
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window
from pyspark.sql.functions import *
input_path = "..."
split_num = 100
# load filenames & contents
filesDF = spark.createDataFrame( sc.wholeTextFiles(input_path), ['filename','contents'] )
# break into individual lines & number them
linesDF = filesDF.select( "filename", \
row_number().over(Window.partitionBy("filename").orderBy("filename")).alias("line_number"), \
explode(split(col("contents"), "\n")).alias("contents") )
# split into headers & body
headersDF = linesDF.where(col("line_number") == lit(1))
bodyDF = linesDF.where(col("line_number") > lit(1))
# split the body in 2 based
splitLinesDF = bodyDF.withColumn("split", when(col("line_number") < lit(split_num), 0).otherwise(1))
split_0_DF = splitLinesDF.where(col("split") == lit(0)).select("filename", "line_number", "contents").union(headersDF).orderBy("filename", "line_number")
split_1_DF = splitLinesDF.where(col("split") == lit(1)).select("filename", "line_number", "contents").union(headersDF).orderBy("filename", "line_number")
# collapse all lines back down into a file
firstDF = split_0_DF.groupBy("filename").agg(concat_ws("\n",collect_list(col("contents"))).alias("contents"))
secondDF = split_1_DF.groupBy("filename").agg(concat_ws("\n",collect_list(col("contents"))).alias("contents"))
# pandas-UDF for more memory-efficient transfer of data from Spark to Python
#pandas_udf(returnType=IntegerType())
def writeFile( filename, contents ):
<save to S3 here>
# write each row to a file
firstDF.select( writeFile( col("filename"), col("contents") ) )
secondDF.select( writeFile( col("filename"), col("contents") ) )
Finally you will need to use some custom python code to save each split file to S3 (or, you could just code everything in Scala/Java). It is far, far more efficient to do this via pandas UDFs than to pass a standard python function to .foreach(...). Internally, spark will serialize the data to Arrow format in chunks (one per partition), which will be very efficient.
Additionally, it looks like you're trying to put the entire object into S3 in a single request. If the data is too large for this, it will fail. You should check out the S3 streaming upload functionality.

Related

AWS Glue performance when write

After performing joins and aggregation i want the output to be in 1 file and partition based on some column.
when I use repartition(1) the time taken by job is 1 hr and if I remove preparation(1) there will be multiple partitions of that file it takes 30 mins (refer to example below).
So is there a way to write data into 1 file ??
...
...
df= df.repartition(1)
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
Is there any other way to increase the write performance. does changing format helps? and how to achieve parallelism by having 1 file output
S3 storage example
**if repartition(1)** // what I want but takes more time
choice=0/part-00-001
..
..
choice=500/part-00-001
**if removed** // takes less time but multiple files are present
choice=0/part-00-001
....
choice=0/part-00-0032
..
..
choice=500/part-00-001
....
choice=500/part-00-0032
Instead of using df.repartition(1)
USE df.repartition("choice")
df= df.repartition("choice")
glueContext.write_dynamic_frame.from_options(
frame = df,
connection_type = "s3",
connection_options = {
"path": "s3://s3path"
"partitionKeys": ["choice"]
},
format = "csv",
transformation_ctx = "datasink2")
If the goal is to have one single file, use coalesce instead of repartition, it avoids data shuffle.

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

Load a master data file to spark ecosystem

While building a log processing system, I came across a scenario where I need to look up data from a tree file (Like a DB) for each and every log line for corresponding value. What is the best approach to load an external file which is very large into the spark ecosystem? The tree file is of size 2GB.
Here is my scenario
I have a file contains huge number of log lines.
Each log line needs to be split by a delimiter to 70 fields
Need to lookup the data from tree file for one of the 70 fields of a log line.
I am using Apache Spark Python API and running on a 3 node cluster.
Below is the code which I have written. But it is really slow
def process_logline(line, tree):
row_dict = {}
line_list = line.split(" ")
row_dict["host"] = tree_lookup_value(tree, line_list[0])
new_row = Row(**row_dict)
return new_row
def run_job(vals):
spark.sparkContext.addFile('somefile')
tree_val = open(SparkFiles.get('somefile'))
lines = spark.sparkContext.textFile("log_file")
converted_lines_rdd = lines.map(lambda l: process_logline(l, tree_val))
log_line_rdd = spark.createDataFrame(converted_lines_rdd)
log_line_rdd.show()

Massive azure wasp JSON folder(450 GB) read in spark optimized way

i have processed file in azure spark . It takes long time to process the file . can anyone please suggest me the optimized way to achieve less process timings . Also attached my sample code with this.
// Azure container filesystem, it is contain source, destination, archive and result files
val azureContainerFs = FileSystem.get(sc.hadoopConfiguration)
// Read source file list
val sourceFiles = azureContainerFs.listStatus(new Path("/"+sourcePath +"/"),new PathFilter {
override def accept(path: Path): Boolean = {
val name = path.getName
name.endsWith(".json")
}
}).toList.par
// Ingestion processing to each file
for (sourceFile <- sourceFiles) {
// Tokenize file name from path
val sourceFileName = sourceFile.getPath.toString.substring(sourceFile.getPath.toString.lastIndexOf('/') + 1)
// Create a customer invoice DF from source json
val customerInvoiceDf = sqlContext.read.format("json").schema(schemaDf.schema).json("/"+sourcePath +"/"+sourceFileName).cache()
Thanks in Advance!
Please write us a bit more about your stack, and processing power (number of masters, slaves, how you deploy code, things like that)

Make Python dictionary available to all spark partitions

I am trying to develop an algorithm in pyspark for which I am working with linalg.SparseVector class. I need to create a dictionary of key value pairs as input to each SparseVector object. Here the keys have to be integers as they represent integers (in my case representing user ids). I have a separate method that reads the input file and returns a dictionary where each user ID ( string) is mapped to an integer index. When I go through the file again and do a
FileRdd.map( lambda x: userid_idx[ x[0] ] ) . I receive a KeyError. I'm thinking this is because my dict is unavailable to all partitions. Is there a way to make userid_idx dict available to all partitions similar to a distributed map in MapReduce? Also I apologize for the mess. I am posting this using my phone. Will update in a while from my laptop.
The code as promised:
from pyspark.mllib.linalg import SparseVector
from pyspark import SparkContext
import glob
import sys
import time
"""We create user and item indices starting from 0 to #users and 0 to #items respectively. This is done to store them in sparseVectors as dicts."""
def create_indices(inputdir):
items=dict()
user_id_to_idx=dict()
user_idx_to_id=dict()
item_idx_to_id=dict()
item_id_to_idx=dict()
item_idx=0
user_idx=0
for inputfile in glob.glob(inputdir+"/*.txt"):
print inputfile
with open(inputfile) as f:
for line in f:
toks=line.strip().split("\t")
try:
user_id_to_idx[toks[1].strip()]
except KeyError:
user_id_to_idx[toks[1].strip()]=user_idx
user_idx_to_id[user_idx]=toks[1].strip()
user_idx+=1
try:
item_id_to_idx[toks[0].strip()]
except KeyError:
item_id_to_idx[toks[0].strip()]=item_idx
item_idx_to_id[item_idx]=toks[0].strip()
item_idx+=1
return user_idx_to_id,user_id_to_idx,item_idx_to_id,item_id_to_idx,user_idx,item_idx
# pass in the hdfs path to the input files and the spark context.
def runKNN(inputdir,sc,user_id_to_idx,item_id_to_idx):
rdd_text=sc.textFile(inputdir)
try:
new_rdd = rdd_text.map(lambda x: (item_id_to_idx[str(x.strip().split("\t")[0])],{user_id_to_idx[str(x.strip().split("\t")[1])]:1})).reduceByKey(lambda x,y: x.update(y))
except KeyError:
sys.exit(1)
new_rdd.saveAsTextFile("hdfs:path_to_output/user/hadoop/knn/output")
if __name__=="__main__":
sc = SparkContext()
u_idx_to_id,u_id_to_idx,i_idx_to_id,i_id_to_idx,u_idx,i_idx=create_indices(sys.argv[1])
u_idx_to_id_b=sc.broadcast(u_idx_to_id)
u_id_to_idx_b=sc.broadcast(u_id_to_idx)
i_idx_to_idx_b=sc.broadcast(i_idx_to_id)
i_id_to_idx_b=sc.broadcast(i_id_to_idx)
num_users=sc.broadcast(u_idx)
num_items=sc.broadcast(i_idx)
runKNN(sys.argv[1],sc,u_id_to_idx_b.value,i_id_to_idx_b.value)
In Spark, that dictionary will already be available to you as it is in all tasks. For example:
dictionary = {1:"red", 2:"blue"}
rdd = sc.parallelize([1,2])
rdd.map(lambda x: dictionary[x]).collect()
# Prints ['red', 'blue']
You will probably find that your issue is actually that your dictionary does not contain the key you are looking up!
From the Spark documentation:
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program.
A copy of local variables referenced will be sent to the node along with the task.
Broadcast variables will not help you here, they are simply a tool to improve performance by sending once per node rather than a once per task.

Resources