Problems with append to output in PySpark - apache-spark

Have a problem with output file. Everytime I start my program, it needs to write down an answer to the dataframe in a new row in string format. (Output example: 1,5,14,45,99.)
My task needs to be automatically checked by program like
PYSPARK_PYTHON=/opt/conda/envs/dsenv/bin/python spark-submit \
--master yarn \
--name checker \
projects/3/shortest_path.py 12 34 /datasets/twitter/twitter.tsv hw3_output
This program returns output file with only one row. But on my local notebook it works even with several runs.
Here is a part of my program that works with output
output = sys.argv(4)
d = [[answer]]
df_out = spark.createDataFrame(data=d)
df_out.write.format("csv").options(delimiter='\n').mode('append').save(output)
Can you please suggest a way to modify my program or make suggestions what goes wrong?
Tried to change options of .save in dozens of combinations.

Related

Pyspark code improvement idea (to lower the duration)

I have severals csv files stored in HDFS, each file is about 500Mo (thus about 200k lines and 20 columns). The data are mainly String type (90%).
I am doing basic operations on those files, like filtering, mapping, regexextract, to generate two "clean file".
I do not ask Spark to infer the schema.
The filters step:
# Filter with document type
self.data = self.data.filter('type=="1" or type=="3"')
# Remove message document without ID
self.data = self.data.filter('id_document!="1" and id_document!="7" and id_document is not NULL')
# Remove message when the Service was unavailable.
self.data = self.data.filter('message!="Service Temporarily Unavailable"')
# Chose message with the following actions
expression = '(.*receiv.*|.*reject.*|.*[^r]envoi$|creation)'
self.data = self.data.filter(self.data['action'].rlike(expression))
# Chose message with a valid date format
self.data = self.data.filter(self.data['date'].rlike('(\d{4}-\d{2}-\d{2}.*)'))
# Parse message
self.data = self.data.withColumn('message_logiciel', regexp_extract(col('message'), r'.*app:([A-Za-z\s-\déèàç
Then I do a mapping based on a dictionnary, and I finish by steps like this:
self.log = self.log.withColumn('action_result', col('message')) \
.withColumn('action_result', regexp_replace('action_result',
'.*(^[Ee]rror.{0,15}422|^[Ee]rror.{0,15}401).*',
': error c')) \
.withColumn('action_result', regexp_replace('action_result',
'.*(^[Ee]rror.*failed to connect.*|^[Ee]rror.{0,15}502).*',
': error ops')) \
.withColumn('action_result', regexp_replace('action_resul', '.*(.*connector.*off*).*',
': connector off'))
Finaly I put the cleaned data to Hive
My issue is the duration of the process; for each 500Mo files it takes 15 minutes. Which do not seems good for me. I am guessing that:
either my code does not follow spark logic (to many regexextract?)
either what is long is to put in Hive.
I am wondering why it takes so long; and if you have any clue to lower this duration.
My environnement is the following:
Coding in pyspark with a Jupyter Notebook (connected to the cluster with Livy, the Cluster is a simple 3 node cluster (4x3 Cores, 3x16Gb Ram).
I tried to increase the allocated cores and memory, but the duration is still the same.
I have done the same exercice with Panda in classic python, it works just fine.
Thank you for your help.

How can I use the saveAsTable function when I have two Spark streams running in parallel in the same notebook?

I have two Spark streams set up in a notebook to run in parallel like so.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df1 = spark \
.readStream.format("delta") \
.table("test_db.table1") \
.select('foo', 'bar')
writer_df1 = df1.writeStream.option("checkpoint_location", checkpoint_location_1) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df2 = spark \
.readStream.format("delta") \
.table("test_db.table2") \
.select('foo', 'bar')
writer_df2 = merchant_df.writeStream.option("checkpoint_location", checkpoint_location_2) \
.foreachBatch(
lambda batch_df, batch_epoch:
process_batch(batch_df, batch_epoch)
) \
.start()
These dataframes then get processed row by row, with each row being sent to an API. If the API call reports an error, I then convert the row into JSON and append this row to a common failures table in databricks.
columns = ['table_name', 'record', 'time_of_failure', 'error_or_status_code']
vals = [(table_name, json.dumps(row.asDict()), datetime.now(), str(error_or_http_code))]
error_df = spark.createDataFrame(vals, columns)
error_df.select('table_name','record','time_of_failure', 'error_or_status_code').write.format('delta').mode('Append').saveAsTable("failures_db.failures_db)"
When attempting to add the row to this table, the saveAsTable() call here throws the following exception.
py4j.protocol.Py4JJavaError: An error occurred while calling o3578.saveAsTable.
: java.lang.IllegalStateException: Cannot find the REPL id in Spark local properties. Spark-submit and R doesn't support transactional writes from different clusters. If you are using R, please switch to Scala or Python. If you are using spark-submit , please convert it to Databricks JAR job. Or you can disable multi-cluster writes by setting 'spark.databricks.delta.multiClusterWrites.enabled' to 'false'. If this is disabled, writes to a single table must originate from a single cluster. Please check https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions-faq for more details.
If I comment out one of the streams and re-run the notebook, any errors from the API calls get inserted into the table with no issues. I feel like there's some configuration I need to add but am not sure of where to go from here.
Not sure if this is the best solution, but I believe the problem comes from each stream writing to the table at the same time. I split this table into separate tables for each stream and it worked after that.

Saving a file locally in Databricks PySpark

I am sure there is documentation for this somewhere and/or the solution is obvious, but I've come up dry in all of my searching.
I have a dataframe that I want to export to a text file to my local machine. The dataframe contains strings with commas, so just display -> download full results ends up with a distorted export. I'd like to export out with a tab-delimiter, but I cannot figure out for the life of me how to download it locally.
I have
match1.write.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.save("file:\\\C:\\Users\\user\\Desktop\\NewsArticle.txt")
but clearly this isn't right. I suspect it is writing somewhere else (somewhere I don't want it to be...) because running it again gives me the error that the path already exists. So... what is the correct way?
cricket_007 pointed me along the right path--ultimately, I needed to save the file to the Filestore of Databricks (not just dbfs), and then save the resulting output of the xxxxx.databricks.com/file/[insert file path here] link.
My resulting code was:
df.repartition(1) \ #repartitioned to save as one collective file
.write.format('csv') \ #in csv format
.option("header", True) \ #with header
.option("quote", "") \ #get rid of quote escaping
.option(delimiter="\t") \ #delimiter of choice
.save('dbfs:/FileStore/df/') #saved to the FileStore
Check if it is present at below location. Multiple part files should be there in that folder.
import os
print os.getcwd()
If you want to create a single file (not multiple part files) then you can use coalesce()(but note that it'll force one worker to fetch whole data and write these sequentially so it's not advisable if dealing with huge data)
df.coalesce(1).write.format("csv").\
option("delimiter", "\t").\
save("<file path>")
Hope this helps!

How many times does the script used in spark pipes gets executed.?

I tried the below spark scala code and got the output as mentioned below.
I have tried to pass the inputs to script, but it didn't receive and when i used collect the print statement i used in the script appeared twice.
My simple and very basic perl script first:
#!/usr/bin/perl
print("arguments $ARGV[0] \n"); // Just print the arguments.
My Spark code:
object PipesExample {
def main(args:Array[String]){
val conf = new SparkConf();
val sc = new SparkContext(conf);
val distScript = "/home/srinivas/test.pl"
sc.addFile(distScript)
val rdd = sc.parallelize(Array("srini"))
val piped = rdd.pipe(Seq(SparkFiles.get("test.pl")))
println(" output " + piped.collect().mkString(" "));
}
}
Output looked like this..
output arguments arguments
1) What mistake i have done to make it fail receiving the arguments.?
2) Why it executed twice.?
If it looks too basic, please apologize me. I was trying to understand to the best and want to clear my doubts.
From my experience, it is executed twice because spark divides your RDD into two partitions and each partition is passed to your external script.
The reason your application couldn't pick test.pl file is, the file is in some node's location. But the application master is created in one of the nodes in cluster. So prolly if the file isn't in that node, it can't pick the file.
You should always save the file in HDFS or S3 to access the external files. Or else pass the HDFS file location through spark command options.

how to get Application ID from Submission ID or Driver ID programmatically

I am submitting a Spark Job in cluster deploy-mode. i am getting Submission ID in my code. In order to use Spark rest Api we need applicationId. So how can we get Application Id via Submission ID programmatically
To answer this question there is assumed that there is known that the certificate ID can be obtained with the function:
scala> spark.sparkContext.applicationId
If you want to get it is a little different. Over the rest API it is possible to send commands. For example like in the following useful tutorial about the Apache REST API: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Another solution is to program it yourself. You can send a Spark submit in the following style:
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://my-cluster:7077 \
--deploy-mode cluster \
/path/to/examples.jar 1000
Here in the background of the terminal you should see the log file being produced. In the log file there is the Application ID (App_ID). The $SPARK_HOME is here the environment variable leading to the folder where spark is located.
With for example Python it is possible to obtain the App_ID in the code described below. First we create a list to send the command with Python its subprocess module. The subprocess module has the possibility to make a PIPE from which you can extract the log information instead of using Spark its standard option to post it to the terminal. Make sure to use communicate() after the Popen, to prevent waiting for the OS. Then split it in lines and scrape throught it to find the App_ID. The example can be find below:
submitSparkList=['$SPARK_HOME/bin/spark-submit','--class','org.apache.spark.examples.SparkPi','--master spark://my-cluster:7077','--deploy-mode cluster','/path/to/examples.py','1000']
sparkCommand=subprocess.Popen(submitSparkList,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = sparkCommand.communicate()
stderr=stderr.splitlines()
for line in stderr:
if "Connected to Spark cluster" in line: #this is the first line from the rest API that contains the ID. Scrape through the logs to find it.
app_ID_index=line.find('app-')
app_ID=line[app_ID_index:] #this gives the app_ID
print('The app ID is ' + app_ID)
Following the Python guide this contains a warning when not using communicate() function:
https://docs.python.org/2/library/subprocess.html
Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

Resources