How to run hive sql in spark - python-3.x

add file s3://nouveau3/cleanser/cleanser.py
CREATE EXTERNAL TABLE IF NOT EXISTS ext_tbl (
c STRING
) ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION 's3-location'
tblproperties ('skip.header.line.count'='1');
CREATE TABLE main_tbl (schema);
INSERT INTO TABLE main_tbl
SELECT TRANSFORM(c)
USING 'python cleanser.py' as (schema)
FROM ext_tbl;
insert query run more than 15 mnts to improve that how can I run that query in spark? s3-location has more than 50 objects(gz format)

Approach 1 - If the query doesn't deal with too much of data and depending on the capacity of your edge-nodes you can run it directly on Spark just by login into spark-sql> shell
Approach 2 - But the spark-sql shell will not submit the query in Cluster mode it will just run on single Edge-Node and this might kill your job if Edge-Node fall short of resources.
You can write a python script which will read your query and in that u can call spark.sql("your queries"). Then you can launch this job using spark-submit --deploy-mode cluster. Here with the spark-submit command you get an option to specify the deploy-mode which should be cluster. This will leverage your entire cluster instead if just a node.

Related

Pyspark on EMR and external hive/glue - can drop but not create tables via sqlContext

I'm writing a dataframe to an external hive table from pyspark running on EMR. The work involves dropping/truncating data from an external hive table, writing the contents of a dataframe into aforementioned table, then writing the data from hive to DynamoDB. I am looking to write to an internal table on the EMR cluster but for now I would like the hive data to be available to subsequent clusters. I could write to the Glue catalog directly and force it to registered but that is a step further than I need to go.
All components work fine individually on a given EMR cluster: I can create an external hive table on EMR, either using a script or ssh and hive shell. This table can be queried by Athena and can be read from by pyspark. I can create a dataframe and INSERT OVERWRITE the data into the aforementioned table in pyspark.
I can then use hive shell to copy the data from the hive table into a DynamoDB table.
I'd like to wrap all of the work into the one pyspark script instead of having to submit multiple distinct steps.
I am able to drop tables using
sqlContext.sql("drop table if exists default.my_table")
When I try to create a table using sqlContext.sql("create table default.mytable(id string,val string) STORED AS ORC") I get the following error:
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-xx-xxx-xx-xxx/xx.xxx.xx.xx to ip-xxx-xx-xx-xx:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ip-xxx-xx-xx-xx:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
I can't figure out why I can create an external hive table in Glue using hive shell on the cluster, drop the table using hive shell or pyspark sqlcontext, but I can't create a table using sqlcontext. I have checked around and the solutions offered don't make sense in this context (copying hive-site.xml) as I can clearly write to the required addresses with no hassle, just not in pyspark. And it is doubly strange that I can drop the tables with them being definitely dropped when I check in Athena.
Running on:
emr-5.28.0,
Hadoop distribution Amazon 2.8.5
Spark 2.4.4
Hive 2.3.6
Livy 0.6.0 (for notebooks but my experimentation is via ssh and pyspark shell)
Turns out I could create tables via a spark.sql() call as long as I provided a location for the tables. Seems like Hive shell doesn't require it, yet spark.sql() does. Not expected but not entirely unsurprising.
Complementing #Zeathor's answer. After configuring the EMR and Glue connection and permission (you can check more in here: https://www.youtube.com/watch?v=w20tapeW1ME), you will just need to write sparkSQL commands:
spark = SparkSession.builder.appName('TestSession').getOrCreate()
spark.sql("create database if not exists test")
You can then create your tables from dataframes:
df.createOrReplaceTempView("first_table");
spark.sql("create table test.table_name as select * from first_table");
All the databases and tables metadata will then be stored in AWS Glue Catalogue.

Query performance in spark-submit vs hive shell

I am having hard time debugging why a simple query against hive external table (dynamodb backed) is taking north of 10 minutes via spark-submit and it only takes 4 seconds in hive shell.
Hive External Table that refers to a Dynamodb table say Employee[id, name, ssn, dept]. id is the partition key and ssn is the range key.
Using aws emr 5.29, spark, hive, tez, hadoop. 1 master, 4 core, m5.l
in hive shell: select name, dept, ssn from employee here id='123/ABC/X12I' returns results in 4 seconds.
Now, lets say i have the following code in code.py (ignoring the imports)
spark = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data=spark.sql("select name, dept, ssn from employee here id='123/ABC/X12I'")
# print data or get length
I submit the above on the master node as:
spark-submit --jars /pathto/emr-ddb-hive.jar, /pathto/emr-ddb-hadoop.jar code.py
The above spark submit takes a long time 14+ minutes. I am not sure which parameter needs to be tweaked or set to get better response time.
In hive shell I did a SET; to view the parameters that hive shell is using and there are a gazillion.
I also tried a boto3 dynamodb way of searching and it is way faster than my simple py sql to spark-submit.
I am missing fundamentals...Any idea or direction is appreciated.
I was doing an aggregation when I was trying to print by doing a collect() . I read about it but, did not realize that it was that bad (timing wise). I also did end up doing some more experiments like take(n) limit 1.

SPARK Performance degrades with incremental loads in local mode

i am trying to run a apache spark sql job (1.6) in local mode over 3 node cluster and i face below issues in production.
Execution time for duplication layer is increasing day by day after incremental load at DL layer.
Nearly 150K records are being inserted in each table every day.
We have tried with default as well as “MEMORY AND DISK” persist mechanism , but its working same in both cases.
Execution time is impacting the other tables if we run large tables first.
spark job is being invoked in a standard format and executed shell script using spark-submit and below sql query from my spark job is as below.
val result=sqlcontext.sql("CREATE TABLE "+DB+"."+table_name+" row format delimited fields terminated by '^' STORED as ORC tblproperties(\"orc.compress\"=\"SNAPPY\",\"orc.stripe.size\"='67108864') AS select distinct a.* from "+fdl_db+"."+table_name+" a,(SELECT SRL_NO,MAX("+INC_COL+") as incremental_col FROM "+fdl_db+"."+table_name+" group by SRL_NO) b where a.SRL_NO=b.SRL_NO and a."+INC_COL+"=b.incremental_col").repartition(100);
please let me know if you need any more info.

How to prevent running spark submit twice in case of failure in cluster mode?

We are running a batch process using spark and using spark-submit to submit our jobs with options
--deploy-mode cluster \
--master yarn-cluster \
We basically takes a csv files and do some processing on those files and create a parquet files from it. We are running multiple files in same spark submit command using a config file. Now lets say we have 10 files that we are processing and if the process fails on lets say file 6 Spark tries to re-run the process again and it will process all the files till file 6 and writes duplicate records for all those 5 files before failing. We are creating Parquet files and hence we don't have control over how spark names those files but it always create unique name.
Is there a way I can change the Spark property about not to re-execute a failed process?
The property spark.yarn.maxAppAttempts worked in my case I set its value to 1 like below in my spark submit command:
--conf "spark.yarn.maxAppAttempts=1"

Spark SQL performance difference in spark-sql vs spark-shell REPL

Spark newb question: I'm making exactly the same Spark SQL query in spark-sql and in spark-shell. The spark-shell version takes about 10 seconds, while the spark-sql version takes about 20.
The spark-sql REPL gets the query directly:
spark-sql> SELECT .... FROM .... LIMIT 20
The spark-shell REPL commands are like this:
scala> val df = sqlContext.sql("SELECT ... FROM ... LIMIT 20 ")
scala> df.show()
In both cases, it's exactly the same query. Also, the query returns only a few rows because of the explicit LIMIT 20.
What's different about how the same query is executed from the different CLIs?
I'm running on Hortonworks sandbox VM (Linux CentOS) if that helps.
I think it is more about two parts,
First, it could be related the order. If you run the spark-sql first spark will be able to build the explain plan from scratch. But if you run the same query again. It could take less than the first one either from shell or sql because the explain plan will be easy to be retrieved
Second, it could be related to the spark-sql conversion to the ordering the resources. It happened multiple times. Spark-shell get the resources and start the process faster than spark-sql. You can check this from the UI or from top you will find the actual start for the spark-shell is faster than spark-sql.

Resources