I have a super-simple pyspark script:
Run query(Hive) and create a dataframe A
Perform aggregates on A which creates dataframe B
Print the number of rows on the aggregates with B.count()
Save the results of B in a Hive table using B.insertInto()
But I noticed something, in the Spark web UI the insertInto is listed as completed, but the client program(notebook) is still marking the insert as as running, if I run a count directly to the Hive table with a Hive client(no spark) the row-count is not the same as printed with B.count(), if I run the query again, the number of rows increases(but still not matching to B.count()) after some minutes, the row count hive query, matches B.count(). Question is, if the insertInto() job is already completed (according to the Spark web UI) what is it doing? given by the row-count increase behavior it seems as it is still running the insertInto but that does not matches with the spark web UI. My guess is something like hive table partition metadata update is running, or something similar.
i am trying to run a apache spark sql job (1.6) in local mode over 3 node cluster and i face below issues in production.
Execution time for duplication layer is increasing day by day after incremental load at DL layer.
Nearly 150K records are being inserted in each table every day.
We have tried with default as well as “MEMORY AND DISK” persist mechanism , but its working same in both cases.
Execution time is impacting the other tables if we run large tables first.
spark job is being invoked in a standard format and executed shell script using spark-submit and below sql query from my spark job is as below.
val result=sqlcontext.sql("CREATE TABLE "+DB+"."+table_name+" row format delimited fields terminated by '^' STORED as ORC tblproperties(\"orc.compress\"=\"SNAPPY\",\"orc.stripe.size\"='67108864') AS select distinct a.* from "+fdl_db+"."+table_name+" a,(SELECT SRL_NO,MAX("+INC_COL+") as incremental_col FROM "+fdl_db+"."+table_name+" group by SRL_NO) b where a.SRL_NO=b.SRL_NO and a."+INC_COL+"=b.incremental_col").repartition(100);
please let me know if you need any more info.
add file s3://nouveau3/cleanser/cleanser.py
CREATE EXTERNAL TABLE IF NOT EXISTS ext_tbl (
c STRING
) ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION 's3-location'
tblproperties ('skip.header.line.count'='1');
CREATE TABLE main_tbl (schema);
INSERT INTO TABLE main_tbl
SELECT TRANSFORM(c)
USING 'python cleanser.py' as (schema)
FROM ext_tbl;
insert query run more than 15 mnts to improve that how can I run that query in spark? s3-location has more than 50 objects(gz format)
Approach 1 - If the query doesn't deal with too much of data and depending on the capacity of your edge-nodes you can run it directly on Spark just by login into spark-sql> shell
Approach 2 - But the spark-sql shell will not submit the query in Cluster mode it will just run on single Edge-Node and this might kill your job if Edge-Node fall short of resources.
You can write a python script which will read your query and in that u can call spark.sql("your queries"). Then you can launch this job using spark-submit --deploy-mode cluster. Here with the spark-submit command you get an option to specify the deploy-mode which should be cluster. This will leverage your entire cluster instead if just a node.
I have a bunch of csv files stored in the blob storage that contains records like this:
2016-04-19 20:26:01.0299,+05:30,ecc84966-9bc0-4bef-9cd2-ad79c25be278,test001,178.03499442294,,Good
2016-04-19 20:26:02.0303,+05:30,ecc84966-9bc0-4bef-9cd2-ad79c25be278,test001,160.205223861246,,Good
I have created an External Hive table with the following command
CREATE EXTERNAL TABLE my_history (
DataTimestamp Timestamp,
TimezoneOffset String,
SystemGuid String,
TagName String,
NumericValue Double,
StringValue String
)
PARTITIONED BY (year int, month int, day int, hour int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 'wasb://mycontainer#mystorage.blob.core.windows.net/';
and have managed to add many partition like below for a month worth of data
ALTER TABLE my_history ADD IF NOT EXISTS PARTITION (year=2016, month = 03, day= 16, hour=00) LOCATION "Year=2016/Month=03/Day=16/Hour=00"
there are around 135,733,286 records in the table, at least that's what the following Hive Query of select count(*) from my_history says.
Now I have following 2 issues:
1. Jupyter Hangs
when I execute a query like this hiveContext.sql("select count(*) from my_history").show() I get no results, not even exception, where as running the same from the Hive gives me 135,733,286 as result after a long long time say 400+ sec.
2. Slow Results
I tried a simple duplicate query on Hive like this
SELECT
my_history.DataTimestamp,
my_history.TagName,
COUNT(*) as count,
MIN(my_history.NumericValue) as min_value,
MAX(my_history.NumericValue) as max_value
FROM
default.my_history
WHERE
my_history.TagName = 'test021'
GROUP BY
my_history.TagName,
my_history.DataTimestamp
HAVING
count > 1;
it takes close to 450 seconds to return result, I kind of expected it to return results in a fraction of that time as i have close to 60 cores on my HDInsight cluster. Running it from Jupyter again didn't yeld any results nor running the same query multiple times improved the performance as I have read that Spark caches the rdd for the next query.
what am I missing here?
Thanks
Kiran
Jupyter may hang if there is no resources in Yarn to start new spark application for your notebook. In this case Jupyter will wait until resources are available. Resources may be consumed by other spark applications from other notebooks. Check Yarn UI to see if there are other applications running, and if there are available resources. You can kill other applications from this UI. Or in case of notebooks you can shut down them using Jupyter "Running notebooks" UI.
Slow queries may be caused by many issues. First thing to check is to make sure your spark application uses all available cores in Yarn. In Preview notebooks are provided with around 25% of resources. You can change that allocation using %%configure command. Set number of cores to 4 and number of executors to 15:
%%configure -f
{"name":"remotesparkmagics-sample", "executorMemory": "12G", "executorCores":4, "numExecutors":15}
This should give all 60 cores to your application.
Spark newb question: I'm making exactly the same Spark SQL query in spark-sql and in spark-shell. The spark-shell version takes about 10 seconds, while the spark-sql version takes about 20.
The spark-sql REPL gets the query directly:
spark-sql> SELECT .... FROM .... LIMIT 20
The spark-shell REPL commands are like this:
scala> val df = sqlContext.sql("SELECT ... FROM ... LIMIT 20 ")
scala> df.show()
In both cases, it's exactly the same query. Also, the query returns only a few rows because of the explicit LIMIT 20.
What's different about how the same query is executed from the different CLIs?
I'm running on Hortonworks sandbox VM (Linux CentOS) if that helps.
I think it is more about two parts,
First, it could be related the order. If you run the spark-sql first spark will be able to build the explain plan from scratch. But if you run the same query again. It could take less than the first one either from shell or sql because the explain plan will be easy to be retrieved
Second, it could be related to the spark-sql conversion to the ordering the resources. It happened multiple times. Spark-shell get the resources and start the process faster than spark-sql. You can check this from the UI or from top you will find the actual start for the spark-shell is faster than spark-sql.