Pyspark query hive extremely slow even the final result is quite small

Pyspark query hive extremely slow even the final result is quite small - apache-spark

I am using spark 2.0.0 to query hive table:
my sql is:
select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.
When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
But then the problem comes when I try to implement this query by my python code.
I am using Spark 2.0.0 and write a very simple pyspark program, code is:
Below is my pyspark code:
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
Below is my scala code:
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
hive.setConf("hive.exec.orc.split.strategy", "ETL")
val df = hive.sql("select * from silver_ep.zj_v limit 10")
df.rdd.collect()
From the info log , I find:
although I use "limit 10" to tell spark that I just want the first 10 records , but spark still scan and read all files(in my case, the source data of this view contains 100 files and each file's size is about 1G) of the view , So , there are nearly 100 tasks , each task read a file , and all the task is executed serially. I use nearlly 15 minutes to finish these 100 tasks!!!!! but what I want is just to get the first 10 records.
So , I don't know what to do and what is wrong;
Anybode could give me some suggestions?

Related

Status of structured streaming query in PySpark

I am following the book Spark - Definitive Guide and I was writing basic program that streams the data . The books says that I should use awaitTermination() method to process the query correctly. When I run the below code , it runs indefinitely until I press Ctrl+C and it ends with exception. My question is how can I monitor the status of my streaming query and as soon as my streaming completes , my program should exit after showing the output. Like in the example code below , as soon as it reads all the files and writes the file on the console , it should have ended but it didn't . I also tried inserting activityQuery.stop() but that also didn't work. How can I achieve the same . Any help be appreciated.
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *
from time import sleep
conf = SparkConf()
spark = SparkSession.builder.config(conf=conf).appName('testapp').getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.streaming.schemaInference", "true")
static = spark.read.format("json").load("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
dataSchema = static.schema
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1).json("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
activityCounts = streaming.groupBy("gt").count()
activityQuery = activityCounts.writeStream.queryName("activity_counts").format("console").outputMode("complete").start()
activityQuery.awaitTermination()
for x in range(5):
spark.sql("select * from activity_counts").show()
sleep(1)

How to prevent spark query against CSV glue catalog source from including headers?

I am attempting to build a Glue job that will execute a SQL query against an existing glue catalog, and store the results in another glue catalog (in the example below, only return the record with the highest cost for each value of sn.) When executing a spark query against CSV sourced data, however, it is including the header in the results. This issue does not occur when the source is parquet. The glue catalog Serde parameters includes skip.header.line.count 1, and executing the query against the source data through Athena does not include the headers.
Is there a way to explicitly tell spark to ignore header rows when using .sql()?
Here is the essence of the python code my glue job is executing:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
glue_source_database_name = 'source_database'
glue_destination_database_name = 'destination_database'
table_name = 'diamonds10_csv'
partition_count = 5
merge_query = 'SELECT SEQ.`sn`,SEQ.`carat`,SEQ.`cut`,SEQ.`color`,SEQ.`clarity`,SEQ.`depth`,SEQ.`table`,SEQ.`price`,SEQ.`x`,SEQ.`y`,SEQ.`z` FROM ( SELECT SUB.`sn`,SUB.`carat`,SUB.`cut`,SUB.`color`,SUB.`clarity`,SUB.`depth`,SUB.`table`,SUB.`price`,SUB.`x`,SUB.`y`,SUB.`z`, ROW_NUMBER() OVER ( PARTITION BY SUB.`sn` ORDER BY SUB.`price` DESC ) AS test_diamond FROM `diamonds10_csv` AS SUB) AS SEQ WHERE SEQ.test_diamond = 1'
spark_context = SparkContext.getOrCreate()
spark = SparkSession( spark_context )
spark.sql( f'use {glue_source_database_name}')
targettable = spark.sql(merge_query)
targettable.repartition(partition_count).write.option("path",f'{s3_output_path}/{table_name}').mode("overwrite").format("parquet").saveAsTable(f'`{glue_destination_database_name}`.`{table_name}`')

spark: No of records in DataFrame is different in different runs

I am running a spark job that reads data from teradata. The query looks like
select * from db_name.table_name sample 5000000;
I'm trying to pull sample of 5 million rows of data. When I tried to print the number of rows in the result DataFrame, it is giving different results each time I run. Sometimes it is 4999937 and sometimes it is 5000124. Is there any particular reason for this kind of behaviour?
EDIT #1:
The code I'm using:
val query = "(select * from db_name.table_name sample 5000000) as data"
var teradataConfig = Map("url"->"jdbc:teradata://HOSTNAME/DATABASE=db_name,DBS_PORT=1025,MAYBENULL=ON",
"TMODE"->"TERA",
"user"->"username",
"password"->"password",
"driver"->"com.teradata.jdbc.TeraDriver",
"dbtable" -> query)
var df = spark.read.format("jdbc").options(teradataConfig).load()
df.count

Try caching the resultant dataframe and perform count action on the dataframe
df.cache()
println(s"Record count: ${df.count()}
From here on when you reuse the df to create new dataframe or any other transformation you don't get mismatched counts since it is already in cache.
Make sure you have given enough memory to hold the cached dataframe in memory.

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel?

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.
Here is my code using spark-shell
import org.apache.spark.sql._
import org.apache.spark.sql.types.StringType
spark.sql("""CREATE TEMPORARY VIEW hello USING org.apache.spark.sql.cassandra OPTIONS (table "hello", keyspace "db", cluster "Test Cluster", pushdown "true")""")
val df = spark.sql("SELECT test from hello")
val df2 = df.select(df("test").cast(StringType).as("test"))
val rdd = df2.rdd.map { case Row(j: String) => j }
val df4 = spark.read.json(rdd) // This line takes forever
I have about 700 million rows each row is about 1KB and this line
val df4 = spark.read.json(rdd) takes forever as I get the following output.
Stage 1:==========> (4866 + 24) / 25256]
so at this rate it will probably take roughly 3hrs.
I measured the network throughput rate of spark worker nodes using iftop and it is about 75MB/s (Megabytes per second) which is pretty good but I am not sure if it is reading partitions in parallel. Any ideas on how to make it faster?
Here is my DAG.

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?

You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark query hive extremely slow even the final result is quite small - apache-spark

Related

Status of structured streaming query in PySpark

How to prevent spark query against CSV glue catalog source from including headers?

spark: No of records in DataFrame is different in different runs

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel?

How to stop load the whole table in spark?

Categories

Resources