What is the fastest way to read database using PySpark? - apache-spark

I am trying to read a table of a database using PySpark and SQLAlchamy as follows:
SUBMIT_ARGS = "--jars mysql-connector-java-5.1.45-bin.jar pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
sc = SparkContext('local[*]', 'testSparkContext')
sqlContext = SQLContext(sc)
t0 = time.time()
database_uri = 'jdbc:mysql://{}:3306/{}'.format("127.0.0.1",<db_name>)
dataframe_mysql = sqlContext.read.format("jdbc").options(url=database_uri, driver = "com.mysql.jdbc.Driver", dbtable = <tablename>, user= <user>, password=<password>).load()
print(dataframe_mysql.rdd.map(lambda row :list(row)).collect())
t1 = time.time()
database_uri2 = 'mysql://{}:{}#{}/{}'.format(<user>,<password>,"127.0.0.1",<db_name>)
engine = create_engine(database_uri2)
connection = engine.connect()
s = text("select * from {}.{}".format(<db_name>,<table_name>))
result = connection.execute(s)
for each in result:
print(each)
t2= time.time()
print("Time taken by PySpark:", (t1-t0))
print("Time taken by SQLAlchamy", (t2-t1))
This is the time taken to fetch some 3100 rows:
Time taken by PySpark: 12.326745986938477
Time taken by SQLAlchamy: 0.21664714813232422
Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?

Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?
More than one. Ultimately you try use Spark in a way it is not intended to be used, measure incorrect thing and introduce incredible amount of indirection. Overall:
JDBC DataSource is inefficient, and as you use it is completely sequential. Check parallellizing reads in Spark Gotchas.
Collecting data is not intended for production use in practice.
You introduce a lot of indirection, by converting data to RDD and serializing, fetching to driver and deserializing.
Your code measures not only data processing time, but also cluster / contexts initialization time.
local mode (designed for prototyping and unit testing) is just a cherry on the top.
And so on...
So at the end of the day your code is slow but it is not something you'd use in production application. SQLAlchemy and Spark are designed for complete different purposes - if you're looking for low latency database access layer Spark is not the right choice.

Related

Spark job taking long time # Code or Environment Issue?

we have a 300 node cluster, each node having 132gb memory and 20 cores. the ask is - remove data from tableA which is in tableB and then merge B with A and push A to teradata.
below is the code
val ofitemp = sqlContext.sql("select * from B")
val ofifinal = sqlContext.sql("select * from A")
val selectfromfinal = sqlContext.sql("select A.a,A.b,A.c...A.x from A where A.x=B.x")
val takefromfinal = ofifinal.except(selectfromfinal)
val tempfinal = takefromfinal.unionAll(ofitemp)tempfinal.write.mode("overwrite").saveAsTable("C")
val tempTableFinal = sqlContext.table("C")tempTableFinal.write.mode("overwrite").insertInto("A")
the config used to run spark is -
EXECUTOR_MEM="16G"
HIVE_MAPPER_HEAP=2048 ## MB
NUMBER_OF_EXECUTORS="25"
DRIVER_MEM="5G"
EXECUTOR_CORES="3"
with A and B having few million records, the job is taking several hours to run.
As am very new to Spark, am not understanding - is it the code issue or the environment setting issue.
would be obliged, if you can share your thoughts to overcome the performance issues.
In your code, exceptcould be a bottleneck because it compares all columns for equality. Is this really what you need (I'm confused about the join À.x=B.y`in the line before)
If you only need to check on 1 attribute, the fastest way would be to do a "leftanti"-join :
val takefromfinal = ofifinal.join(ofitemp,$"A.x"===$"B.y","leftanti")
Besides that, study spark-UI and identify the bottleneck

SPARK JDBC Connection Reuse for many Queries Executed

Borrowing from SO 26634853, then the following questions :
Using an IMPALA connection like this is a one-shot set up :
val JDBCDriver = "com.cloudera.impala.jdbc41.Driver"
val ConnectionURL = "jdbc:impala://url.server.net:21050/default;auth=noSasl"
Class.forName(JDBCDriver).newInstance
val con = DriverManager.getConnection(ConnectionURL)
val stmt = con.createStatement()
val rs = stmt.executeQuery(query)
val resultSetList = Iterator.continually((rs.next(), rs)).takeWhile(_._1).map(r => {
getRowFromResultSet(r._2) // (ResultSet) => (spark.sql.Row)
}).toList
sc.parallelize(resultSetList)
What if I need to put a loop around the con.createStatement() and associated code thereunder with some logic and execute it say, some, 5000 times?
Referring to the db connection overhead discussions with map vs. mapPartitions, would I, in this case, incur 5000 x the cost of the connection, or is it re-usable the way it is done here? From documentation on SCALA JDBC it looks like it can be reused.
My thinking is that as it is not a high-level SPARK API like df_mysql = sqlContext.read.format("jdbc").options ..., then I think it should remain open, but would like to check. May be the SPARK env closes it automatically, but I think not. At the end of processing a close could be issued?
Using the HIVE Context means we do not need to open the connection every time - or is that not so? Then using a parquet or ORC table, then I presume would allow such an approach as performance is quite fast.
I tried this simulation and connection remains open, so provided not in foreach, it is not an issue performance-wise.
var counter = 0
do
{
counter = counter + 1
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select author from family) f ", connectionProperties)
dataframe_mysql.show
} while (counter < 3)

Spark streaming performance issue. Every minute process time increasing

We are facing performance issue in my streaming application. I am reading my data from kafka topic using DirectStream and converting the data into dataframe. After doing some aggregation operation in dataframe ,I am saving the result into registerTempTable. The registerTempTable will be used for next minute dataframe compression. And compared result will be saved in HDFS and data will be overwritten in existing registerTempTable.
In this I am facing performance issue. My streaming job ruining in first min 15sec, second min 18 sec and third min 20 sec like that it keeps increasing the processing time. Over period my streaming job will queued.
About my application.
Streaming will run on every 60 sec.
Spark version 2.1.1( I am using pyspark)
Kafka topic have four partitions.
To solve the issue, I have tried below steps.
Step 1: While submitting my job I am giving “spark.sql.shuffle.partitions=4” .
Step 2: while saving my dataframe as textfile I am using coalesce(4).
When I see my spark url every min run, save as file stage is doubling up like first min 14 stages, second min 28 stages and third min 42 stages.
Spark UI Result.
Hi,
Thanks for reply,
Sorry, i am new to spark. I am not sure what exactly I need change, can you please help me. do I need do unpersist my "df_data"?
I cache and unpersist my "df_data" data frame also. But still, i am facing the same issue.
Do I need to enable my checkpoint? like adding "ssc.checkpoint("/user/test/checkpoint")" this code.createDirectStream will support checkpoint? Or i need to enable offset value? please let me know what change is required here.
if __name__ == "__main__":
sc = SparkContext(appName="PythonSqlNetworkWordCount")
sc.setLogLevel("ERROR")
sparkSql = SQLContext(sc)
ssc = StreamingContext(sc, 60)
zkQuorum = {"metadata.broker.list" : "m2.hdp.com:9092"}
topic = ["kpitmumbai"]
kvs = KafkaUtils.createDirectStream(ssc,topic,zkQuorum)
schema = StructType([StructField('A', StringType(), True), StructField('B', LongType(), True), StructField('C', DoubleType(), True), StructField('D', LongType(), True)])
first_empty_df = sqlCtx.createDataFrame(sc.emptyRDD(), schema)
first_empty_df.registerTempTable("streaming_tbl")
lines = kvs.map(lambda x :x[1])
lines.foreachRDD(lambda rdd: empty_rdd() if rdd.count() == 0 else
CSV(rdd))
ssc.start()
ssc.awaitTermination()
def CSV(rdd1):
spark = getSparkSessionInstance(rdd1.context.getConf())
psv_data = rdd1.map(lambda l: l.strip("\s").split("|") )
data_1 = psv_data.map(lambda l : Row(
A=l[0],
B=l[1],
C=l[2],
D=l[3])
hasattr(data_1 ,"toDF")
df_2= data_1.toDF()
df_last_min_data = sqlCtx.sql("select A,B,C,D,sample from streaming_tbl")#first time will be empty and next min onwards have values
df_data = df_2.groupby(['A','B']).agg(func.sum('C').alias('C'),func.sum('D').alias('D'))
df_data.registerTempTable("streaming_tbl")
con=(df_data.A==df_last_min_data.A) & (df_data.B==df_last_min_data.B)
path1=path="/user/test/str" + str(Starttime).replace(" ","").replace("-","").replace(":","")
df_last_min_data.join(df_data,con,"inner").select(df_last_min_data.A,df_last_min_data.b,df_data.C,df_data.D).write.csv(path=path1,mode="append")
Once again thanks for the reply.
After doing some aggregation operation in dataframe ,I am saving the result into registerTempTable. The registerTempTable will be used for next minute dataframe compression. And compared result will be saved in HDFS and data will be overwritten in existing registerTempTable.
The most likely problem is you don't checkpoint the table and lineage keeps growing with each iteration. This makes each iteration more and more expensive, especially when data is not cached.
Overall if you need stateful operations you should prefer existing stateful transformations. Both "old" streaming and structured streaming come with their own variants, which can be used in a variety of scenarios.

Can spark execute table operations on remote nodes? (vs. row operations)

Most of spark's Dataset functions are per-row operations. However, I'd like to distribute execution of ML tasks to run on Spark -- most ML tasks are naturally operations that are functions of tables, and not natually functions of rows. (I've looked at MLLib -- its way too limited, and in many cases execution is made orders of magnitude slower in spark by distribute operations over many cores that could otherwise fit on a single core).
Its important that ML algorithms process collections of rows, not single rows, and so I'd like to materialize a table into memory on a node. (I pinky promise it will fit into core). How can I do this?
Functionally, I'd like to do:
def mlsubtask(table, arg2, arg3):
data = table.collect()
...
sc = SparkContext(...)
sqlctx = SQLContext(sc)
...
df = sqlctx.sql("SELECT ...")
results = sc.parallelize([(df,arg2,arg3),(df,arg2,arg3),(df,arg2,arg3)]).map(mlsubtask).collect()
If can perform execution like this:
sc = SparkContext(...)
sqlctx = SQLContext(sc)
...
df = sqlctx.sql("SELECT ...")
df = df.collect()
results = sc.parallelize([(df,arg2,arg3),(df,arg2,arg3),(df,arg2,arg3)]).map(mlsubtask).collect()
... but this brings the data to the client, which in then re-serialized and quite inefficient.
For a single task:
def mlsubtask(iter_rows):
data_table = list(iter_rows) # Or other way of bringing into memory.
...
df.repartition(1).mapPartitions(mlsubtask)

How to eval spark.ml model without DataFrames/SparkContext?

With Spark MLLib, I'd build a model (like RandomForest), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features.
It seems like with Spark ML, predict is now called transform and only acts on a DataFrame.
Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame?
Am I missing something?
Re: Is there any way to build a DataFrame outside of Spark?
It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it around somehow, but the whole story is that the connection between DataFrames and SparkContext is by design.
Here is my solution to use spark models outside of spark context (using PMML):
You create model with a pipeline like this:
SparkConf sparkConf = new SparkConf();
SparkSession session = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate();
String tableName = "schema.table";
Properties dbProperties = new Properties();
dbProperties.setProperty("user",vKey);
dbProperties.setProperty("password",password);
dbProperties.setProperty("AuthMech","3");
dbProperties.setProperty("source","jdbc");
dbProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver");
String tableName = "schema.table";
String simpleUrl = "jdbc:impala://host:21050/schema"
Dataset<Row> data = session.read().jdbc(simpleUrl ,tableName,dbProperties);
String[] inputCols = {"column1"};
StringIndexer indexer = new StringIndexer().setInputCol("column1").setOutputCol("indexed_column1");
StringIndexerModel alphabet = indexer.fit(data);
data = alphabet.transform(data);
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Predictor p = new GBTRegressor();
p.set("maxIter",20);
p.set("maxDepth",2);
p.set("maxBins",204);
p.setLabelCol("faktor");
PipelineStage[] stages = {indexer,assembler, p};
Pipeline pipeline = new Pipeline();
pipeline.setStages(stages);
PipelineModel pmodel = pipeline.fit(data);
PMML pmml = ConverterUtil.toPMML(data.schema(),pmodel);
FileOutputStream fos = new FileOutputStream("model.pmml");
JAXBUtil.marshalPMML(pmml,new StreamResult(fos));
Using PPML for predictions (locally, without spark context, which can be applied to a Map of arguments and not on a DataFrame):
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);
inputFieldMap = new HashMap<String, Field>();
Map<FieldName,String> args = new HashMap<FieldName, String>();
Field curField = evaluator.getInputFields().get(0);
args.put(curField.getName(), "1.0");
Map<FieldName, ?> result = evaluator.evaluate(args);
Spent days on this problem too. It's not straightforward. My third suggestion involves code I have written specifically for this purpose.
Option 1
As other commenters have said, predict(Vector) is now available. However, you need to know how to construct a vector. If you don't, see Option 3.
Option 2
If the goal is to avoid setting up a Spark server (standalone or cluster modes), then its possible to start Spark in local mode. The whole thing will run inside a single JVM.
val spark = SparkSession.builder().config("spark.master", "local[*]").getOrCreate()
// create dataframe from file, or make it up from some data in memory
// use model.transform() to get predictions
But this brings unnecessary dependencies to your prediction module, and it consumes resources in your JVM at runtime. Also, if prediction latency is critical, for example making a prediction within a millisecond as soon as a request comes in, then this option is too slow.
Option 3
MLlib FeatureHasher's output can be used as an input to your learner. The class is good for one hot encoding and also for fixing the size of your feature dimension. You can use it even when all your features are numerical. If you use that in your training, then all you need at prediction time is the hashing logic there. Its implemented as a spark transformer so it's not easy to re-use outside of a spark environment. So I have done the work of pulling out the hashing function to a lib. You apply FeatureHasher and your learner during training as normal. Then here's how you use the slimmed down hasher at prediction time:
// Schema and hash size must stay consistent across training and prediction
val hasher = new FeatureHasherLite(mySchema, myHashSize)
// create sample data-point and hash it
val feature = Map("feature1" -> "value1", "feature2" -> 2.0, "feature3" -> 3, "feature4" -> false)
val featureVector = hasher.hash(feature)
// Make prediction
val prediction = model.predict(featureVector)
You can see details in my github at tilayealemu/sparkmllite. If you'd rather copy my code, take a look at FeatureHasherLite.scala.There are sample codes and unit tests too. Feel free to create an issue if you need help.

Resources