How do I print out a spark.sql object? - apache-spark

I have a spark.sql object that includes a couple of variables.
import com.github.nscala_time.time.Imports.LocalDate
val first_date = new LocalDate(2020, 4, 1)
val second_date = new LocalDate(2020, 4, 7)
val mydf = spark.sql(s"""
select *
from tempView
where timestamp between '{0}' and '{1}'
""".format(start_date.toString, end_date.toString))
I want to print out mydf because I ran mydf.count and got 0 as the outcome.
I ran mydf and got back mydf: org.apache.spark.sql.DataFrame = [column: type]
I also tried println(mydf) and it didn't return the query.
There is this related question, but it does not have the answer.
How can I print out the query?

Easiest way would be store your query into a variable then print out the variable to get the query.
Use variable in spark.sql
Example:
In Spark-scala:
val start_date="2020-01-01"
val end_date="2020-02-02"
val query=s"""select * from tempView where timestamp between'${start_date}' and '${end_date}'"""
print (query)
//select * from tempView where timestamp between'2020-01-01' and '2020-02-02'
spark.sql(query)
In Pyspark:
start_date="2020-01-01"
end_date="2020-02-02"
query="""select * from tempView where timestamp between'{0}' and '{1}'""".format(start_date,end_date)
print(query)
#select * from tempView where timestamp between'2020-01-01' and '2020-02-02'
#use same query in spark.sql
spark.sql(query)

Here it is in PySpark.
start_date="2020-01-01"
end_date="2020-02-02"
q="select * from tempView where timestamp between'{0}' and '{1}'".format(start_date,end_date)
print(q)
Here is the onlnie running version: https://repl.it/repls/FeistyVigorousSpyware

Related

Spark to SparkSQL equivqlent syntax

I have this two line in spark, I want to get the equivalent in SparkSQL (im working with python env)
df = spark_df.filter(col["col_name".lower()].rlike("[0-9]{9}$")).count()
spark_df = spark_df.withColumn(columnname, F.to_date(F.col((columnname, ),"yyyyMMdd"))
For spark sql first convert dataframe to temp view then run sql.
Example:
spark_df.createOrReplaceTempView("tmp")
df=spark.sql("""select count(*) from tmp where lower(col_name) rlike("[0-9]{9}$") """).collect()[0][0]
spark_df = spark.sql("""select *, to_date(columnname,"yyyyMMdd") columnname from tmp """)

How to convert a sql to spark dataset?

I have a Val test=sql ("Select * from table1) which returns a dataframe. I want to convert it to dataset which is not working.
test.toDS is throwing error.
Please provide more detail about the error.
If you want to convert a dataframe in dataset use the code below :
case class MyClass(field1: Int, field2: Long) // for example
val df = sql ("Select * from table1)
val ds : Dataset[MyClass] = df.as[MyClass]

update cassandra from spark

I'm a table in cassandra tfm.foehis that have data.
When i did the first charge of data from spark to cassandra, I used this set of commands:
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val wkdir="/home/adminbigdata/tablas/"
val fileIn= "originales/22_FOEHIS2.csv"
val fileOut= "22_FOEHIS_PRE2"
val fileCQL= "22_FOEHISCQL"
val data = sc.textFile(wkdir + fileIn).filter(!_.contains("----")).map(_.trim.replaceAll(" +", "")).map(_.dropRight(1)).map(_.drop(1)).map(_.replaceAll(",", "")).filter(array => array(6) != "MOBIDI").filter(array => array(17) != "").saveAsTextFile(wkdir + fileOut)
val firstDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").option("mode", "DROPMALFORMED").option("delimiter", "|").load(wkdir + fileOut)
val columns: Array[String] = firstDF.columns
val reorderedColumnNames: Array[String] = Array("hoclic","hodtac","hohrac","hotpac","honrac","hocdan","hocdrs","hocdsl","hocol","hocpny","hodesf","hodtcl","hodtcm","hodtea","hodtra","hodtrc","hodtto","hodtua","hohrcl","hohrcm","hohrea","hohrra","hohrrc","hohrua","holinh","holinr","honumr","hoobs","hooe","hotdsc","hotour","housca","houscl","houscm","housea","houser","housra","housrc")
val secondDF= firstDF.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
secondDF.write.cassandraFormat("foehis", "tfm").save()
But when I load new data using the same script, I get errors. I don't know what's wrong?
This is the message:
java.lang.UnsupportedOperationException: 'SaveMode is set to ErrorIfExists and Table
tfm.foehis already exists and contains data.
Perhaps you meant to set the DataFrame write mode to Append?
Example: df.write.format.options.mode(SaveMode.Append).save()" '
The error message clearly says you that you need to use Append mode & shows what you can do with it. In your case it happens because destination table already exists, and writing mode is set to "error if exists". If you still want to write data, the code should be following:
import org.apache.spark.sql.SaveMode
secondDF.write.cassandraFormat("foehis", "tfm").mode(SaveMode.Append).save()

Does spark will reuse rdd in DAG for single action?

Does spark will reuse rdd in DAG for single action?
Case1
val df1 = spark.sql("select id, value from table")
val df2 = spark.sql("select id, value from table")
df1.join(df2, "id").show()
Case 2
val df1 = spark.sql("select id, value from table")
val df2 = df1.filter($"value" > 0)
df1.join(df2, "id").show()
Questions
In case1, does the query select id, value from table will be executed only once?
In case2, does the query will be executed only once?
If not, how can i optimize the code to make the query execute only once, because the query may be very slow.

sparkSQL wrong result

My spark version is 1.5.0,and i use spark-SQL in spark-shell do some ETL,here is my code:
import com.databricks.spark.avro._
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import java.security.MessageDigest
val dfGoods = sqlContext.read.avro("hdfs:///user/data/date=*")
dfGoods.registerTempTable("goodsinfo")
val dfGoodsLmt=sqlContext.sql("SELECT * FROM (SELECT goodsid, etype, goodsattribute, row_number() over (partition by goodsid order by runid DESC) rank_num FROM goodsinfo) tmp WHERE rank_num =1")
i use dfGoodsLmt.count() to see the row number,first time the result is always wrong,but after that, i rerun the code dfGoodsLmt.count() ,the result is right.I have tried many times,I don't know why.
here is a demo code and the result should be 1000,but i need to try more than once to get the right answer.
case class data(id:Int,name:Int)
val tmp=(1 to 1000) zip (1 to 1000)
tmp.map(x=>data(x._1,x._2)).toDF.registerTempTable("test_table")
sqlContext.sql("select * from (select *,row_number() over(partition by id order by id DESC)rank from test_table)tmp where rank=1").count

Resources