Get the Query stored in Database and then collect information after running query - apache-spark

I have a use case which I am trying to implement using spark solution in AWS glue.
I have one table which has query stored as column value which I need to run from script .
For exmaple > Select src_query from table;
This give me another query mentioned below :
select tabl2.col1,tabl3.col2 from table2 join table 3 ;
Now I want to collect information of this second query in dataframe and proceed further.
source_df = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("dbtable", "table1").option("user", Oracle_Username).option("password", Oracle_Password).load()
Now when we run this data from table1 gets stored in source_df . One of column of table1 is storing some sql query . select col1,col2 from tabl2;
Now I want to run query mentioned above and store its result in dataframe .Something like
final_df2 = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("query", "select col1,col2 from tabl2").option("user", Oracle_Username).option("password", Oracle_Password).load()
How can I get query from data frame and run it as query to fetch another result in another dataframe .

You can use the below code when the source table has less number of rows as we will be using collect to get all the queries from the source table.
import org.apache.spark.sql.functions._
// created sample secondary tables found in the query
Seq(("A","01/01/2022",1), ("AXYZ","02/01/2022",1), ("AZYX","03/01/2022",1),("AXYZ","04/01/2022",0), ("AZYX","05/01/2022",0),("AB","06/01/2022",1), ("A","07/01/2022",0) )
.toDF("Category", "date", "Indictor")
.write.mode("overwrite").saveAsTable("table1")
Seq(("A","01/01/2022",1), ("b","02/01/2022",0), ("c","03/01/2022",1) )
.toDF("Category", "date", "Indictor")
.write.mode("overwrite").saveAsTable("table2")
//create the source dataframe
val df=Seq( (1,"select Category from table1"), (2,"select date from table2") )
.toDF("Sno", "Query")
//extract the query from the source table.
val qrys = df.select("Query").collect()
qrys.foreach(println)
//execute the queries in column and save it as tables
qrys.map(elm=>spark.sql(elm.mkString).write.mode("overwrite").saveAsTable("newtbl"+qrys.indexOf(elm)))
//select from the new tables.
spark.sql("select * from newtbl0").show
spark.sql("select * from newtbl1").show
Output:
[select Category from table1]
[select date from table2]
+--------+
|Category|
+--------+
| A|
| AXYZ|
| AZYX|
| AXYZ|
| AZYX|
| AB|
| A|
+--------+
+----------+
| date|
+----------+
|01/01/2022|
|02/01/2022|
|03/01/2022|
+----------+

Related

Spark.Sql not able to read Japanese (Mutltbyte charater) from hive table?

I am writing Japanese character in the hive table as part of one of my program. Later when i am select that field from Hive i am able to read it but when i reading it from Spark.sql it is not giving me expected result .
spark.sql("select SQL_VAL as sql_val from abc.ac_tbl where d_name='2019-07-09_14:26:16.486' ").show()
+-------+
|sql_val|
+-------+
| ?|
+-------+
Same table when query from hive it is giving output -
select SQL_VAL as sql_val from abc.ac_tbl where d_name='2019-07-09_14:26:16.486'
sql_val
文

Hive table is read multiple times when using spark.sql and union

I have a single Hive table that is used in multiple subsequent spark.sql queries.
Each stage shows a HiveTableScan, that is not necessary as the table only needs to be read once.
How can I avoid this?
Here is a simplified example that replicates the problem
Create an example table:-
spark.sql("CREATE DATABASE IF NOT EXISTS default")
spark.sql("DROP TABLE IF EXISTS default.data")
spark.sql("CREATE TABLE IF NOT EXISTS default.data(value INT)")
spark.sql("INSERT OVERWRITE TABLE default.data VALUES(1)")
Run multiple queries that build on the previous dataframe:-
query1 = spark.sql("select value from default.data")
query1.createOrReplaceTempView("query1")
query2 = spark.sql("select max(value)+1 as value from query1").union(query1)
query2.createOrReplaceTempView("query2")
query3 = spark.sql("select max(value)+1 as value from query2").union(query2)
query3.createOrReplaceTempView("query3")
spark.sql("select value from query3").show()
Expected output is:-
|value|
+-----+
| 3|
| 2|
| 1|
+-----+
EDITED
you can use cacheTable(String tableName) ?
try this:
query1 = spark.sql("select value from default.data")
query1.createOrReplaceTempView("query1")
spark.sqlContext().cacheTable("query1")
query2 = spark.sql("select max(value)+1 as value from query1").union(query1)
query2.createOrReplaceTempView("query2")
spark.sqlContext().cacheTable("query2")
query3 = spark.sql("select max(value)+1 as value from query2").union(query2)
query3.createOrReplaceTempView("query3")
spark.sqlContext().cacheTable("query3")
spark.sql("select value from query3").show()
Using this function, Spark-Sql will cache your tables using an in-memory columnar format to minimze memory usage.
Then you can uncache the tables using uncacheTable() as below:
spark.sqlContext().uncacheTable("query1");
spark.sqlContext().uncacheTable("query2");
spark.sqlContext().uncacheTable("query3");

SparkSQL Column Query not showing column contents?

I have created a persistant table via df.saveAsTable
When I run the following query I receive these results
spark.sql("""SELECT * FROM mytable """).show()
I get view of the DataFrame and all of it's columns, and all of the data.
However when I run
spark.sql("""SELECT 'NameDisplay' FROM mytable """).show()
I receive results that look like this
| NameDisplay|
|--|
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
NameDisplay is definitely one of the columns in the table as it's shown when I run select * - how come this is not shown in the second query?
Issue was using quotes on the column names. Needs to be escaped via backtick ``NameDisplay`
Selecting 'NameDisplay', in SQL, is selecting the literal, text "NameDisplay". In that, the result you got are in fact valid.
To select values of the "NameDisplay" column, then you must issue:
"SELECT NameDisplay FROM mytable "
Or, if you need to quote it (maybe in case the column was created like this or has spaces, or is case-sensitive):
"""SELECT `NameDisplay` FROM mytable"""
This is SQL syntax, nothing specific to Spark.

Add Column in Apache Cassandra

How to check in node.js that the column does not exist in Apache Cassandra ?
I need to add a column only if it not exists.
I have read that I must make a select before, but if I select a column that does not exist, it will return an error.
Note that if you're on Cassandra 3.x and up, you'll want to query from the columns table on the system_schema keyspace:
aploetz#cqlsh:system_schema> SELECT * FROm system_schema.columns
WHERE keyspace_name='stackoverflow'
AND table_name='vehicle_information'
AND column_name='name';
keyspace_name | table_name | column_name | clustering_order | column_name_bytes | kind | position | type
---------------+---------------------+-------------+------------------+-------------------+---------+----------+------
stackoverflow | vehicle_information | name | none | 0x6e616d65 | regular | -1 | text
(1 rows)
You can check a column existance using a select query on system.schema_columns table.
Suppose you have the table test_table on keyspace test. Now you want to check a column test_column If exit or not.
Use the below query :
SELECT * FROM system.schema_columns WHERE keyspace_name = 'test' AND columnfamily_name = 'test_table' AND column_name = 'test_column';
If the above query return a result then the column exist otherwise not.

Using Dataframe instead of spark sql for data analysis

Below is the sample spark sql I wrote to get the count of male and female enrolled in an agency.I used sql to generate the output,
Is there a way to do similar thing using dataframe only not sql.
val districtWiseGenderCountDF = hiveContext.sql("""
| SELECT District,
| count(CASE WHEN Gender='M' THEN 1 END) as male_count,
| count(CASE WHEN Gender='F' THEN 1 END) as FEMALE_count
| FROM agency_enrollment
| GROUP BY District
| ORDER BY male_count DESC, FEMALE_count DESC
| LIMIT 10""".stripMargin)
Starting with Spark 1.6 you can use pivot + group by to achieve what you'd like
without sample data (and my own availability of spark>1.5) here's a solution that Should work (not tested)
val df = hiveContext.table("agency_enrollment")
df.groupBy("district","gender").pivot("gender").count
see How to pivot DataFrame? for a generic example

Resources