Spark SQL - best way to programmatically loop over a table - apache-spark

Say I have the following spark dataframe:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 2 | 1 |
| 3 | 1 |
| 4 | NULL |
| 5 | 4 |
| 6 | NULL |
| 7 | 6 |
| 8 | 3 |
This dataframe represents a tree structure consisting of several disjoint trees. Now, say that we have a list of nodes [8, 7], and we want to get a dataframe containing just the nodes that are roots of the trees containing the nodes in the list.The output looks like:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 6 | NULL |
What would be the best (fastest) way to do this with spark queries and pyspark?
If I were doing this in plain SQL I would just do something like this:
CREATE TABLE #Tmp
Node_id int,
Parent_id int
INSERT INTO #Tmp Child_Nodes
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
WHILE #num > 0
INSERT INTO #Tmp (
SELECT
p.Node_id
p.Parent_id
FROM
#Tmp t
LEFT-JOIN Nodes p
ON t.Parent_id = p.Node_id)
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
END
SELECT Node_id FROM #Tmp WHERE Parent_id IS NULL
Just wanted to know if there's a more spark-centric way of doing this using pyspark, beyond the obvious method of simply looping over the dataframe using python.

parent_nodes = spark.sql("select Parent_id from table_name where Node_id in [2,7]").distinct()
You can join the above dataframe with the table to get the Parent_id of those nodes as well.

Related

Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1.
Given below the sample create table command and its show create table output.
Beeline Output:
Spark-Shell:
I wanted to know if there is any configuration on the Spark side to preserve the data type and length information for CHAR,VARCHAR data types. If there are other ways to get DDL from Hive quickly, I will be fine with that also.
This is in
Hive 3.1.1
Spark 3.1.1
Your stack overflow issue raised and I quote:
"I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1. Given below the sample create table command and its show create table output."
Create a simple table in Hive in test database
hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name data_type comment
id bigint
col1 varchar(30)
col2 string
# Detailed Table Information
Database: test
OwnerType: USER
Owner: hduser
CreateTime: Fri Mar 11 18:29:34 GMT 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1647023374
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Now let's go to spark-shell
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see Spark shows columns correctly
Now let us go and create the same table in hive through beeline
0: jdbc:hive2://rhes75:10099/default> use test
No rows affected (0.019 seconds)
0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)
. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
0: jdbc:hive2://rhes75:10099/default> desc formatted etc
. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| id | bigint | |
| col1 | varchar(30) | |
| col2 | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| OwnerType: | USER | NULL |
| Owner: | hduser | NULL |
| CreateTime: | Fri Mar 11 18:51:00 GMT 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 0 |
| | numRows | 0 |
| | rawDataSize | 0 |
| | totalSize | 0 |
| | transient_lastDdlTime | 1647024660 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
33 rows selected (0.159 seconds)
Now check that in spark-shell again
scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
`id` BIGINT,
`col1` VARCHAR(30),
`col2` STRING)
USING text
TBLPROPERTIES (
'bucketing_version' = '2',
'transient_lastDdlTime' = '1647024660')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
It shows OK. So in summary you get column definitions in Spark as you have defined them in Hive.
In your statement above and I quote "I am using Spark 2.4.1 and Beeline 2.1.1", refers to older versions of Spark and hive which may have had such issues.

Using group_by where with Spark SQL

Hello I have created a Spark Dataframe reading from a parquet file which looks like this
+-------+-----+-----+-----+-----+
| col-a|col-b|col-c|col-d|col-e|
+-------+-----+-----+-----+-----+
| .....|.....|.....|.....|.....|
+-------+-----+-----+-----+-----+
I want to do a group by using col-a and col-b and then find out how many groups have more than 1 unique row. For example, if we have rows
a | b | x | y | z |
a | b | x | y | z |
c | b | m | n | p |
c | b | m | o | r |
I want to find out where count > 1 from
col-a | col-b | count |
a | b | 1 |
c | b | 2 |
I wrote this query
"SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df GROUP BY `col-a`, `col-b` WHERE count > 1"
This actually generates an error mismatched input 'WHERE' expecting {<EOF>, ';'}. Can anyone help me what I am doing wrong here?
I am currently doing it like
df_2 = spark.sql("SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count FROM df GROUP BY `col-a`, `col-b`")
df_2.createOrReplaceTempView("df_2")
spark.sql("SELECT * FROM df_2 WHERE `count` > 1").show()
Filtering after a group by requires a having clause. where has to be before group by.
It might work like this. You may also be able to refer to the complex column by index (not sure if that works in spark). Would keep it cleaner.
SELECT
`col-a`,
`col-b`,
count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df
GROUP BY `col-a`, `col-b`
HAVING count(DISTINCT (`col-c`, `col-d`, `col-e`)) > 1

How to combine two columns into one in Sqlite and also get the underlying value of the Foreign Key?

I want to be able to combine two columns from a table into one column then to to be able to get the actual value of the foreign keys. I can do these things individually but not together.
Following the answer below I was able to combine the two columns into one using the first sql statement below.
How to combine 2 columns into a new one in sqlite
The combining process is shown below:
+---+---+
|HT | AT|
+---+---+
|1 | 2 |
|5 | 7 |
|9 | 5 |
+---+---+
into one column as shown:
+---+
|HT |
+---+
| 1 |
| 5 |
| 9 |
| 2 |
| 7 |
| 5 |
+---+
The second SQL statement show's the actual value of each foreign key corresponding to each foreign key id. The Foreign Key Table.
+-----+------------------------+
|T_id | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 2 | 'Chicago Bears' |
| 5 | 'New England Patriots' |
| 7 | 'New York Giants' |
| 9 | 'New York Jets' |
+-----+------------------------+
sql = "SELECT * FROM (SELECT M.HT FROM M UNION SELECT M.AT FROM Match)t"
The second sql statement lets me get the foreign key values for each value in M.HT.
sql = "SELECT M.HT, T.TN FROM M INNER JOIN T ON M.HT = T.Tid WHERE strftime('%Y-%m-%d', M.ST) BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND M.Comp = 6 ORDER BY M.ST"
Result of second SQL statement:
+-----+------------------------+
| HT | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 5 | 'New England Patriots' |
| 9 | 'New York Jets' |
+-----+------------------------+
But try as I might I have not been able to combine these queries!
I believe the following will work (assuming that the tables are Match and T and baring the WHERE and ORDER BY clauses for brevity/ease) :-
SELECT DISTINCT(m.ht), t.tn
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses using Match as m only has columns ht and at */
WHERE strftime('%Y-%m-%d', Match.ST)
BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND Match.Comp = 6
ORDER BY Match.ST
;
Note only tested without the WHERE and ORDER BY clause.
That is using :-
DROP TABLE IF EXISTS Match;
DROP TABLE IF EXISTS T;
CREATE TABLE IF NOT EXISTS Match (ht INTEGER, at INTEGER, st TEXT DEFAULT (datetime('now')));
CREATE TABLE IF NOT EXISTS t (tid INTEGER PRIMARY KEY, tn TEXT);
INSERT INTO T (tn) VALUES('Cows'),('Bears'),('a'),('b'),('Pats'),('c'),('Giants'),('d'),('Jets');
INSERT INTO Match (ht,at) VALUES (1,2),(5,7),(9,5);
/* Directly without the Common Table Expression */
SELECT
DISTINCT(m.ht), t.tn,
Match.st /*<<<<< Added to show results of obtaining other values from Matches >>>>> */
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses here using Match */
;
Noting that limited data (just the one extra column) was used for brevity
Results in :-

How to retrieve data from cassandra using IN Query with given order?

I'm selecting data from a Cassandra database using a query. It is working fine but how to get the data in same order as I have given IN query?
I have created table like this:
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
I am trying to select data using
SELECT *
FROM malleshdmy
WHERE id IN ( 11,10,5)
But, It producing same data as like stored.
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
Please help me in this issue.
I want data as 11,10 and 5
If the id is partition key, then it's impossible - data are sorted only inside the clustering columns, and data for different partition keys could be returned in arbitrary order (but sorted inside that partition).
You need to sort data yourself.
Since id is your partition key, your data is actually being sorted by the token of id, not the values themselves:
cqlsh:testid> SELECT id,n,p,q,token(id) FROM table;
id | n | p | q | system.token(id)
----+---+---+------+----------------------
5 | 1 | 2 | 4 | -7509452495886106294
10 | 2 | 4 | 3 | -6715243485458697746
11 | 1 | 2 | null | -4156302194539278891
Because of this, you don't have any control over how the partition key is sorted.
In order to sort your data by id, you need to make id a clustering column rather than a partition key. Your data will still need a partition key, however, and this will always be sorted by token.
If you decide to make id a clustering column, you will need to specify that you want a descending order in your order by statement
CREATE TABLE clusterTable (
... partition type, //partition key with a type to be specified
... id INT,
... n INT,
... p INT,
... q INT,
... PRIMARY KEY((partition),id))
... WITH CLUSTERING ORDER BY (id DESC);
This link is very helpful in discussing how ordering works in Cassandra: https://www.datastax.com/dev/blog/we-shall-have-order

How can I overwrite in a Spark DataFrame null entries with other valid entries from the same dataframe?

I have a Spark DataFrame with data like this
| id | value1 |value2 |
------------------------
| 1 | null | 1 |
| 1 | 2 | null |
And want to transform it
into
| id | value1 |value2 |
-----------------------
| 1 | 2 | 1 |
That is, I need to get the rows with the same id and merge their values in a single row.
Could you explain me what is the most scalable way to do this?
df.groupBy(“id”).agg(collect_set(“value1”).alias(“value1”),collect_set(“value2”).alias(“value2”))
//more elegant way of doing for dynamic columns
df.groupBy(“id”).agg(df.columns.tail.map((_ -> “collect_set”)).toMap).show
//1.5
Val df1=df.rdd.map(i=>(i(0).toString,i(1).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
Val df2 = df.rdd.map(i=>(i(0).toString,i(2).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
df1.join(df2,df1(“_1”) === df2(“_1”),”inner”).drop(df2(“_1”)).show

Resources