Using Dataframe instead of spark sql for data analysis - apache-spark

Below is the sample spark sql I wrote to get the count of male and female enrolled in an agency.I used sql to generate the output,
Is there a way to do similar thing using dataframe only not sql.
val districtWiseGenderCountDF = hiveContext.sql("""
| SELECT District,
| count(CASE WHEN Gender='M' THEN 1 END) as male_count,
| count(CASE WHEN Gender='F' THEN 1 END) as FEMALE_count
| FROM agency_enrollment
| GROUP BY District
| ORDER BY male_count DESC, FEMALE_count DESC
| LIMIT 10""".stripMargin)

Starting with Spark 1.6 you can use pivot + group by to achieve what you'd like
without sample data (and my own availability of spark>1.5) here's a solution that Should work (not tested)
val df = hiveContext.table("agency_enrollment")
df.groupBy("district","gender").pivot("gender").count
see How to pivot DataFrame? for a generic example

Related

Salting Technique to tackle Skew in Spark SQL

I am trying to understand Salting techniques to tackle Skew in Spark SQL. I have done some reading online and I have come up with a very rudimentary implementation of the same in Spark SQL API.
Let's assume that table1 is Skewed on cid=1:
Table 1:
cid | item
---------
1 | light
1 | cookie
1 | ketchup
1 | bottle
2 | dish
3 | cup
As shown above, cid=1 occurs more than other keys.
Table 2:
cid | vehicle
---------
1 | taxi
1 | truck
2 | cycle
3 | plane
Now my code looks like the following:
create temporary view table1_salt as
select
cid, item, concat(cid, '-', floor(rand() * 19)) as salted_key
from table1;
create temporary view table2_salt as
select
cid, vehicle, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key
from table2;
Final Query:
select a.cid, a.item, b.name
from table1_salt a
inner join table2_salt b
on a.salted_key = concat(b.cid, '-', b.salted_key);
In the above example, I have used 20 salts/splits.
Questions:
Is there any rule of thumb to choose optimal number for the splits to
be used ? For e.g. if table1 has 10 Million records, how many bins/buckets should I use ? (In this simple test example I have used 20).
As shown above, when I am creating Table2_salt, I am hardcoding the
the salts like (0, 1, 2, 3.... thru 19). Is there a better
way to implement the same functionality, but without the
hardcoding and the clutter ? (What if I want to use 100 splits!)
Since we are replicating the second table (table2) N number of times, doesn't it mean that it will degrade the Join performance ?
Note: I need to use Spark 2.4 SQL API only.
Also, kindly let me know if there are any advanced examples available on the net. Any help is appreciated.

Optimizing Theta Joins in Spark SQL

I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:
Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)

Cross Join for calculation in Spark SQL

I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100M rows). I used a CROSS JOIN clause, which is resulting in a performance issue.
Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?
Reference table: (contains only 1 value)
create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';
Cust table (10 M rows):
custid | birthdt
A1234 | 20-03-1980
B3456 | 09-05-1985
C2356 | 15-12-1990
Query (calculate age w.r.t birthdt):
select
a.custid,
a.birthdt,
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;
My question is - Is there a better approach to implement this requirement ?
Thanks
Simply use withColumn!
df.withColumn("new_col", lit("10-05-2020").cast("date"))
Inside view you are using constant value, You can simply put same value in below query without cross join.
select
a.custid,
a.birthdt,
cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age
from cust a;
scala> spark.sql("select * from cust").show(false)
+------+----------+
|custid|birthdt |
+------+----------+
|A1234 |1980-03-20|
|B3456 |1985-05-09|
|C2356 |1990-12-15|
+------+----------+
scala> spark.sql("select a.custid, a.birthdt, cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age from cust a").show(false)
+------+----------+---+
|custid|birthdt |age|
+------+----------+---+
|A1234 |1980-03-20|40 |
|B3456 |1985-05-09|35 |
|C2356 |1990-12-15|29 |
+------+----------+---+
Hard to work out exactly your point, but if you cannot use Scala or pyspark and dataframes with .cache etc. then I think that instead of of using a temporary view, just create a single row table. My impression is you are using Spark %sql in a notebook on, say, Databricks.
This is my suspicion as it were.
That said a broadcastjoin hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints

Getting latest date in a partition by year / month / day using SparkSQL

I am trying to incrementally transform new partitions in a source table into a new table using Spark SQL. The data in both the source and target are partitioned as follows: /data/year=YYYY/month=MM/day=DD/. I was initially just going to select the MAX of year, month and day to get the newest partition, but that is clearly wrong. Is there is a good way to do this?
If I construct a date and take the max like MAX( CONCAT(year,'-','month','-',day)::date ) this would be quite ineffecient, right? Because it will need to scan all data to pull the newest partition.
Try below to get the latest partition without reading data at all, only metadata:
spark.sql("show partitions <table>").agg(max('partition)).show
You can use the result of show partitions as it would be more efficient as it will hit the metastore only. However, you can't just apply a max to the value there, we will need to construct the date first and then do the max.
Here's a sample:
from pyspark.sql import functions as F
df = sqlContext.sql("show partitions")
df.show(10, False)
date = F.to_date(F.regexp_replace(F.regexp_replace("partition", "[a-z=]", ""), "/", "-"))
df.select(F.max(date).alias("max_date")).show()
Input Values:
+------------------------+
|partition |
+------------------------+
|year=2019/month=11/day=5|
|year=2019/month=9/day=5 |
+------------------------+
Result:
+----------+
| max_date|
+----------+
|2019-11-05|
+----------+

SparkSQL/Hive: equivalent of MySQL's `information_schema.table.{data_length, table_rows}`?

In MySQL, we can query the table information_schema.tables and obtain useful information such as data_length or table_rows
select
data_length
, table_rows
from
information_schema.tables
where
table_schema='some_db'
and table_name='some_table';
+-------------+------------+
| data_length | table_rows |
+-------------+------------+
| 8368 | 198 |
+-------------+------------+
1 row in set (0.01 sec)
Is there an equivalent mechanism for SparkSQL/Hive?
I am okay to use SparkSQL or program API like HiveMetaStoreClient (java API org.apache.hadoop.hive.metastore.HiveMetaStoreClient). For the latter I read the API doc (here) and could not find any method related to table row numbers and sizes.
There is no one command for meta-information. Rather there are a set of commands, you may use
Describe Table/View/Column
desc [formatted|extended] schema_name.table_name;
show table extended like part_table;
SHOW TBLPROPERTIES tblname("foo");
Display Column Statistics (Hive 0.14.0 and later)
DESCRIBE FORMATTED [db_name.]table_name column_name;
DESCRIBE FORMATTED [db_name.]table_name column_name PARTITION (partition_spec);

Resources