How to implement "Cross Join" in Spark? - apache-spark

We plan to move Apache Pig code to the new Spark platform.
Pig has a "Bag/Tuple/Field" concept and behaves similarly to a relational database. Pig provides support for CROSS/INNER/OUTER joins.
For CROSS JOIN, we can use alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
But as we move to the Spark platform I couldn't find any counterpart in the Spark API. Do you have any idea?

It is oneRDD.cartesian(anotherRDD).

Here is the recommended version for Spark 2.x Datasets and DataFrames:
scala> val ds1 = spark.range(10)
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> ds1.cache.count
res1: Long = 10
scala> val ds2 = spark.range(10)
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> ds2.cache.count
res2: Long = 10
scala> val crossDS1DS2 = ds1.crossJoin(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]
scala> crossDS1DS2.count
res3: Long = 100
Alternatively it is possible to use the traditional JOIN syntax with no join condition. Use this configuration option to avoid the error that follows.
spark.conf.set("spark.sql.crossJoin.enabled", true)
Error when that configuration is omitted (using the "join" syntax specifically):
scala> val crossDS1DS2 = ds1.join(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]
scala> crossDS1DS2.count
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
...
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
Related: spark.sql.crossJoin.enabled for Spark 2.x

Related

Different spark-sql results for data field type timestamp

Why I am getting different count results, when I am using 'T' separator for timestamp field in spark-SQL
FYI: Using the data from cassandra tables using dse spark
Datastax version: DSE 5.1.3
Apache Cassandra™ 3.11.0.1855 *
Apache Spark™ 2.0.2.6
DataStax Spark Cassandra Connector 2.0.5 *
scala> val data = spark.sql("select * from pramod.history ").where(col("sent_date") >= "2024-06-11 00:00:00.000Z" && col("sent_date") <= "2027-11-15 00:00:00.000Z")
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tx_id: string, agreement_number: string ... 37 more fields]
scala> data.count()
res21: Long = 181466
scala> val data = spark.sql("select * from pramod.history ").where(col("sent_date") >= "2024-06-11T00:00:00.000Z" && col("sent_date") <= "2027-11-15T00:00:00.000Z")
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tx_id: string, agreement_number: string ... 37 more fields]
scala> data.count()
res22: Long = 163228
Also, getting different result if I am using cassandraCount() comparitively to the spark-sql
scala> val rdd = sc.cassandraTable("pramod", "history").select("tx_id","sent_date").where("sent_date>='2024-06-11 00:00:00.000Z' and sent_date <='2027-11-15 00:00:00.000Z'")
rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[77] at RDD at CassandraRDD.scala:19
scala> rdd.count()
res20: Long = 181005
scala> rdd.cassandraCount()
res25: Long = 181005
I'm not tested, so not 100% sure, but it could be that because it's trying to use it as a string, not as timestamp - at least I have seen such behaviour with pushing filters downstream. Can you try with something like:
data.filter("ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")
This is a problem with TimeZone, Spark uses local TimeZone, try to set into your Spark conf, the TimeZone as UTC
TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

How to JOIN 3 RDD's using Spark Scala

I want to join 3 tables using spark rdd. I achieved my objective using spark sql but when I tried to join it using Rdd I am not getting the desired results. Below is my query using spark SQL and the output:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
Now I created the pairedRDDs for three datasets and it is as below :
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
Here actPairedRdd and movieCastPairedRdd is linked with each other and movieCastPairedRdd and moviePairedRdd is linked since they have common column.
Now when I join all the three datasets I am not getting any data
scala> actPairedRdd.join(movieCastPairedRdd).join(moviePairedRdd).take(2).foreach(println)
I am getting blank records. So where am I going wrong ?? Thanks in advance
JOINs like this with RDDs are painful, that's another reason why DFs are nicer.
You get no data as the pair RDD = K, V has no common data for the K part of the last RDD. The K's with 101, 102 will join, but there is no commonality with the 901, 902. You need to shift things around, like this, my more limited example:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
returns:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
You will need to strip out the data via a map, I leave that to you. INNER join per default.

Temp table caching with spark-sql

Is a table registered with registerTempTable (createOrReplaceTempView with spark 2.+) cached?
Using Zeppelin, I register a DataFrame in my scala code, after heavy computation, and then within %pyspark I want to access it, and further filter it.
Will it use a memory-cached version of the table? Or will it be rebuilt each time?
Registered tables are not cached in memory.
The registerTempTable createOrReplaceTempView method will just create or replace a view of the given DataFrame with a given query plan.
It will convert the query plan to canonicalized SQL string, and store it as view text in metastore, if we need to create a permanent view.
You'll need to cache your DataFrame explicitly. e.g :
df.createOrReplaceTempView("my_table") # df.registerTempTable("my_table") for spark <2.+
spark.cacheTable("my_table")
EDIT:
Let's illustrate this with an example :
Using cacheTable :
scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> sc.getPersistentRDDs
// res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> df.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> spark.catalog.cacheTable("my_table") // spark.cacheTable("...") before spark 2.0
scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
// Map(2 -> In-memory table my_table MapPartitionsRDD[2] at
// cacheTable at <console>:26)
Now the same example using cache.registerTempTable cache.createOrReplaceTempView :
scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> df.cache.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res6: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
// Map(2 -> ConvertToUnsafe
// +- LocalTableScan [_1#0,_2#1], [[1,2],[b,3]]
// MapPartitionsRDD[2] at cache at <console>:28)
It is not. You should cache explicitly:
sqlContext.cacheTable("someTable")

How can I check whether my RDD or dataframe is cached or not?

I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not?
Also is there a way so that I am able to see all my cached RDD's or dataframes.
You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.
For the Dataframe do this:
scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = false
scala> df.cache()
res0: df.type = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = true
For the RDD do this:
scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res9: Boolean = false
scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res11: Boolean = true
#Arnab,
Did you find the function in Python?
Here is an example for DataFrame DF:
DF.cache()
print DF.is_cached
Hope this helps.
Ram
Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:
dataframe.storageLevel.useMemory
You can retrieve the storage level of a RDD since Spark 1.4 and since Spark 2.1 for DataFrame.
val storageLevel = rdd.getStorageLevel
val storageLevel = dataframe.storageLevel
Then you can check where it's stored as follows:
val isCached: Boolean = storageLevel.useMemory || storageLevel.useDisk || storageLevel.useOffHeap
In Java and Scala, following method could used to find all the persisted RDDs:
sparkContext.getPersistentRDDs()
Here is link to documentation.`
Looks like this method is not available in python yet:
https://issues.apache.org/jira/browse/SPARK-2141
But one could use this short-term hack:
sparkContext._jsc.getPersistentRDDs().items()

Error in Caching a Table in SparkSQL

I am trying to cache a Table available in Hive(using spark-shell). Given below is my code
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveContext.cacheTable("sparkdb.firsttable")
and I am getting the below Exception
org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:112)
The table firsttable is available in database sparkdb(in Hive). Looks like the issue seems to be in providing database name. How do I achieve this?
PS : HiveQL query like the one shown below does work without any issues
scala> hiveContext.sql("select * from sparkdb.firsttable")
Find below results from few other method calls
scala> hiveContext.tables("sparkdb")
res14: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]
scala> hiveContext.tables("sparkdb.firsttable")
res15: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]
Aha! I was right, this seems to be SPARK-8105. So, for now, your best bet is to do the select * and cache that.

Resources