Does spark load the entire table in memory if select condition is based on RDD transformation? - apache-spark

DataSet<Row> a = spark.read().format("com.memsql.spark.connector").option("query", "select * from a");
a = a.filter((row)-> row.x = row.y)
Sring xstring = "...select all values of x from a and make comma separated string"
DataSet<Row> b = spark.read().format("com.memsql.spark.connector").option("query", "select * from b where x in " + xstring);
b.show()
In this case as spark would load the entire b table in memory and then filter out xtring rows or it actually create that xstring and then load a subset of table b in memory when we call show

When memsql is queried using option("query", "select * from .......") the entire result (not table) will be read from memsql into executors. The MemSQL Spark Connector 2.0 supports column and filter pushdown for which the SQL needs to have the filter and join condition rather than applying filter and join on dataframe. In your example predicate push down will be used. In your example - entire table 'a' will be read because there is no filter condition, xstring will be build, then only that part of table 'b' is read that matches x in (...) condition.
Here is memsql documentation explaining this.

Related

How to make my identity column consecutive on delta table in Azure Databricks?

I am trying to create a delta table with a consecutive identity column. The goal is for our clients to see if there is some data they did not receive from us.
It looks like the generated identity column is not consecutive. Which makes the "INCREMENT BY 1" quite misleading.
store_visitor_type_name = ["apple","peach","banana","mango","ananas"]
card_type_name = ["door","desk","light","coach","sink"]
store_visitor_type_desc = ["monday","tuesday","wednesday","thursday","friday"]
colnames = ["column2","column3","column4"]
data_frame = spark.createDataFrame(zip(store_visitor_type_name,card_type_name,store_visitor_type_desc),colnames)
data_frame.createOrReplaceTempView('vw_increment')
data_frame.display()
%sql
CREATE or REPLACE TABLE TEST(
`column1SK` BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
,`column2` STRING
,`column3` STRING
,`column4` STRING
,`inserted_timestamp` TIMESTAMP
,`modified_timestamp` TIMESTAMP
)
USING delta
LOCATION '/mnt/Marketing/Sales';
MERGE INTO TEST as target
USING vw_increment as source
ON target.`column2` = source.`column2`
WHEN MATCHED
AND (target.`column3` <> source.`column3`
OR target.`column4` <> source.`column4`)
THEN
UPDATE SET
`column2` = source.`column2`
,`modified_timestamp` = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (
`column2`
,`column3`
,`column4`
,`modified_timestamp`
,`inserted_timestamp`
) VALUES (
source.`column2`
,source.`column3`
,source.`column4`
,current_timestamp()
,current_timestamp()
)
I'm getting the following results. You can see this is not sequential.What is also very confusing is that it is not starting at 1, while explicitely mentionned in the query.
I can see in the documentation (https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters) :
The automatically assigned values start with start and increment by
step. Assigned values are unique but are not guaranteed to be
contiguous. Both parameters are optional, and the default value is 1.
step cannot be 0.
Is there a workaround to make this identity column consecutive ?
I guess I could have another column and do a ROW_NUMBER operation after the MERGE, but it looks expensive.
You can utilize Pyspark to achieve the requirement instead of using row_number() function.
I have read the TEST table as a spark dataframe and converted it to pandas on spark dataframe. In pandas dataframe, using reset_index(), I have created a new index column.
Then I have converted it back to spark dataframe. I have added 1 to the index column values since the index starts with 0.
df = spark.sql("select * from test")
pdf = df.to_pandas_on_spark()
#to create new index column.
pdf.reset_index(inplace=True)
final_df = pdf.to_spark()
#Since index starts from 0, I have added 1 to it.
final_df.withColumn('index',final_df['index']+1).show()

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

How can you update values in a dataset?

So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)
when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6
If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));

What does "Correlated scalar subqueries must be Aggregated" mean?

I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?
You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min

Converting the Hive SQL output to an array[Double]

I am reading some data from a hive table using a hive context in spark and the out put is a ROW with only one column. I need to convert this to an array of Double. I have tried all possible ways to do it myself with no success. Can somebody please help in this ?
val qRes = hiveContext.sql("""
Select Sum(EQUnit) * Sum( Units)
From pos_Tran_orc T
INNER JOIN brand_filter B
On t.mbbrandid = b.mbbrandid
inner join store_filter s
ON t.msstoreid = s.msstoreid
Group By Transdate
""")
What next ????
You can simply map using Row.getDouble method:
qRes.map(_.getDouble(0)).collect()

Resources