I have a use case where in there is a table with one column which has sequence of SQL queries.
I want to run these SQL queries in spark program one after the other and not in parallel. This is because SQL query on Nth row will have dependency on (N-1)th row.
Now due to this constraint I want to execute this sequentially one after the other rather than in parallel. How can I achieve this?
I think you could use something like this:
val listOfQueryRows = spark.sqlContext.table("foo_db.table_of_queries")
.select(col("sql_query"))
.orderBy(col("query_index"))
.collectAsList()
listOfQueryRows.forEach(queryRow => spark.sql(queryRow.getString(0)))
This will select all your queries in the sql_query column, order them by the index given in the query_index and collects them in the list listOfQueryRows in the driver. The list is then iterated over sequentially executing the query for each returned row.
Related
I have a query in spark-sql with a lot of values in the IN clause:
select * from table where x in (<long list of values>)
When i run this query i get a TransportException from the MetastoreClient in spark.
Column x is the partition column of the table. The hive metastore is on Oracle.
Is there a hard limit on how many values can be in the in clause?
Or can i maybe set the timeout value higher to give the metastore more time to answer.
yes,you can pass upto 1000 values inside IN clause.
However, you can use OR operator inside IN clause and slice the list of values into multiple 1000 windows.
I have bunch of hive tables.
I want to:
Pull the tables into a pyspark DF.
Do a UDF on them.
Join 4 tables based on customer id.
Is there a concept of indexing in spark to speed up the operation?
If so whats the command?
How do I create index on dataframe?
I understand your problem but the thing is, you acquire the data at the same time you process them. Therefore, calculating an index before joining is useless as it will take take more time to first create the index.
If you have several write operation, you may want to cache your data to speed up but otherwise, the index is not the solution to investigate.
There is maybe another thing you can try : df.repartition.
This will create partition on your df according to one column. But I have no idea if it can help.
I know, that Cassandra allows to GROUP BY and can run UDF on that data.
Is there any default function to get the first row of each aggregated set?
(How) Can I stop processing data and return result from my UDF immediately (E.G. After 1 or few rows processed)?
Now I'm using ... COUNT(1) ... as workaround.
Actualy You fon't need any UDF. It works as described out of the box.
Jusr GROUP BY fields you need.
I am trying to execute 3 conditional inserts to different tables inside a batch by using the Cassandra cpp-driver:
BEGIN BATCH
insert into table1 values (...) IF NOT EXISTS
insert into table2 values (...) IF NOT EXISTS
insert into table3 values (...) IF NOT EXISTS
APPLY BATCH
But I am getting the following error:
Batch with conditions cannot span multiple tables
If the above is not possible in Cassandra, what is the alternative to perform multiple conditional inserts as a transaction and ensure that all succeed or all fail?
I'm afraid there are no alternatives. Conditional statements in a BATCH environment are limited to a single table only, and I don't think there's room for changes in future.
This is due to how Cassandra works internally: a batch containing a conditional update (it is called lightweight transaction) can only be used in one partition because they are based on the Paxos implementation, because the Paxos itself works at partition level only. Moreover, in a batch with multiple conditional statements in the same BATCH, all the conditions must be verified to the batch succeed. Even if one (and only) conditional update fails, the entire batch will fail.
You can read more about BATCH statements in the documentation.
You'd basically get a performance hit for the conditional update, and a performance hit for a batched operation, and C* stops you to get so far.
It seems to me you designed it RDBMS-like. A No-SQL alternative solution, I don't know if it can be applied to your use case though, you could denormalize your data in a 4th table that combines all the other 3 tables, and then supply a single update to this 4th table.
I wanted to know if there's a way to join two or more result sets into one.
I actually need to execute more than one query and return just one result set. I can't use the UNION or the JOIN operators because I'm working with Cassandra (CQL)
Thanks in advance !
Framework like Playorm provide support for JOIN (INNER and LEFT JOINs)queries in Cassandra.
http://buffalosw.com/wiki/Command-Line-Tool/
You may see more examples at:
https://github.com/deanhiller/playorm/blob/master/src/test/java/com/alvazan/test/TestJoins.java
If your wanting to query multiple rows within the same column family you can use the IN keyword:
SELECT * FROM testCF WHERE key IN ('rowKeyA', 'rowKeyB', 'rowKeyZ') LIMIT 10;
This will get you back 10 results from each row.
If your needing to join results from different CFs, or query with differing WHERE clauses, then you need to run multiple queries and merge the results in code - cassandra doesn't cater for that kind of thing.
PlayOrm can do joins, but you may need to have PlayOrm partitioning on so you still scale. (ie. you dont' want to join 1 billion rows with 1 billion rows). Typically instead you do a join of one partition with another partition or a partition on the Account table joining a partition on the Users table. ie. make sure you design for scale still.