I have a question/opinion that needs experts suggestion.
I have a table called config that contains some configuration information as the table name suggests. I need this details to be accessed from all the executors during my job's life cycle. So my first option is Broadcasting them in List[Case Class] .But suddenly got an idea of making the config as Temptable using registerTempTable() and use it accross my job.
This temp table approach can be used an alternative to Broadcast variables ( I have extensive hands-on on Broadcasting)?
registerTempTable does just give you the possibilty to run plain sql queries on your dataframe, there is no performance benefit/caching/materialization involved.
You should go with broadcasting (I would suggest to use a Map for configuration parameters)
registerTempTable() then using it for lookup, will mostly use the broadcast join only internally given the scenario the table/config file size < 10MB.
Related
I need to join a hive table with JSON data from a Rest endpoint. Is it better to use a UDF or a data source (like temp table)? If using a UDF, what'd be a good way to throttle RPS?
If you want need to look up the data in the Rest endpoint and spark you likely want to look at mapParitions. Here's a good explanation here of why it could be better to use that just using map (and a UDF). It would also speaks to throttling by implication. Each partition = 1 executor. So you can set a theoretical max using this. (I say theoretical max as you aren't always guaranteed to get all the executors you wish for.)
I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.
Is there a config for controlling the number of files written using INSERT or CREATE TABLE AS in Presto? Looking for something similar or identical to the Spark counterpart spark.sql.shuffle.partitions = 1.
I am looking to decrease the amount of small files that are generated with INSERT to avoid additional ETL in Spark with the above spark config. Is this possible? I haven't found anything close to this in Presto docs.
You can't control the number of output files directly, but you can reduce the number of files that get written by turning on the scale-writers config option (or scale_writers session property). Add the following to the config.properties file:
scale-writers=true
When that option is enabled, Trino (formerly known as PrestoSQL) will use the minimum number of writers necessary and scale up as necessary based on throughput.
See this discussion on the Trino Community Slack:
https://trinodb.slack.com/archives/CFLB9AMBN/p1564046069087800?thread_ts=1563945529.046400&cid=CFLB9AMBN
Unfortunately, this option is not yet documented as of Presto 327. I created an issue to track this improvement to the documentation: https://github.com/trinodb/trino/issues/2352.
I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?
A few points here:
Working with Spark-SQL is not always the most performant option, the
optimized might not always as good of a job than a job you write
yourself
Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!
I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.