Select count(*) issue with hive and spark - apache-spark

I get the correct count after I run the ANALYZE statement.
But my problem is, it needs to be run every time of the count is updated. Technically I should be able to update the count for the same partition.
But it returns the same count if I don't execute the ANALYZE statement.
This is the query I execute for the count to be updated.
ANALYZE TABLE bi_events_identification_carrier_sam PARTITION(year, month, day) COMPUTE STATISTICS;
And executing is not convenient at all. any ideas?

Your count(*) query is using stats to get the result.
If you are using spark to write data, then you can set spark.sql.statistics.size.autoUpdate.enabled to true. This makes sure that Spark updates the table stats automatically after the write is done.
If you are using Hive, you can set set hive.stats.autogather=true;.
Once you enable these settings, then the write query will automatically update the stats and the subsequent read query will work fine.

Related

how to avoid errors when querying hive table being loaded by Spark at the same time

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?
The error is:
java.io.FileNotFoundException: File does not exist [HDFS PATH]
Which i think happens because the metadata says there is a file A that gets deleted during the job execution.
The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)
The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.
Have a set of two tables and a view over them:
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);
CREATE VIEW foo AS SELECT x, y, z, ... FROM foo_a;
First iteration of ETL process needs to:
Synchronize foo_a -> foo_b
Do the work on foo_b
Drop view foo and recreate it pointing to foo_b
Until step 3 user queries run against table foo_a. From the moment of switch they run against foo_b. Next iteration of ETL will work in the opposite way.
This is not perfect. You need double storage and some extra complexity in the ETL. And anyway this approach might fail if:
user is unlucky enough to hit a short time between dropping and
recreating the view
user submits a query that's heavy enough to run across two iterations of ETL
not sure but check it out
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Cassandra CQL: How to insert only records, which are not older than 3 years?

I have some table like this:
CREATE TABLE events (
id int,
eventdate timestamp,
PRIMARY KEY (id)
);
What I'm trying to do is conditional insert, which going to verify if eventdate is not older than 3 years and insert data if the condition is met.
In SQL something similar could be achieved by DATEADD
How to handle it in Cassandra?
select * from events and iterate (paging) through the result set. Issue an insert for everything older than 3 years. A quick python script and giving it a day or two to run will accomplish it in less time than more elaborate things. Particularly if this is a one off thing. If you need to do it regularly I would recommend writing a spark job to do it. You can get more efficient if you dont want to use spark and want to run it locally by splitting up token ranges on the select statement to be the ring boundaries.
Cassandra wont support large bulk operations that require reads before writes that must read entire data set. It wouldn't work on clusters its designed to support (think petabytes across many data centers).

Select All Records From Cassandra

I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.
When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.
While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!

Require help in creating design for cassandra data model for my requirement

I have a Job_Status table with 3 columns:
Job_ID (numeric)
Job_Time (datetime)
Machine_ID (numeric)
Other few fields containing stats (like memory, CPU utilization)
At a regular interval (say 1 min), entries are inserted in the above table for the Jobs running on each Machines.
I want to design the data model in Cassandra.
My requirement is to get list (pair) of jobs which are running at the same time on 2 or more than 2 machines.
I have created table with Job_Id and Job_Time as primary key for row but in order to achieve the desired result I have to do lots of parsing of data after retrieval of records.
Which is taking a lot of time when the number of records reach around 500 thousand.
This requirement expects the operation like inner join of SQL, but I can’t use SQL due to some business reasons and also SQL query with such huge data set is also taking lots of time as I tried that with dummy data in SQL Server.
So I require your help on below points:
Kindly suggest some efficient data model in Cassandra for this requirement.
How the join operation of SQL can be achieved/implemented in Cassandra database?
Kindly suggest some alternate design/algorithm. I am stuck at this problem for a very long time.
That's a pretty broad question. As a general approach you might want to look at pairing Cassandra with Spark so that you could do the large join in parallel.
You would insert jobs into your table when they start and delete them when they complete (possibly with a TTL set on insert so that jobs that don't get deleted will auto delete after some time).
When you wanted to update your pairing of jobs, you'd run a spark batch job that would load the table data into an RDD, and then do a map/reduce operation on the data, or use spark SQL to do a SQL style join. You'd probably then write the resulting RDD back to a Cassandra table.

Resources