Pyspark giving error with IN query in cassandra - apache-spark

I have a large Cassandra keyspace (around 20 GB) on aws Cassandra server with a master server of 16GB ram. I am trying to run an IN query
"select colA colB colC where colA in {}".foramt( variable );
colA is Clustering Key.
variable is a python datatype which has around 500K entries. Currently I am facing two problems first is it is not at all working for above query and for variable of length around 20K it is taking around 20 minutes any optimization that can be done.

Related

PySpark dataframe drops records while writing to a hive table

I am trying to write a pyspark dataframe to hive table which also got created using the below line
parks_df.write.mode("overwrite").saveAsTable("fs.PARKS_TNTO")
When I try to print the count of the dataframe parks_df.count() I get 1000 records.
But in the final table fs.PARKS_TNTO, I get 980 records. Hence 20 records are getting dropped. How can I resolve this issue ? . Also , how can I capture the records which are getting dropped. There are no partitions on this final table fs.PARKS_TNTO.

Apache Hive Add TIMESTAMP partition using alter table statement

i'm currently running MSCK HIVE REPAIR SCHEMA.TABLENAME for all my tables after data is loaded.
As the partitions are growing, this statement is taking much longer (some times more than 5 mins) for one table. I know it scans and parses through all partitions in s3 (where my data is) and then adds the latest partitions into hive metastore.
I want to replace MSCK REPAIR with ALTER TABLE ADD PARTITION statement. MSCK REPAIR works perfectly fine with adding latest partitions, however i'm facing problem with TIMESTAMP value in the partition when using ALTER TABLE ADD PARTITION.
I have a table with four partitions (part_dt STRING, part_src STRING, part_src_file STRING, part_ldts TIMESTAMP).
After running **MSCK REPAIR, the SHOW PARTITIONS command gives me below output
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39
But, when i drop the above partition from metastore, and recreate it using ALTER TABLE ADD PARTITION
hive> alter table hub_cont add partition(part_dt='20181016',part_src='asfs',part_src_file='kjui',part_ldts='2019-05-02 06:30:39');
OK
Time taken: 1.595 seconds
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39.0
Time taken: 0.128 seconds, Fetched: 1 row(s)
It is adding .0 at the end of timestamp value. When i query the table for this partition, it is giving me 0 records.
Is there way to add parition that has timestamp value without getting this zero added at the end. I'm unable to figure out how MSCK REPAIR is handling this case that is ALTER TABLE statement not able to.
The same should happen if you insert dynamic partitions, it will create new partitions with .0 because default timestamp string representation format includes milliseconds part, REPAIR TABLE finds new folders and adds partition to the metastore and also works correct because timestamp string without milliseconds is quite compatible with the timestamp...
The solution is to use STRING instead of TIMESTAMP and remove milliseconds explicitly.
But first of all double-check that you have really millions of rows in single partition and really need timestamp grain partition, not DATE and this partition column is really significant (for example if it is functionally dependent on another partition column part_src_file, you can completely get rid of it). Too many partitions will cause performance degradation.

Cassandra: Does SELECT COUNT(*) differ between versions 2.x and 3.x?

I'm migrating data between a Cassandra cluster on version 2.2.4 to one on 3.11.3 by exporting the table as a CSV file and using it to create a new table in the new cluster. I'm using SELECT COUNT(*) to verify that the data has been copied over correctly but am seeing a discrepancy in the number of rows. Could this be because of the difference in versions? Is there anything else that would explain it? Thanks!
Here are the steps I'm running through:
SELECT COUNT(*) FROM table_cass2
count
-------
7951
(1 rows)
COPY table_cass2 TO '/tmp/table.csv'
COPY table_cass3 FROM '/tmp/table.csv'
Using 15 child processes
Starting copy of <table> with columns [..].
Processed: 7951 rows; Rate: 3741 rows/s; Avg. rate: 6045 rows/s
7951 rows imported from 1 files in 1.315 seconds (0 skipped).
SELECT COUNT(*) FROM table_cass3`
count
-------
7919
(1 rows)
To answer my own question, someone else on my team confirmed that it is normal for there to be a small but consistent difference in results for SELECT COUNT(*) queries between different instances of Cassandra.

Merge very large hive Tables (11 to be precise) using Spark

I am basically substituting for another programmer.
Problem Description:
There are 11 hive tables each has 8 to 11 columns. All these tables have around 5 columns whose names are similar but hold different values.
For example Table A has mobile_no, date, duration columns so has Table B. But values are not same. other columns have different names table wise.
In all tables, Data types are string, integer, double I.e. simple data types. String data has a maximum 100 characters.
Each Table contains around 50 millions of data. I have requirement to merge these 11 table taking their columns as it is and make one big table.
Our spark cluster has 20 physical server, each has 36 cores (if count virtualization then 72), RAM 512 GB each. Spark version 2.2.x
I have to merge those with both memory & speed wise efficiently.
Can you guys, help me regarding this problem?
N.B: please let me know if you have questions

Select count(*) unstable on wide rows - Cassandra 2.1.2

Im running a 4 node Cassandra 2.1.2 cluster (6 cores per machine, 32G RAM).
I have 2 similar tables with about 650K rows each. The rows are pretty wide - 150K columns
On the first table when running select count(*) from the cqlsh Im getting the same result in a stable manner (the actual number of rows), but on the second table I get completely different values between run to run.
The only difference between the two tables is that the 2nd tables has a column that contains a collection (list) of 3 Doubles, whereas the first table contains a single Double in that column.
There is no data being inserted into the tables, and there are no compactions going on.
The row cache is disabled.
Any ideas on how to fix this ?

Resources