Why Uncache table in spark-sql not working? - apache-spark

I'm learning Spark SQL, when I'm using spark-sql to uncache a table which has previously cached, but after submitted the uncache command, I can still query the cache table. Why this happened?
Spark version 3.2.0(Pre-built for Apache Hadoop 2.7)
Hadoop version 2.7.7
Hive metastore 2.3.9
Linux Info
Static hostname: master
Icon name: computer-vm
Chassis: vm
Machine ID: 15c**********************10b2e19
Boot ID: 48b**********************efc169b
Virtualization: vmware
Operating System: Ubuntu 18.04.6 LTS
Kernel: Linux 4.15.0-163-generic
Architecture: x86-64
spark-sql (default)> CACHE TABLE testCache SELECT * FROM students WHERE AGE = 13;
Error in query: Temporary view 'testCache' already exists
spark-sql (default)> UNCACHE TABLE testCache;
Response code
Time taken: 0.092 seconds
spark-sql (default)> SELECT * FROM testCache;
NAME rollno AGE
Kent 8 21
Marry 1 10
Eddie Davis 5 13
Amy Smith 3 13
Barron 3 12
Fleur Laurent 4 9
Ivy 3 8
Time taken: 0.492 seconds, Fetched 7 row(s)

UNCACHE TABLE removes the entries and associated data from the in-memory and/or on-disk cache for a given table or view, not drop the table. So you can still query it.

Related

Can I add Apache cassandra nodes to DataStax (DSE) cassandra cluster?

I am in the process of migrating from Datastax (DSE) Cassandra to Apache Cassandra 3.11.
I have a cluster of 7 nodes of Datastax (DSE) Cassandra.
Is there a way I create new cluster of apache Cassandra & connect it to DSE Cassandra so that my writes go to both DSE & Apache cassandra
So that once my data has started to be written in both Cassandra I can migrate my Read API's gradually from DSE to Apache.
Yes, I've done this before.
First of all, find the exact version of the Cassandra version (not the DSE version) that your cluster is running:
SELECT release_version FROM system.local;
release_version
-----------------
3.11.4
You can also see this version number when connecting with cqlsh. The DSE version of Cassandra will have a (long) build number added on to that. But the idea is that the version of Apache Cassandra on new nodes should match the DSE version of Cassandra as closely as possible.
Next, build up your Apache Cassandra "replacement" nodes as a new logical datacenter. Make sure that they use a different dc_name (than the existing nodes) in the cassandra-rackdc.properties file. The first node (or two) should use nodes from the existing cluster as seed nodes. The following nodes can then use the first nodes as seeds. Plus, the cluster_name needs to match.
Now check the keyspace definitions for system_auth, system_traces, system_distributed, and any keyspaces that the app needs. Make sure that they're using NetworkTopologyStrategy. If not, make sure it is, and configure the replication factor (RF) for the existing DC (DC name must match dc_name of existing DSE nodes). Then you can extend replication to the new data center.
If current dc_name is DSE_DC and the new dc_name is AC_DC, then:
ALTER KEYSPACE yourkeyspace WITH replication =
{'class': 'NetworkTopologyStrategy',
'DSE_DC': '3', 'AC_DC': '3'};
Once that change is done, run a nodetool rebuild on each new Apache Cassandra node.
nodetool rebuild -- DSE_DC
That will move the data from the DSE_DC to the current node. Then, you should be able to switch your API by specifying the new data center name.
Edit 20200506
Check your data directories. The most important thing that needs to match-up for this to work, is the SSTable format.
ver 3.11.4+
43 Feb 20 08:55 md-1-big-CompressionInfo.db
83 Feb 20 08:55 md-1-big-Data.db
10 Feb 20 08:55 md-1-big-Digest.crc32
16 Feb 20 08:55 md-1-big-Filter.db
17 Feb 20 08:55 md-1-big-Index.db
4769 Feb 20 08:55 md-1-big-Statistics.db
57 Feb 20 08:55 md-1-big-Summary.db
92 Feb 20 08:55 md-1-big-TOC.txt
ver 4.0-alpha4:
47 May 6 10:13 na-1-big-CompressionInfo.db
107 May 6 10:13 na-1-big-Data.db
10 May 6 10:13 na-1-big-Digest.crc32
16 May 6 10:13 na-1-big-Filter.db
32 May 6 10:13 na-1-big-Index.db
4687 May 6 10:13 na-1-big-Statistics.db
66 May 6 10:13 na-1-big-Summary.db
92 May 6 10:13 na-1-big-TOC.txt
You can also verify this in DataStax's Product Compatibility Guide.
Basically, if your SSTable files are prefixed with m[a,b,c,d], then 3.11.6 should be able to work.

Pyspark code to read from Cassandra table takes almost 14 mints to read 6 GB data

Spark cluster I am Using 4 cores and 4 executor instances.
Cassandra table data size after filter is 6GB.
Reading data from this Cassandra table using pyspark code.
Applying filter on partition keys(3 partition key)
Push filter happening.
One of partition key filter is a list of 5000 values.
This simple read is taking more than 14 mint.
Is this expected time or can we acheive this in less time.

Tez VS Spark - huge performance diffs

I'm using HDP 2.6.4 and am seeing huge differences in Spark SQL vs Hive on TeZ. Here's a simple query on a table of ~95 M rows
SELECT DT, Sum(1) from mydata GROUP BY DT
DT is partition column, a string that marks date.
In spark shell, with 15 executors, 10G memory for driver and 15G for executor, query runs for 10-15 seconds.
When running on Hive (from beeline), the query runs (actually is still running) for 500+ seconds. (!!!)
To make things worse, this application takes even more resources (significantly) than the spark shell session I ran the job in.
UPDATE: It finished 1 row selected (672.152 seconds)
More information about the environment:
Only one queue used, with capacity scheduler
User under which the job is running is my own user. We have Kerberos used with LDAP
AM Resource: 4096 MB
using tez.runtime.compress with Snappy
data is in Parquet format, no compression applied
tez.task.resource.memory 6134 MB
tez.counters.max 10000
tez.counters.max.groups 3000
tez.runtime.io.sort.mb 8110 MB
tez.runtime.pipelined.sorter.sort.threads 2
tez.runtime.shuffle.fetch.buffer.percent 0.6
tez.runtime.shuffle.memory.limit.percent 0.25
tez.runtime.unordered.output.buffer.size-mb 460 MB
Enable Vectorization and Map Vectorization true
Enable Reduce Vectorization false
hive.vectorized.groupby.checkinterval 4096
hive.vectorized.groupby.flush.percent 0.1
hive.tez.container.size 682
More Updates:
When checking about vectorization on this link, I noticed I don't see Vectorized execution: true anywhere when I used explain. Another thing that caught my attention is the following: table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Namely, when checking table itself: STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' and OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Any comparisons between spark and tez usually come to relatively same terms, but I'm seeing dramatic differences.
What shd be the first thing to check?
Thx
In the end, we gave up and installed LLAP. I'm going to accept it as an answer, as I have sort of an OCD and this unanswered question has been poking my eyes for long enough.

Spark 1.4.1 dataframe queries on Hive ORC tables take forever

I am using Apache Spark 1.4.1 (which is integrated with Hive 0.13.1) along
with Hadoop 2.7
I have created an ORC table with Snappy compression in Hive and inserted
around 50 million records into the same using Spark Dataframe API
(insertInto method),as below:
inputDF.write.format("orc").mode(SaveMode.Append).partitionBy("call_date","hour","batch_id").insertInto("MYTABLE")
This table has around 50-60 columns with 3 columns being varchar and all
other columns being either INT or FLOAT.
My problem is that when I query the table using below spark command:
var df1 = hiveContext.sql("select * from MYTABLE")
val count1 = df1.count()
The query doesn't come out and is stuck for several hours at the above
query.Spark console logs are stuck at below:
16/12/02 00:50:46 INFO DAGScheduler: Submitting 2700 missing tasks from
ShuffleMapStage 70 (MapPartitionsRDD[553] at cache at
MYTABLE_LOAD.scala:498)16/12/02 00:50:46 INFO YarnScheduler: Adding task
set 70.0 with 2700 tasks
The table has 2700 part files in warehouse directory.
I have tried coalescing the inputDF to 10 partitions before inserting into
the table which created 270 part files for the table instead of 2700,but
querying the table gives same issue,i.e. the query doesn't come out.
The strange thing is that when I invoke the same select query via
spark-shell(invoked with 5g driver memory),the query gives results in less
than a minute.
Even for other ORC tables (not Snappy compressed),querying them using
hiveContext.sql with very simple queries (select from table where ) is taking more than 10 minutes.
Can someone please advise what could be the issue here? I don't think there
is something wrong with the table as the spark-shell query wouldn't have
worked in that case.
Many thanks in advance.

How to use sstableloader?

i use Cassandra 3.4 on some centos 7 machines.
I have 2 clusters:
Cluster 1 with 2 DC , DC1 has 2 machines 192.168.0.171/192.168.172, DC2 has 1 machine 192.168.0.173. Cluster 1 has some data on it, on one keyspace with replication 2 : 1.
Cluster 2 with 1 datacenter , DC3 has 2 machines. 192.168.0.174/192.168.0.175.
On second cluster, DC3, I create the keyspace : "keyspace1" with NetworkTopologyStrategy : DC3 : 2.
Streamed some cassandra-stress on 192.168.0.175 :
cassandra-stress write n=1000000 -node 192.168.0.175.
In this moment cassandra-stress should generate some garbage data.
Checked the /var/lib/cassandra/data/keyspace1/standard1-97a771600d4011e69a5a13282caaa658 and there i have some ma-1-big-Data.db 57 Mb, ma-2-big-Data.db 65 Mb, ma-3-big-Data.db 65 Mb.
My question :
Let`s assume the garbage data is actual data and i want to stream from Cluster 2 this data into Cluster 1.
How can i do that by using sstableloader?
NOTE: Please give, if possible, example with commands ( i`m quite newbie in domain :( )
bin/sstableloader -d 192.168.0.171,192.168.172 /var/lib/cassandra/data/keyspace1/standard1-97a771600d4011e69a5a13282caaa658
this command will load data from one cluster to another cluster
Note: keyspace and table should exist in both clusters, and the tables should have the same schema.

Resources