I want to migrate 500 tables from mysql to cassandra but do not want to create the schemas in cassandra before migration.
i know the option of CQL-IMPORT in Sqoop but only allows copying data with tables created in cassandra.
Is there any way where i can have all the tables structure copied from MYSQL to Cassandra schema format creation of 500 tables in cassandra with more than 100 columns per table will be time consuming.
please help
Related
When I query a table in my Cassandra database (in an EC2 instance), data before 2022-04-15 are not showing. The table had data from November of 2021.
I am new to Cassandra. What could have happened? Last week I exported data to CSV in addition to my usual work.
Need some help as we are baffled. Using Impala SQL, we did ALTER TABLE to add 3 columns to a parquet table. The table is used by both Spark (v2) and Impala jobs.
After the columns were added, Impala correctly reports the new columns using describe, however, Spark does not report the freshly added columns when spark.sql("describe tablename") is executed.
We double checked Hive and it correctly reports the added columns.
We ran refresh table tablename in spark but it still doesn't see the new columns. We believe we must be overlooking something simple. What step did we miss?
Update: Impala sees the table with the columns but Spark does not acknowledge the new columns. Reading more about spark, apparently the spark engine reads the schema from the parquet file rather than from the hive meta store. The suggested work around did not work and the only recourse that could be found was to drop the table and rebuild it.
I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.
Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?
Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?
As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.
Thanks in advance and for reading!
Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.
For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:
CREATE OR REPLACE VIEW myview
AS SELECT col1, col2 FROM mytable
P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.
I am trying to MERGE two tables using spark sql and getting error with the statement.
The tables are created as external tables pointing to the Azure ADLS storage. The sql is executing using Databricks.
Table 1:
Name,Age.Sex
abc,24,M
bca,25,F
Table 2:
Name,Age,Sex
abc,25,M
acb,25,F
The Table 1 is the target table and Table 2 is the source table.
In the table 2 I have one Insert and one update record which needs to be merged with source table 1.
Query:
MERGE INTO table1 using table2 ON (table1.name=table2.name)
WHEN MATCHED AND table1.age <> table2.age AND table1.sex<>table2.sex
THEN UPDATE SET table1.age=table2.age AND table1.sex=table2.sex
WHEN NOT MATCHED
THEN INSERT (name,age,sex) VALUES (table2.name,table2.age,table2.sex)
Is the spark SQL support merge or is there another way of achieving it ?
Thanks
Sat
To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE.
Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.
For Example: I want to create 40 tables in one keyspace. In 40 tables I want to shard 3 tables. Is is it possible to shard specific tables without creating new keyspace.
I have seen How to shard only specific tables using vitess But for this we need to create new keyspace. I don't want to create new keyspace. I want sharded and unsharded tables in one keyspace is it possible?
This is currently not possible. A keyspace is categorized as sharded or unsharded. So, you have to migrate the tables you want to shard into a sharded keyspace and then reshard the keyspace.
Some people worked around this by assigning a "null primary vindex" to the unsharded tables, essentially forcing all rows to live in the first shard. But I don't know if this was experimental or was actually used in production.