MySQL Cluster: “The table is full” with disk storage - mysql-cluster

I have a problem with MySQL Cluster.
I created a table in order to be stored in the disk but i think it's always stored in memory.
I have have created a table with the query below:
CREATE TABLE IF NOT EXISTS user (
`FIELD_KEY` varchar(255) NOT NULL,
`FIELD0` text,
`FIELD1` text,
`FIELD2` text,
`FIELD3` text,
`FIELD4` text,
`FIELD5` text,
`FIELD6` text,
`FIELD7` text,
`FIELD8` text,
`FIELD9` text,
PRIMARY KEY (FIELD_KEY)
) ENGINE=ndbcluster DEFAULT CHARSET=utf8 max_rows=40000000
TABLESPACE ts_1 STORAGE DISK;
ALTER TABLE usertable4M PARTITION BY KEY() PARTITIONS 4
My Environment (3 nodes in the same VM):
MySQL Cluster 7.5
management nodes: 1
data node: 1
sql nodes: 1
VM Config : 16GB RAM, 100GB HDD
MySQL Cluster settings (config.ini) are set to:
[ndbd default]
noofreplicas=1
DataMemory=2048M
# Memory to allocate for index storage
IndexMemory=1024M
# To support multi-thread processors
MaxNoOfExecutionThreads=2
FragmentLogFileSize=256M
NoOfFragmentLogFiles=12
#-------
RedoBuffer=32M
SharedGlobalMemory=256M
DiskPageBufferMemory=4096M
NoOfFragmentLogParts = 4
#-------
[ndb_mgmd]
HostName=192.168.1.10
DataDir=/var/lib/mysql-cluster
[ndbd]
HostName=192.168.1.10
NodeId=2
DataDir=/usr/local/mysql/data
[mysqld]
HostName=192.168.1.10
In order to store data in Disk i created a LOGFILE and TABLESPACE with the queries below:
CREATE LOGFILE GROUP lg ADD UNDOFILE 'undo1.dat'
INITIAL_SIZE=12288M
UNDO_BUFFER_SIZE=32M
ENGINE=NDB;
CREATE TABLESPACE ts_1
ADD DATAFILE 'data1.dat'
USE LOGFILE GROUP lg
INITIAL_SIZE=256M
ENGINE=NDB;
ALTER TABLESPACE ts_1 ADD DATAFILE 'data2.dat' INITIAL_SIZE=256M ENGINE=NDB;
.
.
.
ALTER TABLESPACE ts_1 ADD DATAFILE 'data24.dat' INITIAL_SIZE=256M ENGINE=NDB;
MemoryUsage report
$ sudo ndb_mgm -e 'all report MemoryUsage'
Connected to Management Server at: 192.168.1.10:1186
Node 2: Data usage is 94%(62259 32K pages of total 65536)
Node 2: Index usage is 2%(2801 8K pages of total 131104)
When I ran a query to insert 4000000 records (4GB) to my table I get the error (The table 'user' is full) and can not insert data any more to the table and only 1611787 records has been inserted . I don't understand why it is so.
Could somebody explain me what is the situation. And how can I resolve the problem.

"In a Disk Data table, the first 256 bytes of a TEXT or BLOB column are stored in memory; only the remainder is stored on disk."
You can check this in detail.

Related

Load spark bucketed table from disk previously written via saveAsTable

Version: DBR 8.4 | Spark 3.1.2
Spark allows me to create a bucketed hive table and save it to a location of my choosing.
df_data_bucketed = (df_data.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
I have verified that this saves the table data to my specified path (in my case, blob storage).
In the future, the table 'data_bucketed' might wiped from my spark catalog, or mapped to something else, and I'll want to "recreate it" using the data that's been previously written to blob, but I can find no way to load a pre-existing, already bucketed spark table.
The only thing that appears to work is
df_data_bucketed = (spark.read.format("parquet").load(bucketed_path)
.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
Which seems non-sensical, because it's essentially loading the data from disk and unnecessarily overwriting it with the exact same data just to take advantage of the buckets. (It's also very slow due to the size of this data)
You can use spark SQL to create that table in your catalog
spark.sql("""CREATE TABLE IF NOT EXISTS tbl...""") following this you can tell spark to rediscover data by running spark.sql("MSCK REPAIR TABLE tbl")
I found the answer at https://www.programmerall.com/article/3196638561/
Read from the saved Parquet file If you want to use historically saved data, you can't use the above method, nor can you use
spark.read.parquet() like reading regular files. The data read in this
way does not carry bucket information. The correct way is to use the
CREATE TABLE statement. For details, refer
to https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
Examples are as follows:
spark.sql(
"""
|CREATE TABLE bucketed
| (name string)
| USING PARQUET
| CLUSTERED BY (name) INTO 10 BUCKETS
| LOCATION '/path/to'
|""".stripMargin)

how to create Synapse serverless pool partition table similar to hive

In hive we can create hive partition table as
CREATE EXTERNAL TABLE testdb.test_table(name string, age int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
for files spread across locations like:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
and then we can update the partition like
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
In Azure our files are spread in a container in different folders . I need to create a partition external table in the synapse serverless pool .The syntax i am following is
CREATE EXTERNAL TABLE [testdb].[test1]
(
[STUDYID] varchar(2000) ,[SITEID] varchar(2000) )
WITH
(
    LOCATION = '/<abc_location>/csv/archive/',
    DATA_SOURCE = [datalake],
    FILE_FORMAT = [csv_comma_values]
)
I was checking the azure docs but didn't find any relevant documentation for this . Is there any way we can achieve something similar to hive code.
Partition External table is currently not available in Azure Synapse.
Location starts from the root folder and by specifying /**, you can return the files from subfolders. The external table will skip the files starts with underscore _ and period .
CREATE EXTERNAL TABLE sample_table
(
firstname varchar(50),
lastname varchar(50)
)
WITH (
LOCATION = '/csv/**',
DATA_SOURCE = blob_storage,
FILE_FORMAT = csv_file_format
)

How to awk specific fields from python subprocess output

Tablespace name = SYSCATSPACE
Tablespace ID = 0
Tablespace Type = Database managed space
Tablespace Content Type = All permanent data. Regular table space.
Tablespace Page size (bytes) = 16384
Tablespace Extent size (pages) = 4
Automatic Prefetch size enabled = Yes
Buffer pool ID currently in use = 1
Buffer pool ID next startup = 1
Using automatic storage = Yes
Auto-resize enabled = Yes
How can I select the tablespace name if the auto resize tag is set to yes? (I need to do this to x amounts to tablespaces that are displayed with additonal information).
What db2 command are you using to get this output ?
Instead of trying to parse the output, you should select the tablespaces from db2 catalog sytem tables, and filter your rows per your need..
SELECT MEMBER, TBSP_ID, TBSP_NAME FROM TABLE(MON_GET_TABLESPACE('',-2)) AS t
WHERE TBSP_AUTO_RESIZE_ENABLED = 1
easier.
PS. I don't know what you mean by doing it a "pythonic" way. python has no magic way, under the covers, it has to ask DB2 for the information. I am giving you the best way to retrieve the information you want, from db2 perspective.. regardless the language being used.

Hive managed tables are not drop on azure data lake store

I recently discover what I call a bug and I'd be sure it's a bug.
We work on an azure platform with HDInsight 3.6 with two separates storages : a blobstorage and a data lake store.
For most part of our work we use Hive.
From what we know when you drop a managed table the data under this table are drop too.
To be sure of this we tried this :
CREATE TABLE test(id String) PARTITIONED BY (part String) STORED AS ORC ;
INSERT INTO TABLE PARTITION(part='part1') VALUES('id1') ;
INSERT INTO TABLE PARTITION(part='part2') VALUES('id2') ;
INSERT INTO TABLE PARTITION(part='part3') VALUES('id3') ;
These queries are executed on the default database i.e on the blob storage.
The data are well stored under the location of the table test : if we check we have three directories part=* with files under them.
Then i drop the table :
DROP TABLE test ;
If we check the database directory there is no more directory named test so the data are well dropped and we expect this to be the correct hive behavior.
And now is the trick : For our work we use databases located on a datalake store and when we use this code :
use database_located_on_adl ;
CREATE TABLE test(id String) PARTITIONED BY (part String) STORED AS ORC ;
INSERT INTO TABLE PARTITION(part='part1') VALUES('id1') ;
INSERT INTO TABLE PARTITION(part='part2') VALUES('id2') ;
INSERT INTO TABLE PARTITION(part='part3') VALUES('id3') ;
DROP TABLE test ;
The table are well created, the data are well stored BUT the data are not dropped on the DROP TABLE command ...
Am I missing something ? Or is this a normal behavior ?
If anyone see this old post and have the same issue : Our problem was that we missed the write right on the hive trash (/user/hiveUserName/.Trash HDFS folder).
Hope this can help !

Cassandra stress-tool column-size does not change

When testing my schema on single-node cluster, I am getting the identical data for each stress-test run
I specify my cassandra-stress.yaml file
Table and column spec
table_definition: |
CREATE TABLE files (
id uuid PRIMARY KEY,
data blob
)
columnspec:
- name: data
size: UNIFORM(10..100)
population: UNIFORM(1..100B)
So, I specify the column size to be uniform distributed from 10 to 100, but this config results in identical blob data of fixed size in each row
I use this command to run the test
cassandra-stress user profile=..\cqlstress.yaml n=20 ops(insert=5)
What I am doing wrong?
My wish is to have the data of variable length

Resources