When testing my schema on single-node cluster, I am getting the identical data for each stress-test run
I specify my cassandra-stress.yaml file
Table and column spec
table_definition: |
CREATE TABLE files (
id uuid PRIMARY KEY,
data blob
)
columnspec:
- name: data
size: UNIFORM(10..100)
population: UNIFORM(1..100B)
So, I specify the column size to be uniform distributed from 10 to 100, but this config results in identical blob data of fixed size in each row
I use this command to run the test
cassandra-stress user profile=..\cqlstress.yaml n=20 ops(insert=5)
What I am doing wrong?
My wish is to have the data of variable length
Related
Version: DBR 8.4 | Spark 3.1.2
Spark allows me to create a bucketed hive table and save it to a location of my choosing.
df_data_bucketed = (df_data.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
I have verified that this saves the table data to my specified path (in my case, blob storage).
In the future, the table 'data_bucketed' might wiped from my spark catalog, or mapped to something else, and I'll want to "recreate it" using the data that's been previously written to blob, but I can find no way to load a pre-existing, already bucketed spark table.
The only thing that appears to work is
df_data_bucketed = (spark.read.format("parquet").load(bucketed_path)
.write.mode('overwrite').bucketBy(9600, 'id').sortBy('id')
.saveAsTable('data_bucketed', format='parquet', path=bucketed_path)
)
Which seems non-sensical, because it's essentially loading the data from disk and unnecessarily overwriting it with the exact same data just to take advantage of the buckets. (It's also very slow due to the size of this data)
You can use spark SQL to create that table in your catalog
spark.sql("""CREATE TABLE IF NOT EXISTS tbl...""") following this you can tell spark to rediscover data by running spark.sql("MSCK REPAIR TABLE tbl")
I found the answer at https://www.programmerall.com/article/3196638561/
Read from the saved Parquet file If you want to use historically saved data, you can't use the above method, nor can you use
spark.read.parquet() like reading regular files. The data read in this
way does not carry bucket information. The correct way is to use the
CREATE TABLE statement. For details, refer
to https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING data_source
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
Examples are as follows:
spark.sql(
"""
|CREATE TABLE bucketed
| (name string)
| USING PARQUET
| CLUSTERED BY (name) INTO 10 BUCKETS
| LOCATION '/path/to'
|""".stripMargin)
I want to connect Apache Superset with Apache Spark (I have Spark 3.1.2) and Query the data on Superset's SQL Lab using Apache Spark SQL.
On spark's master, I started thrift server using this command spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Then I added Spark cluster as a database in Superset using SQLAlchemy URI hive://hive#spark:10000/. I am able to access Spark cluster on Superset.
I can load JSON data as table using this SQL
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json"
and I am able to Query data using simple SQL statements like SELECT * FROM test_table LIMIT 10
BUT the problem is that json data is compressed as gzipped files.
So I tried
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json.gz"
but it did not work. I want to know how do load gzipped json data into a table
Compressed JSON storage
If you have large JSON text you can explicitly compress JSON text using built-in COMPRESS function. In the following example compressed JSON content is stored as binary data, and we have computed column that decompress JSON as original text using DECOMPRESS function:
CREATE TABLE Person
( _id int identity constraint PK_JSON_ID primary key,
data varbinary(max),
value AS CAST(DECOMPRESS(data) AS nvarchar(max))
)
INSERT INTO Person(data)
VALUES (COMPRESS(#json))
COMPRESS and DECOMPRESS functions use standard GZip compression.
Another example:
CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
Note: This example was taken from https://www.citusdata.com/blog/2013/05/30/run-sql-on-json-files-without-any-data-loads
In hive we can create hive partition table as
CREATE EXTERNAL TABLE testdb.test_table(name string, age int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
for files spread across locations like:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
and then we can update the partition like
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
In Azure our files are spread in a container in different folders . I need to create a partition external table in the synapse serverless pool .The syntax i am following is
CREATE EXTERNAL TABLE [testdb].[test1]
(
[STUDYID] varchar(2000) ,[SITEID] varchar(2000) )
WITH
(
LOCATION = '/<abc_location>/csv/archive/',
DATA_SOURCE = [datalake],
FILE_FORMAT = [csv_comma_values]
)
I was checking the azure docs but didn't find any relevant documentation for this . Is there any way we can achieve something similar to hive code.
Partition External table is currently not available in Azure Synapse.
Location starts from the root folder and by specifying /**, you can return the files from subfolders. The external table will skip the files starts with underscore _ and period .
CREATE EXTERNAL TABLE sample_table
(
firstname varchar(50),
lastname varchar(50)
)
WITH (
LOCATION = '/csv/**',
DATA_SOURCE = blob_storage,
FILE_FORMAT = csv_file_format
)
I have a problem with MySQL Cluster.
I created a table in order to be stored in the disk but i think it's always stored in memory.
I have have created a table with the query below:
CREATE TABLE IF NOT EXISTS user (
`FIELD_KEY` varchar(255) NOT NULL,
`FIELD0` text,
`FIELD1` text,
`FIELD2` text,
`FIELD3` text,
`FIELD4` text,
`FIELD5` text,
`FIELD6` text,
`FIELD7` text,
`FIELD8` text,
`FIELD9` text,
PRIMARY KEY (FIELD_KEY)
) ENGINE=ndbcluster DEFAULT CHARSET=utf8 max_rows=40000000
TABLESPACE ts_1 STORAGE DISK;
ALTER TABLE usertable4M PARTITION BY KEY() PARTITIONS 4
My Environment (3 nodes in the same VM):
MySQL Cluster 7.5
management nodes: 1
data node: 1
sql nodes: 1
VM Config : 16GB RAM, 100GB HDD
MySQL Cluster settings (config.ini) are set to:
[ndbd default]
noofreplicas=1
DataMemory=2048M
# Memory to allocate for index storage
IndexMemory=1024M
# To support multi-thread processors
MaxNoOfExecutionThreads=2
FragmentLogFileSize=256M
NoOfFragmentLogFiles=12
#-------
RedoBuffer=32M
SharedGlobalMemory=256M
DiskPageBufferMemory=4096M
NoOfFragmentLogParts = 4
#-------
[ndb_mgmd]
HostName=192.168.1.10
DataDir=/var/lib/mysql-cluster
[ndbd]
HostName=192.168.1.10
NodeId=2
DataDir=/usr/local/mysql/data
[mysqld]
HostName=192.168.1.10
In order to store data in Disk i created a LOGFILE and TABLESPACE with the queries below:
CREATE LOGFILE GROUP lg ADD UNDOFILE 'undo1.dat'
INITIAL_SIZE=12288M
UNDO_BUFFER_SIZE=32M
ENGINE=NDB;
CREATE TABLESPACE ts_1
ADD DATAFILE 'data1.dat'
USE LOGFILE GROUP lg
INITIAL_SIZE=256M
ENGINE=NDB;
ALTER TABLESPACE ts_1 ADD DATAFILE 'data2.dat' INITIAL_SIZE=256M ENGINE=NDB;
.
.
.
ALTER TABLESPACE ts_1 ADD DATAFILE 'data24.dat' INITIAL_SIZE=256M ENGINE=NDB;
MemoryUsage report
$ sudo ndb_mgm -e 'all report MemoryUsage'
Connected to Management Server at: 192.168.1.10:1186
Node 2: Data usage is 94%(62259 32K pages of total 65536)
Node 2: Index usage is 2%(2801 8K pages of total 131104)
When I ran a query to insert 4000000 records (4GB) to my table I get the error (The table 'user' is full) and can not insert data any more to the table and only 1611787 records has been inserted . I don't understand why it is so.
Could somebody explain me what is the situation. And how can I resolve the problem.
"In a Disk Data table, the first 256 bytes of a TEXT or BLOB column are stored in memory; only the remainder is stored on disk."
You can check this in detail.
I am trying to teach myself HIVE and familiarise myself to the HDInsight and HDFS on MicrsoftAzure using the tutorial From Raw Data to Insight using HDP and Microsoft Business Intelligence
I have managed to Staging the data on HDFS and now using AzurePowershell and Microsoft Azure HDInsight Query Console to create a Hive schema on the raw data.
I am trying to use the DDL statement below to create the table ‘price_data':
create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float)
row format delimited
fields terminated by ','
stored as textfile
location '/nyse/nyse_prices';
The blob files are located in the container "nyse" and each blob file within the container is named 'nyse_prices/NYSE_daily_prices_.csv'.
I have made sure that the format conforms to Processing data with Hive documentation on MSDN.
When I run the above query it executes successfully and creates the table.
The external table be pointing to the underlying files and therefore should be populated with the data within each csv file.
However when I run the query:
select count(*) from price_data
It returns 0. This is not correct. Can some one let me know what I am doing wrong here.
Cheers
I think the location you are specifying may be incorrect.
You have a default container, which is the container you specify or create when creating an HDInsight container. For example, 'mycontainer'. If I put all the csv files in that container as nyse_prices/filename.csv, then my location would be just '/nyse_prices'. Just the directory that contains the files. The 'container' is treated as the root in this case - '/'.
If the files are not in the default container, or on a different storage account, you can use a location of 'wasb://container#storagename.blob.core.windows.net/nyse_prices'.
As a test, I just created nyse_prices/ on my default container and uploaded some of the csv's to it. Then modified your query to use location '/nyse_prices'; and was able to do selects against the data after that.