Can I add more than 300 columns in Apache Kudu? - apache-kudu

I have been asked to create a Kudu table.
I know that Kudu is a columnar storage, but now my company's database table has like 285 columns which can fit in the the Kudu table, but is it possible to dynamically add columns in excess of the 300 column limit in a Kudu table? If so, how?

#HJSG
Kudu give's best performance with < 300 COLUMNS and Kudu Tables can have a maximum of 300 columns. it is a Kudu limitation. have a look into this https://kudu.apache.org/docs/known_issues.html#_columns

Not sure what happens if exceeding 300 cols limit, but here you are:
--unlock-unsafe-flags --max-num-columns=1000
Ref

Related

like full table in hadoop/spark

I have about 1 million records. I want manage this juridical judgments records in spark hadoop. My question is Can I query spark/hadoop for get all records at once like a full table scan? or Can I paginate efficiently for example records from 800 000 to 800 050?
My problem is that I use elasticsearch for full text search but if I want get results from 800 000 to 800 050 I'm obliged to use scroll api that appear very slow because start from 0 then take 10 000 records then others 10 000 and so on. My goal is get all records then "jump" to 800 000 without chunk of 10 000 records.
Hive or SparkSQL can be used to query offset ranges, of datasets, yes. But they won't help with textual search out of the box.
MongoDB can do both, since it also includes Lucene indexes like Elasticsearch.

Cassandra (DSE) - Need suggestion on using PER PARTITION LIMIT on huge data

I have a table with around 4M of partitions and each partition contains 4 rows. So, the total data in table would be having 16M rows (wide columns). Since our table is a time series database, we only need the latest row or version of the partition_key. I can achieve my desired results through below query. However this will impact load on clusters and time consuming. Would like to see if we have any other best way to achieve this or this is the only way.
SELECT some_value FROM some_table PER PARTITION LIMIT 1;
Using PER PARTITION LIMIT won't have an impact on performance. In fact, it's efficient for achieving what you need from each partition since only the first row will be returned and it doesn't to iterate over the other rows in the partition. Cheers!

Questions about storing data in SQL or Table Storage

I have many questions on whether to store my data into SQL or Table Storage and the best way to store them for efficiency.
Use Case:
I have around 5 million rows of objects that are currently stored in mysql database. Currently the metadata is stored only in the database. (Lat, Long, ID, Timestamp). The other 150 columns about the object that are not used much were moved into the Table Storage.
In the table storage, should these all be stored in one row with all the 150 columns not used much in one column instead of multiple rows?
For each of these 5 million objects in the database, there are certain information about them (temperature readings, trajectories, etc). The trajectory data used to be stored in SQL (~300 rows / object) but were moved to table storage to be cost effective. Currently they are stored in the table storage in a relational manner where each row looks like (PK: ID, RK: ID-Depth-Date, X, Y, Z).
Currently it takes time time grab many of the trajectories data. Table Storage seems to be pretty slow in our case. I want to improve the performance of the gets. Should the data be stored where each Objects has 1 row for its trajectory and all the XYZ's are stored in 1 column in a JSON format? Instead of 300 rows to get, it only needs to get 1 row.
Is the table storage the best place to store all of this data? If I wanted to get a X,Y,Z at a certain Measured Depth, I would have to get the whole row and parse through the JSON. THis is probably a trade-off.
Is it feasible to have the trajectory data, readings, etc in a sql database where there can be (5,000,000 x 300 rows) for the trajectory data. THere is also some information about the objects where it can be (5,000,000 x 20,000 rows). This is probably too much for a SQL database and would have to be in a Azure CLoud Storage. If so, would the JSON option be the best one? The tradeoff is that if I want a portion which is 1000 rows, I would have to get the whole table, however, isnt that faster than querying through 20,000 rows. I can probably split the data into sets of 1000 rows and use sql as a meta data for finding out which sets of data I need from the Cloud Storage.
Pretty much I'm having trouble understanding how to group data and format it into Azure Cloud Tables to be efficient and fast when grabbing data for my application.
Here's an example of my data and how I am getting it: http://pastebin.com/CAyH4kHu
As an alternative to table storage, you can consider using Azure SQL DB Elastic Scale to spread trajectory data (and associated object metadata) across multiple Azure SQL DBs. This allows you to overcome capacity (and compute) limits of a single database. You would be able to perform object-specific queries or inserts efficiently, and have options to perform queries across multiple databases -- assuming you are working with a .Net application tier. You can find out more by looking at http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-get-started/

What's the maximum number of rows per single read/write transactions in Azure Storage Table

We are investigating the performance of Azure storage table, what we want to know is the maximum number of rows per single read and write transactions for table, any official documentation can be referred?
Thanks a lot.
You can write up to 100 rows in a single table storage transaction. Assuming that all of the rows/entities have the same PartitionKey.
With respect to reading, you can read up to 1000 rows in one storage transaction. Once again, assuming the same PartitionKey

Deleting a Windows Azure Table from Table Storage - Cost?

We currently have an Azure Table filled with logs. We have no idea how many records are in them but we know that we did +- 3 mil transactions. So in worst-case scenario we will have 300 mil. rows.
We want to completely delete all the logs.
If we delete the table, will this mean 1 transaction or will this mean he will batch delete all the rows he can and getting around 3 mil. transactions again?
I can't find any official info about the fact that Delete table command is actually 1 transaction.
Any help?
Thanks !
Transactions are billed as single REST requests.
As such you will be charged for 1 transaction to delete the table.
To be completely nosy that would (could) be two storage transactions:
One to drop the table.
One to re-create the table for continued logging.

Resources