Any way to disable hive-style partitioning? - apache-spark

I am writing an apache iceberg table that is synced to a metastore. When generating tables, the partitioning appears as hive-style when I'd prefer it to be just the singular value. I also have tested hudi tables, which comes with an inherent way to disable hive-style partitioning. I understand that this may not be the best solution / best practice, but it's the outcome that I need currently.
Example Output:
directory/year=2022/month=12/day=07/parquet.file
Preferred Output:
directory/2022/12/07/parquet.file
Since I can find no way of disabling this formatting via iceberg table properties, is there any way to do this via my hive site config? Thanks in advance!
I have tried researching the issue, but have found very little information on other people attempting the same thing.

Related

How do I specify equality versus position deletes when using merge-on-read?

The iceberg documentation discusses using merge-on-read when deleting data. The documentation also refers to doing position deletes versus equality deletes. It seems straight forward to specify that I want merge-on-read in the table properties.
I've looked through the iceberg documentation and also found a half dozen external sites that talk about the pro's and con's of each method, but none of them describe how to specify position versus equality. Is this a table property? How do I choose a method?
I'm using spark 3.3 on EMR with scala/python
You don't need to specify POS or EQ delete. These two delete methods are automatically selected within the engine based on different scenarios.
To better use iceberg, you may need to pay attention to the following:
Use merge-on-read or cory-on-write
Merge files by specified policy
Expired snapshots and data deletion
Hope it helps you.

How to Partition Database Table in Azure Data Explorer?

I started exploring ADX a few days back. I imported my data from Azure SQL to ADX using ADF pipeline but when I query those data, it is taking a long time. To find out some workaround I researched for Table Data Partitioning and I am much clear on partition types and tricks.
The problem is, I couldn't find any sample (Kusto Syntax) that guide me to define Paritionging on ADX Database Tables. Can anyone please help me with this syntax?
partition operator is probably what you are looking for:
T | partition by Col1 ( top 10 by MaxValue )
T | partition by Col1 { U | where Col2=toscalar(Col1) }
ADX doesn't currently have the notion of partitioning a table, though it may be added in the future.
that said, with the lack of technical information currently provided, it's somewhat challenging to understand how you got to the conclusion that partitioning your table is required and is the appropriate solution, as opposed to other (many) directions that ADX does allow you pursue.
if you would be willing to detail what actions you're performing, the characteristics of your data & schema, and which parts are performing slower than expected, that may help in providing you a more meaningful and helpful answer.
[if you aren't keen on exposing that information publicly, it's ok to open a support ticket with these details (through the Azure portal)]
(update: the functionality is available for a while now. read more # https://yonileibowitz.github.io/blog-posts/data-partitioning.html)

How can I write to HDFS from Spark to make access to that data faster?

Assume that I am not tools like Hive or HBase (Spark is unable to use Hive indexes anyway for optimization), what is the best way to write data to the HDFS to make access to that data faster.
What I was thinking is to save many different files, the name of whose is identified by the keys. Let us say we have a database of people who are identified by their firstname and surname. Maybe I could save files with the first letters of firstname and surname. In this way, we would have 26x26=676 files. So, for example, if we want to see record of Alan Walker, we need to just load the file AW. Would this be a good way or are there much better ways to do this kind of thing?
I believe that an index is what you need. In HDFS as in databases indexing has some overhead on insertion but makes queries much faster.
HDFS does not have any sort of index as it is supposedly a DFS rather than a Database, yet the requirement that your mentions has been implemented through third programs
There are many indexing tools that works with HDFS, you can have a look to APACHE SOLR for instance
Here is a tutorial to keep you going: https://lucene.apache.org/solr/guide/6_6/running-solr-on-hdfs.html

Is Cassandra just a storage engine?

I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.

Cassandra Data Model for apache access logs

In a POC, we are using cassandra for storing (among other things) Apache access logs (parsed) and use together with apache spark + zeppelin. We have managed to get things working BUT we are very uncertain about how to model the data correctly.
Edit: Our queries will span over months and years rather than weeks and days. Against production jobs are likely executed perhaps daily (at least for now) and we will use a smaller dataset during development.
Since this will be used for analytics ONLY, the queries can be pretty much anything but of course we could consider a handful of queries in advance.
I.e
latency percentiles
geo distribution
sum of requests
Popular rest resources
... etc
Partition key + Primary key. This is really difficult... the only thing that I can think of is something like ((userid, [webresource]), timestamp).
At least this would give a fairly even distribution. Otherwise we would have to use a checksum or something which feels wrong.
Or should I have different tables for different types, like latency, geo etc? Or is this a good option for materialized views?
I have googled for something like this without any luck so perhaps cassandra is a poor solution for this BUT still, we would really like to see how far we can get.
Anyway, any input is highly appreciated!
Regards /Johan

Resources