Partitioning a table in databricks - databricks

Can someone enlighten whether I can partition a table on a column and restrict access only to certain partitions to certain people?
What and how does table partition has anything to do with restricting access ? for example: if I have a partition by Country, with around 150+ countries can we restrict access only to 10 states at partition level?
Thanks

Related

Spark - Get Counts while saving into hive table (ORC)

I would like to ask you if there is any possibility to get the count of DataFrame which I am inserting into Hive Table with usage of saveAsTable() without performance reduction?
Honestly I would like to report log counts or the best would be to get the counts before insert and after insert as that would be really useful information in Splunk Dashboard but I don't want to add hive queries which might harm performance quite significantly as I am having more than 100 Transformations.
Thanks for help in advance!
set hive.stats.autogather=false;- For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default. The user has to explicitly set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive MetaStore.
Table Level Statistics,
spark.sql("ANALYZE TABLE tableName COMPUTE STATISTICS").show()
which results in
parameters:{totalSize=0, numRows=0, rawDataSize=0...```
Table Partition Level Statistics:
spark.sql("ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS").show()
Note: When the user issues that command, he may or may not specify the partition specs. If the user doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any).
Table Column Level Statistics:
spark.sql("ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS FOR COLUMNS").show()
you can get more details from: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables%E2%80%93ANALYZE

Query (with Cosmos DB) on Partition key results in multiple Partition key ranges, How is this possible? [duplicate]

I'm having difficulty understanding the difference between the partition keys & the partition key ranges in Cosmos DB. I understand generally that a partition key in cosmos db is a JSON property/path within each document that is used to evenly distribute data among multiple partitions to avoid any uneven "hot partitions" -- and partition key decides the physical placement of documents.
But its not clear to me what the partition key range is...is this just a range of literal partition keys starting from first to last grouped by each individual partition in the collection? I know the ranges can be found by performing a GET request to the endpoint https://{databaseaccount}.documents.azure.com/dbs/{db-id}/colls/{coll-id}/pkranges but just conceptionally want to be sure I understand. Also still not clear on how to granularly view the specific partition key that a specific document belongs to.
https://learn.microsoft.com/en-us/rest/api/cosmos-db/get-partition-key-ranges
You define property on your documents that you want to use as a partition key.
Cosmos db hashes value of that property for all documents in collection and maps different partition keys to different physical partitions.
Over time, your collection will grow and you might end up having, for example, 100 logical partition distributed over 5 physical partitions.
Partition key ranges are just collections of partition keys grouped by physical partitions they are mapped to.
So, in this example, you would get 5 pkranges with min/max partition key value for each.
Notice that pkranges might change because in future, as your collection grows, physical partitions will get split causing some partition keys to be moved to new physical partition causing part of the previous range to be moved to new location.

A substitute OR query for Cassandra

I have a table in my Cassandra DB with columns userid, city1, city2 and city3. What would my query be if I wanted to retrieve all users that have "Paris" as a city? I understand Cassandra doesn't have OR so I'm not sure how to structure the query.
First - it's heavily depend on the structure of the table - if you have userid as partition key, you can of course use secondary index to search users in cities, but it's not optimal as it's fan-out call - request is sent to all nodes in the cluster. You can re-design to use the materialized view with city as partition key, but you may have problems if you have a lot users in some cities.
In general, if you need to select several values in the same column - you can use IN operator, but it's better not to use it for partition keys (parallel queries are better). If you need OR on different columns - you need to do parallel queries, and collect results on application side.

Multi-tenancy in Cassandra

We are supporting multi-tenancy. Is it better to have customer id as part of partition key or clustering column
Having customer id as part of partition key will ensure that one customers data cannot be viewed by another customer.
Having customer id as part of clustering column, developers have to ensure that customer id is part of where clause. Also takes up more space.
Is there one better way over the other ?
There will impact on time-series data, since data will be partitioned by customer id and the super user having access to all customers will not be able to view time-series data correctly.
Thanks
Have customer_id as a part of your partition key. You'll need this to ensure that each customer's data is stored together.
However, make sure that customer_id is not the only partition key. If you have a time series data set with millions of rows, you won't want to attempt to store them all in the same partition (it'll get too big).
There will impact on time-series data, since data will be partitioned by customer id and the super user having access to all customers will not be able to view time-series data correctly.
This comes back to designing your tables with a query-based approach. If you have a query requirement to support queries on time series data for all (or multiple) customers at once, then you may need a table designed to support that.

Windows Azure table access latency Partition keys and row keys selection

We've got a windows azure table storage system going on where we have various entity types that report values during the day so we've got the following partition and row key scenario:
There are about 4000 - 5000 entities. There are 6 entity types and the types are roughly evenly distributed. so around 800'ish each.
ParitionKey: entityType-Date
Row key: entityId
Each row records the values for an entity for that particular day. This is currently JSON serialized.
The data is quite verbose.
We will periodically want to look back at the values in these partitions over a month or two months depending on what our website users want to look at.
We are having a problem in that if we want to query a month of data for one entity we find that we have to query 31 partition keys by entityId.
This is very slow initially but after the first call the result is cached.
Unfortunately the nature of the site is that there will be a varying number of different queries so it's unlikely the data will benefit much from caching.
We could obviously make the partitions bigger i.e. perhaps a whole week of data and expand the rowKeys to entityId and date.
What other options are open to me, or is simply the case that Windows Azure tables suffer fairly high latency?
Some options include
Make the 31 queries in parallel
Make a single query on a partition key range, that is
Partition key >= entityType-StartDate and Partition key <= entityType-EndDate and Row key = entityId.
It is possible that depending on your data, this query may have less latency than your current query.

Resources