How to stop power query querying full original dataset

How to stop power query querying full original dataset - excel

I have an excel file connected to an access database. I have created a query through Power Query that simply brings the target table into the file and does a couple of minor things to it. I don’t load this to a worksheet but maintain a connection only.
I then have a number of other queries linking to the table created in the first query.
In one of these linked queries, I apply a variety of filters to exclude certain products, customers and so on. This reduces the 400,000 records in the original table in the first query down to around 227,000 records. I then load this table to a worksheet to do some analysis.
Finally I have a couple of queries looking at the 227,000 record table. However, I notice that when I refresh these queries and watch the progress in the right hand pane, they still go through 400,000 records as if they are looking through to the original table.
Is there any way to stop this happening in the expectation that doing so would help to speed up queries that refer to datasets that have themselves already been filtered?
Alternatively is there a better way to do what I’m doing?
Thanks

First: How are you refreshing your queries? If you execute them one at a time then yes, they're all independent. However, when using Excel 2016 on a workbook where "Fast Data Load" is disabled on all queries, I've found that a Refresh All does cache and share query results with downstream queries!
Failing that, you could try the following:
Move the query that makes the 227,000-row table into its own group called "Refresh First"
Place your cursor in your 227,000-row table and click Data - Get &
Transform - From Table,
Change all of your queries to pull from this new query rather than the source.
Create another group called "Refresh Second" that contains every query that
is downstream of the query you created in step 2, and
loads data to the workbook
Move any remaining queries that load to the workbook into "Refresh First", "Refresh Second", or some other group. (By the way: I usually also have a "Connections" group that holds every query that doesn't load data to the workbook, too.)
Unfortunately, once you do this, "Refresh All" would have to be done twice to ensure all source changes are fully propagated because those 227,000 rows will be used before they've been updated from the 400,000. If you're willing to put up with this and refresh manually then you're all set! You can right-click and refresh query groups. Just right-cick and refresh the first group, wait, then right-click and refresh the second one.
For a more idiot-proof way of refreshing... you could try automating it with VBA, but queries normally refresh in the background; it will take some extra work to ensure that the second group of queries aren't started before all of the queries in your "Refresh First" group are completed.
Or... I've learned to strike a balance between fidelity in the real world but speed when developing by doing the following:
Create a query called "ProductionMode" that returns true if you want full data, or false if you're just testing each query. This can just be a parameter if you like.
Create a query called "fModeSensitiveQuery" defined as
let
// Get this once per time this function is retrived and cached, OUTSIDE of what happens each time the returned function is executed
queryNameSuffix = if ProductionMode then
""
else
" Cached",
// We can now use the pre-rendered queryNameSuffix value as a private variable that's not computed each time it's called
returnedFunction = (queryName as text) as table => Expression.Evaluate(
Expression.Identifier(
queryName & queryNameSuffix
),
#shared
)
in
returnedFunction
For each slow query ("YourQueryName") that loads to the table,
Create "YourQueryName Cached" as a query that pulls straight from results the table.
Create "modeYourQueryName" as a query defined as fModeSensitiveQuery("YourQueryName")
Change all queries that use YourQueryName to use modeYourQueryName instead.
Now you can flip ProductionMode to true and changes propagate completely, or flip ProductionMode to false and you can test small changes quickly; if you're refreshing just one query it isn't recomputing the entire upstream to test it! Plus, I don't know why but when doing a Refresh All I'm pretty sure it also speeds up the whole thing even when ProductionMode is true!!!
This method has three caveats that I'm aware of:
Be sure to update your "YourQueryName Cached" query any time the "YourQueryName" query's resulting columns are added, removed, renamed, or typed differently. Or better yet, delete and recreate them. You can do this because,
Power-Query won't recognize your "YourQueryName" and "YourQueryName Cached" queries as dependencies of "modeYourQueryName". The Query Dependences diagram won't be quite right, you'll be able to delete "YourQueryName" or "YourQueryName Cached" without Power Query stopping you, and renaming YourQueryName will break things instead of Power Query automatically changing all of your other queries accordingly.
While faster, the user-experience is a rougher ride, too! The UI gets a little jerky because (and I'm totally guessing, here) this technique seems to cause many more queries to finish simultaneously, flooding Excel with too many repaint requests at the same time. (This isn't a problem, really, but it sure looks like one when you aren't expecting it!)

Related

About speedy mass deletion of users in Kentico10

I want to delete more than 1 million User information in Kentico10.
I tried to delete it with UserInfoProvider.DeleteUser (); (see the following documentation), but it is expected that it will take nearly one year with a simple calculation.
https://docs.kentico.com/api10/configuration/users#Users-Deletingauser
Because it's a simple calculation, I think it's actually a bit shorter, but it still takes time.
Is there any other way to delete users in a short time?

Of course make sure you have a backup of your database before you do any of this.
Depending on the features you're using, you could get away with a SQL statement. Due to the complexities of the references of a user to multiple other tables, the SQL statement can get pretty complex and you need to make sure you remove the other references before removing the actual user record.
I'd highly recommend an API approach and delete users through the API so it removes all the references for you automatically. In your API calls make sure you wrap the delete action in the following so it stops the logging of the events and other labor-intensive activities not needed.
using (var context = new CMSActionContext())
{
context.DisableAll();
// delete your user
}
In your code, I'd only select the top 100 or so at a time and delete them in batches. Assuming you don't need this done all in one run, you could let the scheduled task run your custom code for a week and see where you're at.
If all else fails, figure out how to delete the user and the 70+ foreign key references and you'll be golden.

Why don't you delete them with SQL query? - I believe it will be much faster.

Bulk delete functionality exist starting from version 10.
UserInfoProvider has BulkDelete method. Actually any InfoProvider object inhereted from AbstractInfoProvider has BulkDelete method.

Cassandra delete/update a row and get its previous value

How can I delete a row from Cassandra and get the value it had just before the deletion?
I could execute a SELECT and DELETE query in series, but how can I be sure that the data was not altered concurrently between the execution of those two queries?
I've tried to execute the SELECT and DELETE queries in a batch but that seems to be not allowed.
cqlsh:foo> BEGIN BATCH
... SELECT * FROM data_by_user WHERE user = 'foo';
... DELETE FROM data_by_user WHERE user = 'foo';
... APPLY BATCH;
SyntaxException: line 2:4 mismatched input 'SELECT' expecting K_APPLY (BEGIN BATCH [SELECT]...)
In my use case I have one main table that stores data for items. And I've build several tables that allow to lookup items based on those informations.
If I delete an item from the main table, I must also remove it from the other tables.
CREATE TABLE items (id text PRIMARY KEY, owner text, liking_users set<text>, ...);
CREATE TABLE owned_items_by_user (user text, item_id text, PRIMARY KEY ((user), item_id));
CREATE TABLE liked_items_by_user (user text, item_id tect, PRIMARY KEY ((user), item_id));
...
I'm afraid the tables might contain wrong data if I delete an item and at the same time someone e.g. hits the like button of that same item.
The deleteItem method execute a SELECT query to fetch the current row of the item from the main table
The likeItem method that gets executed at the same times runs an UPDATE query and inserts the item into the owned_items_by_user, liked_items_by_user, ... tables. This happens after the SELECT statement was executed and the UPDATE query is executed before the DELETE query.
The deleteItem method deletes the items from the owned_items_by_user, liked_items_by_user, ... tables based on the data just retrieved via the SELECT statement. This data does not yet contain the just added like. The item is therefore deleted, but the just added like remains in the liked_items_by_user table.

You can do a select beforehand, then do a lightweight transaction on the delete to ensure that the data still looks exactly like it did when you selected. If it does, you know the latest state before you deleted. If it does not, keep retrying the whole procedure until it sticks.

Unfortunately you cannot do a SELECT query inside a batch statement. If you read the docs here, only insert, update, and delete statements can be used.
What you're looking for is atomicity on the execution, but batch statements are not going to be the way forward. If the data has been altered, your worst case situation is zombies, or data that could reappear.
Cassandra uses a grade period mechanism to deal with this, you can find the details here. If for whatever reason, this is critical to your business logic, the "best" thing you can do in this situation is to increase the consistency level, or restructure the read pattern at application level to not rely on perfect atomicity, whichever the right trade off is for you. So either you give up some of the performance, or tune down the requirement.
In practice, QUORUM should be more than enough to satisfy most situations most of the time. Alternatively, you can do an ALL, and you pay the performance penalty, but that means all replicas for the given foo partition key will have to acknowledge the write both in the commitlog and the memtable. Note, this still means a flush from the commitlog will need to happen before the delete is complete, but you can tune the consistency to the level you require.
You don't have atomicity in the SQL sense, but depending on throughput it's unlikely that you will need it(touch wood).
TLDR:
USE CONSISTENCY ALL;
DELETE FROM data_by_user WHERE user = 'foo';
That should do the trick. The error you're seeing now is basically the ANTLR3 Grammar parser for CQL 3, which is not designed to accept to SELECT queries inside batches simply because they are not supported, you can see that here.

rename collection vs updating collection

I have a mongo DB which i need to update daily(delete non relevant documents and add new ones).
the DB is not sharded.
I take the data from an external data master which is not so easy to work with.
There are 2 options:
1. reingest the entire DB (not so big) into a temp collection and then rename it to old collection name (with dropTarget set to true)
2. do the hard work myself, delete the old entires, and figure out from the data master which new documents are relavant and insert them to the DB
option 1 is prefrable obviously but what is the impact? I'm doing this maintenance in a late hour but I don't want the users to get errors when querying the DB during the rename process.
Is using rename to overwrite a collection a standard way to get things done or am I abusing the API ? :)

According to the documentation renameCollection blocks all database activity for the duration of the operation. If your users have set a sufficiently large time out , they will not directly be affected by this rename operation, however, as the dataset can change under their feet there might be side effects. For example, renaming a collection can invalidate open cursors which interrupts queries that are currently returning data.
Regarding renaming of collections in production, personally I would avoid this where possible, firstly because of the cursor issue above, but more importantly because an incomplete renameCollection operation can leave the target collection in an unusable state and require manual intervention to clean up. Instead I would use an update with upsert:true that overwrites the entire document or inserts a new record if it doesn't exist.

How to implement custom pagination with LINQ2Entities using 1 call to the database

I am working on ASP.NET Web Forms project and I use jquery datatable to visualize data fetched from SQL server. I need to pass the results for the current page and the total number of results for which by far I have this code :
var queryResult = query.Select(p => new[] { p.Id.ToString(),
p.Name,
p.Weight.ToString(),
p.Address })
.Skip(iDisplayStart)
.Take(iDisplayLength).ToArray();
and the result that I get when I return the result to the view like :
iTotalRecords = queryResult.Count(),
is the number of records that the user has chosen to see per page. Logical, but I haven't thought about it while building my Method chaining. Now I think about the optimal way to implement this. Since it's likely to use with relatively big amounts of data (10,000 rows, maybe more) I would like leave as much work as I can to the SQL server. However I found several questions asked about this, and the impression that I get is that I have to make two queries to the database, or manipulate the total result in my code. But I think this will won't be efficient when you have to work with many records.
So what can I do here to get best performance?

In regards to what you’re looking for I don’t think there is a simple answer.
I believe the only way you can currently do this is by running more than one query like you have already suggested, whether this would be encapsulated inside a stored procedure (SPROC) call or generated by EF.
However, I believe you can make optimsations to make your query run quicker.
First of all, every query execution MAY result in the query being recached as you are chaining your methods together, this means that the actual query being executed will need to be recompiled and cached by SQL Server (if that is your chosen technology) before being executed. This normally only takes a few milliseconds, but if the query being executed only takes a few milliseconds then this is relatively expensive.
Entity framework will translate this Linq query and execute it using derived tables. With a small result set of approx. 1k records to be paged your current solution maybe best suited. This would also depend upon on how complex your SQL filtering is as generated by your method chaining.
If your result set to be paged grows up towards 15k, I would suggest writing a SPROC to get the best performance and scalability which would insert the records into a temp table and run two queries against it, firstly to get the paged records, and secondly to get the total results.
alter proc dbo.usp_GetPagedResults
#Skip int = 10,
#Take int = 10
as
begin
select
row_number() over (order by id) [RowNumber],
t.Name,
t.Weight,
t.Address
into
#results
from
dbo.MyTable t
declare #To int = #Skip+#Take-1
select * from #results where RowNumber between #Skip and #To
select max(RowNumber) from #results
end
go
You can use the EF to map a SPROC call to entity types or create a new custom type containing the results and the number of results.
Stored Procedures with Multiple Results
I found that the cost of running the above SPROC was approximately a third of running the query which EF generated to get the same result based upon the result set size of 15k records. It was however three times slower than the EF query if only a 1K record result set due to the temp table creation.
Encapsulating this logic inside a SPROC allows the query to be refactored and optimised as your result set grows without having to change any application based code.
The suggested solution doesn’t use the derived table queries as created by the Entity Framework inside a SPROC as I found there was a marginal performance difference between running the SPROC and the query directly.

Selecting and updating against tables in separate data sources within the same transaction

The attributes for the <jdbc:inbound-channel-adapter> component in Spring Integration include data-source, sql and update. These allow for separate SELECT and UPDATE statements to be run against tables in the specified database. Both sql statements will be part of the same transaction.
The limitation here is that both the SELECT and UPDATE will be performed against the same data source. Is there a workaround for the case when the the UPDATE will be on a table in a different data source (not just separate databases on the same server)?
Our specific requirement is to select rows in a table which have a timestamp prior to a specific time. That time is stored in a table in a separate data source. (It could also be stored in a file). If both sql statements used the same database, the <jdbc:inbound-channel-adapter> would work well for us out of the box. In that case, the SELECT could use the time stored, say, in table A as part of the WHERE clause in the query run against table B. The time in table A would then be updated to the current time, and all this would be part of one transaction.
One idea I had was, within the sql and update attributes of the adapter, to use SpEL to call methods in a bean. The method defined for sql would look up a time stored in a file, and then return the full SELECT statement. The method defined for update would update the time in the same file and return an empty string. However, I don't think such an approach is failsafe, because the reading and writing of the file would not be part of the same transaction that the data source is using.
If, however, the update was guaranteed to only fire upon commit of the data source transaction, that would work for us. If the event of a failure, the database transaction would commit, but the file would not be updated. We would then get duplicate rows, but should be able to handle that. The issue would be if the file was updated and the database transaction failed. That would mean lost messages, which we could not handle.
If anyone has any insights as to how to approach this scenario it is greatly appreciated.

Use two different channel adapters with a pub-sub channel, or an outbound gateway followed by an outbound channel adapter.
If necessary, start the transaction(s) upstream of both; if you want true atomicity you would need to use an XA transaction manager and XA datasources. Or, you can get close by synchronizing the two transactions so they get committed very close together.
See Dave Syer's article "Distributed transactions in Spring, with and without XA" and specifically the section on Best Efforts 1PC.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string