Right now the way I am doing my workflow is like this:
get a list of rows from a postgres database (let's say 10.000)
for each row I need to call an API endpoint and get a value, so 10.000 values returned from API
for each row that I have a value returned I need to update a field in the database. 10.000 rows updated
Right now I am doing a update after each API fetch but as you can imagine this isn't the most optimized way.
What other option do I have?
Probably bottleneck in that code is fetching the data from API. This trick only allows to send many small queries to DB faster without having to wait roundtrip time between each update.
To do multiple updates in single query you could use common table expressions and pack multiple small queries to single CTE query:
https://runkit.com/embed/uyx5f6vumxfy
knex
.with('firstUpdate', knex.raw('?', [knex('table').update({ colName: 'foo' }).where('id', 1)]))
.with('secondUpdate', knex.raw('?', [knex('table').update({ colName: 'bar' }).where('id', 2)]))
.select(1)
knex.raw trick there is a workaround, since .with(string, function) implementation has a bug.
Related
Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
cityDBConnection.getDf.foreach(
... // set MongoDB upsert criteria.
)
}
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.
Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
df.repartition("city").foreachPartition{...}
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)
Using node-postgres I want to update columns in my user model, at present i have this
async function update_user_by_email(value, column, email){
const sql = format('UPDATE users SET %I = $1, WHERE email = $2', column);
await pool.query(sql, [value, email]);
}
So I can do this
await update_user_by_email(value, column_name, email_address);
However if I want to update multiple columns and values I am doing something very inefficient at the moment and calling that method x amount of times (i.e for each query)
await update_user_by_email(value, column_name, email_address);
await update_user_by_email(value_2, column_name_2, email_address);
await update_user_by_email(value_3, column_name_3, email_address);
How can I generate this with just one call to the database.
Thanks
You have a few options here:
node-postgres allows you to create queries based on prepared statements. (This builds on the native pg-sql prepared statements).
These are recommended by postgres for populating a table as a secondary option to using their copy command. You would end up doing more SQL statements (probably one per line), but the advantages of prepared statements are supposed to offset this somewhat.
You can also combine this with transactions, also mentioned in the postgres "populate" link above.
Another option is the approach taken by another library called pg-promise (specifically helpers). The pg-promise helpers library literally builds sql statements (as a string) for the bulk insert/update statements. That way you can have a single statement to update/insert thousands of rows at a time.
It's also possible (and relatively easy) to custom-build your own sql helpers, or to supplement pg-promise, by pulling structure data directly from information_schema tables and columns tables.
One of the more tedious things about pg-promise is having to give it all the column names (and sometimes definitions, default values, etc), and if you're working with dozens or hundreds of separate tables, auto-generating this info directly from the database itself is probably simpler and more robust (you don't have to update arrays of column names every time you change your database)
NOTE: You don't need to use pg-promise to submit queries generated by their helpers library. Personally, I like node-postgres better for actual db communications, and typically only use the pg-promise helpers library for building those bulk SQL statements.
NOTE2: It's worth noting that pg-promise wrote their own SQL injection protection (by escaping single-quotes in values and double-quotes in table/column names). The same would need to be done in the third option. Whereas the prepared statements are natively protected from sql injection by the database server itself.
I am working on ASP.NET Web Forms project and I use jquery datatable to visualize data fetched from SQL server. I need to pass the results for the current page and the total number of results for which by far I have this code :
var queryResult = query.Select(p => new[] { p.Id.ToString(),
p.Name,
p.Weight.ToString(),
p.Address })
.Skip(iDisplayStart)
.Take(iDisplayLength).ToArray();
and the result that I get when I return the result to the view like :
iTotalRecords = queryResult.Count(),
is the number of records that the user has chosen to see per page. Logical, but I haven't thought about it while building my Method chaining. Now I think about the optimal way to implement this. Since it's likely to use with relatively big amounts of data (10,000 rows, maybe more) I would like leave as much work as I can to the SQL server. However I found several questions asked about this, and the impression that I get is that I have to make two queries to the database, or manipulate the total result in my code. But I think this will won't be efficient when you have to work with many records.
So what can I do here to get best performance?
In regards to what you’re looking for I don’t think there is a simple answer.
I believe the only way you can currently do this is by running more than one query like you have already suggested, whether this would be encapsulated inside a stored procedure (SPROC) call or generated by EF.
However, I believe you can make optimsations to make your query run quicker.
First of all, every query execution MAY result in the query being recached as you are chaining your methods together, this means that the actual query being executed will need to be recompiled and cached by SQL Server (if that is your chosen technology) before being executed. This normally only takes a few milliseconds, but if the query being executed only takes a few milliseconds then this is relatively expensive.
Entity framework will translate this Linq query and execute it using derived tables. With a small result set of approx. 1k records to be paged your current solution maybe best suited. This would also depend upon on how complex your SQL filtering is as generated by your method chaining.
If your result set to be paged grows up towards 15k, I would suggest writing a SPROC to get the best performance and scalability which would insert the records into a temp table and run two queries against it, firstly to get the paged records, and secondly to get the total results.
alter proc dbo.usp_GetPagedResults
#Skip int = 10,
#Take int = 10
as
begin
select
row_number() over (order by id) [RowNumber],
t.Name,
t.Weight,
t.Address
into
#results
from
dbo.MyTable t
declare #To int = #Skip+#Take-1
select * from #results where RowNumber between #Skip and #To
select max(RowNumber) from #results
end
go
You can use the EF to map a SPROC call to entity types or create a new custom type containing the results and the number of results.
Stored Procedures with Multiple Results
I found that the cost of running the above SPROC was approximately a third of running the query which EF generated to get the same result based upon the result set size of 15k records. It was however three times slower than the EF query if only a 1K record result set due to the temp table creation.
Encapsulating this logic inside a SPROC allows the query to be refactored and optimised as your result set grows without having to change any application based code.
The suggested solution doesn’t use the derived table queries as created by the Entity Framework inside a SPROC as I found there was a marginal performance difference between running the SPROC and the query directly.
I have a requirement to update all users with a specific value in a job.
i have million of users in my Cassandra database. is it okay to query million user first and do some kind of batch update? or is there some implementation available to do these kind of work. I am using hector API to interact with Cassandra. What can be the best possible way to do this.?
You never want to fetch 1 million users and keep them locally. Ideally you want to iterate over all those user keys using a range query. Hector calls this RangeSliceQuery. There is a good example here:
http://irfannagoo.wordpress.com/2013/02/27/hector-slice-query-options-with-cassandra/
For start and end key use null and add this also:
rangeQuery.setRowCount(100) to fetch 100 rows at a time.
Do this inside a loop. The first time you fetch with null being start and end key, the last key you get from the first result set should be the start key of your next query. And you continue paginating like this.
You can then use batch mutate and update in batches.
http://hector-client.github.io/hector/source/content/API/core/1.0-1/me/prettyprint/cassandra/service/BatchMutation.html
I wanted to know how to have a 10 result limit on a redis query. Im using a node js library , and streamline.js.
basically, i do hgetall as a command but the docs state that the "SORT" command has an option for LIMIT. I was just wondering if there was any way to apply a limit in redis. Here is a sample of one of the queries:
members.hgetall(All,_);
HGETALL retrieves all members (fields & values) of a specific hash key. All of them, without limitation.
When talking about SORT, it refers to actions on Lists, Sets and Sorted Sets.
It returns members of these structures in an ordered manner, as dictated by SORT's parameters. See SORT documentation.