Creating a large number of Vendors at once - acumatica

I have a particular scenario whereby I want to create a large number of Vendors at one go programmatically (around 500,000). Performance is a very important aspect. I believe the standard approach would be to use something like Import Scenarios. But this is quite slow.
Another option is to use the VendorMaint graph programmatically to create new Vendors and use the graph to save them. However, this is also quite slow. Is there another approach to create such a large number of vendors at once?
Can I use something like a PXDatabase.Insert, or are there other options? It would be good to know how to use multiple CPU cores if available.
I came across this article which mentions the use of PXdatabase, but I am not sure whether it is a standard approach, and whether I should be aware of any implications.
http://blog.zaletskyy.com/save-dac-class-in-acumatica-to-db

Related

Persisting only part of a data source

I'm using intake to access the catalog catalog.ocean.GFDL_CM2_6.GFDL_CM2_6_control_ocean_surface.
At the moment I only work with small patches of that data, but accessing that data every single time is still quite costly (it's on Google Cloud Storage). So I want to use the persist option of intake to store that data locally. However as far as I've understood from the docs, it looks like one can only persist the whole dataset. For that specific dataset that would amount to almost 400 dollars if I take a cost of 0.1$ per GB, since the total data is 3976GB.
Hence my questions:
Is there a way (especially for a zarr file which in theory should make this quite easy) to persist only parts of the data (for instance only a subset of the variables)
This is probably more complicated, but can I push things further, by persisting regions of data I'm interested in (in terms of coordinates values for instance)?
There is no direct Intake way to do what you are asking for. Intake was conceived as a way to get your data into a format that you can then manipulate as you normally do, i.e., deal with only the loading part, so that a persisted data-set is the same as the original.
However, it is not hard to accomplish manually: you should grab the xarray, filter for the region you need, and call to_zarr to save the new dataset. You can then point a simple catalogue entry like the old one at the new location.
You could have done this manipulation in a driver directly if this was a specific pattern that would repeat a lot. In fact, we have mooted the idea of whether/how to implement such processing steps in Intake, but there is no plan yet. In the end, we may take the work on pipelines in Holoviews to describe processing steps.

Support for multiple byte ranges on Azure blob read/write

We need random read (and later write) access to thousands of discrete ranges (each in the order of a few KBs) within very large binary blobs (in the order of 100s of GB). The current APIs force us to submit a single request for each such range. One negative aspect is billing, of course, but the main problem is the client-side and network loads for handling all these requests!
Are there any known ways of avoiding the massive overhead for access patterns like this?
Assume that reformatting the data is not viable, since the access patterns vary. Replicating the data in a multitude of versions optimized for each access pattern variation is also highly undesirable, for several reasons (optimization lead time, storage costs, data management, plus not all access patterns can be predicted - the known ones might not even be used).
Extending the "Range" REST API header to support multiple ranges would be ideal solution, but obviously that's not ours to control.
Unfortunately, there are no other nice ways to do that. The current api(I think you're using get blob api) only supports a single range not multi-ranges and detail is here.
As of now, there is no good workaround for this issue. I see the user voice you submitted, it's a good feedback and already upvoted for it. Hope the MS team can implement it in the future release.

Considerations for time-series

We are looking into using Azure Table Storage (ATS) together with Deedle (or other libraries with similar functionality) for our time-series storage, manipulations and calculations. From what I can read, F# also seems like a good choice for operations on arrays.
Our starting point is a set of time-series for energy consumption. The series will either be the consumption within an interval (fixed or irregular intervals) or a counter (from which we can calculate the consumption from one reading to the next). As a data point is just a tag (used as a partition key), timestamp (rowkey) and value, this should be well suited for ATS.
From a user's perspective, they want to do calculations on the series for a given period and resolution, e.g. calculate a third series as a difference between two others, for one given year with monthly resolution.
This raises a number of questions:
Will ATS together with F# be fast enough? If we have 10.000 data points? 100.000? Compared to C#?
Resampling will require calculations of points between the series' timestamps. I haven't seen any Deedle examples for (linear) interpolation, but I assume that this is just passing a function which can look at the necessary data points? Will this be fast enough for our number of points?
The calculations will be determined by the users and we must have this as configurations. My best guess so far is to have the formula in some format we can parse easily into reverse polish notation, and take special care of tags that will represent series (ie. read from ATS, resample, then do the operations).
Any comments will be highly appreciated!
I think Isaac already mentioned the most important points, but as this question involves some of the things I'm involved with, I thought I'd share a few additional remarks!
BigDeedle. As Isaac mentioned, I used Azure Table storage in BigDeedle. This is mainly useful if you want to explore data interactively using Deedle APIs and do some filtering and range restriction before getting the data in memory and running your calculations. BigDeedle loads data lazily from potentially very big external data source. That said, if you eventually need to load all data into memory, this might not be all that useful for you.
The storage model used in BigDeedle might be useful though - it partitions data based on date, so when you want to get values in a given date range, it knows in which partitions to look. In my experience, loading data from ATS works pretty well, especially if you can do it on an MBrace cluster running in Azure (which is what my NDC demo does in the end).
Efficiency. I think the combination should work well for 10k or 100k data points - there will be no difference whether you do this from F# or C#. As for Deedle, I've definitely used it with data sets of this size - we optimize the library "as needed". Most of the functions are quite efficient already, but there may be some operations that are not efficient. This is something that can be fixed if you open issue on GitHub.
Resampling. There is built-in function for linear interpolation (see here), but I suspect you may need to write your own custom interpolation. Deedle does not "hide the underlying data" from you, so this is not too hard - the last example on this page shows a custom function for filling missing data that uses linear interpolation. If you are doing something like this, you'll need to have the data in memory (so BigDeedle would not be very useful here).
Specifying calculations. I suspect this is a separate question, but F# is great for domain-specific languages. I did a talk on that at earlier NDC. Generally, you can either specify your own DSL (and parse it) or have an embedded DSL where people write subset of F#. F# has good support for both.
PS: If you wanted to get some more help with F#, Deedle and Azure tables, feel free to get in touch. I'm happy to share my experience - you should be able to find a contact via my profile.
F# versus C# will probably be basically the same perf wise unless you do something completely different between the two (for example, immutable vs mutable data sets). Both compile down to IL at the end of the day.
Azure Table Storage - make sure you pick your partition + row keys correctly. There is a lot of documentation on picking Azure Table Storage partition keys, especially over time series - make sure you group rows up at the correct level to ensure data is distributed, with partitions not too large or small. You might also want to look at the Azure Storage Type Provider and / or Azure Storage F# libraries which makes working with ATS easier than the standard .NET SDK.
Deedle AFAIK does indeed have ability to replace missing values across time series, and there's at least a project called BigDeedle which works directly over ATS (although I'm not sure how ready this project is).

How to keep an object unique

I have a static DataTable (with 80k records) in Common.DLL and that Common.DLL is referred by 10 windows services. So, instead of having 10 copies of that DataTable in the memory I need to have it as 1 copy and all the services pointing to that data source. Is this approach possible?
Given that the services will at least be using different AppDomains, and quite possibly different processes, sharing the same data between all of them would be tricky.
I would personally suggest that you just don't worry about it - unless each record is actually pretty large, 80K records is still going to be fairly small.
You could potentially have an 11th service which is the only one to have the data, and then talk to that service from the other ones. But that's introducing a lot of complexity for very little benefit.
One way of potentially saving memory would be to use a List<T> for a custom type, instead of a DataTable - that may well be more efficient, and would almost certainly be more pleasant to use within the code. It doesn't help if you really need DataTable for whatever you're doing with it, but personally I try to avoid that...
You could create a WCF service that is hosted locally and reads the 80k records into memory.
You would then define an API on the WCF service that contains methods appropriate to whatever calls your 10 windows services need to make.
Doing this would add a level of complexity to your solution that may well not be needed though.
Use the Singleton pattern. Here's a detailed tutorial with C#.

Strategies for search across disparate data sources

I am building a tool that searches people based on a number of attributes. The values for these attributes are scattered across several systems.
As an example, dateOfBirth is stored in a SQL Server database as part of system ABC. That person's sales region assignment is stored in some horrible legacy database. Other attributes are stored in a system only accessible over an XML web service.
To make matters worse, the the legacy database and the web service can be really slow.
What strategies and tips should I consider for implementing a search across all these systems?
Note: Although I posted an answer, I'm not confident its a great answer. I don't intend to accept my own answer unless no one else gives better insight.
You could consider using an indexing mechanism to retrieve and locally index the data across all the systems, and then perform your searches against the index. Searches would be an awful lot faster and more reliable.
Of course, this just shifts the problem from one part of your system to another - now your indexing mechanism has to handle failures and heterogeneous systems, but that may be an easier problem to solve.
Another factor is how often the data changes. If you have to query data in real-time that goes stale very quickly, then indexing may not be practical.
If you can get away with a restrictive search, start by returning a list based on the search criteria corresponding to the fastest data source. Then join up those records with the other systems and remove records which don't match the search criteria.
If you have to implement OR logic, this approach is not going to work.
While not an actual answer, this might at least get you partway to a workable solution. We had a similar situation at a previous employer - lots of data sources, different ways of accessing those data sources, different access permissions, military/government/civilian sources, etc. We used Mule, which is built around the Enterprise Service Bus concept, to connect these data sources to our application. My details are a bit sketchy, as I wasn't the actual implementor, just an integrator, but what we did was define a channel in Mule. Then you write a simple integration piece to go between the channel and the data source, and the application and the channel. The integration piece does the work of making the actual query, and formatting the results, so we had a generic SQL integration piece for accessing a database, and for things like web services, we had some base classes that implemented common functionality, so the actual customization of the integration piecess was a lot less work than it sounds like. The application could then query the channel, which would handle accessing the various data sources, transforming them into a normalized bit of XML, and return the results to the application.
This had a lot of advantages for our situation. We could include new data sources for existing queries by simply connecting them to the channel - the application didn't have to know or care what data sources where there, as it only looked at the data from the channel. Since data can be pushed or pulled from the channel, we could have a data source update the application when, for example, it was updated.
It took a while to get it configured and working, but once we got it going, we were pretty successful with it. In our demo setup, we ended up with 4 or 5 applications acting as both producers and consumers of data, and connecting to maybe 10 data sources.
Have you thought of moving the data into a separate structure?
For example, Lucene stores data to be searched in a schema-less inverted indexed. You could have a separate program that retrieves data from all your different sources and puts them in a Lucene index. Your search could work against this index and the search results could contain a unique identifier and the system it came from.
http://lucene.apache.org/java/docs/
(There are implementations in other languages as well)
Have you taken a look at YQL? It may not be the perfect solution but I might give you starting point to work from.
Well, for starters I'd parallelize the queries to the different systems. That way we can minimize the query time.
You might also want to think about caching and aggregating the search attributes for subsequent queries in order to speed things up.
You have the option of creating an aggregation service or middleware that aggregates all the different systems so that you can provide a single interface for querying. If you do that, this is where I'd do the previously mentioned cache and parallize optimizations.
However, with all of that it you will need weighing up the development time/deployment time /long term benefits of the effort against migrating the old legacy database to a faster more modern one. You haven't said how tied into other systems those databases are so it may not be a very viable option in the short term.
EDIT: in response to data going out of date. You can consider caching if your data if you don't need the data to always match the database in real time. Also, if some data doesn't change very often (e.g. dates of birth) then you should cache them. If you employ caching then you could make your system configurable as to what tables/columns to include or exclude from the cache and you could give each table/column a personalizable cache timeout with an overall default.
Use Pentaho/Kettle to copy all of the data fields that you can search on and display into a local MySQL database
http://www.pentaho.com/products/data_integration/
Create a batch script to run nightly and update your local copy. Maybe even every hour. Then, write your query against your local MySQL database and display the results.

Resources