Automating Consolidation Processes

Automating Consolidation Processes - excel

My company is currently using excel for reporting where we have to collect data from various business units on a monthly basis. Each unit will send an excel file with 50 columns and 10-1000 row items each. After receiving each file, we will use vba to consolidate all these files. This consolidated master file is then split to various sections and sent to various personnels where any changes will have to be updated in the master file.
Is there any way that this process can be improved and automated using a different system?

Is there any way that this process can be improved and automated using a different system?
Well, you already have the "low cost, low tech" solution.
The proper solution (until something better is invented) is a proper web application, which collects the data from the various users, processes it and then generates the necessary reports.
This endeavor is not something to be treated lightly, even if it sounds like a small task. Your company needs to understand what they want, and then contact some supplier companies to get an estimation of the costs.
The costs cover at least:
the development of the application;
the server on which the application will run after it is finished; can be a virtual server;
the costs of training the employees to use the application properly;
the costs of actually making the employees to use the application properly (the most resilient being usually the managers themselves);
Of course, I assume that you already have some network infrastructure, backups of all important data is done according to the best practices (by IT)...

Related

How to apply ratelimiting (restricting access) on logged-in users taking screenshots of my website?

I maintain a record of users' email/IP taking screenshots (44 keypress) of our website.
Currently, I am blocking them based on their weekly screenshot count.
However, I'm thinking of applying daily rate-limiting such that they are restricted access for some time (which is increased if they cross daily thresholds multiple times & the daily threshold limit also starts to decrease) and at some point, they are permanently restricted.
Is this the best way to reduce unrestrained screenshots of my website?
Thank You
I tried restricting users based on their weekly print-screen count. However, there were some users who were crossing the weekly threshold in only a few hours. I would definitely like to restrict such users immediately.

I think you will have a hard time restricting access based on a client side action. Screenshots can always be taken by using OS tools, such as snipping tool, or web scrapers EyeWitness. It may be worth going back to the drawing board to get some better answers:
Why do you want to block people for taking screenshots?
Is this temporary restriction going to actually stop this happening?
How long to you want users to be restricted for?
Have you researched any methods of preventing screenshots from being taken? Rather than trying to detect who takes them?
Have you warned users that taking screenshots will result in them being blocked? This may stop them in the first place.

High memory/performance critical computing - Architectural approach opinions

I need an architectural opinion and approaches to the following problem:
INTRO:
We have a table of ~4M rows called Purchases.We also have a table
of ~5k rows called Categories.In addition, we have a table of
~4k SubCategories. We are using T-SQL to store data.At users
request ( during runtime ), server receives a request of about 10-15 N
possibilities of parameters. Based on the parameters, we take
purchases, sort them by categories and subCategories and do some
computing.Some of the process of "computing" includes filtering,
sorting, rearranging fields of purchases, subtracting purchases with
each other, adding some other purchases with each other, find savings,
etc...This process is user specific, therefore every user WILL get
different data back, based on their roles.
Problem:
This process takes about 3-5 minutes and we are wanting to cut it
down.Previously, this process was done in-memory, on the browser via
webworkers (JS). We have moved away from it as the memory started to get
really large and most of browsers start to fail on load. Then we moved
the service to the server (NodeJS), which processed the request on the fly,
via child-processes. Reason for child-processes: the computing process
goes through a for loop about 5,000x times ( for each category ) and does
the above mentioned "computing".Via child processes we were able to
distribute the work into #of child processes, which gave us somewhat
better results, if we ran at least 16-cores ( 16 child processes ).
Current processing time is down to about a 1.5-2 minutes, but we are
wanting to see if we have better options.
I understand its hard to fully understand our goals without seeing any code but to ask question specifically. What are some ways of doing computing on semi-big data at runtime?
Some thoughts we had:
using SQL in-memory tables and doing computations in sql
using azure batch services
using bigger machines ( ~ 32-64 cores, this may be our best shot if we cannot get any other thoughts. But of course, cost increases drasticaly, yet we accept the fact that cost will increase )
stepping into hadoop ecosystem ( or other big data ecosystems )
some other useful facts:
our purchases are about ~1GB ( becoming a little too large for in-memory computing )
We are thinking of doing pre-computing and caching on redis to have SOME data ready for client ( we are going to use their parameters set in their account to pre-compute every day, yet clients tend to change those parameters frequently, therefore we have to have some efficient way of handling data that is NOT cached and pre-computed )
If there is more information we can provide to better understand our dilemma, please comment and I will provide as much info as possible. There would be too much code to paste in here for one to fully understand the algorithms therefore I want to try delivering our problem with words if possible.

Never decide about technology before being sure about workflow's critical-path
This will never help you achieve ( a yet unknown ) target.
Not knowing the process critical-path, no one could calculate any speedup from whatever architecture one may aggressively-"sell" you or just-"recommend" you to follow as "geeky/nerdy/sexy/popular" - whatever one likes to hear.
What would you get from such pre-mature decision?
Typically a mix of both the budgeting ( co$t$ ) and Project-Management ( sliding time-scales ) nightmares:
additional costs ( new technology also means new skills to learn, new training costs, new delays for the team to re-shape and re-adjust and grow into a mature using of the new technology at performance levels better, than the currently used tools, etc, etc )
risks of choosing a "popular"-brand, which on the other side does not exhibit any superficial powers the marketing texts were promising ( but once having paid the initial costs of entry, there is no other way than to bear the risk of never achieving the intended target, possibly due to overestimated performance benefits & underestimated costs of transformation & heavily underestimated costs of operations & maintenance )
What would you say, if you could use a solution,where "Better options" remain your options:
you can start now, with the code you are currently using without a single line of code changed
you can start now with a still YOUR free-will based gradual path of performance scaling
you can avoid all risks of (mis)-investing into any extra-premium costs "super-box", but rather stay on the safe side re-use a cheap and massively in-service tested / fine-tuned / deployment-proven COTS hardware units ( a common dual-CPU + a few GB machines, commonly used in large thousands in datacentres )
you can scale up to any level of performance you need, growing CPU-bound processing performance gradually from start, hassle-free, up to some ~1k ~2k ~4k ~8k CPUs, as needed -- yes, up to many thousands of CPUs, that your current workers'-code can immediately use for delivering the immediate benefit of the such increased performance and thus leave your teams free hands and more time for thorough work on possible design improvements and code re-factoring for even better performance envelopes if the current workflow, having been "passively" just smart-distributed to say ~1000, later ~2000 or ~5000-CPU-cores ( still without a single SLOC changed ) do not suffice on its own?
you can scale up -- again, gradually, on an as-needed basis, hassle-free -- up to ( almost ) any size of the in-RAM capacity, be it on Day 1 ~8TB, ~16TB, ~32TB, ~64TB, jumping to ~72TB or ~128TB next year, if needed -- all that keeping your budget always ( almost ) linear and fully adjusted by your performance plans and actual customer-generated traffic
you can isolate and focus your R&D efforts not on (re)-learning "new"-platform(s), but purely into process (re)-design for further increasing the process performance ( be it using a strategy of pre-computing, where feasible, be it using smarter fully-in-RAM layouts for even faster ad-hoc calculations, that cannot be statically pre-computed )
What would business owners say to such ROI-aligned strategy?
If one makes CEO + CFO "buy" any new toy, well, that is cool for hacking this today, that tommorrow, but such approach will never make shareholders any happier, than throwing ( their ) money into the river of Nile.
If one can show the ultimately efficient Project plan, where most of the knowledge and skills are focused on business-aligned target and at the same time protecting the ROI, that would make both your CEO + CFO and I guarantee that also all your shareholders very happy, wouldn't it?
So, which way would you decide to go?

This topic isn't really new but just in case... As far as my experience can tell, I would say your T-SQL DB might by your bottle neck here.
Have you measured the performance of your SQL queries? What do you compute on SQL server side? on the Node.js side?
A good start would be to measure the response time of your SQL queries, revamp your queries, work on indexes and dig into how your DB query engine works if needed. Sometimes a small tuning in the DB settings does the trick!

DDD how to model time tracking?

I am developing an application that has employee time tracking module. When employee starts working (e.g. at some abstract machine), we need to save information about him working. Each day lots of employees work at lots of machines and they switch between them. When they start working, they notify the system that they have started working. When they finish working - they notify the system about it as well.
I have an aggregate Machine and an aggregate Employee. These two are aggregate roots with their own behavior. Now I need a way to build reports for any given Employee or any given Machine for any given period of time. For example, I want to see which machines did given employee used over period of time and for how long. Or I want to see which employees worked at this given machine for how long over period of time.
Ideally (I think) my aggregate Machine should have methods startWorking(Employee employee) and finishWorking(Employee employee).
I created another aggregate: EmployeeWorkTime that stores information about Machine, Employee and start,finish timestamps. Now I need a way to modify one aggregate and create another at the same time (or ideally some another approach since this way it's somewhat difficult).
Also, employees have a Shift that describes for how many hours a day they must work. The information from a Shift should be saved in EmployeeWorkTime aggregate in order to be consistent in a case when Shift has been changed for given Employee.
Rephrased question
I have a Machine, I have an Employee. HOW the heck can I save information:
This Employee worked at this Machine from 1.05.2017 15:00 to 1.05.1017 18:31.
I could do this simply using CRUD, saving multiple aggregates in one transaction, going database-first. But I want to use DDD methods to be able to manage complexity since the overall domain is pretty complex.

From what I understand about your domain you must model the process of an Employee working on a machine. You can implement this using a Process manager/Saga. Let's name it EmployeeWorkingOnAMachineSaga. It work like that (using CQRS, you can adapt to other architectures):
When an employee wants to start working on a machine the EmployeeAggregate receive the command StartWorkingOnAMachine.
The EmployeeAggregate checks that the employee is not working on another machine and if no it raises the EmployeeWantsToWorkOnAMachine and change the status of the employee as wantingToWorkOnAMachine.
This event is caught by the EmployeeWorkingOnAMachineSaga that loads the MachineAggregate from the repository and it sends the command TryToUseThisMachine; if the machine is not vacant then it rejects the command and the saga sends the RejectWorkingOnTheMachine command to the EmployeeAggregate which in turns change it's internal status (by raising an event of course)
if the machine is vacant, it changes its internal status as occupiedByAnEmployee (by raising an event)
and similar when the worker stops working on the machine.
Now I need a way to build reports for any given Employee or any given Machine for any given period of time. For example, I want to see which machines did given employee used over period of time and for how long. Or I want to see which employees worked at this given machine for how long over period of time.
This should be implemented by read-models that just listen to the relevant events and build the reports that you need.
Also, employees have a Shift that describes for how many hours a day they must work. The information from a Shift should be saved in EmployeeWorkTime aggregate in order to be consistent in a case when Shift has been changed for given Employee
Depending on how you want the system to behave you can implement it using a Saga (if you want the system to do something if the employee works more or less than it should) or as a read-model/report if you just want to see the employees that do not conform to their daily shift.

I am developing an application that has employee time tracking module. When employee starts working (e.g. at some abstract machine), we need to save information about him working. Each day lots of employees work at lots of machines and they switch between them. When they start working, they notify the system that they have started working. When they finish working - they notify the system about it as well.
A critical thing to notice here is that the activity you are tracking is happening in the real world. Your model is not the book of record; the world is.
Employee and Machine are real world things, so they probably aren't aggregates. TimeSheet and ServiceLog might be; these are the aggregates (documents) that you are building by observing the activity in the real world.
If event sourcing is applicable there, how can I store domain events efficiently to build reports faster? Should each important domain event be its own aggregate?
Fundamentally, yes -- your event stream is going to be the activity that you observe. Technically, you could call it an aggregate, but its a pretty anemic one; easier to just think of it as a database, or a log.
In this case, it's probably just full of events like
TaskStarted {badgeId, machineId, time}
TaskFinished {badgeId, machineId, time}
Having recorded these events, you forward them to the domain model. For instance, you would take all of the events with Bob's badgeId and dispatch them to his Timesheet, which starts trying to work out how long he was at each work station.
Given that Machine and Employee are aggregate roots (they have their own invariants and business logic in a complex net of interrelations, timeshift-feature is only one of the modules)
You are likely to get yourself into trouble if you assume that your digital model controls a real world entity. Digital shopping carts and real world shopping carts are not the same thing; the domain model running on my phone can't throw things out of my physical cart when I exceed my budget. It can only signal that, based on the information that it has, the contents are not in compliance with my budgeting policy. Truth, and the book of record are the real world.
Greg Young discusses this in his talk at DDDEU 2016.
You can also review the Cargo DDD Sample; in particular, pay careful attention to the distinction between Cargo and HandlingHistory.
Aggregates are information resources; they are documents with internal consistency rules.

FetchXml optimization

I am looking for the ways how to optimize bulk download of CRM2011 data. Here are the two main scenarios:
a) Full synchronization: Download of all data - first all accounts, then all contacts etc etc.
b) Incremental synchronization: Download of all entities modified since given date
We use multithread downloader with 3 threads. Each thread performs FetchXml for one entity type that is downloaded page by page. Parsed objects are stored in the downloader cache and the downloader goes on for the next page. There is another thread that pulls the downloaded data from the cache and processes them. This organization increases the download speed more than 2x.
The problems I see:
a) FetchXml protocol is very inefficient. For example it contains lots of unneeded data. Example: FormattedValues take 10-15% bandwidth (my data show ~15% in the source Xml stream or ~10% in the zipped stream), although all we do with it is a) Xml parsing, b) throwing away. (Note that the parsing is not negligible either - iOs/Android Mono parsers are surprisingly slow.)
b) In case of incremental synchronization most of the FetchXml requests return zero items. In this case it would be highly desirable to combine several FetchXml requests into one. (AFAIK it is impossible.) Or maybe use another trick such as to ask for the counts of modified objects I did not investigate what is possible yet.
Does anybody have any advice how to optimize FetchXml traffic?

Your fastest method would be to use SQL server directly for something like this (unless you are using online).
To make the incremental faster, your best bet is to use the aggregate functionality FetchXML provides which is both extremely quick and less verbose.
Why parse on the iOS/Android Mono? If you are instead sharing this to a large number of devices, you'd be better off having a central caching server that could send back this data in a json (zipped) format to the devices (or possibly bson). Then the caching server would request an update of the changes, process those and then send back incremental changes in whatever format to the clients. Would be considerably faster on the clients and far less bandwidth.

I'm not sure of a way to further optimize FetchXML. I would question why you're not using the OData Endpoints and REST, especially if you're primarily concerned with the about of data being sent over the wire.
I have talked to some brilliant CRM MVPs, and I know they have used REST to migrate data to CRM. I'm not sure if they did it because it was faster, but I assumed that was why.
I do know that you are going to minimize the amount of data that is being sent to the client, since XML is extremely bloated.

Have a look at the ExecuteMultipleRequest to allow you to perform multiple requests/queries at once... http://msdn.microsoft.com/en-us/library/microsoft.xrm.sdk.messages.executemultiplerequest.aspx

Reporting progress on a million call process

I have a console/desktop application that crawls a lot (think million calls) of data from various webservices. At any given time I have about 10 threads performing these call and aggregating the data into a MySql database. All seeds are also stored in a database.
What would be the best way to report it's progress? By progress I mean:
How many calls already executed
How many failed
What's the average call duration
How much is left
I thought about logging all of them somehow and tailing the log to get the data. Another idea was to offer some kind of output to a always open TCP endpoint where some form of UI could read the data and display some aggregation. Both ways look too rough and too complicated.
Any other ideas?

The "best way" depends on your requirements. If you use a logging framework like NLog, you can plug in a variety of logging targets like files, databases, the console or TCP endpoints.
You can also use a viewer like Harvester as a logging target.
When logging multi-threaded applications I sometimes have an additional thread that writes a summary of progress to the logger once every so often (e.g. every 15 seconds).

since it is a Console Application, just use Writeline, just have the application spit the important stuff out to the Console.
I did something Similar in an application that I created to export PDF's from a SQL Server Database back into PDF Format
you can do it many different ways. if you are counting records and their size you can run a tally of sorts and have it show the total every so many records..
I also wrote out to a Text File, so that I could keep track of all the PDFs and what case numbers they went to and things like that. that information is in the answer that I gave to the above linked question.
you could also write things out to a Text File every so often with the statistics.
the logger that Eric J. mentions is probably going to be a little bit easier to implement, and would be a nice tool for your toolbox.
these options are just as valid depending on your specific needs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string