data and structure cleansing of excel sheets

data and structure cleansing of excel sheets - excel

I have over 6,000 excel sheets. While all the sheets describe the same thing, they are independently formatted. They all have between 9 and 13 columns, but they are out of order, the column names are independently misspelled, and they may or may not have a second, or third, column header.
I am currently trying in python to read cells in a left-down-right-up motion to attempt to locate the same data, but there is physically too many differences in structure names, column ordering, and data definitions to lock them in one a time. Is there a tool that I can use to read these documents and conform them to a single format, via a rapid mapping function?
Thanks much.
Thanks

Wow, it's the Ultimate Data Horror Story.
I want to ask how you ever let it get this way... but I actually don't want to know; I'm already going to have nightmares about this.
It's like that Hoarding show on TV, but with data.
No, I'm afraid that if you can't even identify a pattern then there's no magic function that will be able to either.
But that doesn't mean it's a lost cause. It's just going to need some human interaction, and there are ways to minimize the pain.
What you need is a custom interface that will load the documents one by one, and will walk a human through clicking each relevant column or area, and then automatically load the next document.
There would also need to be buttons for sorting things like obvious garbage sheets (blanks?), "unknowns" (that get put in a folder for advanced research later), and other "unpredictables" may come up during the process.
Also, perhaps once you get into it, you'll notice a pattern you're not thinking of, like maybe *"the person who handled the files from 2002 to 2004 set them up this way"*, or, "when Budget is misspelled, it's always either Bugdet or Budteg".
In this scope, little patterns like that can make a big difference.
Depending on your coding skills, you may or may not need outside assistance with this. I assume this is not data that can just get thrown out, or you wouldn't be asking...
If each document took an average of 20 seconds to process, that would be about 33 hours in total. An hour a day an it's done in a month. Or someone full-time, and it's done in a week.
Do you have a budget you can throw at this? Data archaeology is an actual thing! Hell, I'll do it for you for the right price... (wouldn't break the bank, depending on how urgent it is, of course!)
Either way, this ain't going to be fun for "someone"...

Related

Open multiple positions in Backtrader

Does someone know if it's possible to open multiple positions with only a single data feed? I am trying to do a second buy whilst in a position, which doesn't seem to be possible.
Nobody seems to adress this issue. Does anyone have any experience with Backtrader and have any input?

If you are just trying to buy more stock to add to your position, then yes, you should be able to do this and if you cannot recheck your strategy code in next.
If you are trying to track two separate positions of the same data...
One cannot have two separate positions in the same data feed. You may trade additional positions if you like but they will be combined in Backtrader. Even if you use two strategies you will still have one combined broker.
The reason for this is to simulate as near as possible real world conditions. If you have a brokerage account you most likely would have just one postion. (I know there are exceptions)
One solution would be to track your trading manually in a dictionary trades that result from different signals/sub-strategies. It's a bit more tedious to develop but very doable.

Need help simplifying or improving a weighted distribution formula in excel (math/excel/programming noob)

I've been having some fun creating a rather extensive inventory in Google Sheets for my collection of trading cards. I buy the majority of my collectibles in lots meaning that I pay a total of X dollars for Y number of cards of different values (as opposed to buying each card individually).
In my spreadsheet I have a "Purchase Price" column where I enter in the price I paid for each card. If I buy 1 lot of 10 cards, to find the value of each of those cards you would just divide the cost of the lot by the number of cards in the lot. So if I purchased 1 lot of 10 cards for a total of $100, the Purchase Price of each card would equal $10. Simple enough right?
Well, that would be if you were OK with entering the rare, uncommon, and common cards in the lot with having the same exact purchase price even though their real market values would all be different. So, what I did was create a formula that automatically adjusts the purchase price for each card that's part of a lot based on its rarity so it's at least closer in accuracy to the actual market value of the card.
Here is the formula:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
Not sure if that means much to anyone, so here's a link to an example spreadsheet of the formula in action below.
And if you don't care to check that out, here's a screenshot:
The problem:
So the formula works exactly how I want it to work EXCEPT when there are 0 commons in a lot. When that happens I get a #DIV/0! error saying that "Function DIVIDE parameter 2 cannot be zero." I understand why this is happening since it doesn't like to divide by 0 in the first line, but what I don't understand is how to fix it.
How can I fix the DIV error, or is there a better way to do this, perhaps an alternative formula or approach? I am not a programmer and somewhat of a beginner at Excel.

Two suggestions.
Embed each division in an IFERROR() function as shown below. This function will return zero instead of an error. You can substitute another calculation for that. In fact, depending upon which level you introduce the IFERROR at (embracing just one of the three calculations or all three) you might choose to embed the IFS in another IFS that tests for zeroes. Once you have no more divisions by zero there would be no more need for IFERROR. So, it becomes a question of formula efficiency.
=IFERROR(D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,0)
Forget about all of this and seek a commercially logical solution. The logic says that you never buy a lot unless it contains some items you want, and the seller never has a lot that doesn't contain rubbish. In the end you get inundated with commons, meaning you have more of them than you can ever hope to sell. So, what's their real, commercial value? Valuate your rare and uncommon cards individually and all the baggage not at all. You will find the outcome more realistic both for the Commons and the Rare. BTW, that's what they do with stamp or coin collections.

I realise this is more a comment than an answer, but it's too large to put it as a comment:
Your formula is unreadable, as you can see:
=IFS(B2="C",D2*$B$15*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="U",D2*$B$16*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2,
B2="R",D2*$B$17*G2/((D2*$B$15)+(E2*$B$16)+(F2*$B$17))/D2)
First I'd advise you to create a new column (you might always hide it), I:I, which contains following formula (for I2):
=B2*G2/((D2*const_weigth_common)+(E2*const_weigth_uncommon)+(F2*const_weigth_rare))/D2
(And you give this a meaningful header name)
As far as names for $B$15:$B$17 are concerned, do something like:
$B$15 : const_weigth_common
$B$16 : const_weigth_uncommon
$B$17 : const_weigth_rare
(You do know how to use names in Excel?)
Like this, your formula becomes:
=IFS(B2="C",I2 * const_weigth_common,
B2="U",I2 * const_weigth_uncommon,
B2="R",I2 * const_weigth_rare);
As far as your error is concerned: as mentioned in another answer, this might be tackled using the =IF() formula, so I2 becomes:
=IF(D2<>0;...;_ERROR_VALUE); // up to you how to change your error value
Like this, your formulas become much clearer and it will be easier to solve possible problems.

I Ain't No Math-A-Magician
But I can help you with this...
The way I see it, there are three schools of thought and you need to figure out which one is yours:
The programmer - trapping #div/0 errors all day
The purist - there is only one answer, and it is undefined
The pragmatic - a graph shows results approaching a limit
I think the programmer is either dangerous or ineffectual, or dangerously ineffectual. Sure he can trap there error but what exactly does that do? I'll tell you exactly what that does, it puts lipstick on a pig. It's literally replacing one string, "#div/0!", with a different more aesthetic string. Or he can play second fiddle to the devil and publicly killing the one bug everyone knew how to defend at defend sgainst and creating another?
I think the purist is right about one thing, the answer is what it is and it can't be a anything else; but he's also wrong. A precisely known theoretical answer may very well be undefined, but where in the real world has anyone ever seen division by zero? It's a mathematical construct like infinity, we can safely ignore it. Don't believe me? What happens when you dvide the sun by zero? Go ahead, I'll wait for your answer.
I think these things because I am grounded in pragmatism One string is not better than the other if they both symbolize an attempt to divide by zero. I prefer knowing who my enemies are so I my keep them in front of it me. Nature truly abhorss the undefined and is only slightly displeased with a vacuum.

(Probably) A Simple Command - Excel

I just joined recently and am really excited to dive into the world of programming. There is still a ton of stuff I don't know, but I'm very proud of myself because I feel like I'm making headway into programming, whereas I used to have a mental block before. I've always been an infrastructure type of gal. But anyway --
I am creating an excel spreadsheet for my new budget. Here is a screencap of my problem (According to rules, new user can't attach images):
http://i66.tinypic.com/hx53zm.png
So this is what I want it to do, logically speaking: Stay blank (B38) until something is entered. Do (B7-B14-B36) if all the fields have something in them. Otherwise, just subtract whatever's in either B14 and/or B36 from B7.
I'm sure it's really simple -- I just lack the knowledge since I'm new. I have been playing around with this for a few days and searching on Google, and I can't figure out how to make it work for my spreadsheet. I have tried the CountA, Count, If, Isblank statements... and just can't get it to go.
This isn't really important to anything in my life, it's just something I'm making for myself to keep my financials in order -- AND to give me practice with some coding.
Thank you for any help you can give me!
Chris

If I understand you correctly, you will want to add the following to B38
=IF(B7 <> ""; B7-B14-B36; "")
Depending on your version of Excel, you may need to replace the ; with ,

Dynamics CRM 2011 Import Data Duplication Rules

I have a requirement in which I need to import data from excel (CSV) to Dynamics CRM regularly.
Instead of using some simple Data Duplication Rules, I need to implement a point system to determine whether a data is considered duplicate or not.
Let me give an example. For example these are the particular rules for Import:
First Name, exact match, 10 pts
Last Name, exact match, 15 pts
Email, exact match, 20 pts
Mobile Phone, exact match, 5 pts
And then the Threshold value => 19 pts
Now, if a record have First Name and Last Name matched with an old record in the entity, the points will be 25 pts, which is higher than the threshold (19 pts), therefore the data is considered as Duplicate
If, for example, the particular record only have same First Name and Mobile Phone, the points will be 15 pts, which is lower than the threshold and thus considered as Non-Duplicate
What is the best approach to achieve this requirement? Is it possible to utilize the default functionality of Import Data in the MS CRM? Is there any 3rd party Add-on that answer my requirement above?
Thank you for all the help.
Updated
Hi Konrad, thank you for your suggestions, let me elaborate here:
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Nice one but I don't think it is really workable in my case, the data will be coming regularly from client in moderate numbers (hundreds to thousands). Typically client won't check about the duplication on the data.
Workflow. Run a process removing any instance calculated as a duplicate.
Workflow is a good idea, however since it is being processed asynchronously, my concern is the user in some cases may already do some update/changes to the data inserted, before the workflow finish working.. therefore creating some data inconsistency or at the very least confusing user experience
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
I like this approach. So I just import like usual (for example, to contact entity), but I already have a plugin in place that getting triggered every time a record is created, the plugin will check whether the record is duplicat-ish or not and took necessary action.

I haven't been fiddling a lot with duplicate detection but looking at your criteria you might be able to make rules that match those, pretty much three rules to cover your cases, full name match, last name and mobile phone match and email match.
If you want to do the points system I haven't seen any out of the box components that solve this, however CRM Extensions have a product called Import Manager that might have that kind of duplicate detection. They claim to have customized duplicate checking. Might be worth asking them about this.
Otherwise it's custom coding that will solve this problem.

I can think of the following approaches to the task (depending on the number of records, repetitiveness of the import, automatization requirement etc.) they may be all good somehow. Would you care to elaborate on the current conditions?
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
Workflow. Run a process removing any instance calculated as a duplicate.
You also need to consider the implication of such elimination of data. There's a mathematical issue. Suppose that the uniqueness' radius (i.e. the threshold in this 1D case) is 3. Consider the following set of numbers (it's listed twice, just in different order).
1 3 5 7 -> 1 _ 5 _
3 1 5 7 -> _ 3 _ 7
Are you sure that's the intended result? Under some circumstances, you can even end up with sets of records of different sizes (only depending on the order). I'm a bit curious on why and how the setup came up.
Personally, I'd go with plugin, if the above is OK by you. If you need to make sure that some of the unique-ish elements never get omitted, you'd probably best of applying a test algorithm to a backup of the data. However, that may defeat it's purpose.
In fact, it sounds so interesting that I might create the solution for you (just to show it can be done) and blog about it. What's the dead-line?

alternative spreadsheet for real time data

I am looking for an alternative spreadsheet to Excel, preferably but not necessarily open source, that allows a programmer to create a plugin that can update cells in the sheet from an external data source in real time. The spreadsheet would then internally compute all dependent calculation chains upon change of value.
This is similar functionality to what the RTD method does with Microsoft Excel. The rate of external data change could be moderate to high (whatever such relativistic terms mean).
Also the reverse process would be useful, i.e. detecting a change in cells and then sending that information to a plugin that can communicate with external processes.
Any recommendations or experience in trying this?

I am afraid you will not find any. The main consumers of the real-time spreadsheets (grids) are big banks and they usually invest in their own solutions. [Because they can afford and they used to see it as their advantage over the competition] Some of the solutions are very dated, but still going strong! Three years ago I worked on a system which was written in C++ (with TibcoRv as a backbone) and it was already five years old. It is still alive and kicking.
One of the strong points of the bespoke grid are "Excel-like formulae" where a user can use a field from the provided data dictionary. So rather than reference cells, you reference data from your systems. It makes formulae easier to implement and read. And of course you can export or share them; users really like that.

The following could be of some help:
http://www.dadisp.com
http://www.quantrix.com
http://www.resolversystems.com/products/
http://pyspread.sourceforge.net/
http://matrex.sourceforge.net/
This may not exactly satisfy your real time requirement but worth exploring.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

data and structure cleansing of excel sheets - excel

Related

Open multiple positions in Backtrader

Need help simplifying or improving a weighted distribution formula in excel (math/excel/programming noob)

(Probably) A Simple Command - Excel

Dynamics CRM 2011 Import Data Duplication Rules

alternative spreadsheet for real time data

Categories

Resources