How to compare data between a database and a guide which are differently structured? - excel

A rather complicated problem in data exchange between a database and a bookform:
The organisation in which I work has a database in mysql for all social profit organisations in Brussels, Belgium. At the same time there is a booklet created in Indesign which was developed in a different time and with different people than the database and consequently has a different structure.
Every year a new book is published and the data needs to be compared manually because of this difference in structure. The book changes its way of displaying entries according to the need of a chapter. It would help to have a crossplatform search and change tool, best not with one keyword but with all the relevant data for an entry in the book.
An example of an entry in the booklet:
BESCHUTTE WERKPLAATS BOUCHOUT
Neromstraat 26 • 1861 Wolvertem • Tel 02-272 42 80 • Fax 02-269 85 03 • Gsm 0484-101 484 E-mail info#bwbouchout.be • Website www.bwbouchout.be Werkdagen: 8u - 16u30, vrijdag tot 14u45.
Personen met een fysieke en/of verstandelijke handicap. Ook psychiatrische patiënten en mensen met een meervoudige handicap.
Capaciteit: 180 tewerkstellingsplaatsen.
A problem: The portable phone number is written in another format as in the database. The database would say: 0484 10 14 84 the book says: 0484-101 484
The opening times are formulated completely different, but some of it is similar.
Are there tools which would make life easier? Tools where you would be able to find similar data something like: similar data finder for excel but then cross platform and with more possibilities? I believe most data exchange programs work very "one-way same for every entry". Is there a program which is more flexible?
For clarity: I need to compare the data, not to generate the data out of the database.
It could mean saving a lot of time, money and eyestrain. Thanks,
Erik Willekens

Erik,
The specific problem of comparing two telephone number which are formatted differently is relatively easy to overcome by stripping all non-numeric characters.
However I don't think that's really what you are trying to achieve. I believe you're attempting to compare whether the booklet data is different to the database data but disregard certain formatting.
Realistically this isn't possible without having some very well defined rules on the formatting. For instance formatting on the organisation name is probably very significant whereas telephone number formatting is not.
Instead you should be tracking changes within the database and then manually check the booklet.

One possible solution is to store the booklet details for each record in your database alongside the correctly formatted ones. This allows you to perform a manual conversion once for the entire booklet and then each subsequent year lets you just compare the new booklet values to the old booklet values stored in the DB.
An example might make this clearer. Imagine you had this very simple record:
Org Name Booklet Org Name GSM Booklet GSM
-------- ---------------- --- -----------
BESCHUTTE BESCHUTTE WERKP 0484 10 14 84 0484-101 484
When you get next year's booklet, then as long as the GSM number in the new booklet still says 0484-101 484 you won't have to worry about converting it to your database format and then checking to see if it has changed.
This would not be a good approach if a large proportion of details in the booklet changed each year

Related

Excel Data Validation not processing recent cell-data from smartphone input

I have recently observed an issue regarding my data in a column that I use to perform data validation on my spreadsheet.
So There is nothing wrong with the formula, neither is there anything from with the use of data validation.
It should be looking for duplicate entries, which works quite fine.
The issue is that it no longer recognizes input made from a smartphone using the excel app.
so what i did was to retype cell text field from my PC and it worked perfectly.
Is there a way that I can continue using this technique (Data validation) without having to re-enter data from a PC in order for it to process?
Certainly! Yes, that is possible.
But... with all the possibilities in today's world, is your current strategy the one that is the best for you?
That is something I cannot answer for you.
That is something I cannot enumerate for you.
But... There is something that I can introduce to you.
PowerQuery
PowerQuery was a free add-on for Excel 2010 and 2013 and it has been baked directly into Excel for more than half a decade. So, if you're using the mobile app then you probably have a modern version of Excel with PowerQuery right at your finger tips.
Your first step if to determine how you want to make your data available for Excel to get. Go to the Data Tab on the ribbon and review your options in the "Get Extetnal Data" group.
It doesn't matter if free data is your Creed and your most intimate moments are publicly available through your raw data feed. Or if paranoia is the reason why you constantly drive around the block scraping SSIDs before squirreling them away to SQL server for detailed analysis. Or if you're using a USB cable to transfer photos to your PC because your mom walked in on you without knocking and was so disgusted by what she saw on your desktop that you're banned from the family LAN... For life. None of that matters because Excel can connect to your data in so many ways that one of them will be perfect for you.
There is a sense of familiarity when Importing your data into PowerQuery. It's not unlike following those timeless MS Wizards; but nothing like the uncanny sensation of being dropped into the PowerQuery editor. It is simultaneously the same as Excel and different from Excel and it may be the closest you ever come to visiting a parallel universe. Many of the same tools are available but they behave just slightly differently. And in some cases, like the Text To Columns tool, it is light years ahead of Excel and you will find yourself cursing at MS for not using it as a replacement for the old tool.
When you're done transforming your data, you'll have a tight clean table. But the real prize, is that you have fully automated pipe from source to product .
I figured that the phone user included extra spaces when inputting the data.
So i Used the TRIM() function which takes care of the extra spaces between, before, or after each word, and that did the job.
Therefore the major error was that there were additional spaces that was not recognized in the tested data.

Dynamics CRM 2011 Import Data Duplication Rules

I have a requirement in which I need to import data from excel (CSV) to Dynamics CRM regularly.
Instead of using some simple Data Duplication Rules, I need to implement a point system to determine whether a data is considered duplicate or not.
Let me give an example. For example these are the particular rules for Import:
First Name, exact match, 10 pts
Last Name, exact match, 15 pts
Email, exact match, 20 pts
Mobile Phone, exact match, 5 pts
And then the Threshold value => 19 pts
Now, if a record have First Name and Last Name matched with an old record in the entity, the points will be 25 pts, which is higher than the threshold (19 pts), therefore the data is considered as Duplicate
If, for example, the particular record only have same First Name and Mobile Phone, the points will be 15 pts, which is lower than the threshold and thus considered as Non-Duplicate
What is the best approach to achieve this requirement? Is it possible to utilize the default functionality of Import Data in the MS CRM? Is there any 3rd party Add-on that answer my requirement above?
Thank you for all the help.
Updated
Hi Konrad, thank you for your suggestions, let me elaborate here:
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Nice one but I don't think it is really workable in my case, the data will be coming regularly from client in moderate numbers (hundreds to thousands). Typically client won't check about the duplication on the data.
Workflow. Run a process removing any instance calculated as a duplicate.
Workflow is a good idea, however since it is being processed asynchronously, my concern is the user in some cases may already do some update/changes to the data inserted, before the workflow finish working.. therefore creating some data inconsistency or at the very least confusing user experience
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
I like this approach. So I just import like usual (for example, to contact entity), but I already have a plugin in place that getting triggered every time a record is created, the plugin will check whether the record is duplicat-ish or not and took necessary action.
I haven't been fiddling a lot with duplicate detection but looking at your criteria you might be able to make rules that match those, pretty much three rules to cover your cases, full name match, last name and mobile phone match and email match.
If you want to do the points system I haven't seen any out of the box components that solve this, however CRM Extensions have a product called Import Manager that might have that kind of duplicate detection. They claim to have customized duplicate checking. Might be worth asking them about this.
Otherwise it's custom coding that will solve this problem.
I can think of the following approaches to the task (depending on the number of records, repetitiveness of the import, automatization requirement etc.) they may be all good somehow. Would you care to elaborate on the current conditions?
Excel. You could filter out the data using Excel and then, once you've obtained a unique list, import it.
Plugin. On every creation of a new record, you'd check if it's to be regarded as duplicate-ish and cancel it's creation (or mark for removal).
Workflow. Run a process removing any instance calculated as a duplicate.
You also need to consider the implication of such elimination of data. There's a mathematical issue. Suppose that the uniqueness' radius (i.e. the threshold in this 1D case) is 3. Consider the following set of numbers (it's listed twice, just in different order).
1 3 5 7 -> 1 _ 5 _
3 1 5 7 -> _ 3 _ 7
Are you sure that's the intended result? Under some circumstances, you can even end up with sets of records of different sizes (only depending on the order). I'm a bit curious on why and how the setup came up.
Personally, I'd go with plugin, if the above is OK by you. If you need to make sure that some of the unique-ish elements never get omitted, you'd probably best of applying a test algorithm to a backup of the data. However, that may defeat it's purpose.
In fact, it sounds so interesting that I might create the solution for you (just to show it can be done) and blog about it. What's the dead-line?

SSIS Split String address

I have a column which is made up of addresses as show below.
Address
1 Reid Street, Manchester, M1 2DF
12 Borough Road, London, E12,2FH
15 Jones Street, Newcastle, Tyne & Wear, NE1 3DN
etc .. etc....
I am wanting to split this into different columns to import into my SQL database. I have been trying to use Findstring to seperate by the comma but am having trouble when some addresses have more "sections" than others. ANy ideas whats the best way to go about this?
Many THanks
This is a requirements specification problem, not an implementation problem. The more you can afford to assume about the format of the addresses, the more detailed parsing you will be able to do; the other side of the same coin is that the less you will assume about the structure of the address, the fewer incorrect parses you will be blamed for.
It is crucial to determine whether you will only need to process UK postal emails, or whether worldwide addresses may occur.
Based on your examples, certain parts of the address seem to be always present, but please check this resource to determine whether they are really required in all UK email addresses.
If you find a match between the depth of parsing that you need, and the assumptions that you can safely make, you should be able to keep parsing by comma indexes (FINDSTRING); determine some components starting from the left, and some starting from the right of the string; and keep all that remains as an unparsed body.
It may also well happen that you will find that your current task is a mission impossible, especially in connection with international postal addresses. This is why most websites and other data collectors require the entry of postal address in an already parsed form by the user.
Excellent points raised by Hanika. Some of your parsing will depend on what your target destination looks like. As an ignorant yank, based on Hanika's link, I'd think your output would look something like
Addressee
Organisation
BuildingName
BuildingAddress
Locality
PostTown
Postcode
BasicsMet (boolean indicating whether minimum criteria for a good address has been met.)
In the US, just because an address could not be properly CASSed doesn't mean it couldn't be delivered - cip, my grandparent-in-laws live in enough small town that specifying their name and city is sufficient for delivery as local postal officials know who they are. For bulk mailings though, their address would not qualify for the bulk mailing rate and would default to first class mailing. I assume a similar scenario exists for UK mail
The general idea is for each row flowing through, you'll want to do your best to parse the data out into those buckets. The optimal solution for getting it "right" is to change the data entry method to validate and capture data into those discrete buckets. Since optimal never happens, it becomes your task to sort through the dross to find your gold.
Whilst you can write some fantastic expressions with FINDSTRING, I'd advise against it in this case as maintenance alone will drive you mad. Instead, add a Script Transformation and build your parsing logic in .NET (vb or c#). There will then be a cycle of running data through your transformation and having someone eyeball the results. If you find a new scenario, you go back and adjust your business rules. It's ugly, it's iterative and it's prone to producing results that a human wouldn't have.
Alternatives to rolling your address standardisation logic
buy it. Eventually your business needs outpace your ability to cope with constantly changing business rules. There are plenty of vendors out there but I'm only familiar with US based ones
upgrade to SQL Server 2012 to use DQS (Data Quality Services). You'll probably still need to buy a product to build out your knowledge base but you could offload the business rule making task to a domain expert ("Hey you, you make peanuts an hour. Make sure all the addresses coming out of this look like addresses" was how they covered this in the beginning of one of my jobs).

Integrating with 500+ applications

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.
You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.
I would also look at CSV and then use an OLEDB connection against the CSV file for importing.
If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.
With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3
Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004
Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.
If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

What is the correct way to implement a massive hierarchical, geographical search for news?

The company I work for is in the business of sending press releases. We want to make it possible for interested parties to search for press releases based on a number of criteria, the most important being location. For example, someone might search for all news sent to New York City, Massachusetts, or ZIP code 89134, sent from a governmental institution, under the topic of "traffic". Or whatever.
The problem is, we've sent, literally, hundreds of thousands of press releases. Searching is slow and complex. For example, a press release sent to Queens, NY should show up in the search I mentioned above even though it wasn't specifically sent to New York City, because Queens is a subset of New York City. We may also want to implement "and" and "or" and negation and text search to the query to create complex searches. These searches also have to be fast enough to function as dynamic RSS feeds.
I really don't know anything about search theory, or how it's properly done. The way we are getting by right now is using a data mart to store the locations the releases were sent to in a single table. However, because of the subset thing mentioned above, the data mart is gigantic with millions of rows. And we haven't even implemented cities yet, and there are about 50,000 cities in the United States, which will exponentially increase the size of the data mart by so much I'm afraid it just won't work anymore.
Anyway, I realize this is not a simple question and there won't be a "do this" answer. However, I'm hoping one of you can point me in the right direction where I can learn about how massive searches are done? Because I really know nothing about it. And such a search engine is turning out to be incredibly difficult to make. Thanks! I know there must be a way because if Google can search the entire internet we must be able to search our own database :-)
Google can search the entire internet, and your data via a Google Appliance!

Resources