Grouping similar strings that have misspellings, spacing differences, etc

Grouping similar strings that have misspellings, spacing differences, etc - excel

I have a data set of about 1 million employer names. These names are from a free-form text field so they include misspellings and variations in the way they are inputted (e.g. "Amazon" .. "Amzaon" .. "Amazon.com" .. "Amazon Web Services" .. "AWS").
I want to either A) group these 1 million so I have a somewhat accurate sense of how many unique employers are in the data set or B) be able to find all variations of any given employer.
So far, I've been using the data in Tableau, then filtering on "employer name" and searching all variations of the name I can think of. But it's tedious and I'm pretty sure I'm leaving many out.
I've also used the fuzzy add-in for excel but it hasn't worked that well on misspellings, special characters...

Tableau just isn't suited for doing this kind of analysis straight out of the box, and I would highly recommend doing some pre-processing on your data before putting trying to build a workbook around it.
Like another commenter said, you could look into using Tableau Prep Builder for a one-time transformation on your data set, but if you wanted to automate this process it costs extra to add functionality to whatever Tableau Server installation you have.
If you're familiar with Python or R (and the integration between Tableau Server and those services is supported by your organization), you could look into building a script to run the transformation real-time, but it probably won't be too efficient.

Try experimenting with Tableau Prep Builder - the companion tool that comes with your Tableau Creator license. It has a group feature that is designed for just these problems.
In Prep Builder, you’ll just need to connect to your data, add a cleaning step, and then add a group to your cleaning step.

Related

Excel Data Validation not processing recent cell-data from smartphone input

I have recently observed an issue regarding my data in a column that I use to perform data validation on my spreadsheet.
So There is nothing wrong with the formula, neither is there anything from with the use of data validation.
It should be looking for duplicate entries, which works quite fine.
The issue is that it no longer recognizes input made from a smartphone using the excel app.
so what i did was to retype cell text field from my PC and it worked perfectly.
Is there a way that I can continue using this technique (Data validation) without having to re-enter data from a PC in order for it to process?

Certainly! Yes, that is possible.
But... with all the possibilities in today's world, is your current strategy the one that is the best for you?
That is something I cannot answer for you.
That is something I cannot enumerate for you.
But... There is something that I can introduce to you.
PowerQuery
PowerQuery was a free add-on for Excel 2010 and 2013 and it has been baked directly into Excel for more than half a decade. So, if you're using the mobile app then you probably have a modern version of Excel with PowerQuery right at your finger tips.
Your first step if to determine how you want to make your data available for Excel to get. Go to the Data Tab on the ribbon and review your options in the "Get Extetnal Data" group.
It doesn't matter if free data is your Creed and your most intimate moments are publicly available through your raw data feed. Or if paranoia is the reason why you constantly drive around the block scraping SSIDs before squirreling them away to SQL server for detailed analysis. Or if you're using a USB cable to transfer photos to your PC because your mom walked in on you without knocking and was so disgusted by what she saw on your desktop that you're banned from the family LAN... For life. None of that matters because Excel can connect to your data in so many ways that one of them will be perfect for you.
There is a sense of familiarity when Importing your data into PowerQuery. It's not unlike following those timeless MS Wizards; but nothing like the uncanny sensation of being dropped into the PowerQuery editor. It is simultaneously the same as Excel and different from Excel and it may be the closest you ever come to visiting a parallel universe. Many of the same tools are available but they behave just slightly differently. And in some cases, like the Text To Columns tool, it is light years ahead of Excel and you will find yourself cursing at MS for not using it as a replacement for the old tool.
When you're done transforming your data, you'll have a tight clean table. But the real prize, is that you have fully automated pipe from source to product .

I figured that the phone user included extra spaces when inputting the data.
So i Used the TRIM() function which takes care of the extra spaces between, before, or after each word, and that did the job.
Therefore the major error was that there were additional spaces that was not recognized in the tested data.

Creating a DTM on Alteryx Designer

I am new to Alteryx and am trying to use it for analysing unstructured data. I have a column of description in text form and I intend to use the K-Means Clustering tool for topic modelling. For K-means to work on text, I will need to convert my text into a Document Term Matrix (DTM) so that they appear as continuous variables to the clustering tool. However, I am struggling to find a way I can convert my text to a DTM.
Does anyone know a way to do so? I am currently looking at the R tool but am not exactly sure how to start too. Hoping that all of you experts here can help me out!
I have looked through posts on text analysis and realized that most fell back on the Microsoft Azure ML Text Analysis Macro. However, I would like to avoid using the macro (to not be restricted to limited runs every month for scalability) and instead use tools that are available in Alteryx.
Thanks to everyone in advance!

with Alteryx being more of a pictoral drag-and-drop workflow, it's not trivial to explain here, however I've created the following workflow and included the actual workflow itself on the Alteryx forum here. The workflow utilizes term frequencies from Inauguration speeches but should apply to any collection of documents. It just splits the words based on various non-numeric characters and does a summary. This is what the workflow looks like:

ActivePivot with a rules engine

I have just started on a project which his regulatory in nature and the business area of the IB I work with uses ActivePivot to manage their securities (inventory).
One of the tasks we need to do is that the ActivePivot data set and run some sort of simple rules engine over the data that feeds ActivePivot. There is a little bit of netting involved at the transactional level but it's mostly simply rules using basic operators. I haven't used ActivePivot before but the users are telling me it doesn't really allow them to add fields within the cube which I understand from a technical perspective. I also noted that ActiveViam have a product called ActiveUI which on the surface appears to do this?
Has any one any tips/advice on what worked for them? The business also want a better data visualisation tool (graphs and the likes).. I was looking at tableaux but open to suggestions. Many thanks for any help given.

There is no clear question here so I will answer to your different points one by one:
run some sort of simple rules engine over the data that feeds ActivePivot
Then you can add your rule engine in your project on the data set before feeding ActivePivot as if you were not using ActivePivot afterwards.
users are telling me it doesn't really allow them to add fields within the cube
you cannot add fields once the cube is started but you can update the description of your cube in your project to integrate the new fields brought by your new logic.
I also noted that ActiveViam have a product called ActiveUI which on the surface appears to do this?
ActiveUI is a UI for the ActiveViam products (including ActivePivot), so it provides you (among others) tables, charts to navigate your data.
The business also want a better data visualisation tool (graphs and the likes).. I was looking at tableaux but open to suggestions
ActiveUI can provide you this. ActivePivot follows the standard for OLAP databases (XMLA) so it is also compatible with other XMLA clients like Excel and Tableau. Your BI has probably already chosen which client they would use so you should see with them.

TFS Query results with a list of linked work item IDs in Excel?

How can I get a list of linked work item IDs for a set of work items?
Excel-hosted queries preferred. API Sample is acceptable.
Direct DB table query is acceptable (read-only and unsupported of course!)
Many thanks in advance! -Zephan
MORE INFORMATION
UPDATE: No answers for my original Q so broadening scope of acceptable answers as follows:
Answer for TFS2015 (migrating very shortly) or TFS2013 (potentially useful for TFS2015) is preferred over TFS2010
Coding acceptable if there are any APIs or PowerShell cmdlets (MS or community).
Connecting directly (read-only!) to TFS DB tables is acceptable (source tables and related relationship link table names). Yes, directly referencing TFS DB tables is VERY unsupported, read-only, and "AT YOUR OWN RISK." Still beats having to manually copy/paste data or reconstruct list of links in Excel.
ORIGINAL QUESTION & DETAILS
My team uses TFS2010 (soon 2013 or hoping 2015) and VS2010-2015. I need to support traceability reports and analyze/quantify our coverage of ~300 Test Case work items linked to ~400 Requirement work items. Direct Link and Tree queries are close but don't give me related links on the same row as parent work item. Many thanks in advance for your suggestions and any related code fragments.
Example:
3 test cases (Test1, Test2, Test3)
4 Requirements (Req1, Req2, Req3, Req4)
For simplicity let's just use TFS work item IDs to represent each TestN and ReqN. In actuality, I have a keyword to identify my validation requirements (separate from the 1,000's of other requirements in this Team Project). The only Test Case WI I care about for this problem are those linked to one or more Validation Requirement trace-ability.
Scenarios:
1:1 (simple) Test1 is linked to Req1
1:2 (1:n) Test2 is linked to Req2 and Req3
2:1 (n:1) Test3 (and Test2) are both linked to Req3
0:1 (Requirement missing Test coverage) Req4 has no test case links
I have a good coverage gap query by creating a Direct Link query for all Requirements then set "linking filters" to Only return items that do not have the specified links.
Desired output (all tests with list of related work items):
|Test1 | Req1 |
|Test2 | Req2, Req3 |
|Test3 | Req3 |
For row #2 I am OK with other separators or even entire list using same separator (.CSV or TAB delimited).
Skip right to answer now if you have a tidy answer. If not then I added considerable RELATED RESEARCH info below to help kick-start an idea that fits the need! Especially since this hasn't been discoverably solved in the last 5 years :-).
RELATED RESEARCH (loooong but may be useful)
1. Visual Studio Queries
Flat Queries should support a list of linked items out-of-the-box... but it does not. RelatedLinkCount field is handy for knowing if there are any links to chase, but that's it for flat queries.
Direct Link queries give a list of all direct links, but the related IDs are on rows below the parent work item. I am seriously considering creating a formula to look on the next X rows to build a list of IDs, but this would be fragile especially when over 3 requirements are linked to same test. Still might solve 80% of my tracing needs.
Tree Queries also show links, but on different rows. Additionally they tend to follow just one link type. Ideally I will need list of User Requirements linked to Functional Requirements linked to Test Case(s).
2. Tools / Plug-ins
SmartExcel4TFS (eDEVTech, http://www.modernrequirements.com/smartexcel4tfs/) has 3 reports it supports, but none get me the core data I need in easily used format. At least it is FREE if you have an MSDN Premium subscription.
Requirements to Tests Trace Matrix is super-interesting. Alass, right now I need to go the other way (Requirements linked to a given test case). Also it merges cells and has sub-sections that are hard to manipulate I think. (I may revisit this option though.)
Intersection Traceability Matrix report is WAY too wide for a full 300 x 400 grid :-O.
Work Item Decomposition Matrix also didn't give me desired contents. (though frankly I've forgotten this report layout from when I checked ~1 month ago.)
3. TFS API calls
I have actually avoided this route in favor of native Excel solution... but if I can get an example of Excel VBA code (or other code with link to calling within Excel) I may go this route. At this point I don't have time to dig into rolling my own... but this would be cool assuming performance is acceptable.
Relevant API/code fragments:
Retrieving TFS Results from a Tree Query (Blogs.msdn.com 2012.02.22) - Looks like this would get me the data I need, but it is not in Excel so I'd need a bridge example of some sort calling this within Excel.
Retrieving work items and their linked work items in a single query using the TFS APIs (stackoverflow.com 2012.01.12) - Also looks very promising, but not connected to Excel. Gives hints for 2 level and 3 level nested links and performance consideration (don't make second call for each item returned!)
Retrieving work items using the Team Foundation Server API (pwee167.github.io 2012.09.18) - Excellently written introductory walkthrough blog posting to learn how to build an (ASP.Net MVC3) app that calls TFS APIs to run Flat or Tree queries. Start here if writing C# (which I could do but don't have time/justification unless easy example to integrate with Excel).
How can I query work items and their linked changesets in TFS? (stackoverflow.com 2011.05.10) - I don't need changesets but this has VB code to instantiate new TfsTeamProjectCollection which might work directly in Excel VBA (assuming proper reference is found and added)
var projectCollection = new TfsTeamProjectCollection(
new Uri("http://localhost:8080/tfs"),
new UICredentialsProvider());
OK, that's everything I have gathered on this problem. Please help contribute with the missing magic tool/snippet or follow the info above to build that last bit I have not had time to prototype & debug. Many thanks in advance!! -Zephan

Integrating with 500+ applications

Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.

You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.

I would also look at CSV and then use an OLEDB connection against the CSV file for importing.

If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.

With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3

Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004

Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.

If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string