I am new to Alteryx and am trying to use it for analysing unstructured data. I have a column of description in text form and I intend to use the K-Means Clustering tool for topic modelling. For K-means to work on text, I will need to convert my text into a Document Term Matrix (DTM) so that they appear as continuous variables to the clustering tool. However, I am struggling to find a way I can convert my text to a DTM.
Does anyone know a way to do so? I am currently looking at the R tool but am not exactly sure how to start too. Hoping that all of you experts here can help me out!
I have looked through posts on text analysis and realized that most fell back on the Microsoft Azure ML Text Analysis Macro. However, I would like to avoid using the macro (to not be restricted to limited runs every month for scalability) and instead use tools that are available in Alteryx.
Thanks to everyone in advance!
with Alteryx being more of a pictoral drag-and-drop workflow, it's not trivial to explain here, however I've created the following workflow and included the actual workflow itself on the Alteryx forum here. The workflow utilizes term frequencies from Inauguration speeches but should apply to any collection of documents. It just splits the words based on various non-numeric characters and does a summary. This is what the workflow looks like:
Related
I have a data set (I have provided a link to a photo of the first 10 lines of the data). For my project I need to filter the “MS” column for “ES” (Spain) and I need to analyze the data and get some conclusions about the data and graphs, Visualizations ect.
But my problem is by looking at the data I just don’t know where to start because it’s just seems like there is nothing to do with the data that is there.
So I have come to all of you to see if you can give em some suggestions and to see if anyone can see any sort of starting point.
These are the details of my project if it helps
Objectives: Represent the protected spaces of Spain and explain the data
Evaluation criteria: Representation is comprehensive, “nice looking” (Visualizations), use of analytics (sql, python, excel ect). Adding to this a report with resources and the conclusion of the análisis with adición representation.
Notes: it’s importe that the cliente “ teachers” (also people from company’s in Spain) can understand the analysis with the report and Visualizations.
Thanks! I have 1863 lines of data (already filtered to ES (SPAIN))
I have a data set of about 1 million employer names. These names are from a free-form text field so they include misspellings and variations in the way they are inputted (e.g. "Amazon" .. "Amzaon" .. "Amazon.com" .. "Amazon Web Services" .. "AWS").
I want to either A) group these 1 million so I have a somewhat accurate sense of how many unique employers are in the data set or B) be able to find all variations of any given employer.
So far, I've been using the data in Tableau, then filtering on "employer name" and searching all variations of the name I can think of. But it's tedious and I'm pretty sure I'm leaving many out.
I've also used the fuzzy add-in for excel but it hasn't worked that well on misspellings, special characters...
Tableau just isn't suited for doing this kind of analysis straight out of the box, and I would highly recommend doing some pre-processing on your data before putting trying to build a workbook around it.
Like another commenter said, you could look into using Tableau Prep Builder for a one-time transformation on your data set, but if you wanted to automate this process it costs extra to add functionality to whatever Tableau Server installation you have.
If you're familiar with Python or R (and the integration between Tableau Server and those services is supported by your organization), you could look into building a script to run the transformation real-time, but it probably won't be too efficient.
Try experimenting with Tableau Prep Builder - the companion tool that comes with your Tableau Creator license. It has a group feature that is designed for just these problems.
In Prep Builder, you’ll just need to connect to your data, add a cleaning step, and then add a group to your cleaning step.
i need some help in implementing Learning To Rank (LTR). It is related to my semester project and I'm totally new to this. The details are as follows:
I gathered around 90 documents and populated 10 user queries. Now i have to rank these documents based on each query using three algorithms specifically LambdaMart, AdaRank, and Coordinate Ascent. Previously i applied clustering techniques on Vector Space Model but that was easy. However in this case, I don't know how to change the data according to these algorithms. As i have this textual data( document and queries) in txt format in separate files. I have searched for solutions online and I'm unable to find a proper solution so can anyone here please guide me in the right direction i.e. Steps. I would really appreciate.
As you said you have applied the clustering in vector space model. the input of these algorithms are also vectors.
Why don't you have a look at the standard data set introduced for learning to rank issue (Letor benchmark) in which documents are shown in vectors of features?
There is also implementation of these algorithm provided in java (RankLib), which may give you the idea to solve the problem. I hope, this help you!
I'm trying to use fuzzy lookup to match a list of correct names with a set of "dirty" names. But apparently vba only uses one core of my processors and it takes too much time because I am using it on at least 5000 names.
Here's a link to the fuzzy code: https://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html#post955137
I also researched about "multi-threading" solutions for VBA and I found that there's no native way of doing it but someone found made an alternative using some scripts.
Here's the link for the multithreading vba script tool: https://analystcave.com/excel-vba-multithreading-tool/
Now, all I need to do is to integrate the lookup code to this multithreading script so that it will speed up the processing of this function. I am assuming that this is possible right?
Can someone help me with this? I only learned VBA through googling and reading other codes but this vba multithread tool is quite complicated for a beginner like me.
Thank you very much!
I'm not qualified to address the multithreading, but about your speed issue: are you running the code directly on the spreadsheet?
A better method is to import the entire table or range into an Array, and run the code on it there while it's in computer memory. It runs MUCH faster there. Then paste the results into the spreadsheet.
Here's some info on pulling the data into an array:
Creating an Array from a Range in VBA
http://www.cpearson.com/excel/ArraysAndRanges.aspx
You'll have to fiddle with the rest of your code, but basically you'll treat the array as if it were a table.
Below is an excerpt from Microsoft website. I believe their C# based add-in Fuzzy Lookup for MS-Excel is multi-threading based and much faster than the code you provide. Why to re-invent the wheel when we have a better option available.
The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages. The following libraries are required and will be installed if necessary:
.NET 4.5
VSTO 4.0
I have university placement data pulled from databases in excel sheet. I need to text mine the job description offered by companies, which is a descriptive field for all the rows and then come up with the analysis of profiles in demand.
Here is a snapshot of the data
Could anyone help me to kick start this activity?
Thanks
Saurabh
I am not a data expert but I have some data mining experience. I would try following these steps for starters:
Excel is not a good for such an analysis. Find some tool dedicated to data mining e.g. RStudio. R has many useful out-of-the-box algorithms for data mining.
Cleanse the data e.g. all texts to lower case, remove stop words, remove punctuation, remove additional white spaces.
Tokenize the data e.g. 1 word tokens - "finance", "bachelor"
Decide on how you will assert if a certain profile is in demand or not? If by profile you mean that you need the information on the frequency of certain tokens appearing in the data more often then others e.g. "finance", "bachelor" etc. then simply create a frequency matrix. R allows you to create a visualisation of this - Word Clouds.
This is to start you off :). I am sure there is much more to be suggested in this matter.