My Question
What are the best practices for creating a customized report based on a user form input? Specifically, how do I create an easy to maintain system which takes user input which is collected in a form and generate multiple paragraphs that explains the results of analysis.
Background
I am working on a very large multiyear project with a startup (who is my client). My job is to program analysis and generate reports to users. The pipeline for data looks like this:
Users enter information into a form -> results are calculated based on user input -> reports are displayed to users that share analysis.
It is really important to my client that some of the analysis results are displayed in paragraphs in a non-formal user friendly tone. The challenge is that the form and analysis are quite complex and will only get more complex over time. An example of the type of template for the paragraphs looks something like this:
resultsParagraphText=`Hi ${userName}. We found that the best ice cream flavour for you is ${bestIceCreamFlavor}. These other flavors ${otherFlavors} might be good for you. Here are the reasons why you might enjoy these flavors: ${reasonsWhyGoodFlavors}.
However we would not recommend these other flavors ${badFlavors}. Here are the reasons you should avoid this bad flavors: ${reasonsWhyBadFlavors}.`
These results paragraphs, of which there of many, have several minor problems which combined are significant:
If there is a bug in the code, minor visual errors would be visible to end users (capitalization errors, missing/extra commas, and so on).
A lot of string comparisons (e.g. if answers.previousFlavors.includes("Vanilla")) are required to generate the results paragraphs. Minor errors in the forms (e.g. vanilla in the form is not capitalized so answers.previousFlavors.includes("Vanilla") returns false even when user enters vanilla.) can cause errors in the results paragraph.
Changes in different parts of the project (form, analysis) directly effect how the results paragraph is made. Bad types, differences in string values, null or undefined values not being caught directly have an impact on how the results paragraph is made.
There are many edge cases (e.g. What if the user has no other suitable good flavors for them? The the sentence These other flavors ${otherFlavors} might be good for you. needs to be excluded).
It is hard to write paragraphs that use templates and have a non-formal tone.
and so on.
I have charts and other types of ways to display results and have explained to the client the challenges of sharing the information in paragraph form.
What I am looking for
I need examples, how tos, best practices on how to build a maintainable system for generating customized paragraphs based on user input. I know how to solve each of the individual issues (as they are fairly simple) but in a large project this will become very hard to maintain.
Notes
I have no clue what tags to use for the post. Feel free to edit/add tags if you know more appropriate ones.
The project is planning to use machine learning in the future other parts of the project. If there is a ML/AI solution that is useful please tell me.
I am working primarily in JavaScript, Python, C, and R, but if there is a library or tool in any other language please tell me. Finding a solution is very important to me and I would be willing to learn a lot find a best solution.
To avoid this question being removed because I have rephrased it to avoid asking for personal opinion, instead asking for existing examples or how tos. I can also imagine that others might find a solution fairly useful. If you can edit it to make the question less subjective please do so.
If you have any questions or need clarification feel free to ask. Any help is appreciated.
I am working on text normalization. I have descriptions of variables/attributes, which I need to convert to correct english.
A an example is shown below:
"This is the sta of the customer's order"
The word 'sta' above needs to be converted to 'status' based on the error and the context.
I tried out a character level encoder decoder architecture, but did not get good results.I need some direction on how to approach this problem.
input :"This is the sta of the customer's order"
output: "This is the status of the customer's order"
This is called spell checking. There are ways to do so, one common way is to use minimum edit distance. An edit is one of these actions : adding a char, removing a char, replacing a char with another, transposing two adjacent chars. You can use edits to make new words out of mistaken words, and use a dictionary to see if the word really exists in English language. There may be more than 1 candidate for each incorrect word to choose from. There are also ways for candidate ranking.
Reading this paper may be a good start :
A Survey of Spelling Error Detection and Correction Techniques
I am working with a document, where each row contains a description for a specific incident (fire incidents, where firefighters turn up and thereafter write a report).
The incidents/reports are written by several different people, so the language varies a lot, which makes it difficult to code for one specific context using one word: is.number(search(substring;text))
Because even if the word is in the text piece, the context is not related to what I am trying to analyse.
I want to broaden my word search to be more flexible, by being able to "put" or "store" several different words/phrases into my "substring" - being able to get closer to the specific context that I wish to analyse.
This way to cover more data that is in fact related, but different in how it is described in the individual incident reports.
I have tried to search for a solution myself, but am unsure on how to phrase this specific inquiry.
So far I have only been able to use the code piece above, which is a bit insufficient, when trying to comb through 2000 rows.
I hope that someone is able to help me!
Thank you
An example:
Store the following words: stopped fire, killed fire, fire was put out into: Killed fire
So that when I use Killed fire all the above wordings are included in my search.
I have university placement data pulled from databases in excel sheet. I need to text mine the job description offered by companies, which is a descriptive field for all the rows and then come up with the analysis of profiles in demand.
Here is a snapshot of the data
Could anyone help me to kick start this activity?
Thanks
Saurabh
I am not a data expert but I have some data mining experience. I would try following these steps for starters:
Excel is not a good for such an analysis. Find some tool dedicated to data mining e.g. RStudio. R has many useful out-of-the-box algorithms for data mining.
Cleanse the data e.g. all texts to lower case, remove stop words, remove punctuation, remove additional white spaces.
Tokenize the data e.g. 1 word tokens - "finance", "bachelor"
Decide on how you will assert if a certain profile is in demand or not? If by profile you mean that you need the information on the frequency of certain tokens appearing in the data more often then others e.g. "finance", "bachelor" etc. then simply create a frequency matrix. R allows you to create a visualisation of this - Word Clouds.
This is to start you off :). I am sure there is much more to be suggested in this matter.
I need to break a line of string into different columns into excel. Here is te input that i get.
Input:
37006 II Semester P.G. Diploma in Clinical Research and Clinical Data Management Examination, July/August 2012 Pharma Regulatory Affairs Time : 3 Hours Max. Marks : 100
Output: CSV record with structure (Code, Sem/Year, Subject, Course, Exam Date, Time, Marks)
37006 , II Semester, P.G. Diploma in Clinical Research and Clinical Data Management, Pharma Regulatory Affairs, July/August 2012 , 3 Hours , 100
I have data in different sets which constructs above lines. For example:
Grammar (this is an array / dictionary):
Semesters[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
Years[I,II,III,IV,V,VI,VII,VIII,IX,X,1,2,3,4,5,6,7,8,9,10]
Subjects[P.G. Diploma in Clinical Research and Clinical Data Management, LL.B]
Courses[Pharma Regulatory Affairs,Law - Jurisprudence]
ExamDates[ July/August 2012 , Jan./Feb. 2013 ]
Time[3 Hours]
MaxMarks[30,40,50,60,70,80,90,100]
FYI,
I'm not sure that i can use any delimiters to break it as its highly unpredictable or dependable.
I'm not sure the text will be in same order in each line or no fixed length or cars or words
My assumption is, read word by word and try to match with any word in any array that I have. If its match with any word, then categorize that word into falling category and add into relevant column in excel.
Here, I know how to handle data and everything, except what is the optimized / best way to
understand each word falls under which category.
Is there any lexical analysis expert that can share some thoughts on this?
You should use regular expressions for matching such complicated text pattern.
Please take a look at a lexical analyzer like ANTLR. If you know Java or other languages that read regular expressions, you will be able to parse these with ease after an afternoon (or week) of torture. You can also write the regexp in Java, but I would nudge you toward the ANTLR interface, which you may use from Eclipse. It will show you how the lines are being parsed.
Have the output of the ANTLR or Java write out a CSV file. The CSV will get be your vehicle for getting your data into the Excel spreadsheet.