Best practices for creating a customized report based on user form input? - string

My Question
What are the best practices for creating a customized report based on a user form input? Specifically, how do I create an easy to maintain system which takes user input which is collected in a form and generate multiple paragraphs that explains the results of analysis.
Background
I am working on a very large multiyear project with a startup (who is my client). My job is to program analysis and generate reports to users. The pipeline for data looks like this:
Users enter information into a form -> results are calculated based on user input -> reports are displayed to users that share analysis.
It is really important to my client that some of the analysis results are displayed in paragraphs in a non-formal user friendly tone. The challenge is that the form and analysis are quite complex and will only get more complex over time. An example of the type of template for the paragraphs looks something like this:
resultsParagraphText=`Hi ${userName}. We found that the best ice cream flavour for you is ${bestIceCreamFlavor}. These other flavors ${otherFlavors} might be good for you. Here are the reasons why you might enjoy these flavors: ${reasonsWhyGoodFlavors}.
However we would not recommend these other flavors ${badFlavors}. Here are the reasons you should avoid this bad flavors: ${reasonsWhyBadFlavors}.`
These results paragraphs, of which there of many, have several minor problems which combined are significant:
If there is a bug in the code, minor visual errors would be visible to end users (capitalization errors, missing/extra commas, and so on).
A lot of string comparisons (e.g. if answers.previousFlavors.includes("Vanilla")) are required to generate the results paragraphs. Minor errors in the forms (e.g. vanilla in the form is not capitalized so answers.previousFlavors.includes("Vanilla") returns false even when user enters vanilla.) can cause errors in the results paragraph.
Changes in different parts of the project (form, analysis) directly effect how the results paragraph is made. Bad types, differences in string values, null or undefined values not being caught directly have an impact on how the results paragraph is made.
There are many edge cases (e.g. What if the user has no other suitable good flavors for them? The the sentence These other flavors ${otherFlavors} might be good for you. needs to be excluded).
It is hard to write paragraphs that use templates and have a non-formal tone.
and so on.
I have charts and other types of ways to display results and have explained to the client the challenges of sharing the information in paragraph form.
What I am looking for
I need examples, how tos, best practices on how to build a maintainable system for generating customized paragraphs based on user input. I know how to solve each of the individual issues (as they are fairly simple) but in a large project this will become very hard to maintain.
Notes
I have no clue what tags to use for the post. Feel free to edit/add tags if you know more appropriate ones.
The project is planning to use machine learning in the future other parts of the project. If there is a ML/AI solution that is useful please tell me.
I am working primarily in JavaScript, Python, C, and R, but if there is a library or tool in any other language please tell me. Finding a solution is very important to me and I would be willing to learn a lot find a best solution.
To avoid this question being removed because I have rephrased it to avoid asking for personal opinion, instead asking for existing examples or how tos. I can also imagine that others might find a solution fairly useful. If you can edit it to make the question less subjective please do so.
If you have any questions or need clarification feel free to ask. Any help is appreciated.

Related

Reusing cucumber steps in a large codebase/team

We're using cucumberJS on a fairly large codebase with hundreds of cucumber scenarios and we've been running into issues with steps reuse.
Since all the steps in Cucumber are global, it's quite difficult to write steps like "and I select the first item in the list" or similar that would be similarly high-level. We end up having to append "on homepage" (so: "I select the first item in the list of folders on homepage") which just feels wrong and reads wrong.
Also, I find it very hard to figure out what the dependencies between steps are. For example we use a "and I see " pattern for storing a page object reference on the world cucumber instance to be used in some later steps. I find that very awkward since those dependencies are all but invisible when reading the .feature files.
What's your tips on how to use cucumber within a large team? (Including "ditch cucumber and use instead" :) )
Write scenarios/steps that are about what you are doing and why you are doing it rather than about how you do things. Cucumber is a tool for doing BDD. The key word here is Behaviour, and its interpretation. The fundamental idea behind Cucumber and steps is that each piece of behaviour (the what) has a unique name and place in the application, and in the application context you can talk about that behaviour using that name without ambiguity.
So your examples should never be in steps because they are about HOW your do something. Good steps never talk about clicking or selecting. Instead they talk about the reason Why you are clicking or selecting.
When you follow this pattern you end up with fewer steps at a higher level of abstraction that are each focused on a particular topic.
This pattern is easy to implement, and moderately easy to maintain. The difficulty is that to write the scenarios you have to have a profound understanding of what you are doing and why its important so you can discover/uncover the language you need to express yourself distinctly, clearly and simply.
I'll give my standard example about login. I use this because we share an understanding of What login is and Why its important. Realise before you can login that you have to be registered and that is complex.
Scenario: Login
Given I am registered
When I login
Then I should be logged in
The implementation of this is interesting in that I delegate all work to helper methods
Given I am registered
#i = create_registered_user
end
When I login
login_as(user: #i)
end
Then I should be logged in
should_be_logged_in
end
Now your problem becomes one of managing helper methods. What you have is a global namespace with a large number of helper methods. This is now a code and naming problem and All you have to do is
keep the number of helper methods as small as possible
keep each helper method simple
ensure there is no ambiguity between method names
ensure there is no duplication
This is still a hard problem, but
- its not as hard as what you are dealing with
- getting to this point has a large number of additional benefits
- its now a code problem, lots of people have experience of managing code.
You can do all these things with
- naming discipline (all my methods above have login in their name)
- clever but controlled use of arguments
- frequent refactoring and code cleaning
The code of your helper methods will have
- the highest churn of all your application code
- the greatest need to be simple and clear
So currently your problem is not about Cucumber its about debt you have with your existing scenarios and their implementation. You have to pay of your debt if you want things to improve, good luck

Extracting user interests from social profiles

This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!
First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

Converting data into information:Where to start?

We (my company) runs a website which have lots of data recorded like user registration, visits, clicks, what the stuff they post etc etc but so far we don't have a tool to find out how to monitor entire thing or how to find patterns in it so that we can understand what kind of information we can get from it? So that Mgmt can take decisions based on it. In short, the people do at Amazon or Google based on data they retrieve, we want a similar thing.
Now, after the intro, I would like to know what technology could it be called;is it Data Mining,Machine Learning or what? Where should we start to convert meaningless data into useful Information?
I think what you need enters in the "realm" of: parsing data, creating graphs, showing statistics about some elements, etc.
There is no "easy" answer, I can only answer parts of your question.
There are no premade magical analytical tools, big companies have their own backend tools tunned to parse the large amounts of data and spit out data summaries that are then used to build graphs or for statistical analysis.
I think the domain you are searching for is statistical data analysis. But there are many parts that go together here.
Best advice I can give you is to set up specific goals for you analysis and then try to see what is the best solution, you question is too open.
ie. if you are interested in visits/clicks/website related statistics Google Analytics is a great tool, and very easy to use.

When the bots attack! [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What are some popular spam prevention methods besides CAPTCHA?
I have tried doing 'honeypots' where you put a field and then hide it with CSS (marking it as 'leave blank' for anyone with stylesheets disabled) but I have found that a lot of bots are able to get past it very quickly. There are also techniques like setting fields to a certain value and changing them with JS, calculating times between load time and submit time, checking the referer URL, and a million other things. They all have their pitfalls and pretty much all you can hope for is to filter as much as you can with them while not alienating who you're here for: the users.
At the end of the day, though, if you really, really, don't want bots to be sending things through your form you're going to want to put a CAPTCHA on it - best one I've seen that takes care of mostly everything is reCAPTCHA - but thanks to India's CAPTCHA solving market and the ingenuity of spammers everywhere that's not even successful all of the time. I would beware using something that is 'ingenious' but kind of 'out there' as it would be more of a 'wtf' for users that are at least somewhat used to your usual CAPTCHAs.
Shocking, but almost every response here included some form of CAPTCHA. The OP wanted something different, I guess maybe he wanted something that actually works, and maybe even solves the real problem.
CAPTCHA doesn't work, and even if it did - its the wrong problem - humans can still flood your system, and by definition CAPTCHA wont stop that (cuz its designed only to tell if you're a human or not - not that it does that well...)
So, what other solutions are there? Well, it depends... on your system and your needs.
For instance, if all you're trying to do is limit how many times a user can fill out a "Contact Me" form, you can simply throttle how many requests each user can submit per hour/day/whatever. If your users are anonymous, maybe you need to throttle according to IP addresses, and occasionally blacklist an IP (though this too can be circumvented, and causes other problems).
If you're referring to a forum or blog comments (such as this one), well the more I use it the more I like the solution. A mix between authenticated users, authorization (based on reputation, not likely to be accumulated through flooding), throttling (how many you can do a day), the occasional CAPTCHA, and finally community moderation to cleanup the few that get through - all combine to provide a decent solution. (I wonder if Jeff can provide some info on how much spam and other malposts actually get through...?)
Another control to consider (dont know if they have it here), is some form of IDS/IPS - if you can detect and recognize spam, you can block THAT pattern. Moderation fills that need manually, here...
Note that any one of these does not prevent the spam, but incrementally lowers the probability, and thus the profitability. This changes the economic equation, and leaves CAPTCHA to actually provide enough value to be worth it - since its no longer worth it for the spammers to bother breaking it or going around it (thanks to the other controls).
Give the user the possibility to calculate:
What is the sum of 3 and 8?
By the way: Just surfed by an interesting approach of Microsoft Research: Asirra.
http://research.microsoft.com/asirra/
It shows you several pictures and you have to identify the pictures with a given motif.
Try Akismet
Captchas or any form of human-only questions are horrible from a usability perspective. Sometimes they're necessary, but I prefer to kill spam using filters like Akismet.
Akismet was originally built to thwart spam comments on WordPress blogs, but the API is capabable of being adapted for other uses.
Update: We've started using the ruby library Rakismet on our Rails app, Yarp.com. So far, it's been working great to thwart the spam bots.
A very simple method which puts no load on the user is just to disable the submit button for a second after the page has been loaded. I used it on a public forum which had continuous spam posts, and it stopped them since.
Ned Batchelder wrote up a technique that combines hashes with honeypots for some wickedly effective bot-prevention. No captchas, just code.
It's up at Stopping spambots with hashes and honeypots:
Rather than stopping bots by having people identify themselves, we can stop the bots by making it difficult for them to make a successful post, or by having them inadvertently identify themselves as bots. This removes the burden from people, and leaves the comment form free of visible anti-spam measures.
This technique is how I prevent spambots on this site. It works. The method described here doesn't look at the content at all. It can be augmented with content-based prevention such as Akismet, but I find it works very well all by itself.
http://chongqed.org/ maintains blacklists of active spam sources and the URLs being advertised in the spams. I have found filtering posts for the latter to be very effective in forums.
The most common ones I've observed orient around user input to solve simple puzzles e.g. of the following is a picture of a cat. (displaying pictures of thumbnails of dogs surrounding a cat). Or simple math problems.
While interesting I'm sure the arms race will also overwhelm those systems too.
You can use Recaptcha to at least make a captcha useful. Then you can make questions with simple verbal math problems or similar. Microsoft's Asirra makes you find pics of cats and dogs. Requiring a valid email address to activate an account stops spammers when they wouldn't get enough benefit from the service, but might deter normal users as well.
The following is unfeasible with today's technology, but I don't think it's too far off. It's also probably overkill for dealing with forum spam, but could be useful for account sign-ups, or any situation where you wanted to be really sure you were dealing with humans and they would be prepared for it to take a few minutes to complete the process.
Have 2 users who are trying to prove themselves human connect to each other via their webcams and ask them if the person they are seeing is human and live (i.e. not a recording), by getting them to, for example, mirror each other's movements, or write something on a piece of paper. Get everyone to do this a few times with different users, and throw a few recordings into the mix which they also have to identify correctly as such.
A popular method on forums is to simply queue the threads of members with less than 10 posts in a moderation queue. Of course, this doesn't help if you don't have moderators, or it's not a forum. A more general method is the calculation of hyperlink to text ratios. Often, spam posts contain a ton of hyperlinks, and you can catch a lot this way. In the same vein is comparing the content of consecutive posts. Simply do not allow consecutive posts that are extremely similar.
Of course, anyone with knowledge of the measures you take is going to be able to get around them. To be honest, there is little you can do if you are the target of a specific attack. Rather, you should focus on preventing more general, unskilled attacks.
For human moderators it surely helps to be able to easily find and delete all posts from some IP, or all posts from some user if the bot is smart enough to use a registered account. Likewise the option to easily block IP addresses or accounts for some time, without further administration, will lessen the administrative burden for human moderators.
Using cookies to make bots and human spammers believe that their post is actually visible (while only they themselves see it) prevents them (or trolls) from changing techniques. Let the spammers and trolls see the other spam and troll messages.
Javascript evaluation techniques like this Invisible Captcha system require the browser to evaluate Javascript before the page submission will be accepted. It falls back nicely when the user doesn't have Javascript enabled by just displaying a conventional CAPTCHA test.
Animated captchas' - scrolling text - still easy to recognize by humans but if you make sure that none of the frames offer something complete to recognize.
multiple choice question - All it takes is a ______ and a smile. idea here is that the user will have to choose/understand.
session variable - checking that a variable you put into a session is part of the request. will foil the dumb bots that simply generate requests but probably not the bots that are modeled like a browser.
math question - 2 + 5 = - this again is to ask a question that is easy to solve but prevents the bots ability to generate a response.
image grid - you create grid of images - select 1 or 2 of a particular type such as 3x3 grid picture of animals and you have to pick out all the birds on the grid.
Hope this gives you some ideas for your new solution.
A friend has the simplest anti-spam method, and it works.
He has a custom text box which says "please type in the number 4".
His blog is rather popular, but still not popular enough for bots to figure it out (yet).
Please remember to make your solution accessible to those not using conventional browsers. The iPhone crowd are not to be ignored, and those with vision and cognitive problems should not be excluded either.
Honeypots are one effective method. Phil Haack gives one good honeypot method, that could be used in principle for any forum/blog/etc.
You could also write a crawler that follows spam links and analyzes their page to see if it's a genuine link or not. The most obvious would be pages with an exact copy of your content, but you could pick out other indicators.
Moderation and blacklisting, especially with plugins like these ones for WordPress (or whatever you're using, similar software is available for most platforms), will work in a low-volume environment. If your environment is a low volume one, don't underestimate the advantage this gives you. Personally deciding what is reasonable content and what isn't gives you ultimate flexibility in spam control, if you have the time.
Don't forget, as others have pointed out, that CAPTCHAs are not limited to text recognition from an image. Visual association, math problems, and other non-subjective questions relayed through an image also qualify.
Sblam is an interesting project.
Invisble form fields. Make a form field that doesn't appear on the screen to the user. using display: none as a css style so that it doesn't show up. For accessibility's sake, you could even put hidden text so that people using screen readers would know not to fill it in. Bots almost always fill in all fields, so you could block any post that filled in the invisible field.
Block access based on a blacklist of spammers IP addresses.
Honeypot techniques put an invisible decoy form at the top of the page. Users don't see it and submit the correct form, bots submit the wrong form which does nothing or bans their IP.
I've seen a few neat ideas along the lines of Asira which ask you to identify which pictures are cats. I believe the idea originated from KittenAuth a while ago..
Use something like the google image labeler with appropriately chosen images such that a computer wouldn't be able to recognise the dominant features of it that a human could.
The user would be shown an image and would have to type words associated with it. They would keep being shown images until they have typed enough words that agreed with what previous users had typed for the same image. Some images would be new ones that they weren't being tested against, but were included to record what words are associated with them. Depending on your audience you could also possibly choose images that only they would recognise.
Mollom is supposedly good at stopping spam. Both personal (free) and professional versions are available.
I know some people mentioned ASIRRA, but if you go to all the adopt me links for the images, it will say on that linked page if its a cat or dog. So it should be relatively easy for a bot to just go to all the adoptme links. So its just a matter of time for that project.
just verify the email address and let google/yahoo etc worry about it
You could get some device ID software the41 has some fraud prevention software that can detect the hardware being used to access your site. I belive they use it to catch fraudsters but could be used to stop bots. Once you have identified an device being used by a bot you can just block that device. Last time a checked it can even trace your route throught he phone network ( Not your Geo-IP !! ) so can even block a post code if you want.
Its expensive through so prop. a better cheaper solution that is a little less big brother.

Resources