Converting data into information:Where to start? - text

We (my company) runs a website which have lots of data recorded like user registration, visits, clicks, what the stuff they post etc etc but so far we don't have a tool to find out how to monitor entire thing or how to find patterns in it so that we can understand what kind of information we can get from it? So that Mgmt can take decisions based on it. In short, the people do at Amazon or Google based on data they retrieve, we want a similar thing.
Now, after the intro, I would like to know what technology could it be called;is it Data Mining,Machine Learning or what? Where should we start to convert meaningless data into useful Information?

I think what you need enters in the "realm" of: parsing data, creating graphs, showing statistics about some elements, etc.
There is no "easy" answer, I can only answer parts of your question.
There are no premade magical analytical tools, big companies have their own backend tools tunned to parse the large amounts of data and spit out data summaries that are then used to build graphs or for statistical analysis.
I think the domain you are searching for is statistical data analysis. But there are many parts that go together here.
Best advice I can give you is to set up specific goals for you analysis and then try to see what is the best solution, you question is too open.
ie. if you are interested in visits/clicks/website related statistics Google Analytics is a great tool, and very easy to use.

Related

Best practices for creating a customized report based on user form input?

My Question
What are the best practices for creating a customized report based on a user form input? Specifically, how do I create an easy to maintain system which takes user input which is collected in a form and generate multiple paragraphs that explains the results of analysis.
Background
I am working on a very large multiyear project with a startup (who is my client). My job is to program analysis and generate reports to users. The pipeline for data looks like this:
Users enter information into a form -> results are calculated based on user input -> reports are displayed to users that share analysis.
It is really important to my client that some of the analysis results are displayed in paragraphs in a non-formal user friendly tone. The challenge is that the form and analysis are quite complex and will only get more complex over time. An example of the type of template for the paragraphs looks something like this:
resultsParagraphText=`Hi ${userName}. We found that the best ice cream flavour for you is ${bestIceCreamFlavor}. These other flavors ${otherFlavors} might be good for you. Here are the reasons why you might enjoy these flavors: ${reasonsWhyGoodFlavors}.
However we would not recommend these other flavors ${badFlavors}. Here are the reasons you should avoid this bad flavors: ${reasonsWhyBadFlavors}.`
These results paragraphs, of which there of many, have several minor problems which combined are significant:
If there is a bug in the code, minor visual errors would be visible to end users (capitalization errors, missing/extra commas, and so on).
A lot of string comparisons (e.g. if answers.previousFlavors.includes("Vanilla")) are required to generate the results paragraphs. Minor errors in the forms (e.g. vanilla in the form is not capitalized so answers.previousFlavors.includes("Vanilla") returns false even when user enters vanilla.) can cause errors in the results paragraph.
Changes in different parts of the project (form, analysis) directly effect how the results paragraph is made. Bad types, differences in string values, null or undefined values not being caught directly have an impact on how the results paragraph is made.
There are many edge cases (e.g. What if the user has no other suitable good flavors for them? The the sentence These other flavors ${otherFlavors} might be good for you. needs to be excluded).
It is hard to write paragraphs that use templates and have a non-formal tone.
and so on.
I have charts and other types of ways to display results and have explained to the client the challenges of sharing the information in paragraph form.
What I am looking for
I need examples, how tos, best practices on how to build a maintainable system for generating customized paragraphs based on user input. I know how to solve each of the individual issues (as they are fairly simple) but in a large project this will become very hard to maintain.
Notes
I have no clue what tags to use for the post. Feel free to edit/add tags if you know more appropriate ones.
The project is planning to use machine learning in the future other parts of the project. If there is a ML/AI solution that is useful please tell me.
I am working primarily in JavaScript, Python, C, and R, but if there is a library or tool in any other language please tell me. Finding a solution is very important to me and I would be willing to learn a lot find a best solution.
To avoid this question being removed because I have rephrased it to avoid asking for personal opinion, instead asking for existing examples or how tos. I can also imagine that others might find a solution fairly useful. If you can edit it to make the question less subjective please do so.
If you have any questions or need clarification feel free to ask. Any help is appreciated.

tensorflow for classification of strings vs elasticsearch

So, a little bit on my problem.
TL;DR
Can I use machine-learning instead of Elastic Search to find results depending on the user's text input? Is it a good idea?
I am working on a car spare parts project, and we have split the car into 300 parts that we store on the database, with some data for each part (weight, availability, etc).
When the customer inputs the text of his part, we need to be able to classify the part, and map it to one in our database.
The current way it's being done is by people on our team manually mapping the customer text with the parts on our database, we want to automate that process.
We tried using MongoDB text search, but it was often inaccurate since parts have different names in different parts of the country.
So we wanted something that got more accurate results, and improved by the more data we have, we immediately considered TensorFlow, after some research and taking part of Google's Machine Learning Crash Course, I got to that point where it specified:
Models can't learn from string values, so you'll have to perform some feature engineering to convert those values to something numeric
That would be useful in the case we have limited number of features as strings, but we don't know what the user will input as a text.
So, my questions are:
1- Can we use Machine Learning to map text input by the user with some documents on our database?
2- If we can do that, is it a good idea to favor it over other search tools like ElasticSearch?
3- Can ElasticSearch improve its results the more data we have? How?
4- How would you go about this problem?
Note: I'd be doing that in Node.js, and since TensorFlow.js is new, I am inclining to go for other solutions, but if push comes to shove, and the results are much better, I would definitely go there.
TL;DR: Yes and yes.
TS;WM:
This is a perfectly suited problem for machine learning. Especially so, if you have a database of past customer texts that have already been mapped to parts. Ideally, you have hundreds of texts mapped to each part. If that is present, you can design and train a network. And models can learn from string values with some engineering, and it's not that bad.
I'm not sure ElasticSearch would improve much on the network. I don't know much about auto parts trading, but as a wild guess, "the large round thingy that helps change direction" would never be mapped to "steering wheel" by ES but could be learned easily by a network - provided there are at least some examples of people using that text to specify steering wheel.
You can but don't have to necessarily use tensorflow.js for your network. The AI could run on your server as a webservice, and you'd just send over the customer's text to it and it would send back it's recommendations of part SKUs and names.

OLAP cube powering Excel Pivot. What's a better solution?

I'm looking to build a dynamic data environment for non-technical marketers.
I want to provide large sets of data in an Excel pivot table form so even marketers without analytics/technical backgrounds can access relevant performance information. I'm trying for avoid non-excel front ends since I don't want users to have to constantly export data when they need to manipulate it in some way.
My first thought was to just throw together an OLAP cube populated with pre-aggregated data, but I got pushback from the IT team as OLAP is "obsolete." I don't disagree with them - there are definitely faster data processing architectures out there.
So my question is this: are there any other ways to structure the data so that marketers can access it easily but still manipulate it to some degree in Excel? I'm working with probably 50-100m rows of data and need the ability to scale dimensionality.
This is just my thoughts.
Really the question could be thrown back at your IT team. Your first thought was to throw together an OLAP cube. IT didn't like this. If they're so achingly hip that they consider OLAP "obsolete", what do they suggest as a better, more up-to-date alternative?
Or, to put it a different way - what is the substance of their objection to an OLAP solution? (I'm assuming there is one beyond "MS gave us an awesome presentation of PowerPivot/Azure tabular, with really great free snacks and coffee").
Your requirements are pretty clear:
Easy access for non-technical people
Structured data so that they don't have to interpret the raw data
Access through Excel
Scalability
I'll be paying close attention to any other answers to your question, because I'm always interested in finding out that I don't know something; but personally I haven't come across a better solution to these requirements than OLAP.
What makes me suspicious of the "post-OLAP" sentiment is related to point (2) in the list above. Non-technical users can tend to think of the cube data they consume as being somehow effortlessly produced, by some kind of magic. That in itself is an indicator of success, demonstrating just how easy it is for users to get what they want from a well-designed OLAP system.
But this effortlessness is an illusion: to structure the raw data into this form takes design effort, and the resulting structure incorporates design decisions and assertions: that is how it can be easy to use, because the hard stuff has been encapsulated in the cube design.
I have a definite Han Solo-like bad feeling about "post-OLAP": that it amounts to pandering to this illusion of effortless transformation of data into a usable form, and propagates further illusions.
Under OLAP, users get their wonderful magic usable data structure, and the hard work is done out of sight by developers like you or me. Perhaps we get something wrong so that they can't see data exactly as they'd like to - but at least the users can then talk to us and ask for what they do want.
My impression of the "post-OLAP" sales pitch is that it tries to dispense with the design work. We don't need those pesky expensive developers, we don't need to make specific design decisions (which necessarily enable some functionality while precluding some other functionality), we don't need cube-processing time-lags. We can somehow deliver this:
Input any data you like. Don't worry if it's completely unstructured or full of dirt!
Any scale
Immediate access to analytics without ETL/processing delays
Somehow, the output is usable, structured data. Structured by... no-one in particular. The user can structure it as they like, but somehow this will be easy
Call me cynical, but this sounds like magical thinking to me.

Which are the best tools available for an online routing application?

So here's my question. Supposing that one is about to create an online web appliation that takes as user input a current location and a location for destination, and displays as a result one of the 5-6 available routes that are stored in a database that is most suitable in terms of distance,and Open Street Map data and Open Layers are used which would be the best way to make this happen?
What I am asking for is what would I need for:
1.Storing the data in database
2.Do the routing calculations. If I would like to change a bit the algorithms for academic reasons and have more control of my final result how would I do that? Do I need any programming language? Any good tutorials?
3.What is the difference between using pgRouting and using any custom solution(like mentioned above)? Doing the all the coding again by myself would be like reinventing the wheel?
4.What would be best for a commercial website, where fast calculations would be needed?
UPDATE: What I need is a way to connect 1.user input(as geometry points) 2.Routing algorithm I have written 3.Road Network and return a result in terms of best way to go to a point
Please see the list of online routers and offline routers for OSM as well as the general wiki page about routing with OSM.
If that still doesn't answer your questions, ask a more specific one.

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

Resources