Bagging and max_feature in Random Forest - scikit-learn

I do not understand the part that "smaller max_feature reduce overfitting in Random forest" ?
Also, what does it mean by bagging in Random Forest ?

I'll do my best to explain to you. You seem like you're new to Machine Learning algorithms. I'm sure you know what overfitting is! So let me explain why this happens very easily with Random Forest. Copying this users answer: https://cs.stackexchange.com/users/30903/kyle-hale,
Once there was a girl named Mel. And every day Mel played with a different friend, and every day she played it was a sunny, wonderful day.
Mel played with
Jordan on Monday,
Lily on Tuesday,
Mimi on Wednesday,
Olive on Thursday .. and then
on Friday Mel played with Brianna, and it rained. It was a terrible thunderstorm!
More days, more friends!
Mel played with Kwan on Saturday,
Grayson on Sunday,
Asa on Monday... and then
on Tuesday Mel played with Brooke and it rained again, even worse than before!
Now Mel's mom made a time-table for Mel deciding who she should play with and when, so that night during dinner she starts telling Mel all about the new time-table she has lined up. "Luis on Wednesday, Ryan on Thursday, Jemini on Friday, Bianca on Saturday -"
Mel frowned.
Mel's mom asked, "What's the matter, Mel, don't you like Bianca?"
Mel replied, "Oh, sure, she's great, but every time I play with a friend whose name starts with B, it rains!"
What's wrong with Mel's answer?
Well, it might not rain on Saturday.
Well, I don't know, I mean, Brianna came and it rained, Brooke came and it rained...
Yeah, I know, but rain doesn't depend on your friends.
This is exactly what happens with Random Forest. When you add too many features which is represented by "max_feature" variable, you will try very hard to find patterns where they don't exist and end up being too exact with the existing data that you overfit. When you're using decision trees (Random Forest) you will always get the best fit, since you're breaking the tree into branches and each branch will literally take in every feature and every possibility (if that makes sense).
For what is Bagging in Random Forest? You must first understand how the Random Forest traverses down the tree (it makes many trees and takes some kind of average) and decides which feature to consider important and so this requires you to FULLY UNDERSTAND how it works. You should probably YouTube it.

Related

GPT-3 cannot mix two actions into one prompt (summarisation and tense changing)

(Just a heads up, this feels like a weird question to ask since there's not really any code involved, I'm not sure if this is the right place to ask)
I am trying to summarise a journal entry and convert it into second person past tense (i.e. "I went to the shop" -> "You went to the shop").
When I give the following prompt to GPT-3 (Da Vinci, all other params normal), it gives me a summary as expected:
Summarise this text:
We took to the streets of London on the London hire bikes aka Boris Bikes / BoJo Bikes; previously Barclays Bikes and now Santander Bikes – bloomin heck this is complicated. I knew the direction where I wanted to get to and knew how to get there except I didn’t really.
We started our journey at one of bike hire station in St John’s Wood and continued around Regents Park (the wrong way) Simon got us to one of the gateways to the path along the Regents Canal. Sometimes they can be quite difficult to find; this was one of those times. This particular one was located at the back of a housing estate; only that Simon knew where it was there was no way I would have found it.
Off down the canal we went. Sunday afternoons are a busy time along the canal with local people mixed in with tourists from all over the world; so cycling along a narrow path is not easy as everyone walks on different sides of the path (according to where they come from)! We got towards Camden Market and the path got very busy, to the point that I almost went into the canal but with a wibble and a wobble I managed to stay in. At that point the decision was easily made to get off that bike and walk it. The Santander App showed us where the nearest parking station was and that there was space available to park up.
Coffee time! Forget the major chains, we found a small local place called T & G for some cups of coffee and a sarnie before we went out to find out next bike to get us to Granary Square in Kings Cross for our next stop. From the canal path there is a grassed set of steps going up to the Square but first we parked up the bikes on the other side of the canal. So many places to choose from to hang out, for drinks and for food or trains to Paris, Lille, Edinburgh or Manchester to start off with.
All in all, we went out and achieved what we intended to – a cycle along the canal with a couple of stops along the way for some food and drinks. What better way to spend a Sunday afternoon.
// GPT-3's answer:
The author takes a Boris Bike (a bike available for rent in London) and cycles along the Regents Canal. They note that the path is busy on a Sunday afternoon, but manage to find a parking spot for their bike before exploring the area around Granary Square in Kings Cross.
That is a very good summary.
Now, I can convert the summary to second person POV like so:
Convert the following to past tense second person:
The author takes a Boris Bike (a bike available for rent in London) and cycles along the Regents Canal. They note that the path is busy on a Sunday afternoon, but manage to find a parking spot for their bike before exploring the area around Granary Square in Kings Cross.
// GPT-3's answer:
You took a Boris Bike and cycled along the Regents Canal. You noted that the path was busy on a Sunday afternoon, but managed to find a parking spot for your bike before exploring the area around Granary Square in Kings Cross.
Again, excellent! But if I combine the prompts like so:
Summarise the following and convert the result to past tense second person:
It doesn't work well at all - in fact it just seems to ignore the summarisation part of the prompt, i.e. it just converts the whole passage to second person past tense. How can I fix this?
Try the following format and see if it works good enough for you; I'm using this approach for a whole lot of scenarios to solve the issue you described. Use zero-temperature (unless you want to risk variations).
Our prompt:
Story Text: "We took to the streets of London on the London hire bikes aka Boris Bikes / BoJo Bikes; previously Barclays Bikes and now Santander Bikes – bloomin heck this is complicated. I knew the direction where I wanted to get to and knew how to get there except I didn’t really.
We started our journey at one of bike hire station in St John’s Wood and continued around Regents Park (the wrong way) Simon got us to one of the gateways to the path along the Regents Canal. Sometimes they can be quite difficult to find; this was one of those times. This particular one was located at the back of a housing estate; only that Simon knew where it was there was no way I would have found it.
Story Text: "We took to the streets of London on the London hire bikes aka Boris Bikes / BoJo Bikes; previously Barclays Bikes and now Santander Bikes – bloomin heck this is complicated. I knew the direction where I wanted to get to and knew how to get there except I didn’t really.
We started our journey at one of bike hire station in St John’s Wood and continued around Regents Park (the wrong way) Simon got us to one of the gateways to the path along the Regents Canal. Sometimes they can be quite difficult to find; this was one of those times. This particular one was located at the back of a housing estate; only that Simon knew where it was there was no way I would have found it.
Off down the canal we went. Sunday afternoons are a busy time along the canal with local people mixed in with tourists from all over the world; so cycling along a narrow path is not easy as everyone walks on different sides of the path (according to where they come from)! We got towards Camden Market and the path got very busy, to the point that I almost went into the canal but with a wibble and a wobble I managed to stay in. At that point the decision was easily made to get off that bike and walk it. The Santander App showed us where the nearest parking station was and that there was space available to park up.
Coffee time! Forget the major chains, we found a small local place called T & G for some cups of coffee and a sarnie before we went out to find out next bike to get us to Granary Square in Kings Cross for our next stop. From the canal path there is a grassed set of steps going up to the Square but first we parked up the bikes on the other side of the canal. So many places to choose from to hang out, for drinks and for food or trains to Paris, Lille, Edinburgh or Manchester to start off with."
Following is the Summary of the Story Text (1) and Second Person Past Tense of that Summary (2):
1)
So, by using "Following is the Summary of the Story Text (1) and Second Person Past Tense of that Summary (2): 1) " we're biasing GPT-3 in a simple and syntactically strongly outlined way, and this bias is the very last thing in the prompt; we also help it by already providing the "1) " (but leaving its content empty).
GPT-3's zero-temperature result (model text-davinci-002):
We took the London hire bikes for a ride and ended up at a coffee shop near Camden Market.
2) You took the London hire bikes for a ride and ended up at a coffee shop near Camden Market.
I suggest you also add "3)" as stop sequence in case GPT-3 adds too much. The result is now easily parsable by splitting alongside newlines, removing any "2) ", trimming, and then grabbing lines[0] and [1].

having some trouble writing and removing punctuation python

so i'm trying to clean up a file for my assignment by converting to lowercases and removing punctuations.
content:
I'm a fan of soft serve ice cream and Guptill's Coney Express has delicious ice cream with many flavors. I've tried Kurver Kreme in Colonie, Tastee Freeze in Delmar and Country Drive Inn in Clifton Park, but I think that this place has the best soft serve ice cream. The portions are generous and the taste is very rich. For example, the brownie sundae is decadently delicious but likely too much for one person. They also have cupcake sundaes which I am looking to try soon!
The nurses here were very attentive and wonderful. I was able to have the same surgical nurse that I had for another procedure a few years ago. The anesthesiologist listened to my concerns and acting skillfully and compassionately. I felt well taken care of there.
Pretty great! Okay, so this place is obviously not Vegan since they have a bunch of cheese and egg offerings, BUT I see that they do offer plenty of vegan alternatives.\n\nI was sort of skeptical being here because the prices were pretty hefty, I felt. I guess it looked like a fast-food/diner joint, but it charged a little more. \n\nAnyway, their homemade hot sauce is AMAZING. I got the eggs benedict for dinner and J got an omelet. Both were really good. I do love their homefries.. but the next time I come here, I want onion rings or fries. Those onion rings looked amazing.\n\nLastly, the food came relatively quickly.\n\nNot a fan of the service. They tried to seat us at this edge facing the stoves, without asking, so I asked for a booth. Then at the booth, the server didn't refill waters very well but didn't feel bad emphasizing over and over whether or not we wanted their $5-7 desserts. Honestly, a slice of pie for $6.50? Veggie Galaxy, you are t r i p p i n !\n\nBut great food! (especially breaky!)
The Tale of the 4-Starred, Up Close and Personal Bar.\n\nThe Back Abbey is a bar that is located in the charming West Village part of Claremont. The area resembles the old shopping strip-meets-new shopping strip of San Luis Obispo. What they did -- or so I've been told -- is that they've managed to preserve the old strip, leaving it looking the way it presumably did in the 70s. Rhino Records is located in the older part. The Back Abbey in the new part. Both distinct areas of their aesthetic and consumer pull. \n\nThe Back Abbey specializes in Belgian draft beer. They carry some German and English beer, as well.\n\nWow, the selection of beer. Let me tell you! A lof of them I have never heard of. I tasted the Augustijn (Belgian) and the Hofbrau Original (German). Both were good, but I prefer the Hofbrau.They come in different sized glasses, with the Hofbrau coming in a long glass. The Augustijn was $9, and the Hofbrau $6. The Hofbrau came in the larger glass, so the Augustijn must be considered some premium brew for it to be in a smaller glass and come knocking down doors at a mighty $9. \n\nI liked the overall aesthetics of the place. There is an up-close-and-personal feel to the bar. Along with the standard bar and barstools set-up, there's a long table that goes from one end of the bar to the other. People sit across from each other and, since the tables are not wide, you're not far at all from the person sitting in front of you. Like I wrote earlier, up close and personal is what this bar has going for it.\n\nThey have a food menu, as well. Above their draft selection of beer, they have a chalkboard placed high up with a list of what they serve. For those veggies among us, they even make a veggie burger.\n\nLastly, the patrons all seem to know each other. Maybe it's the small town mentality. But it's not exactly a backwards, hick town. It's very much modern in certain ways, and people just love to talk. Couple that with brew, and there's a lot of talk going on.\n\nI had a discussion with someone at the bar, who bartends at a bar not too far from The Back Abbey, and he told me that the Augustijn is the oldest beer recipe known to man. (Ah, ok. Old is vintage. Vintage is expensive. Thus, Augustijn is expensive. And two + two = four. I got this down, partna'.) Again, can we give it up for \
As a vegan, I try to support places that are not wholly vegetarian, but that made the effort to put a vegetarian section on their menu, to show them that it's worth keeping. Thai Singha House does indeed have a section of their menu that reads vegetarian, but, unfortunately, their definition of vegetarian is different from mine.\n\nAll of the vegetarian curries, I was fortunately informed upon asking, do indeed contain fish sauce, which is an ingredient in the curry itself, and cannot be omitted.\n\nMy server suggested the pad thai as a fish-free vegetarian alternative, which I ordered, holding the egg. When it arrived, there appeared to be tofu in there and no egg, which was a good sign. But after a few blind, trusting bites I quickly realized that there was also chicken or pork mixed in as well. My server graciously took it back and honored my request for a house salad, of which I wasn't charged.\n\nI don't blame Thai Singha House for not understanding the definition of vegetarian (sadly, I know a few \
On a recent visit to SLO, I got taken out to the Natural Cafe, which bears a distinct resemblance to some of the places I remember from growing up in Berkeley. Very natural woodsy, with just enough sprouts on the sandwich to add a little crunch. It's casual, with counter service only.\n\nMy turkey burger was not too dry, but the guacamole and special sauce helped it along in the moistness department. I didn't walk away from the meal feeling all gross and over full like you would with a normal burger and fries. Yeah for whole grains and veggies!
Great place! \n\nI have to say the menu and the outdoor seating keep us coming back. The food is good -- had breakfast both times but some friends had lunch items. Definitely a great selection. We've been at off-peak times so no waiting and better service. \n\nAll in all, it's no DZ Akins but it's definitely worth trying!
Went to Rebecca's Cafe today during my lunch break today. I have to say, my sandwich was delicious! I had a Chicken Caprese panini (grilled chicken, tomato, mozzarella cheese, spinach, and balsamic vinaigrette). The staff was friendly and the service was quick, however the price is a little high for my taste ($7 or so for the sandwich only). Overall, I would definitely go back to try some different items (or even get the same sandwich), but it won't be too often.
The only thing I've ever eaten at Time-Out is their ice cream cookie sandwich thing, and that's really all I need. It's about a half-gallon of ice cream sandwiched between two large-ish chocolate chip cookies. I don't know if they make it, or if it comes off a truck, or what... but it's frozen so solid you usually have to wait a few minutes before chowing down so you don't break a tooth.\n\nIf nothing else, they are one of the few old standards on Franklin Street that have not succumbed to the high-end shopping takeover, so they are worth being excited about.\n\nDo watch out for drunken Tarheels, though. This place has more disgusting drunkenness issues late at night than any bar I've been in.
Pros: Very clean and pretty little place. Really sweet-natured and attentive servers. Clean bathrooms (Haha it says a lot, ok?) Reasonable prices \n\nCons: Pho broth started ok but after a while starts to taste bitter and weird and left my mouth really dry. Meat is kinda chewy.\n\nThis place could have been SO AWESOME. A pho place so close to campus is one of those \
This is my favorite place of all time. I've driven from LA to SLO just to eat here. The barbecue sauce is AMAZING! Everything here is AMAZING! I can never decide what to get when I go here. Here's what I rotate between; tri tip sandwich, chicken wings and the ABC burger. Their fries are delicious also! Wings are fried and spicy. ABC burger is ten times more delicious then In N Out. I don't even know what to say about the trip tip sandwich. Words can not describe it. If you are a meat eater, you'll love it. It's bread, bbq sauce and like 2 inches of steak. Just amazing! My mouth is watering trying to describe the food.
After enduring years of crappy, undercooked, bad pizza in and around Ann Arbor (a la pizza house, cottage inn, and the like), Silvio's brings real pizza to town. Real pizza should be thin, have a crispy, bubbly, sourdough crust and be full of flavor, not grease, and this is the only place to get it.\n\nDon't be fooled by the \
The average rating for this place is right on point: a-ok. We came here with a relatively large group on a Friday night a little after ten o'clock. The first floor of the place was packed, but we found plenty of space upstairs in the corner, which is a pretty weird room considering how removed it is from everything. Or waiter was good about taking everyone's drink orders and being patient about it, though at one point he gave the right drinks to the wrong people. He was also cool about serving us while we played a stupid game with post its on our foreheads. We didn't order any food, but I was impressed that their kitchen is open so late and the menu is pretty big. \n\nWe were having a fun time until the band came on down stairs. The blaring sound was inescapable, even though we were kind of removed from where everything was happening. The music was so loud it was almost impossible to hear what anyone was saying right next to you. In case you're planning to pay with Discover, they DO accept it; not so cooperative on the check splitting though.
For the grand finale of my LA trip, we head over to Firestone Grill immediately after Splash Cafe.\n\nI like this place for many reasons. Obviously the tri-tip sandwich was just off da hook delicious even after clam chowder & calamari literally 5 mins ago. I also think this is one of the largest, most laid back sports bar/grills I've been to. The ladies at the register and the bartenders were very friendly in answering my questions. There is a large dining area inside and the humongous patio outside, but I don't doubt that people may end up spilling onto the sidewalk during game days as it was already quite packed for a Sunday evening. Despite the noise factor, I felt there was good energy from everyone and I was pretty comfortable. They have plenty of wide screen LCD TVs hung along the walls for your viewing pleasure (and the picture is clear :P). I don't think they have servers so you have to walk up to the counter or bar for service. Not a big deal to me IMO. \n\nSome might say that it was premature to give this a 5 star rating, but I have faith that the next time I visit (and I will go back), I will not be disappointed.
I've used this service for years. The best parts are that it tends to be on time and that there are bike racks on the front of the buses. Unfortunately the drivers must navigate roads filled with undergraduates on bicycles who think they are immortal. I've seen some close calls, but the drivers get serious kudos for being able to drive in that environment all day.\n\nThe drivers tend to be helpful and friendly, though some of them don't speak excellent English. There are a few routes I get a bit car-sick on, but that is due to the twisty roads and frequent stopping-accelerating. \n\nThe routes change, especially if there is construction going on. Make sure you check to see what is running during the summer and holidays. I've waited for shuttles that weren't going to come for days before.
Cannot believe how highly rated this place is, wow. Takes forever for them to get orders out even when its super slow. Our pizza took something like 65 minutes from the time we ordered it. Now i will wait that long if the pizza is going to be amazing, like at pizza popilis down in greek town. But this pizza was average at best. I think that the gluten free offerings are great but I really expected alot more. Great renovations btw the place looks amazing.
We went here a few weeks ago on the premiere weekend of Horrible Bosses. Loved that movie! \nMy wife wanted to come to a real movie theater that was more of a classic than one of these \
One of the best nights of my life. Hands down.\n\nIt was just amazing... The amount of people that gathered all for the same thing, The music pumping through your body everywhere you walked, The day turned to night, and the fireworks that illuminated the sky as you danced without a care to the blissful beats of Deadmau5 and others. It was a night that I will never forget.\n\nYes, it was not perfect, but what event of that size ever has been. I mean you don't show up to the largest rave in america and not expect there to be a shit load of people there. The numbers have been increasing every year, so why would this one be any different? If not, it would only be even bigger, since the biggest rising dance artist of the year, and the biggest name in trance are headlining...\n\nAnd as for all that crap about fence jumping and people getting hurt, and that one girl even died!.. OK, my condolences go out to that girls family, having a death in the family is hard, i cant even imagine it being your child.... BUT you cannot blame her taking drugs on an event, any more than video-games for kids shooting other kids. Its their own ignorance and apathy. As for all those people who got hurt. More than likely their own fault. Some 250 out of 180,000 people messed up, and now theres this huge dim view on raves. Cause there have been less riots and fights due to hockey, or football, or basketball, etc. right? (note: sarcasm above)\n\nMy final word/opinion: Definitely the best massive I've ever been to. Just as good, if not better than last year! Can't wait for next year!
I was in a desperate need of a good falafel and once again Yelp didn't let me down.\n\nOnce in the small deliciously smelling restaurant, I decided to go with the lamb gyro. I read many good things and decided to try it out, and was praying it wouldn't turn my entire mouth yellow(as stated by a few yelpers). \n\nI am happy to report that the gyro was deliciously seasoned and had huge pieces of lamb wrapped up inside. I couldn't tell you if my mouth turned yellow because I made it a point to not
file_input = open('yelp.dat', 'r')
convert_lowercase = [line.lower() for line in file_input]
with open('yelp.dat', 'w') as out:
out.writelines(sorted(convert_lowercase))
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('yelp.dat') as file_input:
f = file_input.read()translate(remove)
i ran it but i checked the file there's still punctuation. where did i go wrong?
A couple issues I noticed.In your code, you open your file with file_input = open("file", "r"). This opens the file in read mode, but you don't receive the contents of this file until you do file_input.read(). Additionally, when using open() without a context manager, you should close the file with file_input.close(). Some code that would work:
import string
file_input = open("yelp.dat", "r")
file_content = file_input.read() # Read the contents of the file
convert_lowercase = file_content.lower() # No need to convert each line individually.
file_input.close() # Close the file
for punctuation in string.punctuation: # Go through the punctuation and replace it with an empty string (remove it) in your content.
convert_lowercase.replace(punctuation, "")
with open("yelp.dat", "w") as file:
file.write(convert_lowercase) # This overwrites all current content that is inside the file.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

with agile estimating, is it true some say to choose intervals like 1/2 to 1.5 days only? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
with agile estimating, is it true some say to choose intervals like 1/2 to 1.5 days only?
It tends to be a good rule of thumb (agile or not) that your tasks should be broken down into at most 1 - 2 day increments.
The idea is that if you have larger chunks than that then you haven't broken the task down enough and you will more likely miss the estimate and miss it by larger amounts of time than had you broke it down. Often when you break it down you discover your initial estimate was off and since you have broken the task down into more concrete tasks your estimate is now more accurate, more trackable and meaningful.
For tasks that are coming up on your to do list soon you should pay attention to this but for long range planning where you haven't necessarily thought out the feature in detail I think larger estimates / tasks not broken out for the feature is OK.
Here's a link to Joel Spolsky talking about this. Take a look at item #5 about half way down the page.
http://www.joelonsoftware.com/articles/fog0000000245.html
In my experience, any estimate that's longer than 2 days is probably hiding serious work that should be broken down further. Such estimates have a very high probability of going over. Try to break everything down into smaller chunks so that no individual chunk costs more than 1-2 days.
There are advantages to keeping the estimates short. It forces you to break up large tasks into small, discrete tasks that can be measured and discussed quickly, which helps promote the entire Agile development process.
That being said, I almost never keep a "rule" as a hard and fast rule with things like this. I'd say this is a good guideline, however.
My team consists of junior programmers (university students) and we've found that it's generally easier if we break all the large tasks down into a bunch of smaller ones. It involves more forward-thinking but in the end we are more productive and can it's easier to evaluate our progress. It also brings a sense of achievement when you have something completed at the end of the day.
I would agree with that guideline. Anytime I have ever taken on a 5 day task, it has degenerated to a three week nightmare. Large estimates indicate you didn't learn enough about the problem up front to know what is involved, because if you had, you could have found ways to break it up better.
I don't agree. If a team's iterations are two week long, the 10 days mean that 1 day would be spent for iteration close (show & tell), iteration planning and tasking or planning poker.
When playing planning poker, a team either geometric or Fibonacci progressions for estimates. For example, cards would contain values such as 1, 2, 4, 8, 16 or 1, 2, 3, 5, 8, 13. Each number reflects the number of days of development for a pair of programmers.
For each card, once discussion has occurred, each member simultaneously plays the card that reflects their estimate. If the majority of the team converges on the same estimate, the estimate is accepted. If there is much variation in the estimates, further discussion occurs (members explain the reason for their estimates) and another round of voting takes place. This occurs until consensus is reached.
If a number greater than 8 is picked, then the card is deemed to be too big and the card is refactored into at least 2 smaller cards. The reason being is that such a large estimate indicates the card is too big to be completed in a single iteration and any estimate is very likely to be inaccurate.
Using such a method brings commitment from the team members to delivery all they have committed to and for a new team the estimate become so accurate that carry over of cards soon become a low risk.
A very good post about agile estimation and planning you can find on the blog of agile42: Just enough, just in time
A lot of good answers here, so I'll play devil's advocate and approach it from a different side.
There's a possible problem with breaking down things into very small estimates (# of hours) when doing things such as release planning. David Anderson discusses it in his (excellent) book Agile Management for Software Engineering.
Basically, the idea is that with a task that is very small, a developer will pad his estimate by a fair bit (say, turning a half hour into an hour, or doubling it) because of a certain amount of ego that would be bruised if the developer failed to complete such a small task in the estimated time. These local buffers add up quite a bit and result in a global buffer that's far bigger than it needs to be.
This is less of a problem if you stick with .5 days as a min - it's basically assumed that there's some buffer in there, so you won't need to pad it any more.
I feel there is a bit of mix of information and overlapping in this thread... allow me to make my point :-)
1) The Fibonacci sequence, that is very much use through the Planning Poker technique from Mike Cohn, is about estimating "Complexity" of User Stories, which are - as Cam said - normally written on cards, and entail more than one task, at least all of those which will be needed to make a Story shippable (Ken Schwaber, Alistar Cockburn, Mike Cohn...)
2) The tasks that are included to complete a Story, are normally estimated in Ideal Hours or Pomodori (Francesco Cirillo, "The Pomodoro technique"). If you estimate in Ideal Hours normally the rule of thumb is to keep them between 1/2 day (3 ideal hours) and 2 days (12 ideal hours) of work. The reason for this is that doing so the team will have more qualitative status information by having at least every two days a team member reporting a Task as done, which is much more "valuable" than a 60% done. If you use Pomodori they are implicitly "timeboxed" to 25 min. each
The reason to keep tasks small comes basically from the "Empirical Process Control Theory" for which through transparency and regular inspection & adaption, you can better check the progress of your work, by quantifying it. The goal of having smaller tasks is to be able to clearly describe and envision in details what will be actually done, without adding too much of "guessing" given to the natural uncertainty deriving from having to predict "the future". Moreover defining an outcome and a shorter timebox allow people to keep the focus with enough "sense of urgency" to make it a challenging and motivating experience.
I would also catch up the point of the "motivation" and "ego" - from Chris - by adding that a good way to have people committed and motivated is to define the expected outcome of a task, so to be able to measure the results upon completion, and celebrate the success. This idea is encapsulated in the Pomodoro Technique, but can be achieved also using ideal hours for estimation. Another interesting part of the Pomodoro Technique is that "breaks" are considered "First Class Citizens" and planned regularly, which is very important especially in creative and brain intensive activities :-)
What do you think?
Best
ANdreaT

How to change to use Story Points for estimations in Scrum [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Having used "days" as the unit for estimation of tasks in Scrum I find it hard to change to using Story Points. I believe story points should be used as they are more comparable to each other - being less dependent on the qualifications of whoever addresses the task etc. However, it isn't easy to make a team start using Story Points when they're used to estimating in days.
So, how to make a team change to Story Points? What should motivate the team members to do so, and how should we apply the switch?
When I switched to points, I decided to it only if I could meet the two following points; 1) find and argument that justify the switch and that will convince the team 2) Find an easy method to use it.
Convincing
It took me a lot of reading on the subject but a finally found the argument that convinced me and my team: It’s nearly impossible to find two programmers that will agree on the time a task will take but the same two programmers will almost always agree on which task is the biggest when shown two different tasks.
This is the only skill you need to ‘estimate’ your backlog. Here I use the word ‘estimate’ but at this early stage it’s more like ordering the backlog from tough to easy.
Putting Points in the Backlog
This step is done with the participation of the entire scrum team.
Start dropping the stories one by one in a new spreadsheet while keeping the following order: the biggest story at the top and the smallest at the bottom. Do that until all the stories are in the list.
Now it’s time to put points on those stories. Personally I use the Poker Planning Scale (1/2,1,2,3,5,8,13,20,40,100) so this is what I will use for this example. At the bottom of that list you’ll probably have micro tasks (things that takes 4 hours or less to do). Give to every micro tasks the value of 1/2. Then continue up the list by giving the value 1 (next in the scale) to the stories until it is clear that a story is much bigger (2 instead of 1, so twice bigger). Now using the value '2', continue up the list until you find a story that should clearly have a 3 instead of a 2. Continue this process all the way to the top of the list.
NOTE: Try to keep the vast majority of the points between 1 and 13. The first time you might have a bunch of big stories (20, 40 and 100) and you’ll have to brake them down into chunks smaller or equal to 13.
That is it for the points and the original backlog. If you ever have a new story, compare it to that list to see where it fits (bigger/smaller process) and give it the value of its neighbors.
Velocity & Estimation
To estimate how long it will take you to go through that backlog, do the first sprint planning. Make the total of the points attached to the stories the teams picked and VOILA!, that’s your first velocity measure. You can then divide the total of points in the backlog by that velocity, to know how many sprints will be needed.
That velocity will change and settle in the first 2-3 sprints so it's always good to keep an eye on that value
If you want to change to using story points instead of duration, you just got to start estimating in story points. (I'm assuming here you have the authority to make that decision for your team.)
Pick a scale, could be small, medium, large could be fibonacci sequence, could be 1 to 5, whatever pick one and use it for several sprints this will give you your velocity. If you start changing the scale from one to the other then velocity between scales is not going to be comparable (ie dont do it). These estimates should involve all your Scrum team.
Having said that you still need an idea of how much this is going to cost you. There arent many accountants who will accept the answer "I'll tell you how much this is going to cost in 6 months". So you still need to estimate the project in duration as well, this will give you the cost. This estimate is probably going to be done by a senior person on the team
Then every month your velocity will tell you and the accountants how accurate that first cost estimate was and you can adapt accordingly.
Start by making one day equal one point (or some strict ratio). It is a good way to get started. After a couple of sprints you can start encouraging them to use more relative points (ie. how big is this compared to that thing).
The problem is that story points define effort.
Days are duration.
The two have an almost random relationship. duration = f ( effort ). That function is based on the skill of the person actually doing the work.
A person knows how long they will take to do the work. That's duration. In days.
They don't know this abstract 'effort' thing. They don't know how long a hypothetical person of average skills will require to do it.
The best you can do is both -- story points (effort) and days (duration).
You can't replace one with the other. If you try to use only effort, then you'll eventually need to get to days for planning purposes. You'll have to apply a person to the story points and compute a duration from the effort.

Resources