Statistical Analysis of Loan Data - statistics

I have an assignment and need assistance with the 2 questions below:
Dataset link is https://www.sendspace.com/file/7zt8kh
Attached is a dataset – loanapp.csv. This dataset contains loan application data. This is Data on loan
applications to a bank, including various information on the applicant and the purpose of the loan, along
with the eventual loan decision. Please answer the questions that follow :
Is there a significant difference between the income amount and the approved loan amount? a.
Perform a statistical test to determine if there is a significant difference or not.
Which race has a better chance of getting a loan? a. Visualize a Count of Loan Decision by Race
a. Visualize a % of Loan Approval by Race
b. Perform a statistical test to determine which race has a better chance of getting a loan?
c. Comment on the results

Related

Pure effect of an independent variable on the dependent variable

I have a statistics course assignment regarding the "pure" effect of the mileage on second hand cars' sales price.
The dataset contains several factors which may affect the sales price of cars on an exchange website, including:
Year manufactured
Mileage
Make
Type (Sedan, Wagon, SUV, etc)
Color
Complete logbook service (Y/N)
Fuel efficiency
Seller Zip Code
My understanding of the analysis of the "pure" effect of one independent variable on the dependent variable should limit all the other variables as the same, as in same make, manufacturing year range, type, color, etc. However, if I do that just for a single combination of cars of the same characteristics, I'd give up many data points.
So what's the best approach to tackle this kind of problem? Should I do many sets of single-variable linear regressions between mileage and sales price on many combinations of similar cars and average the effect?
Sorry there isn't any data here. I just want to have a road map of solving the problem. Thanks very much.

How to calculate the effort a customer type most likely require of an FTE during a year

I know how many customers of each customer-size group each sales representative handles on an annual basis. Is it possible to calculate the likely time/effort required by each customer size based on this data set? Or said differently, I'm trying to find out if larger customers require more or less effort than smaller customers.
Is there a function or formula in Excel that will allow be to answer the above based on the data set below?
To add some context, in case it is helpful. There are 2080 work hours a year. I'm assuming they spend all their time with the customers under their responsibility. I also expect that the largest customers require more time than a small customers, but I dont know how much more. That is what I'm trying to figure out. Some employees do handle a lot more customers than others, so its probably best to look a the relative difference between the customer sizes for each employee...
Customer size is rated from 0 (very small) to 7 (the largest).
Below is a small data extract of a large Data table

Designing a domain model (class diagram) for a financial software [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
During my preparation for an exam in software engineering, I came across the following task in an old exam:
For a client, you create a new financial software whose task is, among other things, to perform tax calculations. The following requirements have been communicated to you by the Client:
The system must be able to:
calculate and display VAT for different countries and tax rates (Germany 19%, Austria 20%, Switzerland 8%).
calculate and display the income tax according to country-specific tax tables (separate table for Germany, Austria, Switzerland).
The system must allow the user to:
enter the tax relevant data (gross amount for VAT, annual income for income tax)
print the result of the tax calculation on a network printer.
send the result of the tax calculation to the appropriate tax office.
Task 1: Capture the requirements communicated by the client in a domain model (class diagram) with the following information: classes, attributes, methods, relationships, multiplicities, relationship name.
Solution:
I am not sure how to define the right classes, relationships and multiplicities. But I tried it and came to the following incomplete solution:
First Update:
Second Update:
Could someone help me with this? Thanks :)
Review of your diagram
I propose you to read your first diagram, and leave it as an exercise to cross-check if it really meets the requirements:
"A tax rate is composed of a country" (top composition). So countries do not exist independently of tax codes. Is this really what you meant? And does anything in the requirements tell that there is only one tax rate per country?
"A tax rate is composed of an (optional) income tax rate, and an (optional) VAT rate" (double composition in the middle). Ouh!?
"Every income tax rate has its own tax category(ies)" (bottom composition). Isn't the idea of categories to group similar income tax rates?
"A tax rate aggregates tax administrations, and a tax administration may appear in several aggregates" (aggregation). Why should an administration be aggregated in tax codes?
First recommentation: read in your course the difference between association, aggregation and composition. THe use of aggreegation and composition are in principle exceptional and there must be strong reasons to use use it.
Some more questions:
Where are the names of the relations?
What requirement justifies the tax administration? If it is justified, should'nt it be related to a country?
Is printing some elements really part of the domain model or does it already belong to some user-interface?
Second recommendation: only show elements taht you can reasonably derive from the requirements, and avoid any user-interface related behaviors.
Edit: your final diagram following our exchanges in the comment section represents much better what you wanted to represent initially. You could add the multiplicity 1..* rate for 1 category. You could also add a separator, in order to show classes consistently with a property and operation sections, even if one of the two is empty. The design is still basic, since all properties/attributes are public which is not recommended (but for I suppose you did this to avoid a lot of extra getters/setters in your design).
Alternate approach:
Your narrative describes one single use-case, which is perform tax calculation and consists of entering the calculation data, printing it and sending it. The actors are probably some clerc of your customer and perhaps tax offices.
I find the following candidates for classes chronologically, when reading the narrative: VAT, country, tax rate, income tax, "country-specific tax tables", gross amount, annual income, tax calculation, tax office. Let's have a closer look:
Tax office is very unclear: is there a network printer per tax office? how is the relevant tax office determined? are there one office per country, or can the organisation be more complex?
VAT and income tax are very different:
for VAT there are different rates per country. The applicable rate is always known, and the calculation is based on the applicable rate and the gross value.
For income tax the narrative speaks of country-specific tax tables: this means that the rate might not be known in advance, but depend on the taxable income level. (e.g. in Austria there is a minimum, and beyond it's flat rate; but in France, there is a normal rate, and a reduced rate for the first 500K€). In reality, income tax is much more complicated, since it may also depend on the legal form of the enterprise, or what is done with the income (re-invested vs. distributed), but let's keep it simple for the exercise. The wording leaves an ambiguity whether there is one table per country or several.
You could nevertheless generalize the concept of tax, if you'd want, considering in this exercise, that its amount is calculated for a base amount (gross amount or annual income).
The tax calculation is not fully clear: is it just the user interface, or is the calculation actually some domain object. This would give us:
This would lead to a diagram like:

"IF" function for analysis of hospital lab frequency

I work for a hospital that is part of a larger network. We were recently asked by our corporate overlords to address the use of a specific laboratory test. in general, this test should only be performed daily, which should be considered to corresponded to a 24 hour period from last draw. sometimes, however, based on when people arrive to the hospital (e.g. 7pm), and in the interest of bundling labs for a single draw, they may be drawn sooner to coincide with routine testing i.e. 5am. it would never be necessary to otherwise need to repeat within a short (8 hour) window, particularly on the same day.
we have been asked to validate to see if we are adhering to this general practice, as testing any more frequent than that, say, within 12h of a previous test, has no real clinical value and thus adds unnecessary cost.
To address this issue I was given a dataset that among other items includes all instances the lab was performed including collection date and time.
please see HIPPA-safe example below (to be clear, no real data and identifiers are not real); the actual dataset has over 4,174 entries corresponding to 1,328 unique persons. everyone had at least one test performed, not everyone had >1.
I THINK what I want to do is an IF formula that reads the antecedent cell to 1) check if same person and 2) if so, perform a subtraction of the time stamp to display the relevant difference in time, which I can then filter, create histogram, etc. does this seem like a reasonable approach? is there a more preferable method to facilitate analysis? do any other forms of analysis come to mind?
=IF(B2=B1, D2-D1, "n/a")
example data set with formula:
any other forms of analysis come to mind?
By the looks of it you should consider taking the values under "Results" into account, assuming there is a band that might be considered 'normal' readings. The "one in 24 hours is sufficient" rule of thumb may well be appropriate for a series of values within the 'normal' band but not so much so if readings are close to 'danger level'.
That is, in some cases a higher than 'standard' frequency of monitoring may be in the patient's interest, even if not hospital policy, so it may be worth separating the "less than 24 hours interval" readings into those where the higher frequency provided information of little value (eg readings remaining within a 'normal' band) from any that crossed into or out of the band and/or large changes in value. This though may be more a matter of statistical analysis than programming and depend upon whether any action might be taken as a result of such "extra" readings.

Excel predicting future value

I have a large excel file that has monthly sales per customer for January - December 2016. I want to predict what their sales will be in January 2017.
You could average each client's data and ignore the zeros with a formula like
=AVERAGEIF(D2:D12,"<>0)
D2:D12 would be the range of a single client's sales variable and it would give you a monthly average for that client that you could use for January Predicted Sales.
You have several problems to solve:
Determining (a) candidate forecasting model(s) to use.
Organising your existing data to test whether such model(s) are actually suitable, performing such tests and selecting (a) suitable model(s) [There may be more than one model to be used dependent on whether your data are homogeneous or not.]
Organising your existing data to apply your chosen model(s) for the
purposes of making your prediction. (A different organisation to 2. may be required.)
Your description talks about "sales" but the data sample you provided mentions "claims". These are very different entities - sales (dependent on what type of sales) may well be as frequent as monthly, but claims are likely to be a lot less frequent. If this is the case and claims are highly infrequent, then there is little sense in trying to predict an individual customer's claim. In such a case it would make more sense to predict the aggregate level of claims across a group of customers.
With all modelling, and particularly with forecasting models, context is highly important in steering towards which particular types of model are likely to be suitable. As it is, you have provided no context about what your data really represents, so are unlikely (beyond random chance) to find that any solution offered to you is actually going to be suitable. A solution might compute but, in the context in which you are operating, will it provide anything like a sensible or justifiable set of forecasts?
The "AverageIf" solution may be sufficient; however, you may be able to do better if there is in fact any trends/seasonality in the data that could be used to modeling advantage. For each customer, I would check for autocorrelation in the data. "Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them."(https://en.wikipedia.org/wiki/Autocorrelation) For instance, if there is significant autocorrelation at lag = 12, this would suggest yearly seasonality in the data (maybe every January is similar). There is a nice tutorial to analyze autocorrelation in Excel at:
http://www.real-statistics.com/time-series-analysis/stochastic-processes/autocorrelation-function/
If autocorrelation does exist, it would likely then be useful to perform regression with that time component(s). If there is a trend with time in additional to a cyclical component, that should also be factored into the regression (i.e., such as a "Year" variable); or a more sophisticated time series method could be applied that would accomodate trend and autocorrelation such as an Autoregressive Integrated Moving Average (ARIMA) model:
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
Excel has a forecasting function that might help:
FORECAST.ETS function
Calculates or predicts a future value based on existing (historical) values by using the AAA version of the Exponential Smoothing (ETS) algorithm. The predicted value is a continuation of the historical values in the specified target date, which should be a continuation of the timeline. You can use this function to predict future sales, inventory requirements, or consumer trends.
This function requires the timeline to be organized with a constant step between the different points. For example, that could be a monthly timeline with values on the 1st of every month, a yearly timeline, or a timeline of numerical indices. For this type of timeline, it’s very useful to aggregate raw detailed data before you apply the forecast, which produces more accurate forecast results as well.
Syntax
FORECAST.ETS(target_date, values, timeline, [seasonality], [data_completion], [aggregation])
And you can see it in action in a workbook from the FORECAST.ETS.SEASONALITY page:
Download a sample workbook

Resources