Is there a difference between the people who repay the loans and those who don’t?

The banks are having a bit of trouble with debt at the moment. They have lent lots of money to people who promised to pay it back, and then didn’t. In the future, they would like to avoid lending to the kind of person who won’t pay back the loan, and that is where you come in. We have got some data from a bank describing 1000 of its loan customers. The data also tells us whether or not each customer repaid the loan (good or bad credit rating).

The question is simple – Is there a difference between the people who repay the loans and those who don’t?

You should use the Weka data mining package, which is installed in the university computers and also available to download from: http://www.cs.waikato.ac.nz/~ml/weka/

The data ‘credit-g’ in ARFF format is available on the Blackboard. It is also provided in the Excel format ‘German credit’, which contains the interpretation of variables.

You should hand in a report covering the following:

  1. Select a suitable tree building algorithm and build a model. Describe how you split the data for training and testing purposes. Interpret the output results (what predictions have you obtained, which attributes were used to make the predictions, how many nodes and leaves you obtained).
  2. Give a detailed technical description of the classification model (which algorithm is used, and what tree induction method is utilised). Include a diagram showing the structure of the model that you built.
  3. If you vary the model parameters, show how this impacts the results:
  • Change the confidence factor to 35%, report any change in the model accuracy, explaining reasons behind the change.
  • Set the ‘REP’ parameter (Reduced Error Pruning) to ‘TRUE’. Explain the meaning of this operation. Report and explain any change in the model accuracy.
  • Set the parameter ‘unpruned’ to ‘TRUE’, Report and explain any change in the model accuracy and in the tree structure. Explain which pruning method for this algorithm is used.
  1. Report on the model’s comparative ability to any base model of your choice (for example, logistic regression model), to predict a defaulted loan, and also on how easy it would be for the insurance company to understand the model.
  2. Analyse and describe the level of accuracy the model achieves and the errors your model makes. Show a confusion matrix for the model and interpret it. Show a ROC curve for the decision as to whether or not a loan will be repaid.