we are beginning to understand the real value of independent variables towards the target variable as it is clearly shown by different data visualizations that we see and now comes further part that of looking out data for missing values and outlier as we need to create a predictive model in order to generate values on the test data. Missing value imputation As we already know missing values are never a good sign for a dataset as it severely compromises our ability to predict correct results and thus we need to remove through different imputation techniques whatever provides us with good accuracy results. Only a few variables posses missing values and so we will remove them to tidy the dataset. Loan_Status missing values are the values in test data which we need to predict and so apart from it rest show the real missing values. We can further again check the missing values again the same way to make sure we have taken care of them. Now we further need to check if there are any outliers that may have been in dataset due to some mistake or wrong reason. Outliers - We need to make sure that are no invalid outliers that may compromise the accuracy of our model. As we see grubs test points us towards the highest income applicant whose income is quite high as an outlier and so it falls to us to decide to consider it as an outlier or not and if we consider we need to impute the value same way we impute for missing values. we can thereby use a boxplot to visualize it. Here we see that indeed there are certain incomes which are way more than usual and count as outlier but since they can genuinely be higher incomes of person we leave it that way and same way process can be repeated for other variables as well Feature engineering This feature is most the critical portion and tests one thinking skills as you can create different variables from the same information and thus increase the prediction accuracy or interpretability of model through. We also create some new features but you as are also free to create some others that show your creative thinking or acumen in terms of understanding of the problem at hand. You can go one step further and create some more exciting variables from the independent variables that provide a positive effect. Preparing for model The next step that we need to take is to convert categorical variables into factors so that our machine learning algorithm can provide us with good predictive accuracy. ML algorithm - It is one of the portions where we get to see the result of our hard work done on dataset during initial stages and as I apply random forest algorithm you are also free to apply different algorithms of your choice and thereby even perform parameter tuning to even increase your accuracy. Random forest does provide us with importance parameter that helps us to know relative importance of independent variables, and so we can remove least important ones to improve accuracy You can also use other ML algorithms or even try ensembling to improve accuracy and produce better results. Some of our other tutorials related to data analysis with R R for Data |
Computer Science >