In the previous part, we explored data by creating visualizations both univariate and multivariate and clearly understood some relationships between different variables telling us about the dataset.
A few more points to explore Now we will move to the next stage of model preparation and selecting variables to be used for predicting the target variable. Model preparation Missing value As we observed there are some missing values that we need to take care of before of. Now we need to take care of missing values in datasets In order to impute a categorical variable like Outlet_Size, we will use mode In order to increase accuracy, we can use MICE package too for imputing categorical and continuous variables and the same way we can replicate it over test dataset. Outliers We further need to check the outlier present and see if they are more and impute them accordingly Modification of features Now we need to modify features to make suitable for input to a machine learning algorithm We further remove identifiers like item and outlet identifier so that we can build a Ml model on it Machine learning Let us apply ML algorithms to fully predict the result we start with random forest and so one can apply different algorithms to improve accuracy or try different alternatives like Cross-validation, ensembling to improve accuracy. To further check the model effect on train and we can simply remove least important variables one by one and see its effect on accuracy we can further save the result in a form of a cv and submit it . we can try different methods involving creating new features thereby and try boosting algorithms xgboost or GBM to further enhance results. Some of our other tutorials related to data analysis with R R for Data |
Computer Science >