Let's understand
what the hell is regularization ?When the model fits the training data but does not have a good predicting performance and generalization power, we have an over-fitting problem. Here is the best answer at quora.com Here one can see by using this visualization that the line separating red and blue dots (data points) have 2 classifier one is green line (over-fitting) another is black line (best-fit).
First of all, I want to clarify how this problem of When someone wants to model a problem, let's say trying to predict the wage of someone based on his age, he will first try a linear regression model with age as an independent variable and wage as a dependent one. This model will mostly fail, since it is Then, you might think: Your model becomes You find out that your result are quite good but not as perfect as you wish. So you add more variables: location, profession of parents, social background, number of children, weight, number of books, preferred color, best meal, last holidays destination and so on and so forth. Your model will do good but it is probably So how do you solve this? It is here where the You penalize your loss function by adding a multiple of an (LASSO[2]) or an (Ridge[3]) norm of your weights vector (it is the vector of the learned parameters in your linear regression). You get the following equation: L(X,Y)+λN(w) (L is either the L1, L2 or any other norm) This will help you avoid Finally you might ask: One possible answer is to use
For more information, read [Cortez et al., 2009]. Let's load it. Mini Batch Gradient Descent Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient. Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning. Above we specified our parameters that every batch of gradient descent will adopt. Remember Lambda is the regularization parameter here. It has most influence on your results. I have tried this with so many batches but let's hop up to the best parameter one. Here our model is not over-fitted. It's the best fit on this data set. Lambda = 0.1 and alpha = 5e-05 So, machine learners here we learnt how to make our prediction accurate with-out letting model to learn the noise around data. What's Next - Hyper-parameter tuning and Cross validation
Some of our other tutorials for Python for Data and Machine Learning |

Computer Science >