In AI and Machine Learning, there will be many notations being thrown around and it would be helpful to get a sense of them. Whereby i goes from 1 to m for m training examples. Their corresponding graphs will illustrate the above points: A perfect prediction which has little or no loss/cost incurred. Hence, when y=0, we have the loss function minimized to the extreme when ŷ = 0. However, recall that h⊖(x) is limited to the minimum value of 0 due to the Sigmoid function making the estimated hypothesised value between 0 to 1. As a result, 1- h⊖(x) would be very large, and the corollary of that would be h⊖(x) being very small. We like this to be very negative due to our loss minimisation objective. A perfect prediction which has little or no loss/cost incurred.Ĭonversely, if y = 0, and focusing on the bottom part of the equation, -log(1- h⊖(x)). Hence, when y=1, we have the loss function minimized to the extreme when ŷ = 1. That said, recall that h⊖(x) is limited to the maximum value of 1 due to the Sigmoid function constraining the estimated hypothesised value between 0 to 1: h⊖(x) = ŷ = σ(z) To see why this loss function makes sense: Assuming y = 1, and focusing on the top equation, -log(h⊖(x)), we want it to be very negative as this is a loss function (Recall that we want to minimise the loss function as an objective). Going back to formulating the loss function, a typical loss function would be to minimise the sum of squared errors, i.e.ĭo note I used cost and loss interchangeably but for those accustomed to Andrew Ng’s lectures, the “loss function” is for a single training example whereas the “cost function” takes the average over all training examples. We will touch more on this in later writings. While we like to have zero errors when running our cost function with our hypothesised values (perfect prediction for every value), this might not be a good scenario given that it might lead to a phenomenon called “high variance”. On the other side of the argument, if our errors are very high, this means that our predicted values are missing all the true observed values, i.e., our darts are in general missing the bull’s-eye. Ideally, if all our errors are zero, it is akin to playing a dart game in which all our darts would hit the bull’s-eye. Concretely, we like to minimise the errors of our predictions, i.e, to minimise the cost function. The Cost Function is important because it gives us the errors of our predictions and subsequently, is needed for our learning algorithm. In this post, we will continue sharing on the Cost Function. From our earlier post, we delved into the reasons behind using a Sigmoid function for the Logistic Regression rather than a normal linear function.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |