How Does Support Vector Machine (SVM) Algorithm Work?

In this blog, I am going to capture important points about Support Vector Machine (SVM) Algorithm. It helps me to prepare for Data Science Interview.

  • Ability to solve complex machine learning problems, numerous other advantages over other classification problems, such as the ability to deal with large data sets, classifying nonlinearly separable data, etc.
  • It is important to remember that SVMs belong to the class of linear machine learning models (logistic regression is also a linear model).
  • To summarize, SVMs are linear models that require numeric attributes. In case the attributes are non-numeric, you need to convert them to a numeric form in the data preparation stage.
  • Hyperplanes: It is a boundary which classifies the data. It could be lines, 2D planes, or even n-dimensional planes that are beyond our imagination.
    • 2-dimensional hyperplane :
      • A line that is used to classify one class from another is also called a hyperplane.
      • W0+W1x1+W2x2=0, where x1 and x2 are the features, W1 and W2 are the coefficients
    • 3-dimensional hyperplane :
      • In 3 dimensions), the hyperplane will be a plane with an expression of ax+by+cz+d = 0. The plane divides the data set into two halves. Data points above the plane represent one class, while data points below the plane represent the other class.
      • In general, if you derive the d-dimensional hyperplane from d attributes, you can write the expression as follows:
  • Linear discriminator.
    • Similar to the 2D and 3D expressions, an n-dimensional hyperplane also follows the general rule: all points above the plane will yield a value greater than 0, and those below it will yield lesser than 0 when plugged into the expression of the hyperplane.
  • Maximal Margin Classifier: optimum hyperplane.
    • There could be multiple lines (hyperplanes) possible, which perfectly separate the two classes. But the best line is the one that maintains the largest possible equal distance from the nearest points of both the classes. So for the separator to be optimal, the margin — or the distance of the nearest point to the separator — should be maximum. This is called the Maximal Margin Classifier.
    • The maximal margin line (hyperplane), although it separates the two classes perfectly, is very sensitive to the training data. This means that the Maximal Margin Classifier will perform perfectly on the training data set. But on the unseen data, it may perform poorly. Also, there are cases where the classes cannot be perfectly separated. But the Soft Margin Classifier helps in solving this problem.
  • Soft Margin Classifier:
    • The Support Vector Classifier is also called the Soft Margin Classifier because instead of searching for the margin that exactly classifies each and every data point to the correct class, the Soft Margin Classifier allows some observations to fall on the wrong side. The points that are close to the hyperplane are only considered for constructing the hyperplane, and they are called support vectors.
    • Like the Maximal Margin Classifier, the Support Vector Classifier also maximises the margin; but the margin, here, will allow some points to be misclassified.
    • So to select the best-fit Support Vector Classifier, the notion of slack variables (epsilons(ε)) can help by comparing the classifiers.
    • A slack variable is used to control misclassifications. It tells you where an observation is located relative to the margin and hyperplane
      • Each data point has a slack value associated to it according to where the point is located.
      • The value of slack lies between 0 and +infinity.
      • Lower values of slack are better than higher values (slack = 0 implies a correct classification, but slack > 1 implies an incorrect classification, whereas slack within 0 and 1 classifies correctly but violates the margin).
  • Cost of misclassification:
    • The cost of misclassification is greater than or equal to the summation of all the epsilons of each data point
  • When C is large, the slack variables can be large, i.e. you allow a larger number of data points to be misclassified or to violate the margin. So you get a hyperplane where the margin is wide and misclassifications are allowed. In this case, the model is flexible, more generalizable, and less likely to overfit. In other words, it has a high bias.
    • On the other hand, when C is small, you force the individual slack variables to be small, i.e. you do not allow many data points to fall on the wrong side of the margin or the hyperplane. So, the margin is narrow and there are few misclassifications. In this case, the model is less flexible, less generalizable, and more likely to overfit. In other words, it has a high variance
  • Kernel:
    • They enable the linear SVM model to separate nonlinearly separable data points. 
    • Deals with nonlinear data sets.
  • Mapping Nonlinear Data to Linear Data
    • Transform nonlinear boundaries to linear boundaries by applying certain functions to the original attributes. The original space (X, Y) is called the attribute space, and the transformed space (X’, Y’) is called the feature space.
  • Feature Transformation
    • The process of transforming the original attributes into a new feature space is called ‘feature transformation’. However, as the number of attributes increases, there is an exponential increase in the number of dimensions in the transformed feature space
  • Kernel Trick
    • As feature transformation results in a large number of features, it makes the modelling (i.e. the learning process) computationally expensive.
    • The key fact that makes the kernel trick possible is that to find the best-fit model, the learning algorithm only needs the inner products of the observations (XT i.Xj). It never uses the individual data points X1, X2, etc. in isolation.
    • Kernel functions use to bypass the explicit transformation process from the attribute space to the feature space, and they rather do it implicitly.
      • The benefit of an implicit transformation is that now, you do not need to
        • Manually find the mathematical transformation needed to convert a nonlinear feature space to a linear feature space
        • Perform computationally heavy transformations
    • The three most popular types of kernel functions are
      • The linear kernel: This is the same as the Support Vector Classifier or the hyperplane, without any transformation at all.
      • The polynomial kernel: This kernel function is capable of creating nonlinear, polynomial decision boundaries.
      • The radial basis function (RBF) kernel: This is the most complex kernel function; it is capable of transforming highly nonlinear feature spaces to linear ones. It is even capable of creating elliptical (i.e. enclosed) decision boundaries
  • Kernel Parameter
    • In nonlinear kernels such as the RBF, you use the parameter sigma to control the amount of nonlinearity in the model.
    • The higher the value of sigma, the more is the nonlinearity introduced.
    • The lower the value of sigma, the lesser is the nonlinearity. It is also denoted as gamma in some texts and packages. Apart from sigma, you also have the hyperparameter C or the cost (with all types of kernels).
    • Cross-validation (or hit-and-trial, if you are only choosing from 2-3 types of kernels) is  a good strategy to choose the appropriate kernel.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: