A decision tree is a predictive modeling approach that is used in machine learning. A decision tree works on the principle of going from observation to observation (represented as branches) to reach conclusions about a target value (represented as leaves). A decision tree is a great tool to represent decision analysis visually. A decision tree is used to solve classification and regression problems as well. A decision tree works on the logic of If-Then-Else and visually looks like a record that starts from the tree’s root, and the decisions are represented on the leaves.

**Types of Decision Tree**

Decision trees can be categorized depending on the type of target variable. They are of two types:

- Categorical Variable Decision Tree (Classification Tree): They have a categorical target variable.
- Continuous Variable Decision Tree (Regression Tree): They have a continuous target variable.

**Assumptions for Decision Tree**

A decision tree follows some critical assumptions. Some of these assumptions are:

- The whole starting set is considered as a root.
- Feature values should be categorical. If they are not categorical, instead, they are continuous. They are discretized before building the model.
- Based on the value of attributes, records are recursively distributed.
- Some statistical approach should be used to decide the order of placing attributes as internal nodes or roots of the decision tree.

Decision trees are known to follow Sum of Product (SOP), also known as the Disjunctive Normal Form. Every branch from the root to the leaf node, which has the same class, is a combination of the values. The different branches that end in the same class are called a disjunction or sum.

**The Working of a Decision Tree**

The accuracy of a decision tree is based on the correct decisions of doing a strategic split. For regression and classification trees, the criteria are different for both.

A decision tree uses algorithms to decide whether a node is divided into 2 or more sub-nodes. The more the number of sub-nodes, the more homogenous they are. The selection of algorithms depends on the target variables. Some of the algorithms used in decision trees are:

- ID3 (Iterative Dichotomiser)
- C4.5
- CART: Classification And Regression Tree
- CHAID: Chi-square automatic interaction detection
- MARS: multivariate adaptive regression splines

**ID3 (Iterative Dichotomiser)**

It uses a top-down greedy approach through the branch space and does not have backtracking. It always makes the choice which appears to be the best at that moment. ID3 follows the below steps:

- The original set, “S,” is the root node.
- Every iteration of the algorithm goes through all the unused attributes and calculates the attribute’s Entropy (H) and Information Gain (IG).
- Then it chooses the attribute which gives the highest IG or the lowest Entropy.
- Set “S” is then split by the selected attribute, resulting in a subset of the data.
- The algorithm then keeps recurring on each subset and considers only those attributes that were not selected before.

**Attribution Selection Measures**

Some of the criteria that are suggested to use to select the attributes are:

**Entropy:**It measures the randomness of the information. Higher entropy means greater unpredictability. It is represented by**Information Gain(IG)**measures how good an attribute is at separating training examples depending on their target classification. It focuses on finding the attribute that gives the highest IG and lowest entropy. It is used by ID3.**Gini Index:**Gini Index is used by CART. It is simply a cost function that is used to evaluate the splits in any dataset. It is in favor of larger partitions, and it is also easy to implement. It is calculated by following simple steps-

- Calculate the Gini Index for sub-nodes.
- Calculate the Gini Index for the Split. The Gini score of each node in the split is to be used.

**Gain Ratio:**C4.5 uses the Gain ratio to select attributes. The gain ratio reduces the bias of ID3, which prefers large number values as nodes. It considers the resulting number of branches before doing a split.**Reduction in Variance:**This is used for continuous target problems (Regression problems). It uses a simple formula of variance to decide the best split. It is calculated by following simple steps-

- Calculate the variance for every node.
- Calculate the variance of each split as the weighted average of variance of every node.

**Chi-Squared:**It is an ancient classification technique. It measures the statistical significance between the parent and the sub node. It works on the principle of success or failure. The higher the value of Chi-squared, the higher is the difference. Chi-squared can be calculated as follows:

- Calculated Chi-squared of every individual node
- Calculated Chi-squared of split and use the sum of all Chi-square

It is calculated as given below:

**Avoiding Overfitting in Decision Trees**

If there is a table that has many columns, they fit too much. Decision trees give 100% accuracy when there are no limits set on them. Hence, it affects the accuracy when samples that are being predicted do not form part of the training set. Two ways to remove the overfitting problem are:

**Pruning**: This involves removing branches of trees starting from the leaves so that the overall accuracy is not disturbed. The entire training set is split into training data set, and validation data set and then trimmed accordingly.**Random Forest****:**It is a type of ensemble learning. In this method, multiple machine learning algorithms are combined to obtain a better performance. It involves random sampling of data set and considering random subsets of features when the nodes are split.

A technique called bagging is used to build up the ensemble of trees. In this technique, multiple training sets are created, and they are without replacement. In this technique, randomized sampling is used to divide to dataset into several samples. A single learning algorithm is then used to build a model based on all samples. After that, the resulting predictions are combined using voting or averaging in parallel.

**Linear vs. Tree-Based Models**

Linear regression is preferred when a linear model approximates the relationship between dependent and independent variables. When the relation is complex, a decision tree model is selected. Also, if the model is to be explained easily, then the decision tree model is preferred.

**Example of Decision Tree**

In the given tree, each node splits the data. Gini refers to the Gini Ratio and measures the impurity. A node is pure when all the records belong to the same class, and such a node is a leaf node.

**To Sum Up**

You can learn more and in detail about decision trees by taking **AI and machine Learning Course** online. However, in simple words, a Decision Tree is a machine learning model that you can use to make predictions. It is visually easy to understand and follows a simple If-Then-Else logic. It has branches and nodes which split the dataset and leaves which denote decision or target values.