
In this article, the concept of Decision Tree, which is a machine learning concept, will be explained from scratch in a simple and understandably way.
What is Decision Tree?
Decision Tree is a tree-like structure used for making data-driven decisions. It is a supervised machine learning algorithm used for regression and classification. It divides the data based on features to make decisions (predictions).
The Basic Components of a Decision Tree
Decesion Tree is consist of four main components:
- Root Node: The starting point of the tree where the first split occurs.
- Internal Nodes: Points where the data is split further based on specific features and make decisions on how to divide the data.
- Edges/Branches: Connections that represent the outcome of decisions and lead from one node to another.
- Leaf Nodes: The final nodes that represent the outcome, whether a classification or regression result.
Example of Decision Tree
Let's assume that a bank wants to evaluate its customers' loan applications. This can be done using the decision tree method.

Figure 1. Example of decision tree with visulization.
Figure 1 shows that how algorithm can evaluate the customers' loan application with decision tree. In this figure, blue box is root node and yellow boxes are internal nodes. The circles at the end are the leaf nodes.
Decision Tree Working Principle
As previously mentioned, decision trees make decisions by splitting the data based on certain features. At this point, the method tries to split the data in the best possible way to create homogeneous groups. The main criteria used in this process are Entropy and the Gini Index.
Let's assume we have a dataset that includes individuals' income levels and whether they would purchase a specific product. The decision tree starts by analyzing this data and uses mathematical calculations to identify the most effective question for the first split. This process relies on criteria such as Information Gain, Entropy, or the Gini Index, depending on the algorithm being used.
Entropy
Entropy, in its simplest form, tells us this: “Is there a consensus within the group, or is everyone saying something different?” For example, if a group of people all say the same thing - let’s say they all say “will buy the product” - then the group is quite clear, there’s no uncertainty, and entropy is close to zero. But if half the group says “will buy” and the other half says “won’t buy,” then the group is mixed, there’s no full consensus, meaning entropy is high. So, entropy measures how difficult the decision-making situation is.
Information Gain
So, what is information gain? It’s actually directly related to entropy. Let's assume you have a large dataset, and there’s uncertainty within it. Now, you split this data into two parts. For example, “those with income above 50k” and “those below”. Let’s say that after this split, each group contains very clear answers: one group is almost entirely “will buy” and the other is “won’t buy” In this case, you’ve done a great job splitting the data, and the uncertainty has significantly decreased. This reduction is called as information gain. Initially, uncertainty was high, then you made a split, and now the picture is much clearer. The more uncertainty is reduced, the more information you’ve gained. That’s why it’s called information gain.
Gini Index
Now let’s talk about the Gini index. Like entropy, Gini also measures the impurity or disorder in the data, but it does so using a different mathematical approach. The basic idea is this: “If I randomly pick two items from a group, what’s the probability they belong to different classes?” The more mixed the group is, the higher this probability. But if the group is completely unified - for example, if everyone says “won’t buy the product” - then there’s no chance of encountering a different opinion, and the Gini index is zero. Gini is similar to entropy but simpler, more practical, and easier to compute. That’s why algorithms often prefer Gini for speed and performance. For example, Python’s scikit-learn library uses Gini by default when building decision trees.
Based on these criterias, a question is formed and the first node is created. For example: "Is income > $5,000?" This would be the feature that provides the highest information gain or the lowest Gini value. Branches are then created based on the yes/no answers, and further questions are asked along those paths. Finally, when no further meaningful splits can be made, a leaf node is formed and a decision is made.
Example on Python Code
In the code below, I will show how we can apply the example I provided in the working principle to Python.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
# Example dataset
data = {
'Income': [3000, 4500, 6000, 8000, 12000, 2000, 7500, 5000],
'Buys': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)
# Feature and Target
X = df[['Income']]
y = df['Buys']
# Decision Tree Model
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X, y)
# Visulazation of Decision Tree
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=["Income"], class_names=clf.classes_, filled=True, rounded=True)
plt.title("Decision Tree Visualization")
plt.show()
Output

Figure 2. The output of Python code.
The decision tree branches by asking the most appropriate question regarding whether the income is greater than or less than 5500. Those with an income lower than 5500 answered 'No', while those with a higher income answered 'Yes', as shown in Figure 2.
Advantages of Decision Tree
- Easy to Understan: Decision trees are simple and easy to understand, even for beginners. You can clearly see how decisions are being made.
- Different Data Types: Decision trees can work with both numbers and categories, making them flexible for many types of data.
- No Scaling: You don't need to scale or normalize the data before using a decision tree.
- Can Learn Non-linear Patterns: Decision trees can capture complex relationships between data points that other models may miss.
Disadvantages of Decision Tree
- Overfitting: This is most important disadvantages. Decision trees can overfit the training data, meaning they perform well on the training set but poorly on new data.
- Greedy Algorithms: Decision trees use a greedy approach, where they make local optimal choices at each step. This can sometimes lead to suboptimal results because the model may not explore the best possible global solution.
- Biased: If some classes in the data are more common than others, the tree may favor the majority class.
- Instability: Small changes in the data can lead to a completely different tree being generated. This can make decision trees less stable compared to other models.
Conclusion
Decision Trees are a powerful and easy-to-understand tool in machine learning. They are especially useful when you want to clearly see how decisions are made based on your data. While they offer flexibility and work well with different types of data, it's important to be aware of their limitations, such as overfitting and sensitivity to small changes in data.
If you enjoyed this content, feel free to follow me and share this article to help more people learn. Thanks for your support! 🙌