Clean the table, encode features, hold out one day for testing, compute entropy and information gain by hand, fit a decision tree, score the test row, then compare with an equivalent deeper “Sunny → Wind” rule set.
Everything below hangs on this one supervised problem. Rows are past days; columns Sky and Wind strong? are inputs; Play outside? is the label we want the tree to learn. After training, a new day’s sky and wind will be dropped at the root and routed to a leaf — that leaf is the prediction.
Step 1 · Supervised setup
Supervised learning means every training row includes the correct answer y (here, play or not). The job of the tree is to memorize patterns that generalize: if we only memorized row IDs, a new day would be hopeless. Instead we memorize questions on Sky and Wind that reproduce the labels on the training table.
Step 2 · The dataset
Two inputs keep the arithmetic human-sized. “Wind strong?” is Yes/No. Sky uses three categories — but every split in a binary tree only compares one condition at a node (for example “is it sunny?” vs “not sunny,” which groups cloud and rain together for that decision).
| Day | Sky | Wind strong? | Play outside? |
|---|---|---|---|
| 1 | Sunny | No | Yes |
| 2 | Sunny | Yes | No |
| 3 | Rain | No | Yes |
| 4 | Rain | Yes | No |
| 5 | Cloudy | No | Yes |
| 6 | Cloudy | Yes | No |
The full pipeline below treats this table as the only source of truth: we clean and encode it, hold out one day, then use entropy and information gain exactly as in ID3/CART lectures to justify which question belongs at the root.
Step 3 · Data cleaning
Sunny, Rain, or Cloudy.Yes/No for wind and play (reject empty strings).(Day) — none here.Step 4 · Encoding
Binary entropy below uses counts of the positive class (Play = Yes). Wind is encoded as 1 = strong, 0 = calm. For root-split candidates we compare (a) “Wind strong?” and (b) “Sunny?” where Sunny = 1 and not-sunny = 0.
| Day | Sunny (1/0) | Wind (1=strong) | Play (1=Yes) | Split |
|---|---|---|---|---|
| 1 | 1 | 0 | 1 | Train |
| 2 | 1 | 1 | 0 | Train |
| 3 | 0 | 0 | 1 | Train |
| 4 | 0 | 1 | 0 | Train |
| 5 | 0 | 0 | 1 | Train |
| 6 | 0 | 1 | 0 | Test |
Train/test split: days 1–5 are the training set (build the tree here). Day 6 is held out for a blind prediction check.
Step 5 · Entropy on the training set
For a multiset of class labels with positive proportion p (here “Play = Yes”), use bits: H = −p log₂p − (1−p) log₂(1−p), with 0 log₂0 := 0.
Training root (days 1–5): 3× Play Yes, 2× No → p = 3/5.
H(root) = −(3/5)log₂(3/5) − (2/5)log₂(2/5) ≈ 0.971 bits.
Step 6 · Compare splits (information gain)
Weight each child by its fraction of training rows (here all denominators are 5).
| Candidate root split | Left child (rows, labels) | H(left) | Right child | H(right) | Weighted sum | IG |
|---|---|---|---|---|---|---|
| Wind strong? | Wind=0 → {1,3,5}, all Yes | 0 | Wind=1 → {2,4}, all No | 0 | (3/5)·0 + (2/5)·0 = 0 | 0.971 |
| Sunny? | Sunny=1 → {1,2}, 1Y / 1N | 1 | Sunny=0 → {3,4,5}, 2Y / 1N | ≈0.918 | (2/5)·1 + (3/5)·0.918 ≈ 0.951 | ≈0.020 |
Wind has the larger information gain on the training set, so a greedy ID3/CART-style trainer places “Wind strong?” at the root. The children are already pure on days 1–5, so no second split is required for this training slice.
Step 7 · Train & test (library)
Use only the wind column (one feature is enough here). After fit, predict for day 6 (wind = strong) should return No play, matching the held-out label — 1/1 correct on this tiny test. On such small data this is a sanity check, not a statistically stable benchmark.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
# Train: days 1–5 — feature = wind strong (1=yes), label = play (1=yes)
X_train = np.array([[0], [1], [0], [1], [0]])
y_train = np.array([1, 0, 1, 0, 1])
clf = DecisionTreeClassifier(criterion="entropy", max_depth=1, random_state=0)
clf.fit(X_train, y_train)
# Test: day 6 — strong wind
X_test = np.array([[1]])
print(clf.predict(X_test)) # [0] → No play (matches label)
Step 8 · Optimal tree (entropy order)
Edges are labeled Yes / No for the question in the parent. “No” means wind is not strong (calm) → play outside; “Yes” means strong wind → stay home.
Step 9 · Equivalent depth-2 view (optional)
You can rewrite the same input-output map with two levels: ask Sunny, then Wind. It is not what entropy ranked first on this table, but every day still ends in the correct leaf. Branch labels spell out the answer to each question.
Local entropy check (left branch only, days 1–2): before wind, H = 1 bit (one Yes, one No). After splitting on wind, both leaves have H = 0, so the information gain of wind given Sunny is also 1 bit — the second question fixes the remaining uncertainty.
Step 10 · Wrap-up
Cleaning → encoding → train/test split → hand-computed entropy and gain → chosen root → DecisionTreeClassifier fit/predict → optional equivalent two-level policy tree. Use the interactive widget next to rehearse the wind-first decision path on new examples.
Each internal node asks a yes-or-no question on one feature. Each branch is an answer. Each leaf is a prediction (a class label in classification, or a number in regression).
Training finds good questions to ask about your inputs — thresholds on numeric features, or branches for categories — so that each leaf ends up with examples that mostly agree on the target label.
Because the path from root to leaf is a list of rules, you can explain a single prediction to a stakeholder without opening a thousand-dimensional weight matrix.
Algorithms such as CART (Classification and Regression Trees) grow the model top-down. At each step they consider many candidate questions — for example “is age ≤ 35?” or “is color = blue?” — and pick the one that best separates the training labels into purer child groups.
Training is greedy: it chooses the best split now, without lookahead to future levels. That makes training fast, but it also means the tree is not guaranteed to be globally optimal — depth limits, leaf-size rules, and ensembles exist partly to tame that greed.
Stopping rules matter: maximum depth, minimum samples per leaf, minimum impurity decrease, or maximum number of leaves. Without them, a tree can keep splitting until every leaf holds a single training row — perfect training accuracy, poor generalization.
For classification, a node is pure if every example shares the same label. Common impurity scores are Gini and entropy. Both are highest when classes are evenly split at a node, and zero when only one class remains.
For a binary node with positive class proportion p, Gini is 2p(1−p) and entropy is −p log₂ p − (1−p) log₂ (1−p) (with the convention 0 log 0 = 0). Libraries average child impurities weighted by sample counts to score a candidate split.
Curves use the textbook formulas; vertical scale is illustrative.
Imagine eight loan applications: five will default (D) and three will not (N). The root is messy. A rule “income ≤ 40k?” sends five rows to the left leaf — four D and one N — and three rows to the right leaf — one D and two N. Neither leaf is perfect, but both are less mixed than the parent; the algorithm would compare this gain against other candidate questions.
| Region | Default (D) | Not default (N) |
|---|---|---|
| Before split (root) | 5 | 3 |
| Left: income ≤ 40k | 4 | 1 |
| Right: income > 40k | 1 | 2 |
In practice you never hand-count this for real data — DecisionTreeClassifier in scikit-learn does it for every feature threshold the implementation considers.
from sklearn.tree import DecisionTreeClassifier
# X_train: numeric encoding of Sky, Wind columns; y_train: Play label
clf = DecisionTreeClassifier(
max_depth=4,
min_samples_leaf=20,
random_state=42,
)
clf.fit(X_train, y_train)
A single tree can change a lot if you perturb the training sample — that is high variance. Its decision boundaries are axis-aligned (boxes and stair-steps in feature space), which matches spreadsheet-style data well but approximates diagonal boundaries only by stacking many small steps.
Pruning (cost-complexity pruning in CART) removes splits that do not pay for themselves on a validation metric. Hyperparameters like max_depth and min_samples_leaf are often easier to tune first than full pruning schedules.
Random forests train many deep trees on bootstrap samples of the data and, at each split, consider only a random subset of features. Averaging (voting) predictions reduces variance while keeping non-linear boundaries.
Gradient boosting builds trees sequentially: each new tree targets the mistakes (residuals) of the ensemble so far. Shallow trees plus many rounds often beat a single giant tree on accuracy — at the cost of more tuning and a less transparent model unless you add explainability tools.
Takeaway: interpretability lives at the single-tree level; competitive accuracy on tabular data often lives at the forest / boosted level.
This matches the entropy-optimal model above: one question, two leaves. No = calm wind → play outside; Yes = strong wind → stay inside.
One decision — same rule the entropy table selected at the root.
Is the wind strong?
Supervised trees trade some raw accuracy for clarity — and ensembles trade some clarity back for state-of-the-art accuracy on tabular data.
Understand the single tree first; the forest is just teamwork.