Decision Tree — A Supervised Learning

Guided project

Six days, one question: play outside?

Everything below hangs on this one supervised problem. Rows are past days; columns Sky and Wind strong? are inputs; Play outside? is the label we want the tree to learn. After training, a new day’s sky and wind will be dropped at the root and routed to a leaf — that leaf is the prediction.

Step 1 · Supervised setup

Inputs and a known target

Supervised learning means every training row includes the correct answer y (here, play or not). The job of the tree is to memorize patterns that generalize: if we only memorized row IDs, a new day would be hopeless. Instead we memorize questions on Sky and Wind that reproduce the labels on the training table.

Step 2 · The dataset

Six labeled examples (toy weather)

Two inputs keep the arithmetic human-sized. “Wind strong?” is Yes/No. Sky uses three categories — but every split in a binary tree only compares one condition at a node (for example “is it sunny?” vs “not sunny,” which groups cloud and rain together for that decision).

Day	Sky	Wind strong?	Play outside?
1	Sunny	No	Yes
2	Sunny	Yes	No
3	Rain	No	Yes
4	Rain	Yes	No
5	Cloudy	No	Yes
6	Cloudy	Yes	No

The full pipeline below treats this table as the only source of truth: we clean and encode it, hold out one day, then use entropy and information gain exactly as in ID3/CART lectures to justify which question belongs at the root.

Step 3 · Data cleaning

Make the table machine-ready

Trim whitespace in every cell; Sky values must be exactly Sunny, Rain, or Cloudy.
Normalize booleans: Yes/No for wind and play (reject empty strings).
Check duplicates on (Day) — none here.
No missing values in this toy set; if a cell were blank we would drop the row or impute with an explicit policy.

Step 4 · Encoding

Map text to numbers for training

Binary entropy below uses counts of the positive class (Play = Yes). Wind is encoded as 1 = strong, 0 = calm. For root-split candidates we compare (a) “Wind strong?” and (b) “Sunny?” where Sunny = 1 and not-sunny = 0.

Day	Sunny (1/0)	Wind (1=strong)	Play (1=Yes)	Split
1	1	0	1	Train
2	1	1	0	Train
3	0	0	1	Train
4	0	1	0	Train
5	0	0	1	Train
6	0	1	0	Test

Train/test split: days 1–5 are the training set (build the tree here). Day 6 is held out for a blind prediction check.

Step 5 · Entropy on the training set

Shannon entropy H(S)

For a multiset of class labels with positive proportion p (here “Play = Yes”), use bits: H = −p log₂p − (1−p) log₂(1−p), with 0 log₂0 := 0.

Training root (days 1–5): 3× Play Yes, 2× No → p = 3/5.

H(root) = −(3/5)log₂(3/5) − (2/5)log₂(2/5) ≈ 0.971 bits.

Step 6 · Compare splits (information gain)

Information gain = H(parent) − weighted H(children)

Weight each child by its fraction of training rows (here all denominators are 5).

Candidate root split	Left child (rows, labels)	H(left)	Right child	H(right)	Weighted sum	IG
Wind strong?	Wind=0 → {1,3,5}, all Yes	0	Wind=1 → {2,4}, all No	0	(3/5)·0 + (2/5)·0 = 0	0.971
Sunny?	Sunny=1 → {1,2}, 1Y / 1N	1	Sunny=0 → {3,4,5}, 2Y / 1N	≈0.918	(2/5)·1 + (3/5)·0.918 ≈ 0.951	≈0.020

Wind has the larger information gain on the training set, so a greedy ID3/CART-style trainer places “Wind strong?” at the root. The children are already pure on days 1–5, so no second split is required for this training slice.

Step 7 · Train & test (library)

Fit on five rows, predict day 6

Use only the wind column (one feature is enough here). After fit, predict for day 6 (wind = strong) should return No play, matching the held-out label — 1/1 correct on this tiny test. On such small data this is a sanity check, not a statistically stable benchmark.

import numpy as np
from sklearn.tree import DecisionTreeClassifier

# Train: days 1–5 — feature = wind strong (1=yes), label = play (1=yes)
X_train = np.array([[0], [1], [0], [1], [0]])
y_train = np.array([1, 0, 1, 0, 1])

clf = DecisionTreeClassifier(criterion="entropy", max_depth=1, random_state=0)
clf.fit(X_train, y_train)

# Test: day 6 — strong wind
X_test = np.array([[1]])
print(clf.predict(X_test))  # [0] → No play (matches label)

Step 8 · Optimal tree (entropy order)

One question, two pure leaves

Edges are labeled Yes / No for the question in the parent. “No” means wind is not strong (calm) → play outside; “Yes” means strong wind → stay home.

Step 9 · Equivalent depth-2 view (optional)

Same decisions, Sunny first — with Yes/No on every edge

You can rewrite the same input-output map with two levels: ask Sunny, then Wind. It is not what entropy ranked first on this table, but every day still ends in the correct leaf. Branch labels spell out the answer to each question.

Local entropy check (left branch only, days 1–2): before wind, H = 1 bit (one Yes, one No). After splitting on wind, both leaves have H = 0, so the information gain of wind given Sunny is also 1 bit — the second question fixes the remaining uncertainty.

Step 10 · Wrap-up

What you just did

Cleaning → encoding → train/test split → hand-computed entropy and gain → chosen root → DecisionTreeClassifier fit/predict → optional equivalent two-level policy tree. Use the interactive widget next to rehearse the wind-first decision path on new examples.

Picture 1 · Structure

What any decision tree looks like

Each internal node asks a yes-or-no question on one feature. Each branch is an answer. Each leaf is a prediction (a class label in classification, or a number in regression).

Plain language

Questions, not magic

Training finds good questions to ask about your inputs — thresholds on numeric features, or branches for categories — so that each leaf ends up with examples that mostly agree on the target label.

Because the path from root to leaf is a list of rules, you can explain a single prediction to a stakeholder without opening a thousand-dimensional weight matrix.

Training

How a tree grows from rows and columns

Algorithms such as CART (Classification and Regression Trees) grow the model top-down. At each step they consider many candidate questions — for example “is age ≤ 35?” or “is color = blue?” — and pick the one that best separates the training labels into purer child groups.

Training is greedy: it chooses the best split now, without lookahead to future levels. That makes training fast, but it also means the tree is not guaranteed to be globally optimal — depth limits, leaf-size rules, and ensembles exist partly to tame that greed.

Stopping rules matter: maximum depth, minimum samples per leaf, minimum impurity decrease, or maximum number of leaves. Without them, a tree can keep splitting until every leaf holds a single training row — perfect training accuracy, poor generalization.

Impurity

Measuring “mixedness” at a node

For classification, a node is pure if every example shares the same label. Common impurity scores are Gini and entropy. Both are highest when classes are evenly split at a node, and zero when only one class remains.

For a binary node with positive class proportion p, Gini is 2p(1−p) and entropy is −p log₂ p − (1−p) log₂ (1−p) (with the convention 0 log 0 = 0). Libraries average child impurities weighted by sample counts to score a candidate split.

Curves use the textbook formulas; vertical scale is illustrative.

Worked micro-example

One split, two purer buckets

Imagine eight loan applications: five will default (D) and three will not (N). The root is messy. A rule “income ≤ 40k?” sends five rows to the left leaf — four D and one N — and three rows to the right leaf — one D and two N. Neither leaf is perfect, but both are less mixed than the parent; the algorithm would compare this gain against other candidate questions.

Region	Default (D)	Not default (N)
Before split (root)	5	3
Left: income ≤ 40k	4	1
Right: income > 40k	1	2

In practice you never hand-count this for real data — DecisionTreeClassifier in scikit-learn does it for every feature threshold the implementation considers.

from sklearn.tree import DecisionTreeClassifier
# X_train: numeric encoding of Sky, Wind columns; y_train: Play label

clf = DecisionTreeClassifier(
    max_depth=4,
    min_samples_leaf=20,
    random_state=42,
)
clf.fit(X_train, y_train)

Risk

High variance & the axis-aligned frontier

A single tree can change a lot if you perturb the training sample — that is high variance. Its decision boundaries are axis-aligned (boxes and stair-steps in feature space), which matches spreadsheet-style data well but approximates diagonal boundaries only by stacking many small steps.

Pruning (cost-complexity pruning in CART) removes splits that do not pay for themselves on a validation metric. Hyperparameters like max_depth and min_samples_leaf are often easier to tune first than full pruning schedules.

Ensembles

Why no one stops at one tree

Random forests train many deep trees on bootstrap samples of the data and, at each split, consider only a random subset of features. Averaging (voting) predictions reduces variance while keeping non-linear boundaries.

Gradient boosting builds trees sequentially: each new tree targets the mistakes (residuals) of the ensemble so far. Shallow trees plus many rounds often beat a single giant tree on accuracy — at the cost of more tuning and a less transparent model unless you add explainability tools.

Takeaway: interpretability lives at the single-tree level; competitive accuracy on tabular data often lives at the forest / boosted level.

Interactive

Try the wind-first tree

This matches the entropy-optimal model above: one question, two leaves. No = calm wind → play outside; Yes = strong wind → stay inside.

Should we play outside?

One decision — same rule the entropy table selected at the root.

Is the wind strong?

Prediction: Play outside — conditions look good.

Prediction: Stay inside — weather is too rough.

Decision Tree-
A Supervised Learning

Six days, one question: play outside?

Inputs and a known target

Six labeled examples (toy weather)

Make the table machine-ready

Map text to numbers for training

Shannon entropy H(S)

Information gain = H(parent) − weighted H(children)

Fit on five rows, predict day 6

One question, two pure leaves

Same decisions, Sunny first — with Yes/No on every edge

What you just did

What any decision tree looks like

Questions, not magic

How a tree grows from rows and columns

Measuring “mixedness” at a node

One split, two purer buckets

High variance & the axis-aligned frontier

Why no one stops at one tree

Try the wind-first tree

Should we play outside?

About this project

Decision Tree-A Supervised Learning

Six days, one question: play outside?

Inputs and a known target

Six labeled examples (toy weather)

Make the table machine-ready

Map text to numbers for training

Shannon entropy H(S)

Information gain = H(parent) − weighted H(children)

Fit on five rows, predict day 6

One question, two pure leaves

Same decisions, Sunny first — with Yes/No on every edge

What you just did

What any decision tree looks like

Questions, not magic

How a tree grows from rows and columns

Measuring “mixedness” at a node

One split, two purer buckets

High variance & the axis-aligned frontier

Why no one stops at one tree

Try the wind-first tree

Should we play outside?

About this project

Decision Tree-
A Supervised Learning