Artifact 5 · Supervised learning

Decision Tree-
A Supervised Learning

Clean the table, encode features, hold out one day for testing, compute entropy and information gain by hand, fit a decision tree, score the test row, then compare with an equivalent deeper “Sunny → Wind” rule set.

Scroll

Six days, one question: play outside?

Everything below hangs on this one supervised problem. Rows are past days; columns Sky and Wind strong? are inputs; Play outside? is the label we want the tree to learn. After training, a new day’s sky and wind will be dropped at the root and routed to a leaf — that leaf is the prediction.

Step 1 · Supervised setup

Inputs and a known target

Supervised learning means every training row includes the correct answer y (here, play or not). The job of the tree is to memorize patterns that generalize: if we only memorized row IDs, a new day would be hopeless. Instead we memorize questions on Sky and Wind that reproduce the labels on the training table.

Step 2 · The dataset

Six labeled examples (toy weather)

Two inputs keep the arithmetic human-sized. “Wind strong?” is Yes/No. Sky uses three categories — but every split in a binary tree only compares one condition at a node (for example “is it sunny?” vs “not sunny,” which groups cloud and rain together for that decision).

DaySkyWind strong?Play outside?
1SunnyNoYes
2SunnyYesNo
3RainNoYes
4RainYesNo
5CloudyNoYes
6CloudyYesNo

The full pipeline below treats this table as the only source of truth: we clean and encode it, hold out one day, then use entropy and information gain exactly as in ID3/CART lectures to justify which question belongs at the root.

Step 3 · Data cleaning

Make the table machine-ready

  • Trim whitespace in every cell; Sky values must be exactly Sunny, Rain, or Cloudy.
  • Normalize booleans: Yes/No for wind and play (reject empty strings).
  • Check duplicates on (Day) — none here.
  • No missing values in this toy set; if a cell were blank we would drop the row or impute with an explicit policy.

Step 4 · Encoding

Map text to numbers for training

Binary entropy below uses counts of the positive class (Play = Yes). Wind is encoded as 1 = strong, 0 = calm. For root-split candidates we compare (a) “Wind strong?” and (b) “Sunny?” where Sunny = 1 and not-sunny = 0.

DaySunny (1/0)Wind (1=strong)Play (1=Yes)Split
1101Train
2110Train
3001Train
4010Train
5001Train
6010Test

Train/test split: days 1–5 are the training set (build the tree here). Day 6 is held out for a blind prediction check.

Step 5 · Entropy on the training set

Shannon entropy H(S)

For a multiset of class labels with positive proportion p (here “Play = Yes”), use bits: H = −p log₂p − (1−p) log₂(1−p), with 0 log₂0 := 0.

Training root (days 1–5): 3× Play Yes, 2× No → p = 3/5.

H(root) = −(3/5)log₂(3/5) − (2/5)log₂(2/5) ≈ 0.971 bits.

Step 6 · Compare splits (information gain)

Information gain = H(parent) − weighted H(children)

Weight each child by its fraction of training rows (here all denominators are 5).

Candidate root splitLeft child (rows, labels)H(left)Right childH(right)Weighted sumIG
Wind strong?Wind=0 → {1,3,5}, all Yes0Wind=1 → {2,4}, all No0(3/5)·0 + (2/5)·0 = 00.971
Sunny?Sunny=1 → {1,2}, 1Y / 1N1Sunny=0 → {3,4,5}, 2Y / 1N≈0.918(2/5)·1 + (3/5)·0.918 ≈ 0.9510.020

Wind has the larger information gain on the training set, so a greedy ID3/CART-style trainer places “Wind strong?” at the root. The children are already pure on days 1–5, so no second split is required for this training slice.

Step 7 · Train & test (library)

Fit on five rows, predict day 6

Use only the wind column (one feature is enough here). After fit, predict for day 6 (wind = strong) should return No play, matching the held-out label — 1/1 correct on this tiny test. On such small data this is a sanity check, not a statistically stable benchmark.

import numpy as np
from sklearn.tree import DecisionTreeClassifier

# Train: days 1–5 — feature = wind strong (1=yes), label = play (1=yes)
X_train = np.array([[0], [1], [0], [1], [0]])
y_train = np.array([1, 0, 1, 0, 1])

clf = DecisionTreeClassifier(criterion="entropy", max_depth=1, random_state=0)
clf.fit(X_train, y_train)

# Test: day 6 — strong wind
X_test = np.array([[1]])
print(clf.predict(X_test))  # [0] → No play (matches label)

Step 8 · Optimal tree (entropy order)

One question, two pure leaves

Edges are labeled Yes / No for the question in the parent. “No” means wind is not strong (calm) → play outside; “Yes” means strong wind → stay home.

Step 9 · Equivalent depth-2 view (optional)

Same decisions, Sunny first — with Yes/No on every edge

You can rewrite the same input-output map with two levels: ask Sunny, then Wind. It is not what entropy ranked first on this table, but every day still ends in the correct leaf. Branch labels spell out the answer to each question.

Local entropy check (left branch only, days 1–2): before wind, H = 1 bit (one Yes, one No). After splitting on wind, both leaves have H = 0, so the information gain of wind given Sunny is also 1 bit — the second question fixes the remaining uncertainty.

Step 10 · Wrap-up

What you just did

Cleaning → encoding → train/test split → hand-computed entropy and gain → chosen root → DecisionTreeClassifier fit/predict → optional equivalent two-level policy tree. Use the interactive widget next to rehearse the wind-first decision path on new examples.


What any decision tree looks like

Each internal node asks a yes-or-no question on one feature. Each branch is an answer. Each leaf is a prediction (a class label in classification, or a number in regression).

Questions, not magic

Training finds good questions to ask about your inputs — thresholds on numeric features, or branches for categories — so that each leaf ends up with examples that mostly agree on the target label.

Because the path from root to leaf is a list of rules, you can explain a single prediction to a stakeholder without opening a thousand-dimensional weight matrix.


How a tree grows from rows and columns

Algorithms such as CART (Classification and Regression Trees) grow the model top-down. At each step they consider many candidate questions — for example “is age ≤ 35?” or “is color = blue?” — and pick the one that best separates the training labels into purer child groups.

Training is greedy: it chooses the best split now, without lookahead to future levels. That makes training fast, but it also means the tree is not guaranteed to be globally optimal — depth limits, leaf-size rules, and ensembles exist partly to tame that greed.

Stopping rules matter: maximum depth, minimum samples per leaf, minimum impurity decrease, or maximum number of leaves. Without them, a tree can keep splitting until every leaf holds a single training row — perfect training accuracy, poor generalization.

Measuring “mixedness” at a node

For classification, a node is pure if every example shares the same label. Common impurity scores are Gini and entropy. Both are highest when classes are evenly split at a node, and zero when only one class remains.

For a binary node with positive class proportion p, Gini is 2p(1−p) and entropy is −p log₂ p − (1−p) log₂ (1−p) (with the convention 0 log 0 = 0). Libraries average child impurities weighted by sample counts to score a candidate split.

One split, two purer buckets

Imagine eight loan applications: five will default (D) and three will not (N). The root is messy. A rule “income ≤ 40k?” sends five rows to the left leaf — four D and one N — and three rows to the right leaf — one D and two N. Neither leaf is perfect, but both are less mixed than the parent; the algorithm would compare this gain against other candidate questions.

RegionDefault (D)Not default (N)
Before split (root)53
Left: income ≤ 40k41
Right: income > 40k12

In practice you never hand-count this for real data — DecisionTreeClassifier in scikit-learn does it for every feature threshold the implementation considers.

from sklearn.tree import DecisionTreeClassifier
# X_train: numeric encoding of Sky, Wind columns; y_train: Play label

clf = DecisionTreeClassifier(
    max_depth=4,
    min_samples_leaf=20,
    random_state=42,
)
clf.fit(X_train, y_train)

High variance & the axis-aligned frontier

A single tree can change a lot if you perturb the training sample — that is high variance. Its decision boundaries are axis-aligned (boxes and stair-steps in feature space), which matches spreadsheet-style data well but approximates diagonal boundaries only by stacking many small steps.

Pruning (cost-complexity pruning in CART) removes splits that do not pay for themselves on a validation metric. Hyperparameters like max_depth and min_samples_leaf are often easier to tune first than full pruning schedules.

Why no one stops at one tree

Random forests train many deep trees on bootstrap samples of the data and, at each split, consider only a random subset of features. Averaging (voting) predictions reduces variance while keeping non-linear boundaries.

Gradient boosting builds trees sequentially: each new tree targets the mistakes (residuals) of the ensemble so far. Shallow trees plus many rounds often beat a single giant tree on accuracy — at the cost of more tuning and a less transparent model unless you add explainability tools.

Takeaway: interpretability lives at the single-tree level; competitive accuracy on tabular data often lives at the forest / boosted level.


Try the wind-first tree

This matches the entropy-optimal model above: one question, two leaves. No = calm wind → play outside; Yes = strong wind → stay inside.

Should we play outside?

One decision — same rule the entropy table selected at the root.

Is the wind strong?

Prediction: Play outside — conditions look good.
Prediction: Stay inside — weather is too rough.

Supervised trees trade some raw accuracy for clarity — and ensembles trade some clarity back for state-of-the-art accuracy on tabular data.

Understand the single tree first; the forest is just teamwork.

About this project

Introduction Decision trees are a supervised learning model: they learn nested rules from labeled examples. This artifact walks one six-day weather table through cleaning, encoding, a train/test holdout, hand-computed entropy and information gain, a scikit-learn fit, and an interactive demo that matches the entropy-optimal wind-first tree, plus an optional Sunny-then-Wind equivalent diagram. Description Includes numeric gain tables, two tree figures with explicit Yes/No branch labels, a generic topology diagram with the same convention, and supporting sections on greedy training, impurity curves, a separate loan micro-example, variance, and ensembles. Objective Show the full decision-tree workflow on a dataset small enough to recompute by hand, while honestly comparing root-split candidates with entropy. Process Derived training-set counts; verified H and IG for wind versus sunny at the root; aligned the interactive widget and optimal tree diagram with the winning split; documented the optional deeper equivalent tree. Tools and technologies used HTML5, CSS (layout, responsive grids, callouts, code styling), JavaScript (IntersectionObserver, demo state), SVG for all figures, and Python/scikit-learn as the referenced training stack. Value proposition Delivers a self-contained teaching artifact: visuals plus interaction in one static page, suitable for sharing with instructors or employers without extra tooling. Unique value Hand-worked information gain sits next to runnable Python on the same rows, and the page names when a two-level tree is equivalent but not entropy-preferred at the root. Relevance Decision trees and their ensembles (random forest, gradient boosting) appear across industry tabular ML, credit risk, churn, and fault triage. Interpretable baselines are also increasingly important next to black-box models for audit and trust. References Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Chapman & Hall. Mitchell, T. M. (1997). Machine Learning, Chapter 3. scikit-learn documentation: Decision Trees. Course materials: AIML-500, Indiana Wesleyan University.