Published on

ML vs Probabilistic Programming

Authors

Bayesian statistics has had a resurgence over the last few decades, after a long stretch in which non-Bayesian methods dominated much of statistical theory and practice.

That revival has been driven at least as much by computational progress—better samplers, faster hardware, and usable software—as by philosophical debates about inference.

Bayesian methods use probability to quantify uncertainty in conclusions drawn from data. Bayesian data analysis typically follows three steps:

  1. Specify a full probability model — a joint distribution over observables and unknowns that reflects substantive knowledge about the problem.

  2. Condition on data - update from the prior to the posterior, the distribution of unknowns given what was observed.

  3. Check the model and its implications — assess whether the specification is reasonable, where it fails, and what should change.

Comparing ML and statistics

Machine learning is widely used for unsupervised and supervised problems. In the supervised setting, the headline goal is often strong predictive performance on regression or classification tasks, sometimes supplemented with post-hoc explanation of features or predictions. Classical statistical modeling, especially in the Bayesian tradition, puts more emphasis on encoding assumptions explicitly and returning distributional answers—not only a point prediction but uncertainty and alternative structural stories that remain plausible after seeing the data.

Black box vs probabilistic programming

Many production ML systems behave like black boxes: they map inputs to outputs with limited transparency about uncertainty or about why a particular structure is encoded. Post-hoc tools such as SHAP help, but they explain an existing fitted object rather than building uncertainty into the model from the start. I cover interpretability methods in more depth in part 1 and part 2 of my interpreting-ML series.

Probabilistic programming (PP) turns the generative story into code: you write down assumptions, get posterior distributions back, and can simulate data from the model before and after fitting. That makes customized models and domain-informed priors practical, especially as samplers and compilers have improved.

Bayes’ rule and when Bayesian modeling shines

The posterior combines what we assumed before the data (prior) with what the data say (likelihood). In symbols: p(θdata)p(dataθ)p(θ)p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta)\, p(\theta).

Bayesian workflows are especially valuable when:

  • The task is not a single leaderboard metric but understanding mechanism, heterogeneity, or propagation of uncertainty.
  • Data are structured — hierarchical, spatial, temporal, or grouped — and you want to share information sensibly across units.
  • Domain knowledge should enter through priors or constrained parameters rather than only through feature engineering.
  • Uncertainty itself is part of the decision (tails of a forecast, probability of exceeding a threshold, etc.).

A misspecified or poorly identified Bayesian model often shows its problems in diagnostics—divergent transitions, poor mixing, prior–data conflict—whereas a complex predictive model can look fine on a held-out AUC until it fails in deployment. Bayesian models are also generative: once fit, they describe how new data could be produced, which powers simulation-based checks covered later.

Frequentist and Bayesian approaches alike can stumble computationally. As Gelman puts it:

“Computational problems often mean there’s a problem with your model.” — Gelman (2008).

Unlike tuning a handful of hyperparameters for a fixed architecture, Bayesian work usually demands careful choices of parameterization, priors, and sampling settings, plus patient reading of trace plots and predictive checks. When you understand the machinery, you know what to adjust. A natural place to start reading about the core algorithms is Markov chain Monte Carlo.

Monte Carlo

Closed-form posteriors are rare in realistic models because integrating over the latent space is hard. Monte Carlo approximates integrals (and thus posterior expectations) by averaging over random draws.

We often want draws that concentrate in high-posterior regions rather than wasting effort in negligible corners of the space. That is where Markov chains designed to have the posterior as their stationary distribution come in—below.

The general Monte Carlo recipe:

  1. Define inputs consistent with your domain and model.

  2. Draw random inputs from distributions that reflect uncertainty or the sampling design.

  3. Run a deterministic forward computation for each draw (same input → same output).

  4. Aggregate (average, form empirical intervals, etc.).

Caveats: sampling design matters (you want exploration that matches the target), and error typically shrinks as you increase the number of effective samples—subject to budget and mixing limits.

Markov chain Monte Carlo

MCMC uses randomness to construct a Markov chain whose long-run distribution is the posterior. That sidesteps direct high-dimensional integration while still targeting the correct distribution—at the cost of correlation between successive samples and the need for convergence diagnostics.

Probabilistic programming has scaled because efficient MCMC, especially Hamiltonian Monte Carlo (HMC), makes joint posteriors over many parameters workable in practice. Alternatives such as Laplace approximation, importance sampling / SIR, or a MAP point estimate remain useful baselines but often give incomplete uncertainty or bias when the posterior is skewed or multimodal.

Formally, the Markov property says the next state depends on the present, not the full path:

P(Xt+1Xt,Xt1,,X0)=P(Xt+1Xt).P(X_{t+1} \mid X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} \mid X_t).

In short: where you are now is enough to describe the distribution of the next step.

A familiar (imperfect) intuition is Snakes and Ladders: the distribution of your next square depends on your current square and the roll ahead, not on the full history of how you arrived.

The Metropolis algorithm is a generic random-walk proposal scheme; in high dimensions its exploration can become inefficient because volume grows quickly. HMC uses gradient information from the log-posterior to propose coherent moves and usually explores curved, correlated posteriors more effectively. Like optimization, it can struggle with pathological geometry (funnels, multimodality); this visual introduction remains a helpful read.

Think of Hamiltonian Monte Carlo as a marble released high on the inside wall of a bowl: it can sweep through the relevant region before settling near the bottom.

How to picture the workflow:

  1. Initialize somewhere in parameter space (or following your software’s warm-up).

  2. Propose moves that respect the target distribution (Metropolis corrections or HMC integrator with accept/reject).

  3. Iterate, producing dependent samples; thin or use multiple chains for diagnostics.

Monte Carlo supplies averaging; the Markov structure supplies dependent draws that eventually behave like posterior samples once chains mix.

A brief side note on likelihood ratios: these compare how well two hypotheses explain the same observation (e.g. Bayes factors build on marginal likelihoods; generic “event given (B) vs. not (B)” ratios appear in testing). They are not the definition of MCMC, but they show up constantly in model comparison.

No-U-Turn Sampler (NUTS)

NUTS is an adaptive variant of HMC that automatically sets path lengths so the sampler avoids unnecessary retracing—analogous to the marble swinging past the bottom of the bowl and rolling back along nearly the same trajectory. Limiting those U-turns reduces wasted computation while preserving HMC’s favorable scaling in many models.

For an animated comparison of samplers, see Chi Feng’s MCMC gallery.

Confounders still matter

Probabilistic programming does not replace causal thinking. If Z confounds X and Y, conditioning or stratification may be required before interpreting an association as if it reflected the effect of X. PP helps you state assumptions cleanly; it does not automatically fix hidden bias in how data were collected.

Common building-block distributions

  • Poisson — models for count outcomes (events in fixed exposure) when overdispersion is moderate; the mean fixes the variance unless you extend (e.g. negative binomial).

  • Binomial — repeated binary trials with a common success probability; with enough independent flips, relative frequencies concentrate around the true rate (law of large numbers).

  • Normal — convenient for continuous outcomes or latent traits when tails are not decisive; heavily used as measurement error or, with care, as random effects.

  • Beta / Dirichlet — natural priors on probabilities and simplex-valued parameters (proportions, mixture weights).

The right family depends on support (counts vs. real line vs. simplex), tail behavior, and overdispersion. Posterior predictive checks help you see whether a Poisson or Gaussian likelihood systematically misfits the data.

Visualization in probabilistic programming

Visualization is not decoration—it is how you learn whether the model, not just the optimizer, is behaving. Gelman et al. outline a practical workflow in Visualization in Bayesian workflow.

  1. Prior predictive — simulate from the prior through the observational model. The implied data should be plausible, not identical to the real dataset (which would suggest an overly tight prior). Weakly informative priors are often a good starting point.

  2. Diagnosing sampling — bivariate scatter and parallel-coordinate views of samples can reveal divergent transitions and pathologies many divergences often signal hard geometry or priors that fight the data; when divergences look structured, treat that as signal, not noise.

  3. Posterior predictive checks — compare simulated replicate data from the fitted model to what you actually observed. Systematic discrepancies mean the generative story should change.

  4. Pointwise / leave-one-out views — highlight observations that drive the fit or that the model handles poorly; ELPD-style summaries support comparing models that assign probability, not only point predictions.

Iterate: better plots lead to better models, which lead to better plots.

Practical tip: scaling inputs (Gelman)

For regression-style models, Gelman recommends centering continuous predictors and scaling by about two standard deviations so coefficients are more comparable across binary and continuous inputs (roughly putting slopes on a common interpretive scale). The classic reference is Scaling regression inputs by dividing by two standard deviations. Applied guidance also appears in hierarchical-modeling texts—for example, this chapter notes that mean-centering and scaling by one or two SDs are common and usually work well. For discussion in Gelman’s words, see this blog note.

Conclusion

Machine learning and probabilistic programming answer different questions with different defaults: prediction with flexible function classes versus structured uncertainty with explicit generative assumptions. The gap narrows when you care about calibrated uncertainty, small data, hierarchy, or scientific interpretability—places where priors, hierarchical partial pooling, and posterior predictive checks earn their keep. PP tools make that workflow executable, but they still demand statistical judgment: parameterization, parsimony, and critique of the model against reality. The reward is not only a fit but an account of what the data support, and how fragile that account is.

Further reading