对于 A Few Useful Things to Know about Machine Learning 的总结,这是一篇很有启发的文章,总结了机器学习领域一些本质且普遍的概念.

Machine Learning = Representation + Evaluation + Optimization

It’s Generalization that Counts

  • 测试集留到最后→ 根据测试集调参会产生Contamination

  • cross validation

  • 虽然知道但其实还是很神奇的事情 → We don’t have access to the function we want to optimize.

Data Alone is Not Enough

  • 为了使Generatalization成立,必须使用某些Knowledge或者Assumption

  • No Free Lunch

  • Real World → Simple Assumption

    The functions we want to learn in the real world are not drawn uniformly from the set of all mathematically possible functions! In fact, very general assumptions—like smoothness, similar examples having similar classes, limited dependences, or limited complexity—are often enough to do very well, and this is a large part of why machine learning has been so successful.

  • Induction and Deduction

    Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a large amount of output knowledge. Induction is a vastly more powerful lever than deduction, requiring much less input knowledge to produce useful results, but it still needs more than zero input knowledge to work.

    Deductoin: The conclusion is always true as long as the premises are true.

                 **(Idea  → Observation → Conclusion)**

    Induction: the quality of the idea or model or theory depends on the quality of the observations and analysis.

                **(Observation → Analysis → Theory)**
  • Representation → Knowledge

    The most useful learners in this regard are those that do not just have assumptions hardwired into them, but allow us to state them explicitly, vary them widely, and incorporate them automatically into the learning

  • Learning from Nature

    Machine learning is not magic; it cannot get something from nothing. What it does is get more from less.

    Learners combine knowledge with data to grow programs.

Overfitting Has Many Faces

  • Generalization Error → Bias + Variance

    Variance: 多个learner得到的结果不同

    Bias: expected estimator与grountruth的偏差

    模型参数角度: (频率派解释?即\(\theta\)为定值)

    \[\begin{aligned} MSE &= E[(\hat \theta_m-\theta)^2] \\ & = Bias(\hat \theta_m)^2 + Var(\hat \theta_m) \end{aligned} \]
  • 模型capacity越大,variance越大,bias越小。 → 由于inductive bias越小?

    例如神经网络可以拟合任何函数,那么variance就很大

    A linear learner has high bias, because when the frontier between two classes is not a hyperplane the learner is unable to induce it. Decision trees do not have this problem because they can represent any Boolean function, but on the other hand they can suffer from high variance: decision trees learned on different training sets generated by the same phenomenon are often very different,when in fact they should be the same.

  • Variance/Capacity越大的模型越需要更多的数据来减小Variance

    一个rule-based下的数据集,C4.5决策树在数据量小的时候不如naive bayes

    Situations like this are common in machine learning: strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overfitting.

  • regularization technique

    一种理解: 限制模型的解空间,可以约束variance?

  • statistical significance test

  • trade-off

    It is easy to avoid overfitting (variance) by falling into the opposite error of underfitting (bias). Simultaneously avoiding both requires learning a perfect classifier, and short of knowing it in advance there is no single technique that will always do best (no free lunch).

  • Noise and overfitting

    noise会使overfitting加重,但不是根本原因

    This can indeed aggravate overfitting, by making the learner draw a capricious frontier to keep those examples on what it thinks is the right side. But severe overfitting can occur even in the absence of noise.

  • multiple testing

Intuition Fails in High Dimensions

On the Surprising Behavior of Distance Metrics in High Dimensional Space

仅次于overfitting的第二大问题 → curse of dimension

  • 维度越多 → 数据覆盖范围越小

    Generalizing correctly becomes exponentially harder as the dimensionality (number of features) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space.

    假设100维度,10万亿(1e12)个数据, 只能覆盖\(1e^{-18}\)的空间

  • 难以进行有效的距离度量 → 基于相似性的模型break down

    1. 高维度下noise,覆盖了关联特征之间的影响

    2. 即使所有feature relevant,一个数据点相同距离的数据点的随着维度指数级增长

  • Curse of Dimension → PRML第一章

    In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant ``shell'' around it; and most of the volume of a highdimensional orange is in the skin, not the pulp.

    If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor.

    If we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere. This is bad news for machine learning, where shapes of one type are often approximated by shapes of another.

  • 特征不是越多越好 → 多一个特征至少不会损失模型性能(哪怕它没有提供额外信息) ❌

    Naively, one might think that gathering more features never hurts, since at worst they provide no new information about the class. But in fact their benefits may be outweighed by the curse of dimensionality.

  • blessing of non uniformity → 流形学习

Theoretical Guarantees Are Not What They Seem

  • Theory → PAC Learnable

    One of the major developments of recent decades has been the realization that in fact we can have guarantees on the results of induction, particularly if we are willing to settle for probabilistic guarantees.

  • Take with a large grain of salt

  • given infinite data, the learner is guaranteed to output the correct classifier.

    This is reassuring, but it would be rash to choose one learner over another because of its asymptotic guarantees. In practice, we are seldom in the asymptotic regime (also known as ``asymptopia''). And, because of the bias-variance trade-off I discussed earlier, *if learner A is better than learner B given infinite data, B is often better than A given finite data.*

  • Theory只是Theory

    The main role of theoretical guarantees in machine learning is not as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design

    Learning is a complex phenomenon, and just because a learner has a theoretical justification and works in practice does not mean the former is the reason for the latter.

Feature Engineering is The Key

  • machine learning is not a one-shot process of building a dataset and running a learner, but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating.

  • Learning is often the quickest part of this, but that is because we have already mastered it pretty well! Feature engineering is more difficult because it is domain-specific, while learners can be largely general purpose. However, there is no sharp frontier between the two, and this is another reason the *most useful learners are those that facilitate incorporating knowledge.*

  • bear in mind that features that look irrelevant in isolation may be relevant in combination. For example, if the class is an XOR of k input features, each of them by itself carries no information about the class. (If you want to annoy machine learners, bring up XOR.) On the other hand, running a learner with a very large number of features to find out which ones are useful in combination may be too time-consuming, or cause overfitting. So there is ultimately no replacement for the smarts you put into feature engineering.

  • Deep learning is feature engineering(我自己说的)

More Data Beats a Clevrer Algorithm

  • As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.

  • Scalability: 数据多→ 训练时间长

  • 好的模型的payoff其实并没有那么大

    All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of ``nearby.''

    With nonuniformly distributed data, learners can produce widely different frontiers while still making the same predictions in the regions that matter (those with a substantial number of training examples, and therefore also where most test examples are likely to appear).

    This also helps explain why powerful learners can be unstable but still accurate.

    很多不同的模型可以达到相同的效果(由于训练数据有限/空间稀疏)

  • As a rule, it pays to try the simplest learners first (for example, naïve Bayes before logistic regression, k-nearest neighbor before support vector machines)

  • 两种模型: 参数模型 vs. 非参数模型

    Learners can be divided into two major types: those whose representation has a fixed size, like linear classifiers, and those whose representation can grow with the data, like decision trees.

  • payoff在哪里

    由于实际数据有限 + 维度灾难,clever algorithm是能最大化利用数据的算法

    Clever algorithmsthose that make the most of the data and computing resources availableoften pay off in the end, provided you are willing to put in the effort.

  • 最重要的还是人们的insight

    In research papers, learners are typically compared on measures of accuracy and computational cost. But human effort saved and insight gained, although harder to measure, are often more important. This favors learners that produce human-understandable output (for example, rule sets). And the organizations that make the most of machine learning are those that have in place an infrastructure that makes experimenting with many different learners, data sources, and learning problems easy and efficient, and where there is a close collaboration between machine learning experts and application domain ones.

Learn Many Models, Not Just One

  • variants of one model → many variants of many model

  • model ensembles

    • bagging

      each classifier with a resampled dataset. → greatly decrease variance, slightly increase bias

    • boosting

      training examples have weights, and these are varied so that *each new classifier focuses on the examples the previous ones tended to get wrong.*

    • stacking

      the outputs of individual classifiers become the inputs of a ``higher-level'' learner that figures out how best to combine them.

      (有点像multi-scale的感觉?)

  • Bayesian model averaging vs. Ensemble

    Ensembles change the hypothesis space (for example, from single decision trees to linear combinations of them), and can take a wide variety of forms. BMA assigns weights to the hypotheses in the original space according to a fixed formula.

    BMA weights are extremely different from those produced by (say) bagging or boosting: the latter are fairly even, while the former are extremely skewed, to the point where the single highest-weight classifier usually dominates, making BMA effectively equivalent to just selecting it. 8 A practical consequence of this is that, *while model ensembles are a key part of the machine learning toolkit, BMA is seldom worth the trouble.*

Simplicity Does not Imply Accuracy

  • 奥卡姆剃刀 → 如无必要,勿增实体 → 相同训练误差下应该选择简单模型的泛化性更好(❌)

    • Model Ensembles

      The generalization error of a boosted ensemble continues to improve by adding classifiers even after the training error has reached zero.

    • SVM (为什么呢?)

      Another counterexample is support vector machines, which can effectively have an infinite number of parameters without overfitting.

    • 实际上,并没有直接联系

      Conversely, the function sign(sin(ax)) can discriminate an arbitrarily large, arbitrarily labeled set of points on the x axis, even though it has only one parameter. (怎么确定这个a呢)

  • complexity → size of hypothesis space

  • 复杂度与模型搜索

    A further complication arises from the fact that few learners search their hypothesis space exhaustively. A learner with a larger hypothesis space that tries fewer hypotheses from it is less likely to overfit than one that tries more hypotheses from a smaller space. As Pearl points out, the size of the hypothesis space is only a rough guide to what really matters for relating training and test error: the procedure by which a hypothesis is chosen.

  • 奥卡姆剃刀 → 简单本身是好的,而不是由于简单导致准确度更好

    The conclusion is that simpler hypotheses should be preferred because simplicity is a virtue in its own right, not because of a hypothetical connection with accuracy.

Representable Does not imply Learnable

  • Given finite data, time and memory, standard learners can learn only a tiny subset of all possible functions, and these subsets are different for learners with different representations.

Correlation Doesn’t Imply Causation

  • 相关不是因果

  • 但学习相关至少是有用的

    Machine learning is usually applied to observational data, where the predictive variables are not under the control of the learner, as opposed to experimental data, where they are. Some learning algorithms can potentially extract causal information from observational data, but their applicability is rather restricted.

    On the other hand, correlation is a sign of a potential causal connection, and we can use it as a guide to further investigation (for example, trying to understand what the causal chain might be).

  • 是否存在真正的因果是个哲学问题, 但对于机器学习而言:

    First, whether or not we call them ``causal,'' we would like to predict the effects of our actions, not just correlations between observable variables. Second, if you can obtain experimental data (for example by randomly assigning visitors to different versions of a Web site), then by all means do so.14