Dekorationsartikel gehören nicht zum Leistungsumfang.
Sprache:
Englisch
98,50 €*
Versandkostenfrei per Post / DHL
Aktuell nicht verfügbar
Kategorien:
Beschreibung
"The aim of this book is to provide the simplest formulations that can be derived "from first principles" with simple arguments"--
"The aim of this book is to provide the simplest formulations that can be derived "from first principles" with simple arguments"--
Über den Autor
Francis Bach
Inhaltsverzeichnis
Preface ix
I Preliminaries 1
1 Mathematical preliminaries 3
1.1 Linear algebra and differentiable calculus.................. 3
1.1.1 Minimization of quadratic forms................... 3
1.1.2 Inverting a 2 × 2 matrix........................ 4
1.1.3 Inverting matrices defined by blocks, matrix inversion lemma... 4
1.1.4 Eigenvalue and singular value decomposition............ 6
1.1.5 Differential calculus.......................... 7
1.2 Concentration inequalities........................... 7
1.2.1 Hoeffding’s inequality......................... 9
1.2.2 McDiarmid’s inequality........................ 12
1.2.3 Bernstein’s inequality (_x0007_)....................... 14
1.2.4 Expectation of the maximum..................... 16
1.2.5 Estimation of expectations through quadrature (_x0007_)......... 17
1.2.6 Concentration inequalities for matrices (_x0007__x0007_)............. 18
2 Introduction to supervised learning 21
2.1 From training data to predictions....................... 22
2.2 Decision theory................................. 25
2.2.1 Loss functions............................. 25
2.2.2 Risks................................... 26
2.2.3 Bayes risk and Bayes predictor.................... 27
2.3 Learning from data............................... 30
2.3.1 Local averaging............................. 30
2.3.2 Empirical risk minimization...................... 31
2.4 Statistical learning theory........................... 33
2.4.1 Measures of performance....................... 35
2.4.2 Notions of consistency over classes of problems........... 35
2.5 No free lunch theorems (_x0007_).......................... 36
2.6 Quest for adaptivity.............................. 38
2.7 Beyond supervised learning.......................... 39
2.8 Summary - book outline............................ 40
3 Linear least-squares regression 43
3.1 Introduction................................... 43
3.2 Least-squares framework............................ 44
3.3 Ordinary least-squares (OLS) estimator................... 45
3.3.1 Closed-form solution.......................... 45
3.3.2 Geometric interpretation........................ 46
3.3.3 Numerical resolution.......................... 47
3.4 Statistical analysis of OLS........................... 47
3.5 Fixed design setting.............................. 48
3.5.1 Statistical properties of the OLS estimator............. 50
3.5.2 Experiments.............................. 52
3.6 Ridge least-squares regression......................... 53
3.7 Lower-bound (_x0007_)................................ 57
3.8 Random design analysis............................ 59
3.8.1 Gaussian designs............................ 61
3.8.2 General designs (_x0007__x0007_)......................... 61
3.9 Principal component analysis (_x0007_)....................... 63
3.10 Conclusion................................... 65
II Generalization bounds for learning algorithms 67
4 Empirical risk minimization 69
4.1 Convexification of the risk........................... 70
4.1.1 Convex surrogates........................... 71
4.1.2 Geometric interpretation of the support vector machine (_x0007_).... 72
4.1.3 Conditional Φ-risk and classification calibration (_x0007_)........ 74
4.1.4 Relationship between risk and Φ-risk (_x0007__x0007_).............. 76
4.2 Risk minimization decomposition....................... 80
4.3 Approximation error.............................. 80
4.4 Estimation error................................ 81
4.4.1 Application of McDiarmid’s inequality................ 82
4.4.2 Easy case I: quadratic functions.................... 83
4.4.3 Easy case II: Finite number of models................ 84
4.4.4 Beyond finitely many models through covering numbers (_x0007_)... 84
4.5 Rademacher complexity............................ 86
4.5.1 Symmetrization............................. 87
4.5.2 Lipschitz-continuous losses...................... 89
4.5.3 Ball-constrained linear predictions.................. 91
4.5.4 Putting things together (linear predictions)............. 92
4.5.5 From constrained to regularized estimation (_x0007_)........... 93
4.5.6 Extensions and improvements..................... 96
4.6 Model selection (_x0007_)............................... 98
4.6.1 Structural risk minimization...................... 98
4.6.2 Selection based on validation set................... 99
4.7 Relationship with asymptotic statistics (_x0007_)................. 99
4.8 Summary.................................... 101
5 Optimization for machine learning 103
5.1 Optimization in machine learning....................... 103
5.2 Gradient descent................................ 105
5.2.1 Simplest analysis: ordinary least-squares............... 106
5.2.2 Convex functions and their properties................ 110
5.2.3 Analysis of GD for strongly convex and smooth functions..... 112
5.2.4 Analysis of GD for convex and smooth functions (_x0007_)....... 117
5.2.5 Beyond gradient descent (_x0007_)..................... 120
5.2.6 Non-convex objective functions (_x0007_)................. 122
5.3 Gradient methods on non-smooth problems................. 123
5.4 Convergence rate of stochastic gradient descent (SGD)........... 127
5.4.1 Strongly convex problems (_x0007_).................... 132
5.4.2 Adaptive methods (_x0007_)......................... 134
5.4.3 Bias-variance trade-offs for least-squares (_x0007_)............. 135
5.4.4 Variance reduction (_x0007_)........................ 138
5.5 Conclusion................................... 143
6 Local averaging methods 145
6.1 Introduction................................... 145
6.2 Local averaging methods............................ 147
6.2.1 Linear estimators............................ 147
6.2.2 Partition estimators.......................... 148
6.2.3 Nearest-neighbors........................... 150
6.2.4 Nadaraya-Watson estimator a.k.a. kernel regression (_x0007_)...... 151
6.3 Generic “simplest” consistency analysis................... 153
6.3.1 Fixed partition............................. 155
6.3.2 k-nearest neighbor........................... 158
6.3.3 Kernel regression (Nadaraya-Watson) (_x0007_).............. 160
6.4 Universal consistency (_x0007_)........................... 163
6.5 Adaptivity (_x0007__x0007_)................................ 166
6.6 Conclusion................................... 167
7 Kernel methods 169
7.1 Introduction................................... 170
7.2 Representer theorem.............................. 170
7.3 Kernels..................................... 173
7.3.1 Linear and polynomial kernels.................... 175
7.3.2 Translation-invariant kernels on [0, 1]................. 176
7.3.3 Translation-invariant kernels on Rd.................. 179
7.3.4 Beyond vectorial input spaces (_x0007_).................. 182
7.4 Algorithms................................... 184
7.4.1 Representer theorem.......................... 184
7.4.2 Column sampling............................ 185
7.4.3 Random features............................ 185
7.4.4 Dual algorithms (_x0007_).......................... 186
7.4.5 Stochastic gradient descent (_x0007_).................... 187
7.4.6 “Kernelization” of linear algorithms................. 188
7.5 Generalization guarantees - Lipschitz-continuous losses........... 189
7.5.1 Risk decomposition........................... 190
7.5.2 Approximation error for translation-invariant kernels on Rd.... 191
7.6 Theoretical analysis of ridge regression (_x0007_)................. 194
7.6.1 Kernel ridge regression as a “linear” estimator........... 194
7.6.2 Bias and variance decomposition (_x0007_)................ 195
7.6.3 Relating empirical and population covariance operators...... 198
7.6.4 Analysis for well-specified problems (_x0007_)............... 200
7.6.5 Analysis beyond well-specified problems (_x0007_)............. 201
7.6.6 Balancing bias and variance (_x0007_)................... 202
7.7 Experiments................................... 203
7.8 Conclusion................................... 205
8 Sparse methods 207
8.1 Introduction................................... 207
8.1.1 Dedicated proof technique for constrained least-squares...... 209
8.1.2 Probabilistic and combinatorial lemmas............... 210
8.2 Variable selection by the ℓ0-penalty...................... 212
8.2.1 Assuming k is known.......................... 212
8.2.2 Estimating k (_x0007_)............................ 214
8.3 Variable selection by ℓ1-regularization.................... 216
8.3.1 Intuition and algorithms........................ 217
8.3.2 Slow rates - random design...................... 221
8.3.3 Slow rates - fixed design........................ 221
8.3.4 Fast rates (_x0007_).............................. 224
8.3.5 Zoo of conditions (_x0007__x0007_)......................... 225
8.3.6 Random design (_x0007_)........................... 227
8.4 Experiments................................... 228
8.5 Extensions.................................... 229
8.6 Conclusion................................... 230
9 Neural networks 233
9.1 Introduction................................... 233
9.2 Single hidden layer neural network...................... 235
9.2.1 Optimization.............................. 236
9.2.2 Rectified linear units and homogeneity................ 237
9.2.3 Estimation error............................ 239
9.3 Approximation properties........................... 241
9.3.1 Universal approximation property in one dimension........ 242
9.3.2 Infinitely many neurons and variation norm............. 243
9.3.3 Variation norm...
I Preliminaries 1
1 Mathematical preliminaries 3
1.1 Linear algebra and differentiable calculus.................. 3
1.1.1 Minimization of quadratic forms................... 3
1.1.2 Inverting a 2 × 2 matrix........................ 4
1.1.3 Inverting matrices defined by blocks, matrix inversion lemma... 4
1.1.4 Eigenvalue and singular value decomposition............ 6
1.1.5 Differential calculus.......................... 7
1.2 Concentration inequalities........................... 7
1.2.1 Hoeffding’s inequality......................... 9
1.2.2 McDiarmid’s inequality........................ 12
1.2.3 Bernstein’s inequality (_x0007_)....................... 14
1.2.4 Expectation of the maximum..................... 16
1.2.5 Estimation of expectations through quadrature (_x0007_)......... 17
1.2.6 Concentration inequalities for matrices (_x0007__x0007_)............. 18
2 Introduction to supervised learning 21
2.1 From training data to predictions....................... 22
2.2 Decision theory................................. 25
2.2.1 Loss functions............................. 25
2.2.2 Risks................................... 26
2.2.3 Bayes risk and Bayes predictor.................... 27
2.3 Learning from data............................... 30
2.3.1 Local averaging............................. 30
2.3.2 Empirical risk minimization...................... 31
2.4 Statistical learning theory........................... 33
2.4.1 Measures of performance....................... 35
2.4.2 Notions of consistency over classes of problems........... 35
2.5 No free lunch theorems (_x0007_).......................... 36
2.6 Quest for adaptivity.............................. 38
2.7 Beyond supervised learning.......................... 39
2.8 Summary - book outline............................ 40
3 Linear least-squares regression 43
3.1 Introduction................................... 43
3.2 Least-squares framework............................ 44
3.3 Ordinary least-squares (OLS) estimator................... 45
3.3.1 Closed-form solution.......................... 45
3.3.2 Geometric interpretation........................ 46
3.3.3 Numerical resolution.......................... 47
3.4 Statistical analysis of OLS........................... 47
3.5 Fixed design setting.............................. 48
3.5.1 Statistical properties of the OLS estimator............. 50
3.5.2 Experiments.............................. 52
3.6 Ridge least-squares regression......................... 53
3.7 Lower-bound (_x0007_)................................ 57
3.8 Random design analysis............................ 59
3.8.1 Gaussian designs............................ 61
3.8.2 General designs (_x0007__x0007_)......................... 61
3.9 Principal component analysis (_x0007_)....................... 63
3.10 Conclusion................................... 65
II Generalization bounds for learning algorithms 67
4 Empirical risk minimization 69
4.1 Convexification of the risk........................... 70
4.1.1 Convex surrogates........................... 71
4.1.2 Geometric interpretation of the support vector machine (_x0007_).... 72
4.1.3 Conditional Φ-risk and classification calibration (_x0007_)........ 74
4.1.4 Relationship between risk and Φ-risk (_x0007__x0007_).............. 76
4.2 Risk minimization decomposition....................... 80
4.3 Approximation error.............................. 80
4.4 Estimation error................................ 81
4.4.1 Application of McDiarmid’s inequality................ 82
4.4.2 Easy case I: quadratic functions.................... 83
4.4.3 Easy case II: Finite number of models................ 84
4.4.4 Beyond finitely many models through covering numbers (_x0007_)... 84
4.5 Rademacher complexity............................ 86
4.5.1 Symmetrization............................. 87
4.5.2 Lipschitz-continuous losses...................... 89
4.5.3 Ball-constrained linear predictions.................. 91
4.5.4 Putting things together (linear predictions)............. 92
4.5.5 From constrained to regularized estimation (_x0007_)........... 93
4.5.6 Extensions and improvements..................... 96
4.6 Model selection (_x0007_)............................... 98
4.6.1 Structural risk minimization...................... 98
4.6.2 Selection based on validation set................... 99
4.7 Relationship with asymptotic statistics (_x0007_)................. 99
4.8 Summary.................................... 101
5 Optimization for machine learning 103
5.1 Optimization in machine learning....................... 103
5.2 Gradient descent................................ 105
5.2.1 Simplest analysis: ordinary least-squares............... 106
5.2.2 Convex functions and their properties................ 110
5.2.3 Analysis of GD for strongly convex and smooth functions..... 112
5.2.4 Analysis of GD for convex and smooth functions (_x0007_)....... 117
5.2.5 Beyond gradient descent (_x0007_)..................... 120
5.2.6 Non-convex objective functions (_x0007_)................. 122
5.3 Gradient methods on non-smooth problems................. 123
5.4 Convergence rate of stochastic gradient descent (SGD)........... 127
5.4.1 Strongly convex problems (_x0007_).................... 132
5.4.2 Adaptive methods (_x0007_)......................... 134
5.4.3 Bias-variance trade-offs for least-squares (_x0007_)............. 135
5.4.4 Variance reduction (_x0007_)........................ 138
5.5 Conclusion................................... 143
6 Local averaging methods 145
6.1 Introduction................................... 145
6.2 Local averaging methods............................ 147
6.2.1 Linear estimators............................ 147
6.2.2 Partition estimators.......................... 148
6.2.3 Nearest-neighbors........................... 150
6.2.4 Nadaraya-Watson estimator a.k.a. kernel regression (_x0007_)...... 151
6.3 Generic “simplest” consistency analysis................... 153
6.3.1 Fixed partition............................. 155
6.3.2 k-nearest neighbor........................... 158
6.3.3 Kernel regression (Nadaraya-Watson) (_x0007_).............. 160
6.4 Universal consistency (_x0007_)........................... 163
6.5 Adaptivity (_x0007__x0007_)................................ 166
6.6 Conclusion................................... 167
7 Kernel methods 169
7.1 Introduction................................... 170
7.2 Representer theorem.............................. 170
7.3 Kernels..................................... 173
7.3.1 Linear and polynomial kernels.................... 175
7.3.2 Translation-invariant kernels on [0, 1]................. 176
7.3.3 Translation-invariant kernels on Rd.................. 179
7.3.4 Beyond vectorial input spaces (_x0007_).................. 182
7.4 Algorithms................................... 184
7.4.1 Representer theorem.......................... 184
7.4.2 Column sampling............................ 185
7.4.3 Random features............................ 185
7.4.4 Dual algorithms (_x0007_).......................... 186
7.4.5 Stochastic gradient descent (_x0007_).................... 187
7.4.6 “Kernelization” of linear algorithms................. 188
7.5 Generalization guarantees - Lipschitz-continuous losses........... 189
7.5.1 Risk decomposition........................... 190
7.5.2 Approximation error for translation-invariant kernels on Rd.... 191
7.6 Theoretical analysis of ridge regression (_x0007_)................. 194
7.6.1 Kernel ridge regression as a “linear” estimator........... 194
7.6.2 Bias and variance decomposition (_x0007_)................ 195
7.6.3 Relating empirical and population covariance operators...... 198
7.6.4 Analysis for well-specified problems (_x0007_)............... 200
7.6.5 Analysis beyond well-specified problems (_x0007_)............. 201
7.6.6 Balancing bias and variance (_x0007_)................... 202
7.7 Experiments................................... 203
7.8 Conclusion................................... 205
8 Sparse methods 207
8.1 Introduction................................... 207
8.1.1 Dedicated proof technique for constrained least-squares...... 209
8.1.2 Probabilistic and combinatorial lemmas............... 210
8.2 Variable selection by the ℓ0-penalty...................... 212
8.2.1 Assuming k is known.......................... 212
8.2.2 Estimating k (_x0007_)............................ 214
8.3 Variable selection by ℓ1-regularization.................... 216
8.3.1 Intuition and algorithms........................ 217
8.3.2 Slow rates - random design...................... 221
8.3.3 Slow rates - fixed design........................ 221
8.3.4 Fast rates (_x0007_).............................. 224
8.3.5 Zoo of conditions (_x0007__x0007_)......................... 225
8.3.6 Random design (_x0007_)........................... 227
8.4 Experiments................................... 228
8.5 Extensions.................................... 229
8.6 Conclusion................................... 230
9 Neural networks 233
9.1 Introduction................................... 233
9.2 Single hidden layer neural network...................... 235
9.2.1 Optimization.............................. 236
9.2.2 Rectified linear units and homogeneity................ 237
9.2.3 Estimation error............................ 239
9.3 Approximation properties........................... 241
9.3.1 Universal approximation property in one dimension........ 242
9.3.2 Infinitely many neurons and variation norm............. 243
9.3.3 Variation norm...
Details
Erscheinungsjahr: | 2024 |
---|---|
Fachbereich: | EDV |
Genre: | Importe, Informatik |
Rubrik: | Naturwissenschaften & Technik |
Thema: | Lexika |
Medium: | Buch |
ISBN-13: | 9780262049443 |
ISBN-10: | 0262049449 |
Sprache: | Englisch |
Einband: | Gebunden |
Autor: | Bach, Francis |
Hersteller: | MIT Press Ltd |
Verantwortliche Person für die EU: | Produktsicherheitsverantwortliche/r, Europaallee 1, D-36244 Bad Hersfeld, gpsr@libri.de |
Maße: | 188 x 243 x 38 mm |
Von/Mit: | Francis Bach |
Erscheinungsdatum: | 24.12.2024 |
Gewicht: | 1,104 kg |
Über den Autor
Francis Bach
Inhaltsverzeichnis
Preface ix
I Preliminaries 1
1 Mathematical preliminaries 3
1.1 Linear algebra and differentiable calculus.................. 3
1.1.1 Minimization of quadratic forms................... 3
1.1.2 Inverting a 2 × 2 matrix........................ 4
1.1.3 Inverting matrices defined by blocks, matrix inversion lemma... 4
1.1.4 Eigenvalue and singular value decomposition............ 6
1.1.5 Differential calculus.......................... 7
1.2 Concentration inequalities........................... 7
1.2.1 Hoeffding’s inequality......................... 9
1.2.2 McDiarmid’s inequality........................ 12
1.2.3 Bernstein’s inequality (_x0007_)....................... 14
1.2.4 Expectation of the maximum..................... 16
1.2.5 Estimation of expectations through quadrature (_x0007_)......... 17
1.2.6 Concentration inequalities for matrices (_x0007__x0007_)............. 18
2 Introduction to supervised learning 21
2.1 From training data to predictions....................... 22
2.2 Decision theory................................. 25
2.2.1 Loss functions............................. 25
2.2.2 Risks................................... 26
2.2.3 Bayes risk and Bayes predictor.................... 27
2.3 Learning from data............................... 30
2.3.1 Local averaging............................. 30
2.3.2 Empirical risk minimization...................... 31
2.4 Statistical learning theory........................... 33
2.4.1 Measures of performance....................... 35
2.4.2 Notions of consistency over classes of problems........... 35
2.5 No free lunch theorems (_x0007_).......................... 36
2.6 Quest for adaptivity.............................. 38
2.7 Beyond supervised learning.......................... 39
2.8 Summary - book outline............................ 40
3 Linear least-squares regression 43
3.1 Introduction................................... 43
3.2 Least-squares framework............................ 44
3.3 Ordinary least-squares (OLS) estimator................... 45
3.3.1 Closed-form solution.......................... 45
3.3.2 Geometric interpretation........................ 46
3.3.3 Numerical resolution.......................... 47
3.4 Statistical analysis of OLS........................... 47
3.5 Fixed design setting.............................. 48
3.5.1 Statistical properties of the OLS estimator............. 50
3.5.2 Experiments.............................. 52
3.6 Ridge least-squares regression......................... 53
3.7 Lower-bound (_x0007_)................................ 57
3.8 Random design analysis............................ 59
3.8.1 Gaussian designs............................ 61
3.8.2 General designs (_x0007__x0007_)......................... 61
3.9 Principal component analysis (_x0007_)....................... 63
3.10 Conclusion................................... 65
II Generalization bounds for learning algorithms 67
4 Empirical risk minimization 69
4.1 Convexification of the risk........................... 70
4.1.1 Convex surrogates........................... 71
4.1.2 Geometric interpretation of the support vector machine (_x0007_).... 72
4.1.3 Conditional Φ-risk and classification calibration (_x0007_)........ 74
4.1.4 Relationship between risk and Φ-risk (_x0007__x0007_).............. 76
4.2 Risk minimization decomposition....................... 80
4.3 Approximation error.............................. 80
4.4 Estimation error................................ 81
4.4.1 Application of McDiarmid’s inequality................ 82
4.4.2 Easy case I: quadratic functions.................... 83
4.4.3 Easy case II: Finite number of models................ 84
4.4.4 Beyond finitely many models through covering numbers (_x0007_)... 84
4.5 Rademacher complexity............................ 86
4.5.1 Symmetrization............................. 87
4.5.2 Lipschitz-continuous losses...................... 89
4.5.3 Ball-constrained linear predictions.................. 91
4.5.4 Putting things together (linear predictions)............. 92
4.5.5 From constrained to regularized estimation (_x0007_)........... 93
4.5.6 Extensions and improvements..................... 96
4.6 Model selection (_x0007_)............................... 98
4.6.1 Structural risk minimization...................... 98
4.6.2 Selection based on validation set................... 99
4.7 Relationship with asymptotic statistics (_x0007_)................. 99
4.8 Summary.................................... 101
5 Optimization for machine learning 103
5.1 Optimization in machine learning....................... 103
5.2 Gradient descent................................ 105
5.2.1 Simplest analysis: ordinary least-squares............... 106
5.2.2 Convex functions and their properties................ 110
5.2.3 Analysis of GD for strongly convex and smooth functions..... 112
5.2.4 Analysis of GD for convex and smooth functions (_x0007_)....... 117
5.2.5 Beyond gradient descent (_x0007_)..................... 120
5.2.6 Non-convex objective functions (_x0007_)................. 122
5.3 Gradient methods on non-smooth problems................. 123
5.4 Convergence rate of stochastic gradient descent (SGD)........... 127
5.4.1 Strongly convex problems (_x0007_).................... 132
5.4.2 Adaptive methods (_x0007_)......................... 134
5.4.3 Bias-variance trade-offs for least-squares (_x0007_)............. 135
5.4.4 Variance reduction (_x0007_)........................ 138
5.5 Conclusion................................... 143
6 Local averaging methods 145
6.1 Introduction................................... 145
6.2 Local averaging methods............................ 147
6.2.1 Linear estimators............................ 147
6.2.2 Partition estimators.......................... 148
6.2.3 Nearest-neighbors........................... 150
6.2.4 Nadaraya-Watson estimator a.k.a. kernel regression (_x0007_)...... 151
6.3 Generic “simplest” consistency analysis................... 153
6.3.1 Fixed partition............................. 155
6.3.2 k-nearest neighbor........................... 158
6.3.3 Kernel regression (Nadaraya-Watson) (_x0007_).............. 160
6.4 Universal consistency (_x0007_)........................... 163
6.5 Adaptivity (_x0007__x0007_)................................ 166
6.6 Conclusion................................... 167
7 Kernel methods 169
7.1 Introduction................................... 170
7.2 Representer theorem.............................. 170
7.3 Kernels..................................... 173
7.3.1 Linear and polynomial kernels.................... 175
7.3.2 Translation-invariant kernels on [0, 1]................. 176
7.3.3 Translation-invariant kernels on Rd.................. 179
7.3.4 Beyond vectorial input spaces (_x0007_).................. 182
7.4 Algorithms................................... 184
7.4.1 Representer theorem.......................... 184
7.4.2 Column sampling............................ 185
7.4.3 Random features............................ 185
7.4.4 Dual algorithms (_x0007_).......................... 186
7.4.5 Stochastic gradient descent (_x0007_).................... 187
7.4.6 “Kernelization” of linear algorithms................. 188
7.5 Generalization guarantees - Lipschitz-continuous losses........... 189
7.5.1 Risk decomposition........................... 190
7.5.2 Approximation error for translation-invariant kernels on Rd.... 191
7.6 Theoretical analysis of ridge regression (_x0007_)................. 194
7.6.1 Kernel ridge regression as a “linear” estimator........... 194
7.6.2 Bias and variance decomposition (_x0007_)................ 195
7.6.3 Relating empirical and population covariance operators...... 198
7.6.4 Analysis for well-specified problems (_x0007_)............... 200
7.6.5 Analysis beyond well-specified problems (_x0007_)............. 201
7.6.6 Balancing bias and variance (_x0007_)................... 202
7.7 Experiments................................... 203
7.8 Conclusion................................... 205
8 Sparse methods 207
8.1 Introduction................................... 207
8.1.1 Dedicated proof technique for constrained least-squares...... 209
8.1.2 Probabilistic and combinatorial lemmas............... 210
8.2 Variable selection by the ℓ0-penalty...................... 212
8.2.1 Assuming k is known.......................... 212
8.2.2 Estimating k (_x0007_)............................ 214
8.3 Variable selection by ℓ1-regularization.................... 216
8.3.1 Intuition and algorithms........................ 217
8.3.2 Slow rates - random design...................... 221
8.3.3 Slow rates - fixed design........................ 221
8.3.4 Fast rates (_x0007_).............................. 224
8.3.5 Zoo of conditions (_x0007__x0007_)......................... 225
8.3.6 Random design (_x0007_)........................... 227
8.4 Experiments................................... 228
8.5 Extensions.................................... 229
8.6 Conclusion................................... 230
9 Neural networks 233
9.1 Introduction................................... 233
9.2 Single hidden layer neural network...................... 235
9.2.1 Optimization.............................. 236
9.2.2 Rectified linear units and homogeneity................ 237
9.2.3 Estimation error............................ 239
9.3 Approximation properties........................... 241
9.3.1 Universal approximation property in one dimension........ 242
9.3.2 Infinitely many neurons and variation norm............. 243
9.3.3 Variation norm...
I Preliminaries 1
1 Mathematical preliminaries 3
1.1 Linear algebra and differentiable calculus.................. 3
1.1.1 Minimization of quadratic forms................... 3
1.1.2 Inverting a 2 × 2 matrix........................ 4
1.1.3 Inverting matrices defined by blocks, matrix inversion lemma... 4
1.1.4 Eigenvalue and singular value decomposition............ 6
1.1.5 Differential calculus.......................... 7
1.2 Concentration inequalities........................... 7
1.2.1 Hoeffding’s inequality......................... 9
1.2.2 McDiarmid’s inequality........................ 12
1.2.3 Bernstein’s inequality (_x0007_)....................... 14
1.2.4 Expectation of the maximum..................... 16
1.2.5 Estimation of expectations through quadrature (_x0007_)......... 17
1.2.6 Concentration inequalities for matrices (_x0007__x0007_)............. 18
2 Introduction to supervised learning 21
2.1 From training data to predictions....................... 22
2.2 Decision theory................................. 25
2.2.1 Loss functions............................. 25
2.2.2 Risks................................... 26
2.2.3 Bayes risk and Bayes predictor.................... 27
2.3 Learning from data............................... 30
2.3.1 Local averaging............................. 30
2.3.2 Empirical risk minimization...................... 31
2.4 Statistical learning theory........................... 33
2.4.1 Measures of performance....................... 35
2.4.2 Notions of consistency over classes of problems........... 35
2.5 No free lunch theorems (_x0007_).......................... 36
2.6 Quest for adaptivity.............................. 38
2.7 Beyond supervised learning.......................... 39
2.8 Summary - book outline............................ 40
3 Linear least-squares regression 43
3.1 Introduction................................... 43
3.2 Least-squares framework............................ 44
3.3 Ordinary least-squares (OLS) estimator................... 45
3.3.1 Closed-form solution.......................... 45
3.3.2 Geometric interpretation........................ 46
3.3.3 Numerical resolution.......................... 47
3.4 Statistical analysis of OLS........................... 47
3.5 Fixed design setting.............................. 48
3.5.1 Statistical properties of the OLS estimator............. 50
3.5.2 Experiments.............................. 52
3.6 Ridge least-squares regression......................... 53
3.7 Lower-bound (_x0007_)................................ 57
3.8 Random design analysis............................ 59
3.8.1 Gaussian designs............................ 61
3.8.2 General designs (_x0007__x0007_)......................... 61
3.9 Principal component analysis (_x0007_)....................... 63
3.10 Conclusion................................... 65
II Generalization bounds for learning algorithms 67
4 Empirical risk minimization 69
4.1 Convexification of the risk........................... 70
4.1.1 Convex surrogates........................... 71
4.1.2 Geometric interpretation of the support vector machine (_x0007_).... 72
4.1.3 Conditional Φ-risk and classification calibration (_x0007_)........ 74
4.1.4 Relationship between risk and Φ-risk (_x0007__x0007_).............. 76
4.2 Risk minimization decomposition....................... 80
4.3 Approximation error.............................. 80
4.4 Estimation error................................ 81
4.4.1 Application of McDiarmid’s inequality................ 82
4.4.2 Easy case I: quadratic functions.................... 83
4.4.3 Easy case II: Finite number of models................ 84
4.4.4 Beyond finitely many models through covering numbers (_x0007_)... 84
4.5 Rademacher complexity............................ 86
4.5.1 Symmetrization............................. 87
4.5.2 Lipschitz-continuous losses...................... 89
4.5.3 Ball-constrained linear predictions.................. 91
4.5.4 Putting things together (linear predictions)............. 92
4.5.5 From constrained to regularized estimation (_x0007_)........... 93
4.5.6 Extensions and improvements..................... 96
4.6 Model selection (_x0007_)............................... 98
4.6.1 Structural risk minimization...................... 98
4.6.2 Selection based on validation set................... 99
4.7 Relationship with asymptotic statistics (_x0007_)................. 99
4.8 Summary.................................... 101
5 Optimization for machine learning 103
5.1 Optimization in machine learning....................... 103
5.2 Gradient descent................................ 105
5.2.1 Simplest analysis: ordinary least-squares............... 106
5.2.2 Convex functions and their properties................ 110
5.2.3 Analysis of GD for strongly convex and smooth functions..... 112
5.2.4 Analysis of GD for convex and smooth functions (_x0007_)....... 117
5.2.5 Beyond gradient descent (_x0007_)..................... 120
5.2.6 Non-convex objective functions (_x0007_)................. 122
5.3 Gradient methods on non-smooth problems................. 123
5.4 Convergence rate of stochastic gradient descent (SGD)........... 127
5.4.1 Strongly convex problems (_x0007_).................... 132
5.4.2 Adaptive methods (_x0007_)......................... 134
5.4.3 Bias-variance trade-offs for least-squares (_x0007_)............. 135
5.4.4 Variance reduction (_x0007_)........................ 138
5.5 Conclusion................................... 143
6 Local averaging methods 145
6.1 Introduction................................... 145
6.2 Local averaging methods............................ 147
6.2.1 Linear estimators............................ 147
6.2.2 Partition estimators.......................... 148
6.2.3 Nearest-neighbors........................... 150
6.2.4 Nadaraya-Watson estimator a.k.a. kernel regression (_x0007_)...... 151
6.3 Generic “simplest” consistency analysis................... 153
6.3.1 Fixed partition............................. 155
6.3.2 k-nearest neighbor........................... 158
6.3.3 Kernel regression (Nadaraya-Watson) (_x0007_).............. 160
6.4 Universal consistency (_x0007_)........................... 163
6.5 Adaptivity (_x0007__x0007_)................................ 166
6.6 Conclusion................................... 167
7 Kernel methods 169
7.1 Introduction................................... 170
7.2 Representer theorem.............................. 170
7.3 Kernels..................................... 173
7.3.1 Linear and polynomial kernels.................... 175
7.3.2 Translation-invariant kernels on [0, 1]................. 176
7.3.3 Translation-invariant kernels on Rd.................. 179
7.3.4 Beyond vectorial input spaces (_x0007_).................. 182
7.4 Algorithms................................... 184
7.4.1 Representer theorem.......................... 184
7.4.2 Column sampling............................ 185
7.4.3 Random features............................ 185
7.4.4 Dual algorithms (_x0007_).......................... 186
7.4.5 Stochastic gradient descent (_x0007_).................... 187
7.4.6 “Kernelization” of linear algorithms................. 188
7.5 Generalization guarantees - Lipschitz-continuous losses........... 189
7.5.1 Risk decomposition........................... 190
7.5.2 Approximation error for translation-invariant kernels on Rd.... 191
7.6 Theoretical analysis of ridge regression (_x0007_)................. 194
7.6.1 Kernel ridge regression as a “linear” estimator........... 194
7.6.2 Bias and variance decomposition (_x0007_)................ 195
7.6.3 Relating empirical and population covariance operators...... 198
7.6.4 Analysis for well-specified problems (_x0007_)............... 200
7.6.5 Analysis beyond well-specified problems (_x0007_)............. 201
7.6.6 Balancing bias and variance (_x0007_)................... 202
7.7 Experiments................................... 203
7.8 Conclusion................................... 205
8 Sparse methods 207
8.1 Introduction................................... 207
8.1.1 Dedicated proof technique for constrained least-squares...... 209
8.1.2 Probabilistic and combinatorial lemmas............... 210
8.2 Variable selection by the ℓ0-penalty...................... 212
8.2.1 Assuming k is known.......................... 212
8.2.2 Estimating k (_x0007_)............................ 214
8.3 Variable selection by ℓ1-regularization.................... 216
8.3.1 Intuition and algorithms........................ 217
8.3.2 Slow rates - random design...................... 221
8.3.3 Slow rates - fixed design........................ 221
8.3.4 Fast rates (_x0007_).............................. 224
8.3.5 Zoo of conditions (_x0007__x0007_)......................... 225
8.3.6 Random design (_x0007_)........................... 227
8.4 Experiments................................... 228
8.5 Extensions.................................... 229
8.6 Conclusion................................... 230
9 Neural networks 233
9.1 Introduction................................... 233
9.2 Single hidden layer neural network...................... 235
9.2.1 Optimization.............................. 236
9.2.2 Rectified linear units and homogeneity................ 237
9.2.3 Estimation error............................ 239
9.3 Approximation properties........................... 241
9.3.1 Universal approximation property in one dimension........ 242
9.3.2 Infinitely many neurons and variation norm............. 243
9.3.3 Variation norm...
Details
Erscheinungsjahr: | 2024 |
---|---|
Fachbereich: | EDV |
Genre: | Importe, Informatik |
Rubrik: | Naturwissenschaften & Technik |
Thema: | Lexika |
Medium: | Buch |
ISBN-13: | 9780262049443 |
ISBN-10: | 0262049449 |
Sprache: | Englisch |
Einband: | Gebunden |
Autor: | Bach, Francis |
Hersteller: | MIT Press Ltd |
Verantwortliche Person für die EU: | Produktsicherheitsverantwortliche/r, Europaallee 1, D-36244 Bad Hersfeld, gpsr@libri.de |
Maße: | 188 x 243 x 38 mm |
Von/Mit: | Francis Bach |
Erscheinungsdatum: | 24.12.2024 |
Gewicht: | 1,104 kg |
Sicherheitshinweis