Lihang Download - Lihang Source code download

Lihang

Python

1.0.0

Download

statistical learning methods

The second edition of this book has been published. All content updates after May 2019 refer to the first printing of the second edition.

For the content of the first edition, see Release first_edition

[TOC]

tool kit

To facilitate learning, some tool descriptions are compiled.

GitHub's markdown formula support is average. It is recommended to use the Chrome plug-in TeX All the Things to render TeX formulas. The local Markdown editor recommends Typora. Pay attention to Ctrl+, open Preferences, and check inline Math in the Syntax Support section. Both Ubuntu and Windows are fine.
math_markdown.pdf is the exported version of math_markdown.md, which is convenient for viewing and use. The markdown version is the latest version, which basically covers the $LaTeX$ expression of mathematical formulas used in the book.
ref_downloader is a reference download script. This book must be read in conjunction with the references. You must read the large references in each chapter. It will be very helpful for understanding the content of the book.
glossary_index is an informal terminology index. There is one at the back of this book, but it is not convenient to expand. Some expanded content has been added to this part.
symbol_index is an informal symbol index. There are symbol descriptions in the first version, but not in the second version. It may be that there are really too many symbols involved in the unsupervised part. In short, this part is retained to avoid confusion. You can check it out sometimes and see if it helps.
errata_se Unofficial errata, for reference. If you feel unclear about any content, you can refer to it and hope it helps.

Preface

In May 2019, the long-awaited second edition was released. I placed an order immediately and it is expected to be shipped on Mother's Day.
I got the new book on May 13th, and the second edition has a new photo, with short hair, and I look younger than before...
The second edition has revised punctuation marks. In the first edition, commas were in Chinese and periods were in English. The second edition changed the previous English period into a Chinese period.
The symbol table was canceled in the second edition, maybe because different symbols were used in some places before and after the same book? So in this repo, we try to add a symbol table for explanation to facilitate query.
The second edition adds eight unsupervised learning methods. At this point, except for Apriori, the top ten data mining algorithms are complete.

If you need to reference this Repo:

Format: SmirkCao, Lihang, (2018), GitHub repository, https://github.com/SmirkCao/Lihang

or

 @misc{SmirkCao,
  author = {SmirkCao},
  title = {Lihang},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {url{https://github.com/SmirkCao/Lihang}},
  commit = {c5624a9bd757a5cc88e78b85b89e9221deb08270}
}

Preface

This part of the content does not correspond to the preface in "Statistical Learning Methods". The preface in the book is also well written and is quoted as follows:

In terms of content selection, we focus on introducing the most important and commonly used methods, especially methods related to classification and labeling problems.
Try to use a unified framework to discuss all methods so that the whole book does not lose its systematicness.
Applicable to college students and graduate students majoring in information retrieval and natural language processing.

Another thing to note is the author’s work background.

The author has been engaged in research on various intelligent processing of text data using statistical learning methods, including natural language processing, information retrieval, and text data mining.

Everyone has their own way of understanding, and they will have different understandings of the same content.
Books are like data, learning is like training, and people are models.

If you use my model to implement similarity search, the book that is similar to Mr. Li's book is "Semiconductor Optoelectronic Devices". It's a pity that I didn't read it repeatedly when I was young.

I hope that in the process of repeated reading, the entire book will become thicker and thinner. All documents and codes in this series, unless otherwise stated, the description "in the book" refers to "Statistical Learning Methods" by Teacher Li Hang. Contents in other references will be linked if cited.

Some references are listed in Refs, some of which are very helpful for understanding the contents of the book. Descriptions and explanations of these files will be added in Refs/README.md corresponding to the reference section. Some notes on other references have also been added to this document.

To facilitate reference downloading, ref_downloader.sh was added during review02, which can be used to download the references listed in the book. The update process is gradually completed as review02 progresses.

In addition, this book by Teacher Li Hang, ~~It’s really thin (the second version is no longer thin)~~ , but almost every sentence brings out many points and is worth reading again and again.

There is a symbol table after the table of contents in the book, which explains the symbol definitions, so if there are symbols you do not understand, you can look it up in the table; there is an index at the back of the book, and you can use the index to find the meaning of the corresponding symbol that appears in the book. Location. In this Repo, a glossary_index.md is maintained to add some explanations to the corresponding symbols and directly mark the page numbers corresponding to the symbols. The progress will be updated with the review.

After each algorithm or example, there will be a ◼️, indicating that the algorithm or example ends here. This is called the proof end symbol. You will know it if you read more literature.

About the base of logarithms

When reading, we often have questions about the base of logarithms. Some of the more important ones are emphasized in the book. Some that are not emphasized can be understood through context. In addition, because there is a formula for changing the base, it does not matter much what the base is. The difference lies in a constant coefficient. However, choosing different bases will have physical meanings and problem-solving considerations. For analysis of this issue, you can see the discussion on entropy in PRML 1.6 to understand.

In addition, regarding the issue of constant coefficients in the formula, if an iterative solution is used and sometimes the formula is simplified to a certain extent, the convergence speed may be improved. The details can be gradually understood in practice.

About the length

Proportion of length of each chapter

Insert a chart here to list the space occupied by each chapter. Among them, SVM takes up the largest space among supervised learning, MCMC takes up the largest space among unsupervised learning, and DT, HMM, CRF, SVD, PCA, LDA, and PageRank also occupy the largest space. relatively large space.

The chapters are related to each other, such as NB and LR, DT and AdaBoost, Perceptron and SVM, HMM and CRF, etc. If you encounter difficulties in a large chapter, you can review the content of the previous chapters or check the references of specific chapters. References are generally given that describe the problem in more detail and may explain where you are stuck.

CH01 Introduction to Statistical Learning and Supervised Learning

Introduction

Three elements of statistical learning methods:

Model
Strategy
algorithm
The second edition has reorganized the directory structure of this chapter to make it clearer.

CH02 perceptron

Perceptron

The perceptron is a linear classification model for two-category classification.
The perceptron corresponds to the separating hyperplane in the feature space that divides instances into positive and negative categories.

CH03 k nearest neighbor method

kNN

kNN is a basic classification and regression method
The selection of k value, distance measurement and classification decision rules are the three basic elements of kNN.

CH04 Naive Bayes method

N.B.

The Naive Bayes method is a classification method based on Bayes' theorem and the assumption of independence of feature conditions.

$IIDrightarrow$ Joint probability distribution of input and output
$Bayesrightarrow$ The output with the largest posterior probability

If a certain combination of x does not appear in the priori, the probability will be 0, corresponding to the smoothing solution. $$P_lambda(X^{(j)}=a_{jl}|Y=c_k)=frac{sum_{i=1}^{N}{I(x_i^{(j)}=a_ {jl}, y_i=c_k)}+lambda}{sum_{i=1}^{N}{I(y_i=c_k)+S_jlambda}}$$
- $lambda = 0$ Corresponds to maximum likelihood estimation
- $lambda = 1$ Corresponds to Laplacian smoothing
The Naive Bayes method actually learns the mechanism of generating data, so it is a generative model.

CH05 Decision Tree

DT

Decision tree is a basic classification and regression method

CH06 Logistic regression and maximum entropy model

LR

Logistic regression is a classic classification method in statistics
Maximum entropy is a criterion for probability model learning. It can be extended to classification problems to obtain the maximum entropy model.

Regarding the study of maximum entropy, it is recommended to read the reference literature [1] in this chapter, Berger, 1996, which is helpful for understanding the examples in the book and grasping the principle of maximum entropy.

So, why are LR and Maxent placed in one chapter?

All belong to the logarithmic linear model
Both can be used for binary classification and multi-classification
The learning methods of the two models generally use maximum likelihood estimation, or regularized maximum likelihood estimation. It can be formalized as an unconstrained optimization problem, and the solution methods include IIS, GD, BFGS, etc.
It is described as follows in Logistic regression,
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
There is also such a description
Logistic regression is a special case of maximum entropy with two labels +1 and −1.
The derivation in this chapter uses the property of $yin mathcal{Y}={0,1}$
Sometimes we say that logistic regression is called Maxent in NLP

CH07 Support Vector Machine

SVM

Support vector machine is a binary classification model.
The basic model is a linear classifier defined to maximize the interval on the feature space. The maximum interval makes it different from the perceptron.
This chapter takes up a lot of space, because the idea of margin can almost connect the entire classification problem.

CH08 Upgrading method

Boosting

The boosting method is a commonly used statistical learning method that is widely used and effective.

----Separating line----

Let’s break it down here, because HMM and CRF usually lead to the introduction of probabilistic graphical models later. In "Machine Learning, Zhou Zhihua", a separate probabilistic graphical model chapter is used to include HMM, MRF, CRF and other contents. In addition, there are many related points from HMM to CRF itself.

In the first chapter of the book, three applications of supervised learning are explained: classification, labeling and regression. There are supplements in Chapter 12. This book mainly considers the learning methods of the first two. Accordingly, segmentation is also appropriate here. The classification model is introduced earlier, and regression is mentioned in a small part. The labeling problem is mainly introduced later.

CH09 EM algorithm and its promotion

EM

The EM algorithm is an iterative algorithm used for maximum likelihood estimation of probabilistic model parameters containing hidden variables, or maximum posterior probability estimation. (The maximum likelihood estimation and maximum posterior probability estimation here are learning strategies )
If the variables of the probability model are all observed variables, then given the data, the model parameters can be estimated directly using the maximum likelihood estimation method or the Bayesian estimation method.
Note that if you do not understand this description in the book, please refer to the parameter estimation part of the Naive Bayes method in CH04.
This part of the code implements BMM and GMM, it is worth taking a look
Regarding EM, not much has been written about this chapter. EM is one of the top ten algorithms. EM and Hinton are closely related. Hinton published the second article of Capsule Network "Matrix Capsules with EM Routing" at ICLR in 2018.
In CH22, the EM algorithm is classified as a basic machine learning method and does not involve specific machine learning models. It can be used for unsupervised learning, supervised learning, and semi-supervised learning.

CH10 Hidden Markov Model

HMM

Hidden Markov model is a statistical learning model that can be used for labeling problems. It describes the process of randomly generating observation sequences from hidden Markov chains and is a generative model.
The hidden Markov model is a probabilistic model about time series. It describes the process of randomly generating a sequence of unobservable states from a hidden Markov chain, and then generating an observation from each state shorthand to generate an observation sequence.
It can be used for tagging problems, and the status corresponds to the tag.
Three basic problems: probability calculation problem, learning problem, and prediction problem.

CH11 Conditional Random Field

CRF

Conditional random field is a conditional probability distribution model of another set of output random variables given a set of input random variables. Its characteristic is that it is assumed that the output random variables constitute a Markov random field .
The probabilistic undirected graph model, also known as Markov random field, is a joint probability distribution that can be represented by an undirected graph.
Three basic problems: probability calculation problem, learning problem, prediction problem

CH12 Summary of Supervised Learning Methods

Summary

This chapter only has a few pages. You can consider the following reading routine:

Read it with Chapter 1
If you encounter unclear questions in previous studies, read this chapter again.
Read this chapter thickly and expand from this chapter to ten other chapters.
Note that there is Figure 12.2 in this chapter, which mentions the logistic loss function. $y$ here should be defined in $cal{Y}={+1,-1}$. LR was introduced earlier. When $y$ is defined at $cal{Y}={0,1}$, please pay attention here.

Teacher Li’s book really makes you gain something new every time you read it.

----Separating line----

The second edition adds eight unsupervised learning methods: clustering, singular value decomposition, principal component analysis, latent semantic analysis, probabilistic latent semantic analysis, Markov chain Monte Carlo method, latent Dirichlet allocation, and PageRank.

CH13 Introduction to Unsupervised Learning

Introduction

Basic problems of unsupervised learning: clustering, dimensionality reduction, topic analysis and graph analysis.
The issue of horizontal structure and vertical structure is considered from the perspective of storage.
Pay attention to strategies for different tasks: minimizing the distance between category centers, minimizing information loss during dimension conversion, and maximizing the probability of generating data.
In the unsupervised learning part, the structure in the data is often mentioned, which refers to the relationship between variables in the data.

CH14 clustering method

Clustering

Example 14.2 is very good. It is recommended to draw it and think about it yourself before looking at it later.
Clustering can be used for image compression

CH15 singular value decomposition

Basic machine learning methods
The singular value decomposition theorem guarantees that decomposition exists
The singular value matrix is unique, $U, V$ are not unique
Have a clear geometric interpretation

CH16 Principal Component Analysis

Orthogonal transformation is used to convert the observation data represented by linearly related variables into a few data represented by linearly independent variables. The linearly independent variables are called principal components.
Before principal component analysis, the given data need to be normalized so that each variable has a mean of 0 and a variance of 1.
The principal component does not correspond to a certain feature of the original data. The relationship between the principal component and the original feature can be observed through factor loadings.
This part of the content has not yet mentioned the concept of topic . Later chapters begin to introduce a lot of content related to topic analysis. LSA, PLSA, and LDA are all related to topics. MCMC is a tool used in LDA.
The population principal component and the sample principal component are mentioned, the former being the basis of the latter. It is mainly reflected in the overall consideration of expectations and the sample consideration of mean. The sample principal components have the same properties as the population principal components.

CH17 Latent semantic analysis

In the definition of sklearn, LSA is truncated singular value decomposition.
Pay attention to understand the difference between LSA and PCA, mainly whether to remove the mean.
In LSA, the topic vector space is $U$, and the representation of DOC in the topic vector space is $SV^mathrm{T}$. But in sklaern, xtransformed is $UmitSigma$

CH18 Probabilistic Latent Semantic Analysis

CH19 Markov chain Monte Carlo method

CH20 Potential Dirichlet Allocation

CH21 PageRank algorithm

CH22 Summary of unsupervised learning methods

postscript

Each chapter in this book is not completely independent. This part hopes to organize the connections between chapters and applicable data sets. How far the algorithm is implemented and what data sets it can run on are also one aspect.

data_algo_map