Maximum Likelihood Estimation (MLE)
Wiki describes Maximum Likelihood Estimation (MLE) like this:
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data.
We'll start with a binomial distribution.
Suppose we have dataset : 0,1,1,0,1,1 with the probability like this:
$$ p(x=1)=\mu, \quad p(x=0)=1-\mu$$What is the maximum likelihood of the parameter $\mu$?
We can think of the dataset as the outcome of coin toss, and 1 for head, 0 for tail. So, the coin appears biased towards heads. But how much?
Let's multiply the probability of each data point as a function of the parameter $\mu$:
$$ L(\mu) = p(0)p(1)p(1)p(0)p(1)p(1) = (1-\mu)\mu\mu(1-\mu)\mu\mu = (1-\mu)^2\mu^4$$Then, log of $L(\mu)$ becomes:
$$ log(L(\mu))=2log(1-\mu)+4log\mu$$To get the value of $\mu$ which makes $L(\mu)$ max, let's get the derivative of $log(L(\mu))$:
$$ (logL)^\prime = \frac {-2}{1-\mu} + \frac{4}{\mu} = 0$$This gives us $\mu = 2/3$.
Still, we're not sure it's just a saddle or a real max. So, let's get a \mu which makes 2nd derivative equals to 0:
$$ (logL)^{\prime \prime} = \frac {2}{(1-\mu)^2}- \frac {4}{\mu^2} $$We get negative value if we put $\mu = 2/3$ in the 2nd derivative. So, we can be sure it's max when $\mu = 2/3$.
So, if the dataset is for coin toss, then we can say the coin's bias($\mu$) for head is 2/3!
The sample we used just has one parameter ($\mu$: bias towards heads).
In the next section, we'll deal with more general case where we want to estimate multiple parameters.
Let's start by looking into the following maximum likelihood function:
$$ L(w)=\prod_i^n(\phi(z^{(i)})^{y^{(i)}}(1-\phi(z^{(i)}))^{1-y^{(i)}}$$Note that compared with the $L(\mu)$ in the previous section, here $\phi(z)$ is the (conditional) probability.
$$ \phi(z)=p(y=1|x;w)=\frac{1}{1+e^{-z}}$$where $z$ is the net input:
$$ z = \sum_iw_ix_i$$We learned that we can use the logistic regression model to predict probabilities and class labels.
Now let's think about the parameters of the model, for example, weights $w$ with the likelihood $L(w)$ defined above. We want to maximize it when we build a logistic regression model.
In other words, maximizing the likelihood means maximizing the probability. Since we are talking about "cost", lets reverse the likelihood function so that we can minimize a cost function $J$.
For convenience (in case when we use gradient or stochastic gradient descent), we may want to use log-likelihood function as our "cost function", $J(w)$, that can be minimized using gradient descent:
$$ log(L(w))=\sum_i^n y^{(i)}log(\phi(z^{(i)})+(1-y^{(i)})log(1-\phi(z^{(i)}))$$ $$ J(w) = \sum_i^n -y^{(i)}log(\phi(z^{(i)})-(1-y^{(i)})log(1-\phi(z^{(i)}))$$To make the property of this cost function more clear, let's take a look at the cost function with just one single-sample instance:
$$ J(\phi(z),y;w)=\sum_i^n -\; y log\phi(z)-(1-y) \; log(1-\phi(z)) $$If we look at the equation carefully, we can see that the first term becomes zero if $y = 0$, while the second term becomes zero if $y = 1$, respectively as shown in the picture below:
$$ J(\phi(z),y;w)= \begin{cases}-log(\phi(z)) & if \; y=1 \\ -log(1-\phi(z)) & if \; y=0 \end{cases}$$As we can see from the picture, the cost approaches 0 (green) if we correctly predict that a sample belongs to class 1. Similarly, we can see on the y axis that the cost also approaches 0 if we correctly predict y = 0 (blue).
However, if the prediction is wrong, we penalize wrong predictions by increasing the cost to large a value.
Picture code:
import matplotlib.pyplot as plt import numpy as np phi = np.arange(0.01, 1.0, 0.01) j1 = -np.log(phi) j0 = -np.log(1-phi) plt.plot(phi, j1, color="green", label="y=1") plt.plot(phi, j0, color="blue", label="y=0") plt.xlabel('$\phi$(z)') plt.ylabel('J(w)') plt.title('Cost functions') plt.grid(False) plt.legend(loc='upper center') plt.show()
We can use Jupyter as well via inline matplotlib feature:
Machine Learning with scikit-learn
scikit-learn installation
scikit-learn : Features and feature extraction - iris dataset
scikit-learn : Machine Learning Quick Preview
scikit-learn : Data Preprocessing I - Missing / Categorical data
scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization
scikit-learn : Data Preprocessing III - Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests
Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)
scikit-learn : Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)
scikit-learn : Data Compression via Dimensionality Reduction III - Nonlinear mappings via kernel principal component (KPCA) analysis
scikit-learn : Logistic Regression, Overfitting & regularization
scikit-learn : Supervised Learning & Unsupervised Learning - e.g. Unsupervised PCA dimensionality reduction with iris dataset
scikit-learn : Unsupervised_Learning - KMeans clustering with iris dataset
scikit-learn : Linearly Separable Data - Linear Model & (Gaussian) radial basis function kernel (RBF kernel)
scikit-learn : Decision Tree Learning I - Entropy, Gini, and Information Gain
scikit-learn : Decision Tree Learning II - Constructing the Decision Tree
scikit-learn : Random Decision Forests Classification
scikit-learn : Support Vector Machines (SVM)
scikit-learn : Support Vector Machines (SVM) II
Flask with Embedded Machine Learning I : Serializing with pickle and DB setup
Flask with Embedded Machine Learning II : Basic Flask App
Flask with Embedded Machine Learning III : Embedding Classifier
Flask with Embedded Machine Learning IV : Deploy
Flask with Embedded Machine Learning V : Updating the classifier
scikit-learn : Sample of a spam comment filter using SVM - classifying a good one or a bad one
Machine learning algorithms and concepts
Batch gradient descent algorithmSingle Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function
Batch gradient descent versus stochastic gradient descent
Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method
Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)
Logistic Regression
VC (Vapnik-Chervonenkis) Dimension and Shatter
Bias-variance tradeoff
Maximum Likelihood Estimation (MLE)
Neural Networks with backpropagation for XOR using one hidden layer
minHash
tf-idf weight
Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)
Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)
Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)
Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)
Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)
Artificial Neural Networks (ANN)
[Note] Sources are available at Github - Jupyter notebook files1. Introduction
2. Forward Propagation
3. Gradient Descent
4. Backpropagation of Errors
5. Checking gradient
6. Training via BFGS
7. Overfitting & Regularization
8. Deep Learning I : Image Recognition (Image uploading)
9. Deep Learning II : Image Recognition (Image classification)
10 - Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras
Python tutorial
Python Home
Introduction
Running Python Programs (os, sys, import)
Modules and IDLE (Import, Reload, exec)
Object Types - Numbers, Strings, and None
Strings - Escape Sequence, Raw String, and Slicing
Strings - Methods
Formatting Strings - expressions and method calls
Files and os.path
Traversing directories recursively
Subprocess Module
Regular Expressions with Python
Regular Expressions Cheat Sheet
Object Types - Lists
Object Types - Dictionaries and Tuples
Functions def, *args, **kargs
Functions lambda
Built-in Functions
map, filter, and reduce
Decorators
List Comprehension
Sets (union/intersection) and itertools - Jaccard coefficient and shingling to check plagiarism
Hashing (Hash tables and hashlib)
Dictionary Comprehension with zip
The yield keyword
Generator Functions and Expressions
generator.send() method
Iterators
Classes and Instances (__init__, __call__, etc.)
if__name__ == '__main__'
argparse
Exceptions
@static method vs class method
Private attributes and private methods
bits, bytes, bitstring, and constBitStream
json.dump(s) and json.load(s)
Python Object Serialization - pickle and json
Python Object Serialization - yaml and json
Priority queue and heap queue data structure
Graph data structure
Dijkstra's shortest path algorithm
Prim's spanning tree algorithm
Closure
Functional programming in Python
Remote running a local file using ssh
SQLite 3 - A. Connecting to DB, create/drop table, and insert data into a table
SQLite 3 - B. Selecting, updating and deleting data
MongoDB with PyMongo I - Installing MongoDB ...
Python HTTP Web Services - urllib, httplib2
Web scraping with Selenium for checking domain availability
REST API : Http Requests for Humans with Flask
Blog app with Tornado
Multithreading ...
Python Network Programming I - Basic Server / Client : A Basics
Python Network Programming I - Basic Server / Client : B File Transfer
Python Network Programming II - Chat Server / Client
Python Network Programming III - Echo Server using socketserver network framework
Python Network Programming IV - Asynchronous Request Handling : ThreadingMixIn and ForkingMixIn
Python Coding Questions I
Python Coding Questions II
Python Coding Questions III
Python Coding Questions IV
Python Coding Questions V
Python Coding Questions VI
Python Coding Questions VII
Python Coding Questions VIII
Python Coding Questions IX
Python Coding Questions X
Image processing with Python image library Pillow
Python and C++ with SIP
PyDev with Eclipse
Matplotlib
Redis with Python
NumPy array basics A
NumPy Matrix and Linear Algebra
Pandas with NumPy and Matplotlib
Celluar Automata
Batch gradient descent algorithm
Longest Common Substring Algorithm
Python Unit Test - TDD using unittest.TestCase class
Simple tool - Google page ranking by keywords
Google App Hello World
Google App webapp2 and WSGI
Uploading Google App Hello World
Python 2 vs Python 3
virtualenv and virtualenvwrapper
Uploading a big file to AWS S3 using boto module
Scheduled stopping and starting an AWS instance
Cloudera CDH5 - Scheduled stopping and starting services
Removing Cloud Files - Rackspace API with curl and subprocess
Checking if a process is running/hanging and stop/run a scheduled task on Windows
Apache Spark 1.3 with PySpark (Spark Python API) Shell
Apache Spark 1.2 Streaming
bottle 0.12.7 - Fast and simple WSGI-micro framework for small web-applications ...
Flask app with Apache WSGI on Ubuntu14/CentOS7 ...
Fabric - streamlining the use of SSH for application deployment
Ansible Quick Preview - Setting up web servers with Nginx, configure enviroments, and deploy an App
Neural Networks with backpropagation for XOR using one hidden layer
NLP - NLTK (Natural Language Toolkit) ...
RabbitMQ(Message broker server) and Celery(Task queue) ...
OpenCV3 and Matplotlib ...
Simple tool - Concatenating slides using FFmpeg ...
iPython - Signal Processing with NumPy
iPython and Jupyter - Install Jupyter, iPython Notebook, drawing with Matplotlib, and publishing it to Github
iPython and Jupyter Notebook with Embedded D3.js
Downloading YouTube videos using youtube-dl embedded with Python
Machine Learning : scikit-learn ...
Django 1.6/1.8 Web Framework ...
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization