The basic goal behind
conceptually
simple linear regression modeling is to find the best straight line from a two-dimensional plane consisting of pairs of X and Y values (i.e., X and Y measurements).Once the line is found using the minimum variance method, various statistical tests can be performed to determine how well the line fits the observed deviation from the Y value.
The linear equation (y = mx + b) has two parameters that must be estimated from the X and Y data provided, they are the slope (m) and the y-intercept (b). Once these two parameters are estimated, you can enter the observed values into the linear equation and observe the Y predictions generated by the equation.
To estimate the m and b parameters using the minimum variance method, we need to find the estimated values of m and b that minimize the observed and predicted values of Y for all X values. The difference between the observed and predicted values is called the error ( y i- (mx i+ b) ), and if you square each error value and then sum these residuals, the result is a prediction squared Bad number. Using the minimum variance method to determine the best fit involves finding estimates of m and b that minimize the prediction variance.
Two basic methods can be used to find the estimates m and b that satisfy the minimum variance method. In the first approach, one can use a numerical search process to set different values of m and b and evaluate them, ultimately deciding on the estimate that yields the minimum variance. The second method is to use calculus to find equations for estimating m and b. I'm not going to get into the calculus involved in deriving these equations, but I did use these analytical equations in the SimpleLinearRegression class to find least square estimates of m and b (see getSlope() and getYIntercept in the SimpleLinearRegression class method).
Even if you have equations that can be used to find least square estimates of m and b, it does not mean that if you plug these parameters into a linear equation, the result will be a straight line that fits the data well. The next step in this simple linear regression process is to determine whether the remaining prediction variance is acceptable.
You can use a statistical decision process to reject the alternative hypothesis that the straight line fits the data. This process is based on the calculation of the T statistic, using a probability function to find the probability of a randomly large observation. As mentioned in Part 1, the SimpleLinearRegression class generates a number of summary values, one of the important summary values is the T statistic, which can be used to measure how well the linear equation fits the data. If the fit is good, the T statistic will tend to be a large value; if the T value is small, you should replace your linear equation with a default model that assumes that the mean of the Y values is the best predictor (because The average of a set of values can often be a useful predictor of the next observation).
To test whether the T statistic is large enough to not use the average Y value as the best predictor, you need to calculate the probability of obtaining the T statistic randomly. If the probability is low, then the null assumption that the mean is the best predictor can be dispensed with, and accordingly one can be confident that a simple linear model is a good fit to the data. (See Part 1 for more information on calculating the probability of a T-statistic.)
Return to the statistical decision-making process. It tells you when not to adopt the null hypothesis, but it does not tell you whether to accept the alternative hypothesis. In a research setting, linear model alternative hypotheses need to be established through theoretical and statistical parameters.
The data research tool you will build implements a statistical decision-making process for linear models (T-tests) and provides summary data that can be used to construct the theoretical and statistical parameters needed to build linear models. Data research tools can be classified as decision support tools for knowledge workers to study patterns in small to medium-sized data sets.
From a learning perspective, simple linear regression modeling is worth studying as it is the only way to understand more advanced forms of statistical modeling. For example, many core concepts in simple linear regression establish a good foundation for understanding multiple regression (Multiple Regression), factor analysis (Factor Analysis), and time series (Time Series).
Simple linear regression is also a versatile modeling technique. It can be used to model curvilinear data by transforming the raw data (usually with a logarithmic or power transformation). These transformations linearize the data so that it can be modeled using simple linear regression. The resulting linear model will be represented as a linear formula related to the transformed values.
Probability Function
In the previous article, I avoided the problem of implementing the probability function in PHP by asking R to find the probability value. I wasn't completely satisfied with this solution, so I started researching the question: what is needed to develop probability functions based on PHP.
I started looking online for information and code. One source for both is the book Numerical Recipes in C. Probability Functions. I reimplemented some probability function code (gammln.c and betai.c functions) in PHP, but I'm still not satisfied with the results. It seems to have a bit more code than some other implementations. Additionally, I need the inverse probability function.
Fortunately, I stumbled upon John Pezzullo's Interactive Statistical Calculation. John's website about probability distribution functions has all the functions I need, implemented in JavaScript to make learning easier.
I ported the Student T and Fisher F functions to PHP. I changed the API a bit to conform to Java naming style and embedded all functions into a class called Distribution. A great feature of this implementation is the doCommonMath method, which is reused by all functions in this library. Other tests that I didn't bother to implement (normality test and chi-square test) also use the doCommonMath method.
Another aspect of this transplant is also worth noting. By using JavaScript, users can assign dynamically determined values to instance variables, such as:
var PiD2 = pi() / 2
You can't do this in PHP. Only simple constant values can be assigned to instance variables. Hopefully this flaw will be resolved in PHP5.
Note that the code in Listing 1 does not define instance variables—this is because in the JavaScript version, they are dynamically assigned values.
Listing 1. Implementing the probability function
<?php
// Distribution.php
// Copyright John Pezullo
// Released under same terms as PHP.
// PHP Port and OO'fying by Paul Meagher
class Distribution {
function doCommonMath($q, $i, $j, $b) {
$zz = 1;
$z = $zz;
$k = $i;
while($k <= $j) {
$zz = $zz * $q * $k / ($k - $b);
$z = $z + $zz;
$k = $k + 2;
}
return $z;
}
function getStudentT($t, $df) {
$t = abs($t);
$w = $t / sqrt($df);
$th = atan($w);
if ($df == 1) {
return 1 - $th / (pi() / 2);
}
$sth = sin($th);
$cth = cos($th);
if( ($df % 2) ==1 ) {
return
1 - ($th + $sth * $cth * $this->doCommonMath($cth * $cth, 2, $df - 3, -1))
/ (pi()/2);
} else {
return 1 - $sth * $this->doCommonMath($cth * $cth, 1, $df - 3, -1);
}
}
function getInverseStudentT($p, $df) {
$v = 0.5;
$dv = 0.5;
$t = 0;
while($dv > 1e-6) {
$t = (1 / $v) - 1;
$dv = $dv / 2;
if ( $this->getStudentT($t, $df) > $p) {
$v = $v - $dv;
} else {
$v = $v + $dv;
}
}
return $t;
}
function getFisherF($f, $n1, $n2) {
// implemented but not shown
}
function getInverseFisherF($p, $n1, $n2) {
// implemented but not shown
}
}
?>
Output methods
Now that you have implemented the probability function in PHP, the only remaining challenge in developing a PHP-based data research tool is to design a method for displaying the analysis results.
The simple solution is to display the values of all instance variables to the screen as needed. In the first article, I did just that when showing the linear equation, T-value, and T-probability for the Burnout Study. It is helpful to be able to access specific values for specific purposes, and SimpleLinearRegression supports this usage.
However, another method for outputting results is to systematically group parts of the output. If you study the output of the major statistical software packages used for regression analysis, you will find that they tend to group their output in the same way. They often include Summary Table, Analysis Of Variance table, Parameter Estimate table and R Value. Similarly, I created some output methods with the following names:
showSummaryTable()
showAnalysisOfVariance()
showParameterEstimates()
showRValues()
I also have a method for displaying the linear prediction formula ( getFormula() ). Many statistical software packages do not output formulas, but expect the user to construct formulas based on the output of the above methods. This is partly because the final form of the formula you end up using to model your data may differ from the default formula for the following reasons:
The Y-intercept may not have a meaningful interpretation, or the input values may have been transformed in a way that you may need Unconvert them to get the final explanation.
All of these methods assume that the output medium is a web page. Considering that you may want to output these summary values in a medium other than a web page, I decided to wrap these output methods in a class that inherits the SimpleLinearRegression class. The code in Listing 2 is intended to demonstrate the general logic of the output class. In order to make the common logic more prominent, the code that implements various show methods has been removed.
Listing 2. Demonstrating the common logic of the output class
<?php
// HTML.php
// Copyright 2003, Paul Meagher
// Distributed under GPL
include_once "slr/SimpleLinearRegression.php";
class SimpleLinearRegressionHTML extends SimpleLinearRegression {
function SimpleLinearRegressionHTML($X, $Y, $conf_int) {
SimpleLinearRegression::SimpleLinearRegression($X, $Y, $conf_int);
}
function showTableSummary($x_name, $y_name) { }
function showAnalysisOfVariance() { }
function showParameterEstimates() { }
function showFormula($x_name, $y_name) { }
function showRValues() {}
}
?>
The constructor of this class is just a wrapper around the SimpleLinearRegression class constructor. This means that if you want to display the HTML output of a SimpleLinearRegression analysis, you should instantiate the SimpleLinearRegressionHTML class instead of directly instantiating the SimpleLinearRegression class. The advantage is that you won't clutter the SimpleLinearRegression class with many unused methods, and you can more freely define classes for other output media (perhaps implementing the same API for different media types).
Graphical OutputThe
output methods you have implemented so far display summary values in HTML format. It is also suitable for displaying scatter plots or line plots of these data in GIF, JPEG or PNG format.
Rather than writing the code to generate line and distribution plots myself, I thought it would be better to use a PHP-based graphics library called JpGraph. JpGraph is being actively developed by Johan Persson, whose project website describes it this way:
Whether for graphs "obtained in a quick but inappropriate way" with minimal code, or for complex professional graphs requiring very fine-grained control, JpGraph can be used Drawing them becomes easy. JpGraph is equally suitable for scientific and business type graphs.
The JpGraph distribution includes a number of sample scripts that can be customized for specific needs. Using JpGraph as a data research tool is as simple as finding a sample script that does something similar to what I need and adapting it to fit my specific needs.
The script in Listing 3 is extracted from the sample data exploration tool (explore.php) and demonstrates how to call the library and populate the Line and Scatter classes with data from the SimpleLinearRegression analysis. The comments in this code were written by Johan Persson (the JPGraph code base does a good job documenting it).
Listing 3. Details of functions from the sample data exploration tool explore.php
<?php
// Snippet extracted from explore.php script
include ("jpgraph/jpgraph.php");
include ("jpgraph/jpgraph_scatter.php");
include ("jpgraph/jpgraph_line.php");
// Create the graph
$graph = new Graph(300,200,'auto');
$graph->SetScale("linlin");
// Setup title
$graph->title->Set("$title");
$graph->img->SetMargin(50,20,20,40);
$graph->xaxis->SetTitle("$x_name","center");
$graph->yaxis->SetTitleMargin(30);
$graph->yaxis->title->Set("$y_name");
$graph->title->SetFont(FF_FONT1,FS_BOLD);
// make sure that the X-axis is always at the
// bottom at the plot and not just at Y=0 which is
// the default position
$graph->xaxis->SetPos('min');
// Create the scatter plot with some nice colors
$sp1 = new ScatterPlot($slr->Y, $slr->X);
$sp1->mark->SetType(MARK_FILLEDCIRCLE);
$sp1->mark->SetFillColor("red");
$sp1->SetColor("blue");
$sp1->SetWeight(3);
$sp1->mark->SetWidth(4);
// Create the regression line
$lplot = new LinePlot($slr->PredictedY, $slr->X);
$lplot->SetWeight(2);
$lplot->SetColor('navy');
// Add the pltos to the line
$graph->Add($sp1);
$graph->Add($lplot);
// ... and stroke
$graph_name = "temp/test.png";
$graph->Stroke($graph_name);
?>
<img src='<?php echo $graph_name ?>' vspace='15'>
?>
Data Research Script
The data research tool consists of a single script (explore.php) that calls methods of the SimpleLinearRegressionHTML class and the JpGraph library.
The script uses simple processing logic. The first part of the script performs basic validation on the submitted form data. If this form data passes validation, the second part of the script is executed.
The second part of the script contains code that analyzes the data and displays the summary results in HTML and graphical formats. The basic structure of the explore.php script is shown in Listing 4:
Listing 4. Structure of explore.php
<?php
// explore.php
if (!empty($x_values)) {
$X = explode(",", $x_values);
$numX = count($X);
}
if (!empty($y_values)) {
$Y = explode(",", $y_values);
$numY = count($Y);
}
// display entry data entry form if variables not set
if ( (empty($title)) OR (empty($x_name)) OR (empty($x_values)) OR
(empty($y_name)) OR (empty($conf_int)) OR (empty($y_values)) OR
($numX != $numY) ) {
// Omitted code for displaying entry form
} else {
include_once "slr/SimpleLinearRegressionHTML.php";
$slr = new SimpleLinearRegressionHTML($X, $Y, $conf_int);
echo "<h2>$title</h2>";
$slr->showTableSummary($x_name, $y_name);
echo "<br><br>";
$slr->showAnalysisOfVariance();
echo "<br><br>";
$slr->showParameterEstimates($x_name, $y_name);
echo "<br>";
$slr->showFormula($x_name, $y_name);
echo "<br><br>";
$slr->showRValues($x_name, $y_name);
echo "<br>";
include ("jpgraph/jpgraph.php");
include ("jpgraph/jpgraph_scatter.php");
include ("jpgraph/jpgraph_line.php");
// The code for displaying the graphics is inline in the
// explore.php script. The code for these two line plots
// finishes off the script:
// Omitted code for displaying scatter plus line plot
// Omitted code for displaying residuals plot
}
?>
Fire damage study
To demonstrate how to use data research tools, I will use data from a hypothetical fire damage study. This study relates the amount of fire damage in major residential areas to their distance from the nearest fire station. For example, insurance companies would be interested in the study of this relationship for the purpose of determining insurance premiums.
The data for this study are shown in the input screen in Figure 1.
Figure 1. Input screen showing study data
After the data is submitted, it is analyzed and the results of these analyzes are displayed. The first result set displayed is the Table Summary , as shown in Figure 2.
Figure 2. Table Summary is the first result set displayed
The Table Summary displays the input data in tabular form and additional columns indicating the predicted value Y corresponding to the observed value
Figure 3 shows three high-level data summary tables following the Table Summary.
Figure 3. Shows three high-level data summary tables following Table Summary
The Analysis of Variance table shows how deviations in Y values can be attributed to the two main sources of deviation, variance explained by the model (see the Model row) and variance not explained by the model (see the Error row). A large F value means that the linear model captures most of the deviation in the Y measurements. This table is more useful in a multiple regression environment, where each independent variable has a row in the table.
The Parameter Estimates table shows the estimated Y-axis intercept (Intercept) and slope (Slope). Each row contains a T value and the probability of observing the extreme T value (see the Prob > T column). Prob > T for slope can be used to reject a linear model.
If the probability of the T value is greater than 0.05 (or a similarly small probability), then you can reject the null hypothesis because the probability of randomly observing the extreme value is small. Otherwise you must use the null hypothesis.
In fire loss studies, the probability of randomly obtaining a T value of size 12.57 is less than 0.00000. This means that the linear model is a useful predictor (better than the mean of Y values) of Y values that correspond to the range of X values observed in this study.
The final report shows the correlation coefficient or R value . They can be used to evaluate how well a linear model fits the data. A high R value indicates good agreement.
Each summary report provides answers to various analytical questions about the relationship between the linear model and the data. Consult a textbook by Hamilton, Neter, or Pedhauzeur for more advanced treatments of regression analysis.
The final report elements to be displayed are the distribution plot and line graph of the data, as shown in Figure 4.
Figure 4. Final report elements—distribution plot and line graph
Most people are familiar with the description of line graphs (such as the first graph in this series), so I will not comment on this except to say that the JPGraph library can produce high-quality scientific graphs for the Web. It also does a great job when you input distribution or straight line data.
The second plot relates the residuals (observed Y , predicted Y ) to your predicted Y values. This is an example of a graph used by advocates of Exploratory Data Analysis (EDA) to help maximize analysts' ability to detect and understand patterns in data. Experts can use this diagram to answer questions about:
This data study tool can be easily extended to produce more types of graphs — Histograms, box plots, and quartile plots — these are standard EDA tools.
Math library architecture
My hobby of mathematics has kept me interested in mathematics libraries in recent months. Research like this drives me to think about how to organize my code base and anticipate future growth.
I'll use the directory structure in Listing 5 for now:
Listing 5. Directory structure that is easy to grow
phpmath/ burnout_study.php explore.php fire_study.php navbar.php dist/ Distribution.php fisher.php student.php source.php jpgraph/ etc... slr/ SimpleLinearRegression.php SimpleLinearRegressionHTML.php temp/ |
In the future, setting the PHP_MATH variable will be done through a configuration file for the entire PHP math library.
What did you learn?
In this article, you learned how to use the SimpleLinearRegression class to develop data research tools for small to medium-sized data sets. Along the way, I also developed a native probability function for use with the SimpleLinearRegression class and extended the class with HTML output methods and graph generation code based on the JpGraph library.
From a learning perspective, simple linear regression modeling is worthy of further study, as it has proven to be the only way to understand more advanced forms of statistical modeling. You will benefit from a solid understanding of simple linear regression before diving into more advanced techniques such as multiple regression or multivariate analysis of variance.
Even if simple linear regression only uses one variable to explain or predict the deviation of another variable, finding simple linear relationships among all study variables is still often the first step in research data analysis. Just because data is multivariate does not mean that it must be studied using multivariate tools. In fact, starting out with a basic tool like simple linear regression is a great way to start exploring patterns in your data.