Introduction
Compared to other open source languages such as Perl and Python, the PHP community lacks a strong effort to develop math libraries.
One reason for this may be that there are already a large number of mature mathematical tools, which may hinder the community's efforts to develop PHP tools on their own. For example, I worked on a powerful tool, S System, which had an impressive set of statistical libraries, was specifically designed to analyze data sets, and won an ACM Award in 1998 for its language design. If S or its open source cousin R is just an exec_shell call, why go to the trouble of implementing the same statistical computing functionality in PHP? For more information about the S System, its ACM Award, or R, see related references.
Isn't this a waste of developer energy? If the motivation for developing a PHP math library is to save developer effort and use the best tool for the job, then PHP's current topic makes sense.
On the other hand, pedagogical motivations may encourage the development of PHP math libraries. For about 10% of people, mathematics is an interesting subject worth exploring. For those who are also proficient in PHP, the development of a PHP math library can enhance the math learning process. In other words, don't just read the chapter about T-tests, but also implement a program that can calculate the corresponding intermediate values and display them in a standard format. their classes.
Through guidance and training, I hope to demonstrate that developing a PHP math library is not a difficult task and may represent an interesting technical and learning challenge. In this article, I will provide a PHP math library example, called SimpleLinearRegression, which demonstrates a general approach that can be used to develop PHP math libraries. Let's start by discussing some general principles that guided me in developing the SimpleLinearRegression class.
Guiding Principles
I used six general principles to guide the development of the SimpleLinearRegression class.
Create a class for each analysis model.
Use reverse linking to develop classes.
A large number of getters are expected.
Store intermediate results.
Set preferences for detailed APIs.
Perfection is not the goal.
Let’s examine each of these guidelines in more detail.
Create a class for each analysis model.
Each major analysis test or process should have a PHP class with the same name as the test or process. This class contains input functions, functions for calculating intermediate values and summary values, and output functions (replacing the intermediate values). Values and summary values are all displayed on the screen in text or graphical format).
Using reverse chaining to develop classes
In mathematical programming, the coding target is usually the standard output value that an analysis procedure (such as MultipleRegression, TimeSeries, or ChiSquared) wishes to produce. From a problem-solving perspective, this means you can use backward chaining to develop mathematical-like methods.
For example, the summary output screen displays the results of one or more summary statistics. These summary statistical results rely on the calculation of intermediate statistical results, and these intermediate statistical results may involve deeper intermediate statistical results, and so on. This backlink-based development approach leads to the next principle.
Expect a large number of getter
math classes. Most of the class development work involves calculating intermediate and summary values. In practice, this means that you shouldn't be surprised if your class contains many getter methods that calculate intermediate and summary values.
Storing Intermediate Results
Stores the results of an intermediate calculation within a result object so that you can use the intermediate results as input for subsequent calculations. This principle is implemented in the S language design. In the current context, this principle is implemented by selecting instance variables to represent calculated intermediate values and summary results.
Setting Preferences for a Detailed API
When developing a naming scheme for the member functions and instance variables in the SimpleLinearRegression class, I discovered that if I used longer names (something like getSumSquaredError instead of getYY2 ) to describe the member functions and Instance variables, then it is easier to understand the operation content of the function and the meaning of the variables.
I have not completely abandoned abbreviated names; however, when I use a shortened form of a name, I have to try to provide a note that fully explains the meaning of the name. My take is this: highly abbreviated naming schemes are common in mathematical programming, but they make it more difficult to understand and prove that a certain mathematical routine is correct than it need be.
Perfection Not the Goal
The goal of this coding exercise is not necessarily to develop a highly optimized and rigorous math engine for PHP. In the early stages, emphasis should be placed on learning to implement meaningful analytical tests and solving difficult problems in this area.
Instance Variables
When modeling a statistical test or process, you need to indicate which instance variables are declared.
The selection of instance variables can be determined by accounting for the intermediate and summary values generated by the analysis process. Each intermediate and summary value can have a corresponding instance variable, with the variable's value as an object property.
I used this analysis to determine which variables to declare for the SimpleLinearRegression class in Listing 1. Similar analysis can be performed on MultipleRegression, ANOVA, or TimeSeries procedures.
Listing 1. Instance variables of the SimpleLinearRegression class
<?php
// Copyright 2003, Paul Meagher
// Distributed under GPL
class SimpleLinearRegression {
var $n;
var $X = array();
var $Y = array();
var $ConfInt;
var $Alpha;
var $XMean;
var $YMean;
var $SumXX;
var $SumXY;
var $SumYY;
var $Slope;
var $YInt;
var $PredictedY = array();
var $Error = array();
var $SquaredError = array();
var $TotalError;
var $SumError;
var $SumSquaredError;
var $ErrorVariance;
var $StdErr;
var $SlopeStdErr;
var $SlopeVal; // T value of Slope
var $YIntStdErr;
var $YIntTVal; // T value for Y Intercept
var $R;
var $RSquared;
var $DF; // Degrees of Freedom
var $SlopeProb; // Probability of Slope Estimate
var $YIntProb; // Probability of Y Intercept Estimate
var $AlphaTVal; // T Value for given alpha setting
var $ConfIntOfSlope;
var $RPath = "/usr/local/bin/R"; // Your path here
var $format = "%01.2f"; // Used for formatting output
}
?>
Constructor
The constructor method of the SimpleLinearRegression class accepts an X and a Y vector, each with the same number of values. You can also set a default 95% confidence interval for your expected Y value.
The constructor method starts by verifying that the data form is suitable for processing. Once the input vectors pass the "equal size" and "value greater than 1" tests, the core part of the algorithm is executed.
Performing this task involves calculating the intermediate and summary values of the statistical process through a series of getter methods. Assign the return value of each method call to an instance variable of the class. Storing calculation results in this way ensures that intermediate and summary values are available to calling routines in chained calculations. You can also display these results by calling the output method of this class, as described in Listing 2.
Listing 2. Calling class output methods
<?php
// Copyright 2003, Paul Meagher
// Distributed under GPL
function SimpleLinearRegression($X, $Y, $ConfidenceInterval="95") {
$numX = count($X);
$numY = count($Y);
if ($numX != $numY) {
die("Error: Size of X and Y vectors must be the same.");
}
if ($numX <= 1) {
die("Error: Size of input array must be at least 2.");
}
$this->n = $numX;
$this->X = $X;
$this->Y = $Y;
$this->ConfInt = $ConfidenceInterval;
$this->Alpha = (1 + ($this->ConfInt / 100) ) / 2;
$this->XMean = $this->getMean($this->X);
$this->YMean = $this->getMean($this->Y);
$this->SumXX = $this->getSumXX();
$this->SumYY = $this->getSumYY();
$this->SumXY = $this->getSumXY();
$this->Slope = $this->getSlope();
$this->YInt = $this->getYInt();
$this->PredictedY = $this->getPredictedY();
$this->Error = $this->getError();
$this->SquaredError = $this->getSquaredError();
$this->SumError = $this->getSumError();
$this->TotalError = $this->getTotalError();
$this->SumSquaredError = $this->getSumSquaredError();
$this->ErrorVariance = $this->getErrorVariance();
$this->StdErr = $this->getStdErr();
$this->SlopeStdErr = $this->getSlopeStdErr();
$this->YIntStdErr = $this->getYIntStdErr();
$this->SlopeTVal = $this->getSlopeTVal();
$this->YIntTVal = $this->getYIntTVal();
$this->R = $this->getR();
$this->RSquared = $this->getRSquared();
$this->DF = $this->getDF();
$this->SlopeProb = $this->getStudentProb($this->SlopeTVal, $this->DF);
$this->YIntProb = $this->getStudentProb($this->YIntTVal, $this->DF);
$this->AlphaTVal = $this->getInverseStudentProb($this->Alpha, $this->DF);
$this->ConfIntOfSlope = $this->getConfIntOfSlope();
return true;
}
?>
Method names and their sequences were derived through a combination of backlinking and reference to a statistics textbook used by undergraduate students, which explains step-by-step how to calculate intermediate values. The name of the intermediate value I need to calculate is prefixed with "get", thus deriving the method name.
Fit the model to the data
The SimpleLinearRegression procedure is used to produce a straight line fit to the data, where the line has the following standard equation:
y = b + mx
The PHP format of this equation looks similar to Listing 3:
Listing 3. Fit the model to the data Matching PHP equation
$PredictedY[$i] = $YIntercept + $Slope * $X[$i]
The SimpleLinearRegression class uses the least squares criterion to derive estimates of the Y-intercept (Y Intercept) and slope (Slope) parameters. These estimated parameters are used to construct a linear equation (see Listing 3) that models the relationship between the X and Y values.
Using the derived linear equation, you can obtain the predicted Y value for each X value. If the linear equation fits the data well, then the observed and predicted values of Y tend to be consistent.
How to determine if there is a good fit
The SimpleLinearRegression class generates quite a few summary values. An important summary value is the T statistic, which measures how well a linear equation fits the data. If the agreement is very good, the T statistic will tend to be large. If the T statistic is small, then the linear equation should be replaced with a model that assumes that the mean of the Y values is the best predictor (that is, the mean of a set of values is usually a useful predictor of the next observation, make it the default model).
To test whether the T statistic is large enough not to consider the mean of the Y values as the best predictor, you need to calculate the random probability of obtaining the T statistic. If the probability of obtaining a T-statistic is low, then you can reject the null hypothesis that the mean is the best predictor and, accordingly, be confident that the simple linear model fits the data well.
So, how to calculate the probability of T statistic value?
Calculating the probability of the T-statistic value
Since PHP lacks mathematical routines for calculating the probability of the T-statistic value, I decided to leave this task to the statistical computing package R (see www.r-project.org in Resources) to obtain the necessary values . I also want to draw attention to this package because:
R provides many ideas that a PHP developer might emulate in a PHP math library.
With R, it is possible to determine whether the values obtained from the PHP math library are consistent with those obtained from mature, freely available open source statistical packages.
The code in Listing 4 demonstrates how easy it is to leave it to R to get a value.
Listing 4. Handling it to the R statistical package to obtain a value
<?php
// Copyright 2003, Paul Meagher
// Distributed under GPL
class SimpleLinearRegression {
var $RPath = "/usr/local/bin/R"; // Your path here
function getStudentProb($T, $df) {
$Probability = 0.0;
$cmd = "echo 'dt($T, $df)' | $this->RPath --slave";
$result = shell_exec($cmd);
list($LineNumber, $Probability) = explode(" ", trim($result));
return $Probability;
}
function getInverseStudentProb($alpha, $df) {
$InverseProbability = 0.0;
$cmd = "echo 'qt($alpha, $df)' | $this->RPath --slave";
$result = shell_exec($cmd);
list($LineNumber, $InverseProbability) = explode(" ", trim($result));
return $InverseProbability;
}
}
?>
Note that the path to the R executable has been set and used in both functions. The first function returns the probability value associated with the T statistic based on the Student's T distribution, while the second inverse function computes the T statistic corresponding to the given alpha setting. The getStudentProb method is used to evaluate the fit of the linear model; the getInverseStudentProb method returns an intermediate value, which is used to calculate the confidence interval for each predicted Y value.
Due to limited space, it is impossible for me to detail all the functions in this class one by one, so if you want to figure out the terminology and steps involved in simple linear regression analysis, I encourage you to refer to the statistics textbook used by undergraduate students.
Burnout Study
To demonstrate how to use this class, I can use data from a burnout study in a utility. Michael Leiter and Kimberly Ann Meechan studied the relationship between a measure of burnout called the Exhaustion Index and an independent variable called Concentration. Concentration refers to the proportion of people's social contacts that come from their work environment.
To study the relationship between consumption index values and concentration values for individuals in their sample, load these values into an appropriately named array and instantiate this class with these array values. After instantiating a class, display some summary values generated by the class to evaluate how well the linear model fits the data.
Listing 5 shows a script that loads data and displays summary values:
Listing 5. Script that loads data and displays summary values
<?php
// BurnoutStudy.php
// Copyright 2003, Paul Meagher
// Distributed under GPL
include "SimpleLinearRegression.php";
// Load data from burnout study
$Concentration = array(20,60,38,88,79,87,
68,12,35,70,80,92,
77,86,83,79,75,81,
75,77,77,77,17,85,96);
$ExhaustionIndex = array(100,525,300,980,310,900,
410,296,120,501,920,810,
506,493,892,527,600,855,
709,791,718,684,141,400,970);
$slr = new SimpleLinearRegression($Concentration, $ExhaustionIndex);
$YInt = sprintf($slr->format, $slr->YInt);
$Slope = sprintf($slr->format, $slr->Slope);
$SlopeTVal = sprintf($slr->format, $slr->SlopeTVal);
$SlopeProb = sprintf("%01.6f", $slr->SlopeProb);
?>
<table border='1' cellpadding='5'>
<tr>
<th align='right'>Equation:</th>
<td></td>
</tr>
<tr>
<th align='right'>T:</th>
<td></td>
</tr>
<tr>
<th align='right'>Prob > T:</th>
<td><td>
</tr>
</table>
Running this script through a web browser produces the following output:
Equation: Exhaustion = -29.50 + (8.87 * Concentration)
T: 6.03
Prob > T: 0.000005
The last row of this table indicates that the random probability of obtaining such a large T value is very low. It can be concluded that a simple linear model has better predictive power than simply using the mean of the consumption values.
Knowing the concentration of connections in someone's workplace can be used to predict the level of burnout they may be consuming. This equation tells us: for every 1 unit increase in the concentration value, the consumption value of a person in the social service field will increase by 8 units. This is further evidence that to reduce potential burnout, individuals in social services should consider making friends outside of their workplace.
This is only a rough sketch of what these results might mean. To fully explore the implications of this data set, you may want to study the data in more detail to make sure this is the correct interpretation. In the next article I will discuss what other analyzes should be performed.
What did you learn?
For one, you don't have to be a rocket scientist to develop meaningful PHP-based math packages. By adhering to standard object-oriented techniques and explicitly adopting reverse chaining problem solving methods, it is relatively easy to use PHP to implement some of the more basic statistical processes.
From a teaching standpoint, I think this exercise is very useful, if only because it requires you to think about statistical tests or routines at higher and lower levels of abstraction. In other words, a great way to supplement your statistical testing or procedural learning is to implement the procedure as an algorithm.
Implementing statistical tests often requires going beyond the information given and creative problem solving and discovery. It is also a good way to discover gaps in knowledge about a subject.
On the downside, you find that PHP lacks inherent means for sampling distributions, which is necessary to implement most statistical tests. You'll need to let R do the processing to get these values, but I'm afraid you won't have the time or interest to install R. Native PHP implementations of some common probability functions can solve this problem.
Another problem: the class generates many intermediate and summary values, but the summary output doesn't actually take advantage of this. I've provided some unwieldy output, but it's neither sufficient nor well organized so that you can adequately interpret the results of the analysis. Actually, I have absolutely no idea how I can integrate the output method into this class. This needs to be addressed.
Finally, making sense of the data requires more than just looking at summary values. You also need to understand how individual data points are distributed. One of the best ways to do this is to graph your data. Again, I don't know much about this, but if you want to use this class to analyze real data, you need to solve this problem.