Discussion on implementing large result set paging in ASP.NET

Author：Eve Cole Update Time：2009-06-30 16:06:39

In web applications, paginating a large database result set is a well-known problem. Simply put, you don't want all query data to be displayed on a single page, so display with paging is more appropriate. Although this is not a simple task in traditional ASP, in ASP.NET, the DataGrid control simplifies this process to only a few lines of code. Therefore, in asp.net, paging is very simple, but the default DataGrid paging event will read all records from the database and put them into the asp.net web application. When you have more than a million data, this will cause serious performance issues (if you don't believe this, you can execute a query in your application and look at the memory consumption of aspnet_wp.exe in task manager Situation) This is why it is necessary to customize the paging behavior, so as to ensure that only the data records required by the current page are obtained.

There are many articles and posts about this problem on the Internet, as well as some mature solutions. My purpose in writing this article is not to show you a stored procedure that will solve all your problems, but to optimize existing methods and provide you with an application to test so that you can do it according to your needs. development.

But I am not very satisfied with the methods currently introduced online. First, traditional ADO is used, which is obviously written for "ancient" ASP. The remaining methods are SQL Server stored procedures, and some of them are unusable due to too slow response times, as you can see from the performance results at the end of the article, but there are a few that caught my attention.

Generalization

I want to carefully analyze the three currently commonly used methods, which are temporary tables (TempTable), dynamic SQL (DynamicSQL) and row count (Rowcount). In the following, I prefer to call the second method the (ascending-descending) Asc-Desc method. I don't think dynamic SQL is a good name because you can also apply dynamic SQL logic in another method. The common problem with all these stored procedures is that you have to estimate which columns you are going to sort on, not just the primary key columns (PK Columns), which can lead to a series of problems - for each query, You need to display it through paging, which means that for each different sorting column you have to have many different paging queries, which means that you either do a different stored procedure for each sorting column (regardless of which paging method is used), or You have to put this functionality in a stored procedure with the help of dynamic SQL. These two methods have a small impact on performance, but it increases maintainability, especially if you need to use this method to display different queries. Therefore, in this article I will try to use dynamic SQL to summarize all stored procedures, but for some reasons, we can only achieve partial universality, so you still have to write independent stored procedures for complex queries.

The second problem with allowing all sorting fields, including primary key columns, is that if those columns are not properly indexed, none of these methods will help. In all these methods, a paging source must be sorted first. For large data tables, the cost of sorting using non-index columns is negligible. In this case, all stored procedures cannot be used in actual situations due to the long response time. (The corresponding time varies from a few seconds to a few minutes, depending on the size of the table and the first record to be obtained). Indexes on other columns can introduce additional undesirable performance issues, for example it can become very slow if you import a lot of data every day.

Temporary table

First, I am going to talk about the temporary table method. This is a widely recommended solution that I have encountered several times in my projects. Let's take a look at the essence of this method:

CREATE TABLE #Temp(
ID int IDENTITY PRIMARY KEY,
PK /*heregoesPKtype*/
)

INSERT INTO #Temp SELECT PK FROM Table ORDER BY SortColumn

SELECT FROM Table JOIN # Temp temp ON Table.PK = temp .PK ORDER BY temp .ID WHERE ID > @StartRow AND ID< @EndRow

by copying all rows to temporary In the table, we can further optimize the query (SELECT TOP EndRow...), but the key is the worst case scenario - a table containing 1 million records will generate a temporary table with 1 million records.

Considering this situation and looking at the results of the above article, I decided to abandon the

ascending-descending

method in my test.

This method uses the default sorting in the subquery and the reverse sorting in the main query. The principle is like this :

DECLARE @temp TABLE(
PK /* PKType */
NOT NULL PRIMARY
)

INSERT INTO @temp SELECT TOP @PageSize PK FROM
(
SELECT TOP(@StartRow + @PageSize)
PK,
SortColumn /* If sorting column is defferent from the PK,SortColumn must
be fetched as well,otherwise just the PK is necessary
*/

ORDER BY SortColumn
/*
defaultorder–typicallyASC
*/
)

ORDER BY SortColumn
/*
reversed default order–typicallyDESC
*/

SELECT FROM Table JOIN @Temp temp ON Table .PK= temp .PK
ORDER BY SortColumn
/*
defaultorder
*/

The basic logic of this method of

row counting

relies on the SET ROWCOUNT expression in SQL, so that unnecessary rows can be skipped and the required row records can be obtained:

DECLARE @Sort /* the type of the sorting column */

SET ROWCOUNT @StartRow

SELECT @Sort=SortColumn FROM Table ORDER BY SortColumn
SET ROWCOUNT @PageSize

SELECT FROM Table WHERE SortColumn >= @Sort ORDER BY SortColumn

There are two other methods

of subquery

that I have considered, and their sources are different. The first is the well-known Triple Query or self-query method. In this article, I also use a similar common logic that encompasses all other stored procedures. The idea here is to connect to the whole process, I made some reductions to the original code since recordcount is not needed in my tests)

SELECT FROM Table WHERE PK IN(
SELECT TOP @PageSize PK FROM Table WHERE PK NOT IN
(
SELECT TOP @StartRow PK FROM Table ORDER BY SortColumn)
ORDER BY SortColumn)
ORDER BY SortColumn

Cursor

While looking at the google discussion group, I found the last method. This method uses a server-side dynamic cursor. Many people try to avoid using cursors because they are irrelevant and inefficient because of their orderliness. But looking back, paging is actually an orderly task. No matter which method you use, you have to return it. Go to the start line to record. In the previous method, you first select all the rows before the start of recording, add the required rows to record, and then delete all previous rows. Dynamic cursors have a FETCH RELATIVE option that performs magical jumps. The basic logic is as follows:

DECLARE @PK /* PKType */

DECLARE @tblPK

TABLE(
PK /*PKType*/ NOT NULL PRIMARY KEY
)

DECLARE PagingCursor CURSOR DYNAMICREAD_ONLY FOR
SELECT @PK FROM Table ORDER BY SortColumn

OPEN PagingCursor
FETCH RELATIVE @StartRow FROM PagingCursor INTO @PK

WHILE @PageSize>0 AND @@FETCH_STATUS =0
BEGIN
INSERT @tblPK(PK) VALUES(@PK)
FETCH NEXT FROM PagingCursor INTO @PK
SET @PageSize = @PageSize - 1
END

CLOSE
PagingCursor
DEALLOCATE
PagingCursor

SELECT FROM Table JOIN @tblPK temp ON Table .PK= temp .PK

Generalization of complex queries

in ORDER BY SortColumn

I pointed out before that all stored procedures use dynamic SQL to achieve generalization, so in theory they can use any kind of complex queries. Below is an example of a complex query based on the Northwind database.

SELECT Customers.ContactName AS Customer, Customers.Address + ' , ' + Customers.City + ', '+ Customers.Country
AS Address, SUM([OrderDetails].UnitPrice*[OrderDetails] .Quantity)
AS [Totalmoneyspent]

FROM Customers

INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID

INNER JOIN [ OrderDetails ] ON Orders.OrderID = [ OrderDetails].OrderID

WHERE Customers.Country <> 'USA' AND Customers.Country <> 'Mexico '

GROUP BY Customers.ContactName,Customers.Address,Customers.City, Customers.Country
HAVING(SUM([OrderDetails].UnitPrice * [ OrderDetails ] .Quantity)) > 1000

ORDER BY Customer DESC ,Address DESC

returns the paging storage call of the second page as follows:

EXEC ProcedureName

/*Tables */

'
Customers
INNER JOIN Orders ON Customers.CustomerID=Orders.CustomerID
INNER JOIN [OrderDetails] ON Orders.OrderID=[OrderDetails].OrderID
'
,
/* PK */

'
Customers.CustomerID
'
,
/* ORDERBY */

'
Customers.ContactName DESC,Customers.AddressDESC
'
,

/*PageNumber */

2
,

/*PageSize */

10
,

/*Fields */

'
Customers.Contact Name AS Customer,
Customers.Address+'' , '' +Customers.City+ '' , '' +Customers.Country ASAddress, SUM([OrderDetails].UnitPrice*[OrderDetails].Quantity)AS[Totalmoneyspent]
'
,

/*Filter */

'
Customers.Country<>'' USA '' ANDCustomers.Country<> '' Mexico ''' ,

/*GroupBy */

'
Customers.CustomerID,Customers.ContactName,Customers.Address,
Customers.City,Customers.Country
HAVING(SUM([OrderDetails].UnitPrice*[OrderDetails].Quantity))>1000
'

It's worth noting that you used aliases in the ORDER BY statement in the original query, but you'd better not do this in a paged stored procedure, because skipping rows before starting to record is time-consuming. In fact, there are many methods for implementation, but the principle is not to include all fields at the beginning, but only include the primary key column (equivalent to the sorting column in the RowCount method), which can speed up the completion of the task. Only in the request page are all required fields obtained. Moreover, there are no field aliases in the final query, and in skip row queries, index columns must be used in advance.

There is another problem with the RowCount stored procedure. To achieve generalization, only one column is allowed in the ORDER BY statement. This is also a problem with the ascending-descending method and the cursor method, although they can sort several columns. , but it must be ensured that there is only one field in the primary key. I guess this could be solved with more dynamic SQL, but in my opinion it's not worth it. While such situations are possible, they don't happen very often. Usually you can use the above principles to also independently page the stored procedures.

Performance Testing

In testing, I used four methods, if you have a better method, I'd be interested to know. Anyway, I need to compare these methods and evaluate their performance. First of all, my first idea is to write an asp.net test application containing a paging DataGrid, and then test the page results. Of course, this does not reflect the true response time of the stored procedure, so a console application is more suitable. I also included a web application, but not for performance testing, but as an example of DataGrid custom pagination and stored procedures working together.

In the test, I used an automatically generated large data table and inserted about 500,000 pieces of data. If you don't have such a table to experiment with, you can click here to download a table design and stored procedure script for generating data. Instead of using an auto-incrementing primary key column, I used a unique identifier to identify the record. If I use the script I mentioned above, you might consider adding an auto-increment column after generating the table. The auto-increment data will be numerically sorted based on the primary key. This also means that you intend to use a paginated stored procedure with primary key sorting. to get the data of the current page.

In order to implement the performance test, I called a specific stored procedure multiple times through a loop and then calculated the average response time. Considering caching reasons, in order to more accurately model the actual situation - the time it takes for the same page to obtain data for multiple calls to a stored procedure is usually not suitable for evaluation. Therefore, when we call the same stored procedure, , the page number requested for each call should be random. Of course, we must assume that the number of pages is fixed, 10-20 pages, and data with different page numbers may be obtained many times, but randomly.

One thing we can easily notice is that the response time is determined by the distance of the page data to be obtained relative to the starting position of the result set. The further away from the starting position of the result set, the more records will be skipped. This is also Reason why I don't include the top 20 in my random sequence. As an alternative, I would use 2^n pages, and the size of the loop is the number of different pages needed * 1000, so each page is fetched almost 1000 times (there will definitely be a deviation due to random reasons)

Results

Here are my test results:

Conclusion

The tests were performed in order from best to worst performer - row count, cursor, ascending-descending, subquery. One interesting thing is that usually people rarely visit pages after the first five pages, so the subquery method may suit your needs in this case, depending on the size of your result set and how far away it is. To predict the frequency of page occurrences, you are also likely to use a combination of these methods. If it were me, I'd prefer the row count method in any case, it works pretty well, even for the first page, the "any case" here represents some cases where generalization is difficult, in In this case, I would use a cursor. (I would probably use the subquery method for the first two, and the cursor method after that)