Technical preparations for the early stage of a website with millions of visits (Part 2)

Author：Eve Cole Update Time：2011-07-13 16:00:51

-
The last article in this series is written for ordinary programmers. If you are not interested, you can just read the last few paragraphs of this article.

Before starting to design the code structure, let's review what we have prepared before: we have a load-balanced WEB server, a master-slave DB server and possibly sharding, a cache, and scalable storage. All aspects of organizing code are closely related to these preparations. I will list them one by one, two by three, and each one will start with the classic sentence "mentioned above" for easy comparison.

Don't rush to read the classic sentence patterns, my mind jumped and I will insert a paragraph. In actual development, we always make a compromise between performance and code elegance. For today's computers and language interpreters, the question of how many layers of object calls and how many layers of object calls and whether to declare variables as Map or HashMap is the last thing to consider. Always consider the slowest part of the system and solve it from the slowest part. . For example, see if the ORM you use does a lot of things you don't need, and if there are repeated data calls. What we do is web application development, not the underlying framework API. Easy-to-read and understandable code is a very important aspect to ensure quality. What is your program designed for? There are different methods... Forget it, this topic will be discussed separately. This article is too far off topic. If you want to communicate, you can follow my Weibo : http://t.sina.com.cn/liuzhiyi . Let’s continue...

As mentioned earlier, the WEB server needs to be load balanced, and the picture server needs to be separated. Regarding this point, when the code handles client status, do not put the status on a single machine. For example, do not use file session. Well, common sense. If possible, it is best to develop a unified interface for user single-point authentication from the beginning, including how to determine the status across domains, how to determine the status on static pages, and the definition of jump and return parameters when logging in. Provide a good interface at the bottom layer and apply it The layer is used directly (refer to GAE's user service). The design of login should consider the characteristics of mobile devices. For example, computers can use floating layer windows, but NOKIA's own browser or UCWEB cannot handle this form of expression. The program must be able to handle both Ajax requests and directly through the URL. ask. The picture server is separated, and the resource files are best placed on the picture server, that is, the WEB server only serves dynamic programs. Although it is a little complicated to develop and test (because an absolute URI is required to access it), it will be much easier to optimize the front-end of the page in the future, and the IO optimization of your WEB server will also be much easier. When the program references resource files, there must be a unified processing method. Many things can be automatically completed within the method, such as combining CSS/js into a file, or automatically adding QUERYSTRING after the generated URI. If the front end If the cache service is used, generating QUERYSTRING is the simplest way to refresh the server cache and client cache.

As mentioned earlier, the database will be replicated, may have multiple masters and multiple slaves, and may be sharded. In the process of processing data in our program, it is best to abstract it and put it on a separate layer. Take the popular MVC model as an example, which is to put a data layer below the M layer. This data layer is not the commonly known JDBC/PDO/ActiveRecord, etc., but your own access data layer, which only exposes methods to the outside world. Hide data access details. Don’t be afraid of ugly writing inside this data layer, but it must provide all data storage functions. Do not see the words dealing with the database at any other level. The reason for this is that in the case of a single relational database, you may SELECT...JOIN...or directly INSERT...INTO..., but you may store some tables in a key-value database, or shard them. After that, all the original statements and methods will need to be changed. If they are too scattered, a lot of effort will be spent during transplantation, or a very large Model will be obtained. In terms of data-level design, try to avoid JOIN queries. We can do more redundancy and caching. Each type of data only needs one query and then combine it in your program. For more complex data combinations, when real-time requirements are not high, asynchronous processing can be used, and users only get the processed results when accessing. When dealing with primary keys, avoid using auto-incrementing IDs and use unique values generated by certain rules as primary keys. This primary key is the simplest sharding distribution strategy. Even if you use an auto-increment ID, it is best to use an auto-increment ID generator. Otherwise, if the slave database is accidentally written, the primary key will easily conflict.

As mentioned earlier, there are some caches blocking the front of our database. Don't treat MySQL's query cache as a cache. When the application is slightly complex, QUERY CACHE will become a burden. Caching is closely integrated with the database and business. Because it is closely related to business, there is no one-size-fits-all approach. But we still have some rules to follow. Rule 1: The closer to the front end, the greater the cache granularity. For example, the entire page is cached at the front end of the WEB, part of the page is cached at the next level, and a single record in the area is cached at the next level. Because the closer we are to the backend, the more flexible our operability is, and the front-end code that changes the most is easier to write. In practice, because product requirements change very quickly and the iteration cycle is getting shorter and shorter, sometimes it is difficult to clearly distinguish the Controller and the Model. Partial caching is inevitable at the Controller level, but it is necessary to ensure that if this happens, the Controller The cache being operated must not affect other data requesting parties, that is, it must be ensured that this cached data is only used by this Controller. Rule 2: The program cannot make errors without caching. When not considering the avalanche effect caused by cache invalidation, your program must have a cache and be the same as without a cache. It cannot be like Sina Weibo. Once the cache fails, the fan Weibo will be empty and the entire application will be messed up. When caching is essential, giving the user an error message is better than giving a misleading message. Rule 3: Cache updates must be atomic or thread-safe. Especially when passive caching is used, it is very likely that the same cache will be updated when two users access it. Usually this is not a big problem and the cache can be rebuilt after it expires. This may be one of the causes of the chain reaction. Rule 4: Caching also has a cost. Not just the cost of technology, but also the cost of labor time. If a function uses caching and does not use it, the difference in foreseeable access volume is small, but using caching will increase the complexity, then there is no need. We can add a TODO mark and add caching processing in the next iteration.

As mentioned earlier, file storage is independent, so all file operations are remote calls. You can provide a very simple RESTful interface on the file server, or you can provide xmlrpc or json serveice. All files generated and processed by the WEB server are notified to the file server through the interface for processing. The WEB server itself does not need to provide any file storage. You will find that uploading images and saving articles on many large websites are completed in two steps, for this reason.

The above "mentioned before" have actually been said by countless people. I just repeated them in my own words based on the previous articles. The essence of the real analysis is very simple - in addition to good functional logic layering, we also need Design separate interfaces for external resource calls such as database storage, caching, queues, and file services. You can imagine your program as running on Amazon EC2 and using all its web services. Your database is its SimpleDB. Your queue is his SQS, and your storage is his S3. The only difference is that Amazon's interface is a remote call, and yours is an internal call.

Interfacing the support service means that replacing MySQL with PostgreSQL does not require changes to the business processing program, and the migration team does not even need to communicate too much with the business development team; it means that the business development team is programming the interface rather than programming the database; it means Performance won't be slowed down by a business developer's mistake.

If you are not interested in program literacy, just look here——

After the product design is completed and the program framework is built, conflicts may arise at this juncture. There are constant complaints from product designers that their ideas do not achieve the desired results, and programmers complain that product designs are unrealistic. This kind of complaint is mostly due to the fact that product personnel do not understand technology and technical personnel do not understand products. In a broad sense, products include market strategies, marketing methods, and functional design. When arguing about products and technologies, the focus is often on the function, but the actual focus is on the cost of realizing this function and the benefits that this function can bring. Whether it can be converted and whether it can be taken into consideration. If possible, resolve the dispute. If not, flip a coin and see luck. There are many examples of indicators blowing out due to the enhancement of a function, or delays in combat due to project delays. Radical decision-makers focus on benefits, conservative decision-makers focus on losses, and smart decision-makers will consider whether the problem is really that serious.

No one can predict what will happen in the future. Otherwise, half of starting a business depends on luck. But there are always things that can be said accurately, and you have to rely on data to speak for itself.

Not 100%, but 99.9% of websites have installed access statistics code, even my http://zhiyi.us is no exception, and Xinwen Network always talks about scientific decision-making and scientific development. With statistics, there are many things that can be determined. For example, you can analyze which channels have the lowest per capita acquisition cost based on the source-target conversion rate, guess the reasons for user bounce rates based on source-content access, and determine whether the link position is reasonable based on user click behavior. Combine data in different ways to find internal connections, analyze internal and external factors, formulate corresponding strategies, and reduce forehead-slapping decisions. Relying on data to support operations is a very professional matter. Although I don't understand profound mathematical models and complex formula calculations, gradually I learned that B is because of A, and C is because of A and B. It is still relatively simple.

Author of the article: Liu Zhiyi (please indicate the source link and author when reprinting)

Article source: http://zhiyi.us/