Delphi and assembly talks (2)

Author：Eve Cole Update Time：2025-01-29 10:48:02

Elementary Optimization

When it comes to optimization, many people are dismissive, "Computer speeds are so fast now, what's the point of being a few percent faster?" This does make some sense. The results compiled by current compilers have been fully optimized. Except for the development of specific software such as graphics, images, and multimedia, deliberate optimization is not necessary in most cases, but if developers are writing code At that time, you already had the awareness of optimization. While completing the optimization, you can ensure or even improve the development efficiency. Why not?

Of course, the design of the algorithm is the core of optimization. In most cases, the execution efficiency of the program is mainly determined by the developer's overall grasp of the program, the design of the algorithm, etc.! But sometimes optimization of details also makes sense!

Moreover, in many cases this kind of optimization does not require writing code directly through assembly, but in this case it can also reflect the superiority of mastering assembly knowledge!

Such as the following two functions:

function GetBit(i: Cardinal; n: Cardinal): Boolean;

begin

Result := Boolean((i shr n) and 1);

end;

function GetBit(i: Cardinal; n: Cardinal): Boolean;

begin

Result := Boolean((1 shl n) and i);

end;

Corresponding assembly code:

MOV ECX, EDX

SHR EAX, CL

AND EAX, $01

MOV ECX, EDX

MOV EDX, $01

SHL EDX, CL

AND EAX, EDX

They have the same function, they all take the value of a certain bit of i, return True if it is 1, and False if it is 0!

On the surface, you may think that the execution efficiency of the two functions is the same, but in fact there is a difference. The shift operation of the first program is performed on i. According to the default calling convention in Delphi, register, at this time, i The value is stored in the register EAX, and the shift operation can be completed directly; but the second program is different. To complete the shift operation on the immediate value 1, it must be transferred to the register first, so there must be one more instruction! Of course, not in all cases, fewer instructions will necessarily be faster than more instructions. During specific execution, we must also consider issues such as the clock cycle of instruction execution and the pairing of instructions (more on this later). It cannot explain the problem independently. Only when Comparisons can only be made in specific code environments.

Under normal circumstances, this difference in efficiency is too negligible, but it is never a bad thing to keep an awareness of optimization during programming! If such code is located in the innermost layer of a loop, and N clock cycles accumulate through a large number of loops, the difference in execution efficiency may become very large!

The above is just a small example. It can be seen that if you can think about some issues from the perspective of assembly during development, you can write more efficient detailed code in high-level languages while ensuring development efficiency! But there are still many times when detailed optimization has to be completed using embedded assembly code, and sometimes due to the application of embedded assembly code, code writing can also become more efficient.

If you need to reverse the byte order of a 32-digit number, how can you do it completely in high-level language in Delphi? You can use shifting, you can also call the built-in function Swap multiple times, but if you think of a BSWAP instruction, it all becomes very simple.

function SwapLong(Value: Cardinal): Cardinal;

asm

BSWAP EAX

end;

Note: Same as above, the value of Value is stored in the register EAX, and the 32-digit value is also returned through EAX, so only one sentence is needed.

Of course, most embedded assembly optimizations are not that simple, but it is difficult to achieve more in-depth optimization with the little assembly knowledge learned in college. Experience can only be gained through continuous accumulation and comparison of compiled assembly codes! Fortunately, in most cases, detailed optimization is not the main part of program design.

However, if the program developed involves graphics, images, multimedia, etc., it is still necessary to carry out more in-depth optimization! Fortunately, Delphi6 can provide good support whether it is the optimization of floating point instructions or the application of MMX, SSE, 3DNow, etc. Even if you want earlier versions of Delphi to support these CPU extended instruction sets or want to support new CPU instruction sets in the future, you can use the four assembly instructions of DB, DW, DD, and DQ supported by Delphi in embedded assembly (in Borland's Delphi6 official The language manual only says that it supports DB, DW, and DD) and the numerical representation of the relevant instructions can also be flexibly implemented.

like:

DW $A20F //CPUID

DW $770F //EMMS

DB $0F, $6F, $C1 //MOVQ MM0, MM1

Understanding the instructions is only the foundation. After designing the algorithm around FPU, MMX, and SSE, if you want to optimize it further, you must also understand some of the technical characteristics of the CPU itself.

Let’s take a look at the following two pieces of code:

asm

ADD [a], ECX

ADD [b], EDX

end

asm

MOV EAX, [a]

MOV EBX, [b]

ADD EAX, ECX

ADD EBX, EDX

MOV [a], EAX

MOV [b], EBX

end

Is the second one more efficient? Wrong, as mentioned above, fewer instructions does not mean high execution efficiency. According to the relevant information, the clock cycle for the execution of the two instructions in the first section of code is 3 (each instruction needs to complete three steps of reading, modifying and writing. step), the clock cycles executed by the six instructions in the second section of code are all 1. So the two pieces of code are equally efficient? Wrong again, the second piece of code actually executes more efficiently than the first piece of code! Why? Because CPUs after the Pentium class have two pipelines to execute instructions, when two adjacent instructions can be paired, they can be executed at the same time! Specific to the above two pieces of code, what are the specific reasons?

Although the two instructions in the first code can be paired, the total execution clock cycle required is 5 instead of 3, while the six instructions in the second code can be executed in parallel, which leads to this result. .

Speaking of which, these are all very simple examples, which by themselves cannot give you much help. If you really want to optimize a specific program, you should look for some special articles on FPU and MMX optimization, or find technical manuals to study and study technologies such as "out-of-order execution" and "branch prediction". I just hope that all my friends who are in college will not just focus on those "money-making" development tools and fashionable new technologies, but can spend more time on laying the foundation. With a solid foundation, you can quickly master new knowledge, Only in this way can we master new development tools and skills in a faster time... (omit a thousand words).

But then again, knowledge still needs to be used to solve practical problems. If you only focus on technical details every day, you may become an excellent hacker, but you will never develop first-class software. Therefore, the fundamental purpose must still be to create value. So... I won’t talk about it anymore, it won’t really look like a technical article if I talk about it any more. ^_^

Attachment: In addition to considering execution efficiency, program optimization must also consider size issues (small size can load memory faster and complete instruction decoding and other tasks faster). For example, to clear the EAX register, use SUB EAX, EAX Or XOR EAX, EAX instead of MOV EAX, $0. Although their execution clock cycles are both 1, the instruction length of the former (2 bytes) is obviously shorter than that of the latter (5 bytes). But because the above mentioned are all details, the issue of volume was not mentioned. More size reduction issues should be left to the compiler to solve. Just pay a little attention when writing the embedded ASM code.