官术网_书友最值得收藏!

Going the assembler way

Sometimes, when you definitely have to squeeze everything from the code, there is only one solution—rewrite it in assembler. My response to any such idea is always the same—don't do it! Rewriting code in an assembler is almost always much more trouble than it is worth.

I do admit that there are legitimate reasons for writing assembler code. I looked around and quickly found five areas where an assembler is still significantly present. They are memory managers, graphical code, cryptography routines (encryption, hashing), compression, and interfacing with hardware.

Even in these areas,  situations change quickly. I tested some small assembler routines from the graphical library, GraphicEx, and was quite surprised to find out that they are not significantly faster than the equivalent Delphi code.

The biggest gain that you'll get from using an assembler is when you want to process a large buffer of data (such as a bitmap) and then do the same operation on all elements. In such cases, you can maybe use the SSE2 instructions which run circles around the slow 386 instruction set that Delphi compiler uses.

As assembler is not my game, (I can read it but I can't write good optimized assembler code), my example is extremely simple. The code in the demo program, AsmCode implements a four-dimensional vector (a record with four floating-point fields) and a method that multiplies two such fields:

type
TVec4 = packed record
X, Y, Z, W: Single;
end;

function Multiply_PAS(const A, B: TVec4): TVec4;
begin
Result.X := A.X * B.X;
Result.Y := A.Y * B.Y;
Result.Z := A.Z * B.Z;
Result.W := A.W * B.W;
end;

As it turns out, this is exactly an operation that can be implemented using SSE2 instructions. In the code shown next, first movups moves vector A into register xmm0. Next, movups does the same for the other vector. Then, the magical instruction mulps multiplies four single-precision values in register xmm0 with four single-precision values in register xmm1. At the end, movups is used to copy the result of the multiplication into the function result:

function Multiply_ASM(const A, B: TVec4): TVec4;
asm
movups xmm0, [A]
movups xmm1, [B]
mulps xmm0, xmm1
movups [Result], xmm0
end;

Running the test shows a clear winner. While Multiply_PAS needs 53 ms to multiply 10 million vectors, Multiply_ASM does that in half the time—24 ms.

As you can see in the previous example, assembler instructions are introduced with the asm statement and ended with end. In the Win32 compiler, you can mix Pascal and assembler code inside one method. This is not allowed with the Win64 compiler. In 64-bit mode, a method can only be written in pure Pascal or in pure assembler.

The asm instruction is only supported by Windows and OS/X compilers. In older sources, you'll also find an assembler instruction which is only supported for backwards compatibility and does nothing.

I'll end this short excursion into the assembler world with some advice. Whenever you are implementing a part of your program in assembler, please also create a Pascal version. The best practice is to use a conditional symbol, PUREPASCAL as a switch. With this approach, we could rewrite the multiplication code as follows:

function Multiply(const A, B: TVec4): TVec4;
{$IFDEF PUREPASCAL}
begin
Result.X := A.X * B.X;
Result.Y := A.Y * B.Y;
Result.Z := A.Z * B.Z;
Result.W := A.W * B.W;
end;
{$ELSE}
asm
movups xmm0, [A]
movups xmm1, [B]
mulps xmm0, xmm1
movups [Result], xmm0
end;
{$ENDIF}
主站蜘蛛池模板: 桂东县| 太原市| 磴口县| 冷水江市| 新余市| 古浪县| 且末县| 旬邑县| 同仁县| 高清| 黑山县| 日照市| 佛山市| 比如县| 吉首市| 灵宝市| 西畴县| 安乡县| 织金县| 尉氏县| 隆德县| 丹寨县| 泸定县| 桃江县| 东乡县| 九江市| 榆树市| 芷江| 磐安县| 延川县| 中山市| 陆河县| 浏阳市| 高密市| 无极县| 屏边| 金塔县| 余庆县| 大英县| 阿勒泰市| 新安县|