



# Using MMX Technology for image processing



| Student:    | Andreas Wierda                             |
|-------------|--------------------------------------------|
| Date:       | June 18, 1998                              |
| Company:    | Océ Technologies B.V.                      |
|             | P.O. box 101                               |
|             | 5900 MA Venlo                              |
| Study:      | Noordelijke Hogeschool Leeuwarden          |
|             | Engineering department                     |
|             | Technical Computer Science                 |
| Supervisors | :ir. R.J.P. Rutten (Océ Technologies B.V.) |
|             | ir. G.J. van Woudenberg                    |
|             | drs. P.J.A. van Eijk.                      |

# Preface

This report describes the results of my graduation project at the Research & Development (R&D) department of Océ Technologies B.V. in Venlo.

It is assumed that the reader is slightly familiar with computer architectures, assembly and C++ programming, but not with MMX Technology. Furthermore, it is assumed that the reader is not familiar with image processing.

During my period at Océ I have worked in pleasant way with a number of people. I specially would like to thank my supervisor, Rob Rutten. Furthermore I would like thank Marco Krom, Jan Jacobs and my fellow graduate students for the pleasant corporation. Especially Arjan v.d. Ven and Jurgen Rosendahl, who where researching familiar subjects. Finally, I would like to thank Klaas Jan Wierda for introducing me at Océ.

Venlo, June 18 1998,

Andreas Wierda

2

# Summary

This report describes the usage of Intel's MMX Technology in image processing applications. The primary goal is to provide the reader with knowledge on the possible problems when using MMX for image processing algorithms. Additionally an indication of the possible performance improvement is given for a number of algorithm types (compared to a C++ implementation).

Before discussing implementations of specific algorithms the basic concepts of digital copying are explained. After that the Pentium processor architecture is discussed. Instead of describing the hardware this discussion focuses on the consequences of hardware features for the programmer. The most important issues to take in account when writing code are the superscalar architecture and optimal usage of the memory bandwidth (caches). Finally, MMX Technology is discussed by explaining each of the MMX instruction types.

The following algorithms have been implemented:

- Smooth
- Sharpen
- Color conversion (lookup table with interpolation)
- Halftoning (error diffusion)

The discussion of the algorithm implementations starts with an introduction to each of the implemented algorithms. This is followed by a discussion of the main considerations made during the implementation of the algorithm in C++ and with MMX. The main considerations fall in the field of optimal image traversing and the way pixels can be processed in parallel. Next, the performance of the C++ and MMX versions are compared. Finally, an analysis of the processing time of the MMX version is made to determine the amount of processing time required for each part of the algorithm.

In the next chapter the impact of recent CPU developments on the way image processing algorithms can be implemented is discussed. Additionally the possibilities of a number of alternative CPU's are discussed (Intel Pentium II, AMD K6-2, Cyrix 6x86MX, IDT Winchip C6 and Motorola AltiVec).

In the last chapter it is concluded that the main problems encountered during the implementation of an algorithm are caused by shortcomings of the Pentium processors architecture. Additionally for some algorithms it is difficult to process pixels in parallel and very few MMX development tools are available. Using MMX results in a performance gain of a factor two to five, depending on the algorithm type.

Océ Technologies B.V. enabled the author of this report to do the research that formed the basis for this report. Océ-Technologies B.V. is not liable for the correctness of the contents and the conclusions in this report. The author is solely responsible for the contents of this report.

4

# Contents

| 1 Introduction                                       | 7  |
|------------------------------------------------------|----|
| 1.1 Background                                       | 7  |
| 1.2 Project definition                               |    |
| 1.3 Plan of action                                   |    |
|                                                      |    |
| 2 Project settings                                   |    |
| 2.1 Image processing for digital copiers             |    |
| 2.1.1 Digital copying                                |    |
| 2.1.2 Color                                          |    |
| 2.2 The processing path                              |    |
| 2.3 The test system                                  |    |
| 2.4 The Pentium Processor with MMX Technology        |    |
| 2.4.1 General features                               |    |
| 2.4.2 The superscalar architecture                   |    |
| 2.4.3 Instruction cycle times                        |    |
| 2.4.4 Caches                                         |    |
| 2.4.5 MMX Technology                                 |    |
| 2.5 Performance measurement                          |    |
|                                                      |    |
| 3 Algorithm implementation                           | 21 |
| 3.1 Introduction                                     |    |
|                                                      |    |
| 3.2 RGB Color plane separation                       |    |
| 3.2.1 The algorithm                                  |    |
| 3.2.2 The implementation                             |    |
| 3.3 Adding borders                                   |    |
| 3.3.1 The algorithm                                  |    |
| 3.3.2 The implementation<br>3.4 The smooth algorithm |    |
| 3.4.1 The algorithm                                  |    |
|                                                      |    |
| 3.4.2 C++ implementation<br>3.4.3 MMX implementation |    |
|                                                      |    |
| 3.4.4 The basic loop structure                       |    |
| 3.5 The sharpen algorithm                            |    |
| 3.5.1 The algorithm                                  |    |
| 3.5.2 C++ implementation                             |    |
| 3.5.3 MMX implementation                             |    |
| 3.6 RGB to CMYK conversion                           |    |
| 3.6.1 The algorithm                                  |    |
| 3.6.2 C++ implementation                             |    |
| 3.6.3 MMX implementation                             |    |
| 3.7 The halftoning algorithm                         |    |
| 3.7.1 The algorithm                                  |    |
| 3.7.2 C++ implementation                             |    |
| 3.7.3 MMX implementation                             |    |
| 3.8 Bits to Bytes conversion                         |    |
| 3.8.1 The algorithm                                  |    |
| 3.8.2 The implementation                             |    |
| 3.9 Removing borders                                 |    |
| 3.9.1 The algorithm                                  |    |
| 3.9.2 The implementation                             |    |
| 3.10 The total print processing path                 |    |
| 3.10.1 'Warming up' effects                          |    |
| 3.10.2 Algorithm performance                         |    |
| 3.10.3 Improvements of the processing path           |    |

| 4 Architecture developments                  | 95 |
|----------------------------------------------|----|
| 4.1 The IA32 architecture                    | 95 |
| 4.1.1 Intel Pentium Pro                      |    |
| 4.1.2 Intel Pentium II                       |    |
| 4.1.3 AMD                                    | 98 |
| 4.1.4 Cyrix                                  | 98 |
| 4.1.5 IDT                                    | 99 |
| 4.2 The PowerPC architecture                 | 99 |
|                                              |    |
| 5 Evaluation & conclusion                    |    |
| 5.1 Implementation issues                    |    |
| 5.1 Implementation issues                    |    |
|                                              |    |
| 5.1 Implementation issues<br>5.2 Performance |    |
| 5.1 Implementation issues<br>5.2 Performance |    |

7

| 1                       | Introduction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 1.1                     | Background                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |  |
| GRT                     | Océ Technologies B.V. is one of the worlds leading producers of copiers, printers and plotters. In order to further expand Océ's market position, the Research & Development (R&D) department continuously improves existing products and develops new ones.<br>Within the R&D department, the Group Research and Technology (GRT) focuses on long term subjects. It investigates possibilities of new technologies for Océ and searches for emerging markets which might be interesting for Océ. |  |  |
| Image processing        | The last few years the well known analog copiers are gradually<br>being replaced by digital copiers, which achieve a much better<br>image quality. This image quality is, among others, achieved by<br>applying digital image processing algorithms. The hard- and<br>software required for this image processing accounts for a<br>considerable amount of the cost price. Therefore continuous<br>research is done on how to do more and better image processing<br>with less cost.              |  |  |
|                         | Since an image processing device, such as a digital copier,<br>usually contains a personal computer (PC), a considerable cost<br>reduction could be achieved when this PC is used for the<br>required image processing. However, currently the image<br>processing performance of PC's is not enough to satisfy the<br>large demands made. According to Intel MMX Technology can<br>improve the image processing performance of PC's significantly.                                               |  |  |
| 1.2                     | Project definition                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |
| Problems<br>Performance | The primary goal of this project is to investigate the problems<br>encountered when implementing image processing algorithms<br>with MMX Technology. Additionally, the possible performance<br>gain when using MMX must be determined for a number of<br>algorithm classes.                                                                                                                                                                                                                       |  |  |
|                         | <ul> <li>The following algorithms will be implemented in a demo program:</li> <li>Smoothing</li> <li>Sharpening</li> <li>Color conversion</li> <li>Halftoning</li> </ul>                                                                                                                                                                                                                                                                                                                          |  |  |
| 1.3                     | Plan of action                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |
|                         | Before implementing any algorithms the Pentium architecture<br>will be studied. Additionally general knowledge of digital copying<br>and its specific properties has to be acquired.                                                                                                                                                                                                                                                                                                              |  |  |
|                         | In the main phase of the project the four algorithms will be<br>implemented. The implementation of an algorithm starts with an<br>investigation of the way the algorithm operates. Next, a<br>reference C++ implementation of the algorithm will be built. This<br>is done to understand the operation of the algorithm better<br>before implementing the algorithm with MMX. After this an MMX<br>version will be implemented. In this report the entire                                         |  |  |

development phase will be described to give the reader maximum insights in the problems encountered. Finally the performance of the C++ and the MMX version of the algorithm will be compared. To gain insights in the weaknesses and bottlenecks of MMX an analysis of the processing time will be made.

During the final phase of the project a number of alternative architectures will be investigated, focusing on the differences with MMX.

| 2                | Project settings                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |  |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 2.1              | Image processing for digital copiers                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |
| 2.1.1            | Digital copying                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |
| Image processing | A digital copier consists essentially of an expensive scanner, a computing-core for image processing and a fast print engine.<br>Although desktop scanners and printers might seem to have ideal characteristics from the users point of view, real world scanners and printers are far from ideal and a lot of image processing has                                                                                                                                                                                                                                                                                                                                                                                 |  |  |
|                  | to be performed to obtain an acceptable image quality.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |
| 2.1.2            | Color                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |
|                  | Color can be represented using several methods. Each of these methods has its own advantages and is often specific for a certain application. The demo application uses two different color representations: RGB and CMYK.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |  |
| 2.1.2.1          | RGB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |
| Human eye        | RGB (Red, Green, Blue) is a well known color representation<br>which is among others used for television and scanning<br>applications. The RGB representation is based on the way the<br>human eye perceives color. The inner back of the human eye is<br>covered with a lot of small sensors, called 'cones'. There are<br>four types of cones; one type is sensitive to all light, the other<br>three to Red, Green and Blue light respectively. To represent a<br>color in for example a scanner, the Red, Green and Blue color-<br>bases are sufficient.<br>If all three color planes of a pixel are 255 (the upper limit), the<br>pixel will be white. If all three planes are zero the pixel will be<br>black. |  |  |
| 2.1.2.2          | СМҮК                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |
| Print engine     | The RGB representation causes problems when printing,<br>because on the white paper a white pixel should be represented<br>with no ink. Therefore in print engines the complementary CMY<br>(Cyan, Magenta, Yellow) color representation is used. A white<br>pixel can then be represented with no ink, a black pixel with ink<br>of all three colors. Because of the non-ideal characteristics of the<br>ink however, the black pixel created this way has a somewhat<br>brownish color. Since this is unacceptable for text, blacK is<br>added, forming the CMYK color representation.                                                                                                                             |  |  |
| 2.2              | The processing path                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |
|                  | There are many variations in processing-paths. The structure<br>which depends on both the desired image quality and restrictions<br>on the computational resources and architecture. Likewise, many<br>variations exist for the individual algorithms.                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |
|                  | In order to implement a wide range of algorithm classes, it was<br>chosen to implement both a scan (processing) path and a print<br>path. For algorithms such as the halftoning a number of different<br>version of varying complexity exist. It was chosen to implement                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |

one of the versions, based on which the problems expected when implementing the others are predicted.

Both the sequence of the algorithms as well as the specific choices of algorithms of both paths are typical for Océ. Figure 1 shows the sequence of filter algorithms for both paths, where the right path represents the print path and the left the scan path. The italic algorithms are necessary to be able to show the processed images on a PC screen. In the processing path of a digital copier these algorithms are not necessary.



Figure 1: Processing path

Before implementing each of these algorithms a short description of them will be given, including example images.

# The test system

The development platform was a Compaq Deskpro 4000 which was also used for the performance measurements. The relevant characteristics of this PC are:

- Pentium 166 MHz Processor with MMX Technology (See section 2.4).
- An internal memory of 64 MB.
- A 256 KB level two cache.
- 2.4 GB hard disk.
- Windows NT 4.0 Workstation (Service Pack 3).

| Windows NT 4.0 | The default parameters of Windows NT were not changed,      |
|----------------|-------------------------------------------------------------|
|                | unless mentioned otherwise. The measures taken to achieve a |
|                | more constant performance are described in section 2.5.     |

Pentium 166 with MMX

2.3

# 2.4 The Pentium Processor with MMX Technology

## 2.4.1 General features

In 1997 Intel introduced the Pentium processor with MMX<sup>1</sup>. There were several changes that made the Pentium considerably faster than its predecessor, the 486:

- A dual pipeline architecture vs. a single pipeline on the 486. This so called superscalar architecture is described in section 2.4.2.
- MMX Technology<sup>2</sup> allows parallel data processing. MMX is described in section 2.4.5.
- The reduction of cycle times, especially of floating point instructions. See section 2.4.3 for more information on cycle times.
- A new branch prediction feature allows the processor to predict the destination of a branch, eliminating branch delays.
- A 64-bit databus, improving databus bandwidth (32 bit on the 486).
- 16 KB code- and 16KB datacache vs. 8K combined on the 486 (2x8 KB on the Pentium without MMX). The cache architecture and its influence on the performance is discussed in section 2.4.4.
- The branch prediction logic allows the processor to predict the destination of a branch, eliminating branch delays.
- Higher clock speeds further increased the CPU speed. On the 486 clock speeds varied from 25 MHz to 100 MHz. At its introduction the Pentium was available in a 60 and a 66 MHz version. Later versions reached clock speeds up to 233 MHz.

Some other features of the Pentium with MMX are:

- Dual processing configuration
- Fractional bus support, allowing the CPU to run at an higher clock speed than the external databus. The external databus of a Pentium CPU always runs at 66 MHz, while versions with internal clock speeds of 166, 200 and 233 MHz are available.

#### The superscalar architecture

On the predecessor of the 486, the 386 Intel introduced a pipelined architecture. Such an architecture refers to an architecture where the CPU executes each portion of an instruction in different stages. When a stage is completed, another one begins executing in the first stage, while the previous instruction moves to the second stage.

2.4.2

<sup>&</sup>lt;sup>1</sup> There are a number of differences between the conventional Pentium and the Pentium with MMX. The most important one obviously is MMX. Additionally, the Pentium with MMX has an improved branch prediction and larger caches. The conventional Pentium is not discussed further.

<sup>&</sup>lt;sup>2</sup> MMX is sometimes believed to stand for MultiMedia eXtensions or Matrix Math eXtensions. According to Intel however MMX is not an acronym.

| Pipelining       | <ul> <li>A five stage pipeline might be divided into the following stages:</li> <li>1. Fetch instruction opcode from memory</li> <li>2. Decode instruction opcode<sup>1</sup></li> <li>3. Calculate operand<sup>2</sup> address and read operand from memory</li> <li>4. Execute instruction</li> <li>5. Write instruction result</li> </ul>                                                                                                                                                                                                                                                                                               |
|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                  | If each stage takes 1 cycle this architecture might reach execution times of 1 cycle per instruction, compared to 5 cycles per instruction for the non-pipelined equivalent.                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Pipeline delays  | There are two situations where there is a significant delay in the pipeline.<br>First, if a jump is encountered, the entire pipeline has to be flushed and reloaded.<br>Second, if one of the instructions takes more than one cycle to read the operands, while the other stage still require one cycle, the preceding stages of the pipeline will stall, since they have to wait for the next stage to complete the previous instruction.                                                                                                                                                                                                |
| ʻu' and ʻv'-pipe | The superscalar architecture introduced with the Pentium<br>basically consists of two parallel pipelines, each divided into five<br>stages. When running at maximum speed, two instruction results<br>per clock cycle can be produced. The pipelines of the Pentium<br>are called the 'u'-pipe and the 'v'-pipe.                                                                                                                                                                                                                                                                                                                           |
|                  | Unfortunately only the 'u'-pipe can execute all instructions. The 'v'-pipe can only execute 'simple' integer instructions, floating point and MMX instructions. Simple integer instructions are for example move instructions, the various additions/subtractions and logical instructions.<br>Memory accesses can only be made from the 'u'-pipe. Another exception is made for read-write dependencies; if the first instruction changes an operand of the second instruction they cannot be executed parallel, and the second pipeline will stall until the first instruction finishes.                                                 |
| Pairing          | When two instructions can be executed simultaneously, i.e. the<br>first in the 'u'-pipe, and the second in the 'v'-pipe they are said to<br>pair. A pairing rate of for example 50% indicates that half of the<br>instructions can be paired with another instruction, thus reducing<br>execution times by 25% (assuming that all instruction take the<br>same number of cycles).<br>The complete list of paring rules is described in "Pentium<br>Processor Family Developers Manual" [5], chapter two.<br>Examples and further information on paring issues can be found<br>in "Pentium Processor Optimization Tools" [9], chapter nine. |
| Pairing a filter | After a filter has been programmed the code has to be paired.<br>When pairing code the statements are re-ordered in such a way<br>that that the CPU can execute two instructions parallel.<br>Intel provides the tuning program VTune. This program<br>simulates the behaviour of a CPU, and logs the required<br>processing time. VTune also indicates why two instructions will<br>not pair.                                                                                                                                                                                                                                             |

<sup>&</sup>lt;sup>1</sup> The hexadecimal code of the executed statement. <sup>2</sup> The data element the operation is performed on or with.

| General pairing rules | <ol> <li>Although it is not necessary to learn the complete list of pairing rules, a few basic rules are still necessary:</li> <li>If the first instruction modifies a register that is used by the second they will not pair.</li> <li>Only one MMX shift unit is available. This means only one MMX shift instruction can be performed, either in the 'u'- or in the 'v'-pipe. The pack and unpack instructions also use the shift unit.</li> <li>MMX memory moves can only be issued in the 'u'-pipe.</li> <li>Add instructions do not pair with MMX memory moves.</li> <li>When writing data to memory the data has to be ready one cycle in advance before the memory move.</li> </ol> |  |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                       | These are a few of the paring rules for MMX instructions. To indicate the complexity of the total set of pairing rules; the complete list by Intel is 11 pages long.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |
|                       | It is important to keep the pairing in mind when programming the<br>filter; use all registers instead of using one register multiple<br>times for a number of tasks. This allows two (independent)<br>blocks of code to be merged, running the first block in the 'u'-<br>pipe, the second in the 'v'-pipe. Care also has to be taken not to<br>use non-pairable instructions, especially for non-mmx<br>instructions.                                                                                                                                                                                                                                                                      |  |
| Perfect pairing       | A pairing rate of 100% is possible by moving read and calculate operations from the beginning of the loop to the end of the loop <sup>1</sup> . This is necessary because when pairing the code, instructions that can be delayed are shifted down, and instructions that use recourses not used by other instructions are moved up. At the end of the loop however, it is not possible to move instructions down, commonly resulting in a number of not pairable instructions. If some instructions from the top of the loop are moved to the end, it is possible to pair them with the instructions that otherwise would not be pairable.                                                 |  |
|                       | It was mentioned before to take care not to switch related<br>statements. However, especially memory dependencies are<br>easily overseen. It is therefore inevitable that mistakes will be<br>made. If these mistakes are not noticed the consequences are<br>disastrous; it is virtually impossible to find bugs in paired code<br>because the logic between statements is completely lost <sup>2</sup> .                                                                                                                                                                                                                                                                                  |  |

<sup>&</sup>lt;sup>1</sup> And of course just before the beginning of the loop. If this is skipped, the first loop cycle will not perform all the required operations. <sup>2</sup> While pairing the smooth algorithm a mistake was made. The mistake was made in a group of

<sup>&</sup>lt;sup>2</sup> While pairing the smooth algorithm a mistake was made. The mistake was made in a group of about ten statements, but undoing the moves was not possible because Visual C++ had been shut down. After spending three hours searching for the misplaced statement. it was decide to start pairing all over again with a non-paired version of the filter.

| Guidelines for safe pairing | <ol> <li>In this paragraph some guidelines are given that help limit the consequences of errors made while pairing the code:         <ol> <li>Back up previous versions.</li> <li>Number each statement before starting to pair the code.</li> <li>After pairing about ten to twenty statements, test the filter code.</li> </ol> </li> <li>Test the filter by subtracting the filters result byte-wise from the result of the unpaired code previously stored, for example with Paint Shop Pro. Errors can cause subtle changes in the image.</li> <li>Do not close Visual C++ before the new version of the code passed a thorough test. If Visual C++ is closed, it is not possible to undo actions.</li> </ol> <li>Do not use test images that contain large areas of the same color. Such images are sometimes processed correct, although the code is corrupted.</li> |  |
|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 2.4.3                       | Instruction cycle times                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |
|                             | One of the major improvements of the Pentium compared to the 486 is the reduction of instruction cycle times. For example a floating point add takes 23 to 72 cycles on a 386, 8 to 32 on a 486 and 1 to 3 on a Pentium. Besides the floating point instructions, quite a number of other instructions were also improved. The return instruction for example takes 11 cycles on a 386, 5 on a 486 and just 2 on a Pentium.                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |
| Memory accesses             | The cycle times for instructions that read and write to memory<br>are based upon the assumption that the data can be read (or<br>written) from the level one cache. In most situations however,<br>that will not be the case, and severe delays will occur. For more<br>information on cache and memory issues see section 2.4.4.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |
| 2.4.4                       | Caches                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |
| Code and data cache         | The first series of the Pentium had a separate code and data cache of each 8 KB, vs. 8 KB combined on the 486. There are some advantages in having the code and data cache separated. This separation generates fewer internal bus conflicts that could cause delays. But more important, this allows the code cache to contain additional information about each byte in the cache, for example instruction pairing information.                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |
| Cache lines                 | The later versions of the Pentium, those who where equipped<br>with MMX Technology, have a separate code- and data cache of<br>16 KB each. The cache consists of 512 lines of 32 bytes each.<br>Every time a data item is read which is not cached, the<br>processor will read an entire cache line from memory. The cache<br>lines are always aligned to a physical address divisible by 32.<br>When a data element is spread over multiple cache lines, a read<br>takes at least 3 clock cycles extra because the CPU can not read<br>data spread over multiple cache lines. (The read is split up in<br>two reads, after which the data is merged). This is called a Data<br>Cache Unit (DCU) Split.                                                                                                                                                                     |  |
| Misaligned data             | Another delay of 3 clock cycles occurs when reading misaligned data. Intel recommends data to be aligned on the boundaries shown in the list below for optimal performance <sup>1</sup> :                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |

<sup>&</sup>lt;sup>1</sup> Source: "Intel Architecture Optimization Manual " [8], section 3.4.2.

| • | Bytes (8 bit):        | On any boundary.                     |
|---|-----------------------|--------------------------------------|
| • | Words (16 bit):       | Align data to be contained within an |
|   |                       | aligned 4-byte doubleword.           |
| • | Doublewords (32 bit): | On any boundary which is a multiple  |
|   |                       | of four.                             |
|   |                       |                                      |

Quadwords (64 bit): On any boundary which is a multiple of eight.

For example, a quadword read from address 0x02F01 is not aligned (light gray in Figure 2), and takes three clock cycles more to read than a quadword on address 0x02F08 (dark gray).



The cache is set-associative. This means that a cache line can not be assigned to arbitrary memory addresses. To determine the cache line and the position within a cache line, a physical address is split up like shown in Figure 3.

| bit 31                                               | 12 | 11 5            | 4 0             |  |
|------------------------------------------------------|----|-----------------|-----------------|--|
|                                                      |    | cache set value | byte nr in line |  |
| Figure 2: Manning a physical address to a sache line |    |                 |                 |  |

Figure 3: Mapping a physical address to a cache line

The seven bits for the cache set value are used to address 128 set values. The four least significant bits indicate the byte number of the physical address within the cache line.

For each of the 128 set values, four cache lines are available. Thus, a cache line address is composed as shown in Figure 4.

| bit  | 8 2              | 1  | 0   |     |
|------|------------------|----|-----|-----|
|      | cache set value  |    |     |     |
| Figu | re 4: Cache line | ac | ldı | ess |

The seven cache set value bits are the same bits as in the physical address, allowing 128 set values. The two least significant bits allow four cache lines per set value.

The consequence of this is that the cache can hold no more than four different data blocks with the same set value (bits 5 to 11) of the address.

Level two cache Level two cache Evel two Evel two cache Evel two Evel two cache Evel two Evel two Evel two Evel two Evel two Evel two Evel Evel two Evel two Evel Evel two Evel two

Set associative

| Memory             | If the data is not available in the level two cache, it has to be<br>read from memory. This takes between 200 and 300 ns. For our<br>166 MHz CPU this would be 33 to 50 clock cycles.                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |  |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
|                    | When a write is performed to an address which is not in the level<br>one cache, then the value will go right through to the level cache<br>or to the RAM, depending upon the implementation of the<br>motherboard. This takes approximately 100 ns, causing a delay<br>of 17 cycles (on our 166 MHz system).<br>The Pentium with MMX has four write buffers. This means that<br>up to four unfinished writes to uncached memory can be done<br>without delaying the subsequent instructions.                                                                                             |  |  |  |  |  |  |
| Address generation | The last cache related issue we will discuss here is the so called<br>Address Generation Interlock (AGI). This delay is caused by the<br>fact that it takes one clock cycle to calculate the address needed<br>by an instruction which accesses memory. Normally, this<br>calculation is done in a separate stage of the pipeline, while the<br>preceding instruction is executing. But if the address depends on<br>the result of an instruction executing in the preceding clock<br>cycle, an extra cycle is needed in order to calculate the address.<br>This is called an AGI stall. |  |  |  |  |  |  |
| 2.4.5              | MMX Technology                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |  |  |  |  |  |
|                    | In 1996 Intel introduced MMX Technology. Basically MMX provides a set of SIMD (Single Instruction Multiple Data) instructions, allowing processing of multiple data elements in one instruction.                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |  |
| Data types         | <ul> <li>MMX Technology introduces 57 new instructions, 4 new data types and 8 MMX registers. The basic data type is the packed fixed point integer, which can be used in the following configurations:</li> <li>Packed byte. Eight bytes combined in a 64 bit quantity.</li> <li>Packed word. Four words packed in a 64 bit quantity.</li> <li>Packed doubleword.</li> <li>Quadword. One 64 bit quantity.</li> </ul>                                                                                                                                                                    |  |  |  |  |  |  |
|                    | The first three data types are called packed data. The MMX instructions operate on groups of eight bytes, four words two doublewords or one quadword, as shown in Figure 5.                                                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |  |  |



Figure 5: MMX data types

MMX Registers The MMX registers MM0 to MM7 are aliased on the general purpose floating point registers R0 to R7 and are used in the same way.

MMX Technology introduces a new arithmetic mode known as saturation arithmetic Saturation can best be explained by contrasting it with wraparound mode. In wraparound mode results that over- or underflow are truncated and only the least significant bits are returned. In saturation mode the results are clipped to the data range limits shown in Table 1.

|               | Lowe   | r limit   | Upper limit |           |  |  |
|---------------|--------|-----------|-------------|-----------|--|--|
|               | (hex)  | (decimal) | (hex)       | (decimal) |  |  |
| Signed byte   | 0x80   | -128      | 0x7F        | 127       |  |  |
| Signed word   | 0x800  | -32,768   | 0x7FFF      | 32,767    |  |  |
| Unsigned byte | 0x00   | 0         | 0xFF        | 255       |  |  |
| Unsigned word | 0x0000 | 0         | 0xFFFF      | 65,536    |  |  |

Table 1: Data type ranges

To perform actions on packed data types, Intel provides the following types of instructions:

- Arithmetic instructions
- Comparison instructions
- Conversion instructions
- Logical instructions
- Shift instructions
- Data transfer instructions
- The "Empty-MMX-State" instruction (EMMS)

Since limitations of the MMX instruction set have serious consequences on the way algorithms must be implemented, we will discuss some details of the MMX instruction set here. For a complete list of all MMX instructions, see appendix A.

Arithmetic instructions MMX provides most common arithmetic instructions, such as addition, subtraction and multiplication. The latter can only be performed on packed words, the other on packed words and packed bytes. The packed add, subtract and the packed multiply and add can also be performed on packed doublewords. However, none of the arithmetic instructions operates on a

quadword. A packed divide instruction is not available. Division can be made either with the shift-right instruction (divide by a power of 2) or by multiplying with the reciprocal.

Comparison instructions Since a packed data type consists of multiple data elements, a compare instruction could not be implemented with a single zero flag as is common for normal compare instructions. Instead, the compare instruction generates a bit mask as shown<sup>1</sup> in Figure 6.



Conversion instructions The conversion instructions are used to convert packed data from one type to another. To perform an extend operation, for example to convert packed bytes to words, the packed unpack instruction must be used. If in the example in Figure 7 the mm/m32 operand is filled with zeroes, the mm operand is extended from the packed byte data type to packed words.



Figure 7: Unpack operation (punpcklbw mm, mm/m32)

The complement of the unpack is the pack instruction (see Figure 8). This instruction can be used to shrink packed data types, for example from packed words to bytes.



Figure 8: Pack operation (PACKUSWB MM,MM/M64)

PACKUSWB stands for "Pack Unsigned with Saturation Words to Bytes". Thus, when the value of a unsigned packed word exceeds the range of the unsigned packed byte, the result is saturated.

The packed logical instructions are similar to those that operate on non-packed data types, except that the result is placed in a bit mask in the first (=destination) operand, similar to the packed compare instructions.

Logical instructions

 $<sup>^{1}</sup>$  mm = MMX register.

m32 = Memory location (32 bits).

m64 = Memory location (64 bits).

Shift instructions The MMX instruction set has three shift instructions; logical shift left and right and arithmetic shift right. The arithmetic shift right instruction is needed to divide -by a power of 2-. The difference between the arithmetic shift and the logical shift is that the latter also shifts the sign bit, unlike the arithmetic shift which skips this bit. The arithmetic shift left can be done with a logical shift left combined with some code to preserve the sign bit.

Data transfer instructions Data transfers can performed with the move instructions. When using these instructions to move packed data between registers an exact copy of the destination register is made. Care has to be taken when moving packed data from memory to an MMX register; the byte on the memory location with the highest address is placed in the most significant byte of the packed data type, as show in Figure 9.



Figure 9: MMX memory move operation (MOVQ MM,M64)

- Immediate operandsMost instructions do not accept immediate operands, for<br/>example to add '1' to all of the eight packed bytes. They only<br/>accept MMX registers or memory operands. An exception is<br/>made for the shift instructions, which accept an immediate<br/>operand for the number of bits to shift.'EMMS' instructionA special MMX instruction is the EMMS instruction. This<br/>instruction clears the internal state registers of the MMX unit.<br/>The EMMS instruction is necessary because the MMX registers
  - The EMMS instruction is necessary because the MMX registers are mapped to the floating point registers. The next floating point instruction after the EMMS instruction takes 57 cycles extra to execute, thus mixing floating point and MMX instructions must be avoided at all costs.
- Cycles times As far as the cycle times of regular MMX instructions are concerned; they all take one cycle, except for instructions that use the MMX multiply unit<sup>1</sup>. The results of these instructions are available three cycles after the instruction has been started. An average throughput of one instruction per cycle can be reached by starting a number of result-independent multiplies consecutively.

# 2.5 Performance measurement The time necessary to apply each filter is measured by reading the time stamp counter register of the Pentium CPU before and after the call to the filter function. The time stamp counter

the time stamp counter register of the Pentium CPU before and after the call to the filter function. The time stamp counter register is a 64 bits wide register that is incremented one each clock cycle.

<sup>&</sup>lt;sup>1</sup> The multiply unit is used by the packed multiply and the packed multiply-and-add instructions.

| Cycles/pixel             | The time needed to process one pixel is calculated by dividing the number of clock cycles that passed by the number of pixels processed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Multitasking             | <ul> <li>Windows NT is a multitasking operating system. This implies that an applications performance also depends the other processes running. Even if all other applications are shut down, in the background Windows NT still runs internal processes, called services, such as the spooler service and the TCP/IP service. To reach constant performance of the filter, all services possible where shut down. Two services can not be shut down:</li> <li>1. The eventlog service .</li> <li>2. The RPC service.</li> </ul>                                                                                                                                                                                                                                                                                  |
|                          | It is expected that the influence of these services on the filter<br>applications performance will be minimal, because they are just<br>two of the nine services running normally. Besides shutting down<br>most services, the test application has been given maximum<br>scheduling priority <sup>1</sup> .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Disk swap                | Windows NT uses, like all modern operating systems, virtual<br>addressing. Virtual addressing allows the usage of more memory<br>than physical available in the computer. This is done by mapping<br>the virtual address either to a physical memory address (the data<br>is loaded into the physical memory), or to a location on the hard<br>disk. In the latter situation the data is said to be 'swapped' to the<br>disk. Windows NT manages address mapping in such a way that<br>a data block requested by an application is loaded in the<br>physical memory. This so called 'disk swap' dramatically<br>influences the filter applications performance. By closing all<br>other application, most of the disk swapping can be prevented.<br>However, if the filter application allocates more memory than |
|                          | physically available in the PC, Windows NT will have to swap data to the hard disk.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Allocating memory        | <ul> <li>Therefore the filter application has to allocate the required memory in such a way that no swapping will occur in critical sections:</li> <li>1. Minimise the amount of allocated memory.</li> <li>2. Allocate memory outside the (timed) filter functions.</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                          | Further research is required on how to achieve constant processing performance on a Windows NT platform.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Fastest reproducible run | When measuring filter performance a number of measurements where performed, out of which the fastest that could be reproduced was documented. It was chosen to record the fastest reproducible measurement instead of the average because the slower measurements are heavily influenced by the time Windows NT needs to swap data to the hard disk <sup>2</sup> .                                                                                                                                                                                                                                                                                                                                                                                                                                                |

<sup>&</sup>lt;sup>1</sup> This is done with the Windows NT Task Manager. 2 A disk access requires several milliseconds, while a CPU clock cycle is about 6 nanoseconds (based on a clock speed of 166 MHz).

| 3                  | Algorithm implementation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3.1                | Introduction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Image Gear         | The application framework needed to test the algorithms is built<br>using the MFC-Application wizard provided with Microsoft Visual<br>C++ 5.0. The wizard generates a customised application<br>template, containing for example menu's and toolbars. Reading<br>and writing of images has been implemented using the Image<br>Gear function library. The image is read by Image Gear from the<br>harddisk into memory as a linear sequence of pixels. Each pixel<br>is represented by 24 bits. For each of the Red, Green and Blue<br>color planes 8 bits are used, giving a Red, Green and Blue range<br>from 0 to 255. Within each pixel the planes are placed in Blue -<br>Green - Red order. |
| C++ vs. MMX        | <ul> <li>In the test application two versions of the processing paths are implemented; one written in standard C++, another using in-line assembler with MMX Technology. It has been decided to implement the filters in C++ for two reasons:</li> <li>1. To find out each filters typical difficulties.</li> <li>2. To be able to compare the performance of the MMX implementation to a conventional C++ implementation.</li> </ul>                                                                                                                                                                                                                                                              |
| Time stamp counter | A test session is started by choosing either the "C processing<br>path" or the "MMX processing path" option from the image<br>menu. Then the required main processing function ('FilterC' or<br>'FilterMmx') is called. These two functions are the base of the<br>processing path. They allocate all the required storage space<br>and call the specific filter functions. The required processing<br>time<br>is measured by reading the time stamp counter register before<br>and after the call to the filter function (see section 2.5).                                                                                                                                                       |
|                    | When the users starts a test session, a dialog box is displayed in<br>which the user can select the filters to execute. When the image<br>processing is completed, another dialog box is displayed<br>containing the number of cycles per pixel and the number of<br>milliseconds for each of the applied filters.                                                                                                                                                                                                                                                                                                                                                                                 |
| 3.2                | RGB Color plane separation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 3.2.1              | The algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                    | The filters in the processing path (smooth, sharpen) will be<br>applied to each of the individual color planes. In the image<br>however, the pixels are stored in a 3x8 bits block, where the first<br>eight bits represent the Blue value, the middle eight Green and<br>the last eight Red.<br>For an easier and faster <sup>1</sup> implementation of the filters, a function<br>is built that separates the three color planes, as well as its<br>complement, a function that merges the three planes.                                                                                                                                                                                         |
| 3.2.2              | The implementation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                    | The implementation of the separate function is easy. A loop runs through the array containing the image and puts of every group                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |

<sup>&</sup>lt;sup>1</sup> Faster because memory reads are less cluttered up.

|                        | of three bytes the first byte in the Blue color plane array, the second into the Green, and the third into Red color plane array.                                                                                                                                                                                                                                                                     |  |  |  |  |  |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| C++ version only       | This function, which is written in C++, is used for the C++ as well<br>as for the MMX implementation of the processing path.<br>However, it is expected that the use of MMX technology will lead<br>to a performance improvement because an MMX<br>implementation can separate multiple color pixels parallel, where<br>the C++ implementation processes the pixels sequentially.                     |  |  |  |  |  |
| 3.3                    | Adding borders                                                                                                                                                                                                                                                                                                                                                                                        |  |  |  |  |  |
| 3.3.1                  | The algorithm                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |
| Smooth and sharpen     | In section 3.4.3.5 and 3.5.3.5 it can be read that both the smooth and the sharpen filter require an image width which is a multiple of eight. Since not all images have such a width, the image width has to be extended.                                                                                                                                                                            |  |  |  |  |  |
|                        | On page 81 it can be seen that the MMX implementation of the halftoning algorithm does not process all the border pixels. Both on the left and the right a seven pixels wide column is not completely processed. Therefore it has been decided to add two, eight pixel wide borders to the left and right edge of the image.                                                                          |  |  |  |  |  |
| Separated color planes | The border adding function can be applied to the merged color<br>planes or to the separated color planes. Since the function that<br>separates the color planes does not require an image width<br>which is a multiple of eight, it has been chosen to apply the<br>border adding function to the three separated color planes.                                                                       |  |  |  |  |  |
| 3.3.2                  | The implementation                                                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |
|                        | The implementation of the border adding function is not difficult.<br>Two loops are used to traverse the image; one to process the<br>image horizontally (the rows), another to process all the rows.                                                                                                                                                                                                 |  |  |  |  |  |
|                        | The inner loop, which processes one row, simply copies all the pixels from the source array to the destination array. After all the pixels from one row of the source image have been copied, a number of pixels is added to reach an image width of 8n. The two eight pixel wide borders which have to be added for the halftoning algorithm are copied before respectively after the row is copied. |  |  |  |  |  |
| MMX version only       | <ul> <li>It has been chosen to implement this function with MMX<br/>Technology for two reasons:</li> <li>By using MMX eight pixels can be copied simultaneously,<br/>compared to one of a C++ implementation.</li> <li>This function is only used by the MMX version of the<br/>processing path.</li> </ul>                                                                                           |  |  |  |  |  |
|                        |                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |  |  |  |

# 3.4 The smooth algorithm

# 3.4.1 The algorithm

The smooth algorithm is used to suppress disturbances added to a scanned image because of the non-ideal characteristics of scanners. A scanner always adds some electronic noise to the scanned image, and scanning of regular patterns, such as color photos, can result in moiré like effects. A smoothing operation is performed on the separated Red, Green and Blue planes of the image to reduce these artefacts.

In Figure 10 the smooth filter is applied to the squared area.



<sup>&</sup>lt;sup>1</sup> There are 128 cache lines of 32 bytes each. See section 2.4.4.

which the bytes previously read will be overwritten<sup>1</sup>. This means that the bytes needed for the next one byte column have to be read again from the slow main-memory. When traversing the image array diagonally, of the 32 bytes read into the level one cache, the same (small) number of bytes would be used as when we would run vertically through the array. Because at a certain point cache lines still needed are overwritten, performance will be far from optimal. Because traversing image array horizontally results in the most Horizontal traversing efficient use of the level one cache, this way has been chosen. The direction -from left to right or vice versa- has no influence on the performance, so it has been decided to run from left to right, since the bytes are stored in memory from left to right. 3.4.2.2 **Re-using calculated values** Calculation of the new pixel value could be done in one single calculation where all the pixel values are added (with their own weight) and the sum is divided by 16. Such an implementation will result in poor performance because the kernel symmetry enables the calculation of row or column (weighted) sums. Because we run from left to right through the image, it is possible to calculate the weighted sum of each of Three kernel columns the three kernel columns, add them and divide by 16. By storing the sum of the rightmost column in a buffer the number of uncached memory accesses can be reduced by  $66\%^2$ . In the next loop cycle the previously buffered result can be multiplied by two to calculate the weighted sum of the middle column. One loop cycle later the value stored two cycles ago can be used as the weighted sum of the left column of the kernel. The example in Figure 12 shows how the same column-result is calculated three times when calculating three new pixel values. Suppose at a certain moment the memory looks like shown in Figure 12. Pixel '5' is currently processed. Memory: Kernel: 1 3 2 3 2 1 1 1 3 5 7 8 0 4 2 2 1 2 4 9 6 1 2 1

> Column=3+2\*7+2=19 Figure 12: Calculating the right kernel column

At the next iteration of the loop (when calculating the new value for the '7') the same column is used.

<sup>&</sup>lt;sup>1</sup> Since there are a source and destination array this will occur approximately after 2\*2048=4096 byes have been read. <sup>2</sup> Assuming that the buffer is loaded into the level one cache, or placed in registers.



Column=2\*3+4\*7+2\*2=38=2\*19 Figure 13: Calculating the middle kernel column

The third time, when calculating the new value for the '8', it looks like shown in Figure 14.



Column=3+2\*7+2=19 Figure 14: Calculating the left kernel column

## 3.4.2.3

# The final C++ implementation

With this buffering implemented the filter loop final C++ implementation looks like shown in the pseudocode in Code example 1. The buffer consists of two storage spaces, A and B. The function Weighted\_Sum calculates the (1, 2, 1) weighted sum of the first three bytes of a column, each with the appropriate weight.

```
Dst.Rows[0] = Src.Rows[0]
                                 // Copy top and bottom row
Dst.Rows[Height] = Src.Rows[Height]
RowCounter = 1
Do While (RowCounter < Height-1)
        Dst.Pixels[RowCounter,0] = Src.Pixels[RowCounter,0]
        A = Weighted_Sum(Src.Columns[0])
        B = Weighted_Sum(Src.Columns[1])
        ColCounter = 2
        Do While(ColCounter < Width-1)</pre>
             ł
              Temp=Weighted_Sum(Src.Columns[ColCounter])
              Dst.Pixels[RowCounter,ColCounter-1]=(A+2*B+Temp )/16
              A = B
              B = Temp
              ColCounter = ColCounter+1
             }
      Dst.Pixels[RowCounter,Width]=Src.Pixels[RowCounter,Width]
      RowCounter = RowCounter+1
      }
```

# Code example 1: C++ implementation of the smooth

PerformanceThe performance of the code is discussed in section 3.4.4.7,<br/>where it is compared to that of the MMX implementation.3.4.3MMX implementation3.4.3.1Parallelability of the algorithm

When implementing a filter with MMX Technology, the first question to ask is:

"Is it possible to process pixels simultaneously ?"

Unlimited parallelability

If not, using MMX will not improve performance because MMX's strength is the fact that it can perform an operation on multiple pixels simultaneously.

The smooth filter applies a kernel to a number of pixels of an image, resulting in the new pixel value. Since the output of the filter has no influence on the input, parallel processing of pixels is possible without any limitations.

#### 3.4.3.2 Pixel block width

Next it must be decided how many pixels can be processed simultaneously. The answer to this question is basically provided by the way the data to calculate one pixel is loaded into the MMX registers. To understand this, it must be clear how the pixels are

Pixels written sequentially stored in memory; the pixels of each row are written sequentially from left to right. The first (=leftmost) pixel of the next row is stored in memory right after the last (=rightmost) pixel of the row last written. Thus, the pixels of the image shown in Figure 15 are written in the same order as they are numbered; pixel '1' is stored on the lowest address in memory, pixel '80' on the highest. Pixel '21' is stored one location higher than pixel '20'. As a result of this, all the pixels of the top row are placed in adjacent memory locations, and can be read with one MMX memory read.

| 1  | 2                                    | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|----|--------------------------------------|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 21 | 22                                   | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| 41 | 42                                   | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 |
| 61 | 62                                   | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 |
|    | Figure 15: Deading the source pixels |    |    |    |    |    |    |    |    |    |    |    |    |    |    |

Figure 15: Reading the source pixels

Thus when calculating the new value for pixel 22 the pixels of the top kernel-row can be read by reading pixel 1 with an MMX memory read. However, the MMX memory read loads eight pixels (=bytes) from memory. This enables the calculation of six pixels, 22 to 27<sup>1</sup>, shown in the squared area. It is obvious that this way of implementing the filter uses the memory bandwidth more efficient than an implementation that calculates only one pixel per loop cycle. But how many pixels should be processed each loop

cycle ?

The answer to this question is provided by the number of dataelements calculated in parallel. When processing packed bytes eight bytes are processed parallel<sup>2</sup>. As far as the other data types are concerned; four packed words or two packed doublewords can be processed parallel.

These considerations lead to the conclusion that it would be very inefficient to process six pixels per loop cycle; such an implementation only uses a part of the available processing power.

Since the pixels are stored in memory as bytes it is obvious that packed bytes are used for pixel calculations. This implies that each loop cycle eight new pixels are calculated.

Eight pixels per loop cycle

<sup>&</sup>lt;sup>1</sup> The kernel requires two extra -border- pixels.

<sup>&</sup>lt;sup>2</sup> Because a packed byte contains eight bytes.

# 3.4.3.3

Vertical traversing

## Re-usage of calculated values

The choice made in section 3.4.3.2 does not determine how to run through the image array. Although the example runs through the image array vertically, it would also be possible to run horizontally, as with the C++ implementation. The third possibility is to run through the image diagonal.

Let us suppose the filter runs horizontally through the image array. Data would then be processed as show in Figure 16; in the first loop cycle the pixels 22-29 are calculated requiring the grayed pixels of the source image. In the next loop cycle the pixels 30-37 are calculated, requiring the squared area of the source image.

| 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 |
| 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 |
| 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 |
|    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |

Figure 16: Traversing the image array horizontally

Thus, when running horizontally through the image, only the weighted sum of two columns can be re-used to calculate the next group of pixels. Now suppose the loop runs through the array vertically. The second cycle of the loop calculates the new values for pixels 42- 49, requiring the squared area of the source array in Figure 17.

| 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 |
|----|----|----|----|----|----|----|----|----|----|----|
| 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 |
| 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 |
| 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 |

Figure 17: Traversing the image array vertically

In this situation, the weighted sum of the row 21-30 and of the row 41-50 can be re-used. The re-usage of two previously calculated results is possible because the kernel is vertically symmetric and because the middle row is a multiple of the bottom row, as shown in Figure 18.

|                   | 1 | 2 | 1 |  |  |  |  |  |  |
|-------------------|---|---|---|--|--|--|--|--|--|
|                   | 2 | 4 | 2 |  |  |  |  |  |  |
|                   | 1 | 2 | 1 |  |  |  |  |  |  |
| a 18: Smoothing k |   |   |   |  |  |  |  |  |  |

Figure 18: Smoothing kernel

The third option was to run through the image diagonal. Such an implementation would result in poor performance because calculated values could not be re-used efficiently.

From the previous it is clear that optimal usage of previously calculated weighted sums can only be achieved when running vertically through the image array.

The values to be re-used have to be stored somewhere. If registers would be used, this would require four<sup>1</sup> of the eight

<sup>&</sup>lt;sup>1</sup> Four instead of two because the data has to be extended from bytes to words. The reasons it has been decided to extend data from bytes to words are discussed later.

available MMX registers, leaving not enough registers free to do the calculations. Thus, these values are stored in an array. Because the array is accessed frequently, the data will always be in the level one cache, allowing data to be read in one cycle.

## 3.4.3.4 Optimal image traversing

From the previous paragraph it is clear that as far as the reusage of calculated values is concerned the most efficient way of processing the image would be by running vertically through the image. But in the discussion of the C++ implementation<sup>1</sup> it was concluded that from the caches point of view, an optimal implementation would run horizontally through the image.

An implementation that combines these two demands would run vertically through the image to enable optimal re-usage of calculated values, but would not overwrite cache lines still needed. This can be achieved by running down until all the cache

lines are filled, instead of running down until the bottom of the image. This way of running through the image is shown in Figure 19.



Since this way of processing the image promises optimal results it has been chosen to implement the filter in this way. A key issue in this implementation is the moment where all the cache lines are filled. The level one cache consists of 512 lines of 32 bytes each. When the loop runs down the image, each loop cycle one cache line is filled<sup>2</sup>. This means that after running 256 lines<sup>3</sup> down, the entire cache is filled. At this point, the filter starts reading at the top of a new column. The number of rows to run down before starting at the top of the next column is called the

Row-set size

Cache lines filled

#### row-set size.

If the rowsets are too small performance is decreased because of the overhead. If the rowsets are too big cache lines needed later are overwritten, reducing performance.

The optimal rowset size is further influenced by other data that is read from memory. Because it is hard to predict the right rowset

<sup>&</sup>lt;sup>1</sup> See section 3.4.2.1.

<sup>&</sup>lt;sup>2</sup> When re-using the inner products only one 10 byte-wide row is needed to calculate 8 new pixels.

<sup>&</sup>lt;sup>3</sup> 512/2 because the source and destination both require cache space. The cache space used for overhead (for example Windows NT ) is neglected.

size, performance is tested with various rowset sizes. The results of these tests are described in section 3.4.4.5.

#### 3.4.3.5

Copy border pixels

# Border pixels

When processing pixels like shown in section 3.4.3.4, the leftmost, top and bottom pixels of the image require special treatment, because the complete kernel can not be applied to them. Either a special kernel for border pixels is created, or the border pixels are not processed.

A kernel to apply to the left-border pixels is shown in Figure 20.

| 2/12 | 1/12 |
|------|------|
| 4/12 | 2/12 |
| 2/12 | 1/12 |

Figure 20: Smooth kernel for left-border pixels

The pixel weighted 4/12 in the kernel shown in Figure 20 is the centerpixel. An additional difficulty when implementing special border kernels is that instead if dividing by 16, a division by 12 has to be made. Because of the absence of a packed divide instruction, the division by 12 is very complex (the division by 16 can be made by shifting three bits to the right). Instead of dividing by 12 it is also possible to assign different weights to the pixels to allow a division by 16.

Other algorithms, such as halftoning, add noise to the border of the image, requiring a border to be cut off. Therefore it is generally not necessary to pay special attention to the border pixel. Because of the influence they have on their neighbour pixels when the next filter is applied they are not skipped entirely (resulting in random values) but copied from the source image. The top, bottom and leftmost pixels of the image are copied like this, but a problem arises for the rightmost pixels. The processing loop of the filter calculates eight new pixels simultaneously. Since images will seldom have a width of  $(8n+2)^1$  pixels, the source image will have to be extended horizontally. At the end of the processing path, the border added to the right of the image will have to be removed.

If the filter would be implemented this way, of each row a number of pixels would be processed that would be deleted from the image later anyway (on the right edge). If the image width would be extended to a multiple of eight (8n), the seven rightmost pixels would have to be copied. Because the pixels are placed sequentially in the image array, the leftmost pixel of the next row are placed in memory right after the seven rightmost pixels of the previous, allowing the MMX move instruction to copy all eight pixels with two instructions. This requires half the time needed to copy a single leftmost and rightmost pixel.

The top and bottom rows of the image are copied at the beginning respectively the end of the processing loop.

<sup>&</sup>lt;sup>1</sup> Pixels are processed in 8 pixel wide groups (=8n), two border pixels are copied (=+2) resulting in an image width of 8n+2.

3.4.4

Because the next filter function in the processing path (sharpen) is similar to the smooth filter, an extra function that extends the image width is created and called from the "FilterMmx" function.

## The basic loop structure

The implementation of the basic loop structure designed in the previous sections is shown in Code example 2.

Code example 2: The basic loop structure of the smooth

In the inner loop the actual processing is done. Each loop cycle an eight pixel wide row is calculated, running down the image array (thus forming an eight pixel wide column). The middle loop (Do While (ColSetCounter > 0)) determines the number of those columns horizontally processed.

The outer loop counts the number of rowsets to process. If the number of rows left to process is less than the default number of rows in a rowset (256) the default number is adjusted. At the end of the next loop cycle, the ColCounter will be decreased to zero and the filter will exit from the loop.

Note that the required 'if-then' would normally be programmed with a compare instruction followed by a jump. However a significant delay occurs if the jump is predicted incorrect<sup>1</sup>.

Avoid 'if-then'

# 3.4.4.1

Minimum

## Avoiding a jump

The jump can be avoided with the following trick; basically, the minimum of the rowset-size and the number of rows left to process has to be taken. Code example 3 takes the minimum of ebx and eax, and places it in eax.

| sub | ebx, | eax |
|-----|------|-----|
| sbb | ecx, | ecx |
| and | ecx, | ebx |
| add | eax, | ecx |

Code example 3: Minimum of eax and ebx

The first instruction determines the difference between ebx and eax. If this difference is negative (eax > ebx) the carry-flag is set, otherwise it is cleared. The next instruction copies the carry bit to all bits of ecx, after which the 'AND' instruction masks the calculated difference. If the difference was positive (ebx > eax) the 'AND' instruction will clear ecx. If ebx was less than eax, ecx will be loaded with the (either zero or negative) difference. Finally, ecx is added to eax, causing eax to be decreased if ebx was less. Note that the initial value in ebx is lost.

#### 3.4.4.2 Register allocation

The next step in the implementation of the smooth filter is determining the register allocation for calculating the new pixels. Basically, all the pixels required to calculate a new pixel are loaded into MMX registers in such a way that they can be added each with its own weight, after which the result is divided by 16. However, suppose a pixel in a gray plane like shown in Figure 21 is calculated.

| Pixels: |     |     | Kernel: |   |   |  |
|---------|-----|-----|---------|---|---|--|
| 192     | 192 | 192 | 1       | 2 | 1 |  |
| 192     | 192 | 192 | 2       | 4 | 2 |  |
| 192     | 192 | 192 | 1       | 2 | 1 |  |

Figure 21: Calculating a pixel in a gray plane

When calculating the new pixel value for the grayed centerpixel, a number of registers would be used, each containing one of the source pixels. Next, each of the pixels would be multiplies width its weight, one two or four. However, the pixels multiplied with a weight of two or four, would saturate to 255 because the calculated values exceed the range of an unsigned byte (0 to 255), resulting in the calculation shown in [1] (the italic values are saturated pixels).

new pixel = 
$$\frac{4 \times 192 + (4 \times 255) + 255}{16} = 128$$
 [1]

In [2] the calculation without the saturation of the pixels is shown.

new pixel = 
$$\frac{4 \times 192 + 4 \times (2 \times 192) + 1 \times (4 \times 192)}{16} = 192$$
 [2]

Bytes saturate

The two resulting pixels are shown<sup>1</sup> in Figure 22.



<sup>&</sup>lt;sup>1</sup> This is for one color plane .

<sup>&</sup>lt;sup>2</sup> With a workaround it is possible to shift bytes. This can be done by shifting words after which the bits shifted out of the most- to the least significant byte of a word have to be masked out with an and-instruction.

them by two. Care has to be taken not to use PSLLQ (= packed shift left logical quadword), which will shift the entire 64-bit register instead of the four packed words, resulting in incorrect values. The two PADDUSW (= packed add unsigned words) add the three weighted pixels, forming the result of the top row.

Bytes to words The extension of the bytes read from memory can be implemented by using the PUNPCKLBW instruction (= Packed UNPaCK Lower Byte to Word). This instruction merges the four least significant bytes of two quadwords into one quadword, as shown in Figure 25.



Figure 25: Unpack operation (PUNPCKLBW MM, MM/M32)

If the second register (mm/m32) is filled with zeros, the lower four bytes of the first (mm) are extended from bytes to words. The counterpart of the unpack is the pack instruction, which is described in section 2.4.5.

After the extension of the bytes to words, two sets of four pixels are processed. After they have been calculated, the new pixel values have to be shrunk from packed words to packed bytes, after which they can be written to memory. Instead of processing each loop cycle eight pixels in two portions of four, it is also possible to process one portion of four pixels. However, this was not done because of two reasons:

- 1. Other algorithms that do not require the bytes to be extended to words require an image width of 8n pixels so the image width has to be extended anyway.
- 2. The sharpen filter, which is similar to the smooth, requires an image width of 8n. It is preferred to keep the loop structure of the sharpen and the smooth identical, because only the kernels differ.

Processing each loop cycle multiple sets of pixels instead of one is called 'loop unrolling'. Loop unrolling is sometimes applied to achieve better pairing rates (see section 2.4.2).

# The inner loop structure

The register allocation described in section 3.4.4.2 results in the inner loop structure shown in Code example 4. The code resembles the Process\_Pixels function in the pseudocode shown in Code example 2on page 30. Register esi and edi point to the lower-left pixels of the data-block required to process eight pixels. esi points to the source array, edi to the destination. The registers mm1 to mm7 are MMX registers, ResultBuffer is a four quadword sized buffer used to store the weighted sums used later.

Eight pixels per loop cycle

3.4.4.3

```
//Least significant 4 pixels
mm1=ResultBuffer[0]
                          //Upper row
mm2=Resultbuffer[3]
                          //Middle row
mm2=ShiftLeft(mm2,1)
                          //mm2=2*mm2
mm3=Read8Pixels(esi)
                           //Lower row
mm3=UnpackData(mm3)
mm3=CalculateRow(mm3)
mm0 = (mm1 + mm2 + mm3) / 16
                           //Calc. new pixels
ResultBuffer[0]=ResultBuffer[3] //Shift buffer.
Resultbuffer[3]=mm3
                          //Store res. of lower row
                    //Most significant 4 pixels
mm1=ResultBuffer[1]
                          //Upper row
mm2=Resultbuffer[4]
                          //Middle row
mm2=ShiftLeft(mm2,1)
                          //mm2=2*mm2
mm3=Read8Pixels(esi+4)
                          //Lower row
mm3=UnpackData(mm3)
mm3=CalculateRow(mm3)
mm1 = (mm1 + mm2 + mm3) / 16
                           //Calc. new pixels
ResultBuffer[1]=ResultBuffer[4] //Shift buffer
Resultbuffer[4]=mm3
                          //Store res. of lower row
mm0=PackData(mm0,mm1)
Write8Pixels(edi+1)
```

Code example 4: The inner loop structure of the smooth

#### Image size

Before the performance for various rowset sizes can be determined, first the resolution of the images which will be processed has to be determined. Figure 26 shows the relation between the number of dots per inch (dpi) and the number of pixels for two resolutions.



Figure 26: Relation between the resolution and the image size

Note that each pixel requires 3 bytes, resulting in an image size of 11.6 MB for the 200 dpi scan, and 26.1 MB for the 300 dpi scan. The image size was the primary reason why it was chosen

3.4.4.4

| Minimum physical memory | to use 200 dpi images instead of 300. If a 300 dpi image would<br>be used, the minimum amount of storage space to be allocated<br>by the program would be 52 MB <sup>1</sup> . On a PC with a total physical<br>memory of 64 MB this would lead to large portions of the image<br>being<br>swapped to and from the hard disk. Generally, the minimal size<br>of the physical memory should be about twice the size of the<br>image plus 25 MB, which is required for Windows NT. |  |  |  |
|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| 200 dpi: 10MB           | The test image scanned at 200 dpi has a total pixel size of 1580x2176. This is smaller than expected because the scanner removes 0,42 inch from the top and bottom edges, as well as 0.19 inch from the left and right edge. The 200 dpi scan with the borders cut of has a total size of 10 MB.                                                                                                                                                                                 |  |  |  |
| 3.4.4.5                 | The optimal rowset size                                                                                                                                                                                                                                                                                                                                                                                                                                                          |  |  |  |
|                         | Now the final code has been built and it has been decided which resolution images will be processed, the optimal rowset size can be determined.                                                                                                                                                                                                                                                                                                                                  |  |  |  |
| 150 dpi A4<br>200 dpi   | Performance is measured for two different images, an A4 scanned at 150 dpi, another scanned at 200 dpi resulting in a resolution of 1210x1627 for the first image and 1580x2176 for the second (both images with 16 million colors). The performance for different rowset sizes is shown in Table 2 (in cycles per                                                                                                                                                               |  |  |  |

processed pixel).

| Rowset | 1210x1627x16 milj | 1580x2176x16 milj |
|--------|-------------------|-------------------|
| size   | (cycles/pixel)    | (cycles/pixel)    |
| 1      | 48                | 50                |
| 2      | 35                | 37                |
| 4      | 29                | 31                |
| 8      | 26                | 28                |
| 16     | 24                | 26                |
| 32     | 23                | 25                |
| 48     | 23                | 25                |
| 64     | 23                | 25                |
| 96     | 23                | 33                |
| 128    | 29                | 34                |
| 256    | 29                | 34                |
| 512    | 35                | 40                |
| 1024   | 37                | 40                |

Table 2: Performance for various rowset sizes

For small rowset sizes performance is sub-optimal. After reaching a minimum, the number of cycles per pixel increases. The latter is caused by cache lines which are overwritten although they are still needed. For rowset sizes which are a bit too small performance decreases marginally. For *very* small rowsets the performance is poor. This is caused by the overhead required to process the rowsets.

The performance measurements for various rowset sizes clearly show the influence of the image size on the optimal rowset size.

<sup>&</sup>lt;sup>1</sup> Assuming the source an destination array both require 26 MB.

| Calculated rowset size | The optimal rowset size <i>seems</i> to be determined by the size of<br>the level two cache and the image width in pixels <sup>1</sup> .<br>The test PC has a level two cache of 256 KB. Assuming that<br>both the source and destination array require 125 KB, the<br>optimal rowset size for the 200 dpi scan can be determined with<br>the<br>calculation shown in [3].<br>$\frac{\frac{1}{2} \times L2 \text{ cache size}}{\text{Image width}} = \frac{\frac{1}{2} \times 256 \times 1024}{1580} = 83 \text{ rows/rowset} $ [3]                                           |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Test results           | Tests show the optimal rowset size is between 32 and 64<br>rows/rowset (200 dpi scan). The reason the value calculated in<br>[ 3 ] results in a value slightly too large is that it ignores the<br>cache space used for other variables, as well as the cache space<br>used by Windows NT.<br>From the previous it can be concluded that the optimal rowset<br>size can be estimated be dividing half the level two cache size<br>by the image width (in pixels), but that the accurate size can only<br>be determined by measuring performance with various rowset<br>sizes. |
| Rowsets of 64          | It has been decided to use a rowset size of 64 <sup>2</sup> , because this size gives optimal performance for both 150 and 200 dpi (A4) scans. For smaller images the performance will be sub-optimal.                                                                                                                                                                                                                                                                                                                                                                        |
| 38% faster by rowsets  | In Table 3 the performance improvement of an optimal rowset implementation has also been compared to an implementation that does not use rowsets (for the 150 dpi scan).                                                                                                                                                                                                                                                                                                                                                                                                      |
|                        | Without rowsets: 37 cycles/pixel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                        | Rowsets of 64: 23 cycles/pixel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                        | Improvement: 38 %                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                        | Table 3: Performance gain by using rowsets                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 3.4.4.6                | Pairing the code                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| Re-order statements    | The final step in implementing the smooth filter is pairing the code. Because most of the processing time required is used for the inner loop, most attention is paid to the inner loop. When pairing the code, program statements are re-ordered in such a way that the CPU can execute two instructions parallel. Care has to be taken not to switch related statements, since this will cause the filter to malfunction.                                                                                                                                                   |
| 3.4.4.7                | Filter performance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                        | After the code has been paired performance has been tested again, resulting in the cycle times shown in Table 4. In this table the performance of the paired version is compared to the unpaired version <sup>3</sup> .                                                                                                                                                                                                                                                                                                                                                       |

<sup>&</sup>lt;sup>1</sup> Not the image with in bytes, because the filter operates on the separated color planes. Each pixel is represented with one byte in each color plane. <sup>2</sup> The final step in the implementation of the filter (pairing the code) has no influence on the

<sup>&</sup>lt;sup>2</sup> The final step in the implementation of the filter (pairing the code) has no influence on the optimal rowset size because pairing only influences the code, which is loaded into a separate code cache.

code cache. <sup>3</sup> The improvement is calculated by dividing the difference in number of cycles/pixel by the new number of cycles/pixel.

32% faster by paring

| Image size | Cycle          | Improvement     |      |  |  |
|------------|----------------|-----------------|------|--|--|
|            | unpaired       | unpaired paired |      |  |  |
| (pixels)   | (cycles/pixel) | (cycles/pixel)  | (%)  |  |  |
| 290x509    | 23             | 20              | 15,0 |  |  |
| 1580x2176  | 25             | 19              | 31,6 |  |  |

Table 4: Performance improvement by pairing the code

The difference in performance improvement is caused by the fact that for the larger image the memory bandwidth has more influence on the performance than for the smaller image. In Table 4 it can be seen that pairing the code improves performance of the smooth operation on the 200 dpi A4 image with 32%.

In Table 5 the performance of the paired MMX code is compared to that of the C++ code for several images.

| Image size    | Cycle          | Improv         | vement |          |
|---------------|----------------|----------------|--------|----------|
|               | C++ version    | MMX version    |        |          |
| (pixels)      | (cycles/pixel) | (cycles/pixel) | (%)    | (factor) |
| 87x16         | 67             | 13             | 80,6   | 5,2      |
| 63x96         | 73             | 16             | 78,1   | 4,6      |
| 127x96        | 83             | 17             | 79,5   | 4,9      |
| 255x320       | 94             | 19             | 79,8   | 4,9      |
| 290x509       | 94             | 20             | 78,7   | 4,7      |
| 1580x32       | 89             | 19             | 78,7   | 4,7      |
| 1580x2176 (1) | 96             | 19             | 80,2   | 5,1      |
| 1580x2176 (2) | 95             | 19             | 80,0   | 5,0      |

Table 5: Performance of the smooth filter

| Influence of clipping  | The last two images are A4 pages scanned at 200 dpi. The performance of the C++ implementation differs because the images differ; in the first image more clipping <sup>1</sup> occurs than in the second, thus requiring more calculations. The MMX implementation has a constant performance because no 'if-then' is needed <sup>2</sup> for clipping. |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MMX 5x faster than C++ | From Table 5 it can be concluded that the MMX implementation<br>is about a factor five faster than the C++ implementation.<br>Processing the scanned A4 images with the C++ implementation<br>requires 96 cycles/pixel (1.99 seconds). The MMX<br>implementation requires 19 cycles/pixel (0.39 seconds).                                                |

Small imagesNotice that the number of cycles/pixel decreases for smaller<br/>images. This is caused by the fact that a (very) small image can<br/>be kept entirely in the level one cache<sup>3</sup>. A slightly larger image<br/>can still be kept into the level two cache, but the 200 dpi<br/>scanned A4 images (of approximately 10 MB) can not.

Added border Because the MMX implementation of the processing path adds a border to the image to extend the image width to a multiple of eight (8n), performance will be optimal for images where the extra pixels do not have to be added because the image width

<sup>&</sup>lt;sup>1</sup> If the pixel clips has to be tested with an "If-Then" statement.

<sup>&</sup>lt;sup>2</sup> See section 2.4.5.

<sup>&</sup>lt;sup>3</sup> It is loaded into the level one cache when performing the color plane separation.

|                     | already is 8n <sup>1</sup> . When implementing a complete processing path using MMX technology, the width of the scanned image would have to be chosen a multiple of eight.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Different CPU speed | Finally, a it must be remarked that the number of cycles/pixel<br>discussed will differ if the code is run on a system with a<br>different<br>CPU clock speed. This is caused by the fact that the time<br>required to read data from memory is constant, while the cycle<br>time changes. Thus for a CPU with a higher clock speed, the<br>number of cycles/pixel will increase.                                                                                                                                                                                                                                                                                                                               |
| 3.4.4.8             | Categorising the processing time                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                     | <ul> <li>In Figure 27 the processing time required for the inner loop of the unpaired smooth filter is shown categorised into five categories:</li> <li>1. Read pixels</li> <li>2. Buffer (management)</li> <li>3. Write pixels</li> <li>4. Calculate</li> <li>5. Overhead</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Cycle count method  | <ol> <li>If two instructions pair both are counted as using cycles.<br/>Alternative counting methods for paired instructions are:</li> <li>Count both half; when counting this way, a one cycle add<br/>instruction pairing with a ten cycle memory move, is counted<br/>as a five cycle instruction.</li> <li>If both instructions take the same amount of cycles, count<br/>both half. Otherwise count the one taking most cycles. Since<br/>memory accesses are often paired with arithmetic<br/>instructions, this way of counting cycles favours the<br/>arithmetic instructions.</li> <li>Weigh each instructions cycles with its proportion to the total<br/>number of cycles the pair takes.</li> </ol> |
|                     | The first two alternatives give a wrong impression of the partition<br>of the cycle times. The third is the best way of counting the cycle<br>times, but takes a lot of time to be applied. When counting all<br>instructions (with a pairing rate of 100%) the total number of<br>cycles will be twice the amount when counted according to<br>method three, but the partitioning of the processing time will be<br>the same. Therefore this method of counting the cycle times has<br>been chosen.                                                                                                                                                                                                            |
| For 200 dpi A4      | The processing time is measured for an A4 image scanned at 200 dpi. However, not the entire image is processed because it would take VTune a number of days to do this.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |

<sup>1</sup> When calculating the number of cycles/pixel the total number of cycles is devided by the number of pixels in the original image.



Figure 27: Categorised processing time of the smooth

Figure 27 shows that only half of the processing time is required for calculations. More than 40% of the processing time is needed for memory related operations (buffer management, read and write pixels). 51% of the processing time is required for calculations.

40% memory

51% calculations



Figure 28: Processing time per instruction type

In Figure 28 the processing time categorised per instruction type is shown. The cycle times are counted in the same way as in Figure 27. Notice that the amount of time used for move

operations (including memory moves) is about the same as the amount of memory related operations in Figure 27.

Figure 28 shows that only 20% of the processing time is used for arithmetic instructions. When taking in account that some of these are required for the overhead, this is surprisingly low compared to the 51% required for calculations in Figure 27. When adding the processing time of the conversion and shift instructions<sup>1</sup> however, this value<sup>2</sup> is also reached.

3.5 The sharpen algorithm

#### 3.5.1 The algorithm

During the smooth operation discussed in the previous chapter, sharp edges are blurred. This is unwanted for text and other sharp object. This effect is compensated by enhancing the edges using a sharpen operation. Thus, after performing a smooth and a sharpen the scanner interference is suppressed, but edges are still sharp.

In Figure 29 the sharpen filter is applied to the squared area.



Figure 29: Effect of sharpening

The sharpen operation is implemented as a 3x3 kernel operation, with the coefficients shown in Figure 30.

| -1/4     |                      | -1/4     |    |
|----------|----------------------|----------|----|
|          | 8/4                  |          |    |
| -1/4     |                      | -1/4     |    |
| Figure ( | 30 <sup>.</sup> Shar | penina k | er |

Figure 30: Sharpening kernel

#### 3.5.2 C++ implementation

3.5.2.1 Traversing the image array

> Basically the only difference between the smooth and the sharpen filter is that other coefficients are applied. Therefore the loop structure of the C++ implementation of both filters is identical. Thus the loop runs horizontally from left to right through the image.

Horizontal traversing

<sup>&</sup>lt;sup>1</sup> These instructions are counted as calculate-instructions in Figure 27 because they are necessary to enable the calculations.  $^{2}$  20%+18%+13% = 51%

| 3.5.2.2                 | Re-using calculated values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |  |  |  |  |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
|                         | The limited kernel symmetry allows re-usage of calculated values only for the left and right column of the kernel. A calculated weighted sum of the right row can not be used in the next loop cycle, but in the loop cycle after that. This requires a buffer.                                                                                                                                                                                                                                                                                              |  |  |  |  |  |
| 3.5.2.3                 | Clipping                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |  |  |  |  |
| Negative pixels         | The negative values of the kernel can cause the new pixel value<br>to be negative (if the centerpixel is zero and the other pixels are<br>not). Negative pixels are invalid and should be corrected to<br>zero <sup>1</sup> .<br>Now suppose the center-source-pixel is 255, while the other<br>pixels are all zero. The new pixel value calculated will be 510<br>(255*8/4). This value is too large and should be corrected to 255.<br>The described adjustment of a pixel value because it exceeds<br>the pixels upper or lower limit is called clipping. |  |  |  |  |  |
|                         | Clipping requires the value of the calculated pixel to be tested,<br>and if necessary, adjusted. Normally, this would be done in the<br>way shown in Code example 5.                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |
|                         | <pre>If (New_Pixel &lt; 0) Then     ImageArray[Row,Col]=0 Else If (New_Pixel &gt; 255) Then     ImageArray[Row,Col]=255 Else     ImageArray[Row,Col]=New_Pixel Code example 5: Non-optimised clipping code</pre>                                                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |
|                         | This will result in non optimal performance because the compiler will translate this using four jumps <sup>2</sup> (one for each 'if', one for each 'else'). The code in Code example 6 uses three jumps.                                                                                                                                                                                                                                                                                                                                                    |  |  |  |  |  |
|                         | <pre>ImageArray[Row,Col]=New_Pixel If (New_Pixel &lt; 0) Then     ImageArray[Row,Col]=0 Else If (New_Pixel &gt; 255) Then     ImageArray[Row,Col]=255 Code example 6: Optimised clipping     code</pre>                                                                                                                                                                                                                                                                                                                                                      |  |  |  |  |  |
| Optimised clipping code | This version of the clipping code will use less time because it contains less jumps; a jump which is mispredicted requires a lot of time because the pipeline has to be reloaded (see section 2.4.2).                                                                                                                                                                                                                                                                                                                                                        |  |  |  |  |  |
| 3.5.2.4                 | The final C++ implementation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |  |  |  |
|                         | Taking the issues described in the previous sections in account, the final structure of the C++ implementation of the sharpen filter is shown in Code example 7.                                                                                                                                                                                                                                                                                                                                                                                             |  |  |  |  |  |

 $<sup>^1</sup>$  If a pixels value exceeds the upper limit (255) the pixel should be saturated to 255.  $^2$  With the compiler option for optimisation set to off.

```
Code example 7: C++ implementation of the sharpen
```

```
Dst.Rows[0] = Src.Rows[0] // Copy top and bottom row
Dst.Rows[Height] = Src.Rows[Height]
RowCounter = 1
Do While (RowCounter < Height-1)
      {
         Dst.Pixels[RowCounter,0] = Src.Pixels[RowCounter,0]
        A = Src.Pixels[RowCounter-1,0] + Src.Pixels[RowCounter+1,0]
B = Src.Pixels[RowCounter-1,1] + Src.Pixels[RowCounter+1,1]
         ColCounter = 1
        Do While(ColCounter < Width-1)</pre>
              ł
               Temp=Src.Pixels[RowCounter-1,ColCounter+1] +
                    Src.Pixels[RowCounter+1,ColCounter+1]
              New_Pixel=(-A+8*Src.Pixels[ColCounter,RowCounter]-Temp)/4
              Dst.Pixels[RowCounter,ColCounter]=New_Pixel
               If (New_Pixel < 0) Then
                    Dst.Pixels[RowCounter, ColCounter]=0
               Else If (New_Pixel > 255) Then
                    Dst.Pixels[RowCounter, ColCounter]=255
              A = B
                                   //Shift buffer contents
              B = Temp
              ColCounter = ColCounter+1
              }
       Dst.Pixels[RowCounter,Width]=Src.Pixels[RowCounter,Width]
       RowCounter = RowCounter+1
       1
```

| Performance    | The performance of the code is discussed in section 3.5.3.11, where it is compared to that of the MMX implementation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3.5.3          | MMX implementation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 3.5.3.1        | Parallelability of the algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                | Similar to the smooth filter, there are no limitations as far as the parallelability of the sharpen filter is concerned.                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 3.5.3.2        | Pixel block width                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                | Next it has to be decided how many pixels to process in parallel.<br>This is determined by the number of data elements processed<br>parallel; eight packed bytes, four packed words or two packed<br>double-words.                                                                                                                                                                                                                                                                                                                                                                   |
| Smooth filter  | The smooth filter read pixels as packed bytes from memory,<br>after which they are extended to packed words for accuracy<br>reasons. Each loop cycle eight new pixels are calculated.<br>Because the ten-pixel wide rows of input pixels required to<br>calculate eight new pixel values have to be read with two MMX<br>memory reads. Besides this the pixels have to be extended to<br>packed words (see section 3.4.4.2). Therefore instead of<br>calculating eight pixels per loop cycle, the smooth filter could<br>also be implemented calculating four pixels per loop cycle. |
| Sharpen filter | Now, lets look at the input pixels required for the sharpen<br>algorithm to calculate eight new pixels. In Figure 32 the new<br>values for the pixels 22 to 29 are calculated, requiring input from<br>the grayed area.                                                                                                                                                                                                                                                                                                                                                              |
|                | -1/4 -1/4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

| -1/4     |          | -1/4     |       |
|----------|----------|----------|-------|
| Figure 3 | 31: Shar | bening k | ernel |

| 1<br>21  | 2  | 3   | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 |
|----------|----|-----|----|----|----|----|----|----|----|----|----|
| 21       | 22 | 23  | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 |
| 41       | 42 | 43  | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 |
| 61       | 62 | 63  | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |
| <u> </u> |    | ~ ~ |    |    |    |    |    |    |    |    |    |

Figure 32: Source pixels for the sharpen algorithm

Notice that the pixels 21 and 30, are not used for the calculation of 22-29. This is caused by the fact that the kernel does not use the pixel to the left and the pixel to the right of the centerpixel of the kernel.

Similar to the smooth, the sharpen requires ten pixels of the top and bottom row. But for the middle row only eight pixels are needed. These pixels can be read with one MMX memory read, instead of two for the top and bottom row (each). If the filter processes eight pixels per loop cycle five memory reads are required to calculate eight pixels. When processing four pixels per loop cycle, three reads are required for four pixels (=six reads for eight pixels).

Because a memory access requires relatively many cycles, it has been decided to process eight pixels per loop cycle.

Eight pixels per loop cycle

3.5.3.3

#### Re-usage of calculated values

Suppose the image is traversed vertically, and that the total calculation is split into three calculations (one for each of the kernel-rows). The calculated values for the bottom row can then be stored to be used for the top row in the loop cycle after the next loop cycle, like shown in the following example, with the image shown in Figure 34 (the sharpen kernel is shown in Figure 33).

| -1 |   | -1 |  |  |  |  |
|----|---|----|--|--|--|--|
|    | 8 |    |  |  |  |  |
| -1 |   | -1 |  |  |  |  |
|    |   |    |  |  |  |  |

Figure 33: Sharpen kernel

| 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 21  | 22  | 23  | 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
| 41  | 42  | 43  | 44  | 45  | 46  | 47  | 48  | 49  | 50  | 51  |
| 61  | 62  | 63  | 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
| 81  | 82  | 83  | 84  | 85  | 86  | 87  | 88  | 89  | 90  | 91  |
| 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |

Figure 34: The intermediate result of the bottom row

For each of the three image rows the intermediate result of the kernel coefficients can be calculated. In the next loop cycle the pixels 42 to 49 are calculated.

| 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 21  | 22  | 23  | 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
| 41  | 42  | 43  | 44  | 45  | 46  | 47  | 48  | 49  | 50  | 51  |
| 61  | 62  | 63  | 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
| 81  | 82  | 83  | 84  | 85  | 86  | 87  | 88  | 89  | 90  | 91  |
| 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |

Figure 35: The middle row

Because the middle row of the kernel is completely different than the bottom row, the intermediate result of the bottom row cannot be re-used. In the next loop cycle however, the stored intermediate result can be re-used for the top row.

| 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 21  | 22  | 23  | 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
| 41  | 42  | 43  | 44  | 45  | 46  | 47  | 48  | 49  | 50  | 51  |
| 61  | 62  | 63  | 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
| 81  | 82  | 83  | 84  | 85  | 86  | 87  | 88  | 89  | 90  | 91  |
| 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |

Figure 36: Re-using the bottom row

Vertical traversing Obviously re-using calculated values in this way is only possible when running vertically through the image. If the image is traversed horizontally, only the two rightmost edge values can be re-used, similar to the smooth algorithm (see section 3.4.3.3).

### 3.5.3.4 Optimal image traversing

The fact that the re-usage of calculated values only is possible when running vertically through the image, combined with the cache related factors described in section 3.4.2.1 leads to the conclusion that optimal image traversing can be reached by using rowsets in the same way as with the smooth filter (see section 3.4.3.4).

The optimal rowset size has to be determined again, because the way the sharpen filter reads the data is slightly different than the way the smooth does. After the sharpen filter has been implemented completely the optimal rowset size is determined (see section 3.5.3.9).

3.5.3.5 Bo

Using rowsets

## Border pixels

Similar to the smooth, the sharpen algorithm requires special treatment of border pixels. For example, the left-border pixels would has to be calculated with the kernel shown in Figure 37.

|     | -1/6 |
|-----|------|
| 8/6 |      |
|     | -1/6 |
|     |      |

Figure 37: Kernel for left-border pixels

Because a border of the image is cut off anyway it is decided to skip the border pixels (top-, bottom-, left- and right edge).

With the smooth filters MMX implementation it was decided to extend the image's width to a multiple of eight  $(=8n)^1$ . Not

surprisingly, this is an optimal width for the sharpen filter too.Because the pixels are calculated in groups of eight, one MMXCopy bordersmemory move can be used to copy one pixel of the left border<br/>as well as seven pixels of the right border.

## 3.5.3.6 The basic loop structure

From the previous section it is clear that the loop structure used for the smooth filter can be used also for the sharpen filter, as shown in Code example 8.



Code example 8: The basic loop structure of the sharpen

The number of rows in a rowset temporarily is set to 256. After the filter has been fully implemented the optimal rowset size has to be determined (section 3.5.3.9).

Avoiding 'if-then' The jump normally required for the 'if-then' at the end of the outer loop can be avoided as described in section 3.4.4.1.

## Register allocation

3.5.3.7

Next the register allocation has to be determined. Basically, all the pixels needed to calculate a new pixel are loaded into MMX registers. After the centerpixels are multiplied with their weight (8) the four pixels of the upper and lower corners are subtracted. Finally the result is divided by four.

Now consider a pixel in a white plane like shown in Figure 38 is calculated.

| Pixe | ls: |     |   | Kerr | iel: |    |   |
|------|-----|-----|---|------|------|----|---|
| 255  | 255 | 255 | ] | -1   |      | -1 |   |
| 255  | 255 | 255 |   |      | 8    |    | ] |
| 255  | 255 | 255 |   | -1   |      | -1 | 1 |
|      |     |     |   |      |      |    | ÷ |

Figure 38: Calculating a pixel in a white plane

Suppose the new pixel value is calculated using packed bytes. Signed packed bytes are required because of the subtraction of the corner-pixels, resulting in a range of -128 to +127. The calculation of the new pixel is shown in [4] (saturated values are italic).

new pixel = 
$$\frac{-255 - 255 + 255 - 255 - 255}{4} = -191 = 0$$
 (clipped)

Bytes saturate

After the multiplication with eight the centerpixel saturates to 255. In [4] the corner pixels are subtracted unsigned with saturation, leading to the result shown<sup>1</sup>. Because negative pixels are not allowed, this value has to be clipped to 0 (see section 3.5.2.3).

The calculation without the (unwanted) saturation is shown in [5].

new pixel = 
$$\frac{-255 - 255 + (8 \times 255) - 255 - 255}{4} = 255$$
 [5]

255

The two resulting pixels are shown in Figure 39.

0



Figure 39: Difference between pixel value 0 and 255

Because these changes are unacceptable it has been decided to extend the packed unsigned bytes of the source image to packed

Packed signed words

Top kernel row

signed words before performing calculations.

Suppose with the memory shown in Figure 40 the new values for the pixels 12 to 15 have to be calculated.

| Ima | ige: |    |    |    |    |    |    |    | Ker | nel: |  |
|-----|------|----|----|----|----|----|----|----|-----|------|--|
| 01  | 02   | 03 | 04 | 05 | 06 | 07 | 08 | 09 | -1  |      |  |
| 11  | 12   | 13 | 14 | 15 | 16 | 17 | 18 | 19 |     | 8    |  |
| 21  | 22   | 23 | 24 | 25 | 26 | 27 | 28 | 29 | -1  |      |  |
| 31  | 32   | 33 | 34 | 35 | 36 | 37 | 38 | 39 |     |      |  |

Figure 40: Calculating four pixels

The calculation is divided into four parts; one part for each row, and a fifth to merge the three rows. The example in Figure 41 shows how the intermediate result for the top row is calculated. Similar to the smooth, register esi points to the source array.

| movq mm0, [esi] | mm0=[ | 04    | 03    | 02    | 01    |
|-----------------|-------|-------|-------|-------|-------|
| movq mm1[esi+4] | mm1=[ | 06    | 05    | 04    | 03    |
| paddusw mm0,mm1 | mm0=[ | 04+06 | 03+05 | 02+04 | 01+03 |

Figure 41: Calculating an intermediate result

The example assumes the pixels are written in the array as signed words; in the real implementation the pixels are written as unsigned bytes. The second instruction then reads data from (esi+2) instead of (esi+4).

Notice that the pixels are reverse ordered compared to the pixels in memory. This is caused by the way multi-byte data is read from memory in Intel Architecture microprocessors (see section 2.4.5).

<sup>&</sup>lt;sup>1</sup> Other implementation of the calculation have been investigated, but they all produce incorrect results.

Merge kernel rows

In the same way like shown in Figure 41 the intermediate values for the bottom row is calculated. The code in Figure 42 merges the three intermediate results, and writes the resulting pixels. The intermediate results are assumed to be loaded into mm0 (top-), mm1 (middle-) and mm2 (bottom row). In the final implementation the top row's intermediate value is re-used from the bottom row of previous pixels.

| mm0= | 04+06           | 03+05           | 02+04           | 01+03           |  |
|------|-----------------|-----------------|-----------------|-----------------|--|
| mm1= | 15              | 14              | 13              | 12              |  |
| mm2= | 24+26           | 23+25           | 22+24           | 21+23           |  |
| mm1= | -04-06+15       | -03-05+14       | -02-04+13       | -01-03+12       |  |
| mm1= | -04-06+15-24-26 | -03-05+14-23-25 | -02-04+13-22-24 | -01-03+12-21-23 |  |

psubw mm1,mm0 psubw mm1,mm2

Arithmetic vs. logical shift

Figure 42: Merging the three intermediate results

Next, the value in mm1 has to be divided by four to complete the operation. This is done with the instruction 'PSRAW MM1, 2' (Packed Shift Right Arithmetic Words) which shifts the words in mm1 two bits to the right. A common mistake is that instead of the arithmetic shift the logical shift instruction is used. For negative values this will produce incorrect results because then the (leftmost) sign-bit is also shifted, as illustrated in the example in Figure 43 where a byte is shifted.

|                                     | Binary                   | Decimal |
|-------------------------------------|--------------------------|---------|
| Start value:                        | 11111100                 | -4      |
| After shift right logical 1 bit:    | 01111110                 | 129     |
| After shift right arithmetic 1 bit: | 11111110                 | -2      |
| Figure 40. Chift legical ve         | بر ما با الم الم الم الم |         |

Figure 43: Shift logical vs. shift arithmetic

Notice that the logical shift inserts zeroes at the left side of the byte, while the arithmetic shift inserts the sign bit.

Similar to the smooth, the extension from packed bytes to packed words is done with the PUNPCKLBW (Packed UNPaCK Lower Bytes to Words) instruction.

# The inner loop structure

The previous section describes how four pixels are calculated with input from packed words. In the final implementation the input pixels are read as packed bytes, and each loop cycle eight pixels are calculated.

This results in the inner loop pseudocode shown in Code example 9. This code resembles the Process\_Pixels function in the pseudocode shown on page 45.

3.5.3.8

```
//Least significant 4 pixels
mm1=ResultBuffer[0] //Upper row
mm2=Read8Pixels(esi+1) //Middle row
mm6=mm2
                        //Copy
mm2=UnpackData(mm2)
mm3=Read8Pixels(esi+Width)//Lower row
mm3=UnpackData(mm3)
mm3=CalculateRow(mm3)
mm0 = (8 * mm2 - mm1 - mm3) / 4
                        //Calc. new pixels
ResultBuffer[0]=ResultBuffer[3] //Shift buffer.
Resultbuffer[3]=mm3
                         //Store res. of lower row
                   //Most significant 4 pixels
mm2=UnpackData(mm2)
                         //Middle row
mm3=Read8Pixels(esi+Width+2)//Lower row
mm3=UnpackData(mm3)
mm3=CalculateRow(mm3)
mm1=(8*mm2-mm1-mm3)/4
                         //Calc. new pixels
ResultBuffer[1]=ResultBuffer[4] //Shift buffer
Resultbuffer[4]=mm3
                        //Store res. of lower row
mm0=PackData(mm0,mm1)
Write8Pixels(edi+1)
```

Code example 9: Inner loop structure of the sharpen

In the pseudocode in Code example 9 register esi points to the source array; edi points to the destination array like shown in Figure 44.

|               | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 |
|---------------|----|----|----|----|----|----|----|----|----|----|----|----|
| $\rightarrow$ | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 |
| (             | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 |
|               | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 |

esi / edi

Figure 44: Pointer positioning

Notice that in the pseudocode no statements are placed to perform clipping of pixels whose values lie outside the range 0 to 255. This code can be omitted because the pack instruction used to pack the two packed words into a packed byte automatically clips the pixels<sup>1</sup>. This is a huge advantage compared to the C++ implementation, because the 'if-then' required there consumes a lot of processing time.

### 3.5.3.9

Automatic clipping

The optimal rowset size

In section 3.5.3.4 it was decided to use rowsets for optimal image traversing. Now the entire filter has been implemented the optimal rowset size has to be determined. Performance is measured for two A4 images; the first scanned at 150 dpi, the second at 200.

Rowsets of 64

3.5.3.10

26% gain by re-using

|                                     | Row set size                                                               | 1210x1627x16 milj<br>(cycles/pixel)                                                               | 1580x2176x16 milj<br>(cycles/pixel)                                                                                                        |
|-------------------------------------|----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
|                                     | 1                                                                          | 44                                                                                                | 45                                                                                                                                         |
|                                     | 2                                                                          | 35                                                                                                | 37                                                                                                                                         |
|                                     | 4                                                                          | 31                                                                                                | 33                                                                                                                                         |
|                                     | 8                                                                          | 30                                                                                                | 31                                                                                                                                         |
|                                     | 16                                                                         | 28                                                                                                | 29                                                                                                                                         |
|                                     | 32                                                                         | 28                                                                                                | 29                                                                                                                                         |
|                                     | 48                                                                         | 28                                                                                                | 29                                                                                                                                         |
|                                     | 64                                                                         | 27                                                                                                | 29                                                                                                                                         |
|                                     | 96                                                                         | 27                                                                                                | 36                                                                                                                                         |
|                                     | 128                                                                        | 33                                                                                                | 37                                                                                                                                         |
|                                     | 256                                                                        | 34                                                                                                | 37                                                                                                                                         |
|                                     | 512                                                                        | 37                                                                                                | 41                                                                                                                                         |
|                                     | 1024                                                                       | 38                                                                                                | 41                                                                                                                                         |
| <i>i</i> ith<br>or t<br>ows<br>s sł | ce that the r<br>the smooth<br>he referenc<br>et size is be<br>nown in the | ange of optimal rowse<br>filter <sup>1</sup> .<br>e image of 200 dpi (ri<br>etween 16 and 64. Beo | or various rowset sizes<br>et sizes is much larger as<br>ght column) the optimal<br>cause smaller images, su<br>prefer larger rowset sizes |
| scan                                | performan                                                                  |                                                                                                   | naller than the 200 dpi A<br>imal, because the 150 dp<br>ith 64 rows/rowset.                                                               |
|                                     |                                                                            |                                                                                                   |                                                                                                                                            |
| Perf                                | ormance g                                                                  | ain caused by re-usi                                                                              | ng results                                                                                                                                 |

Before paring the code it was measured if the re-using results causes a performance improvement. It appeared that re-using results causes a performance improvement of 26% for the sharpen algorithm. Since the smooth kernel is symmetrical, there the performance gain will be even more.

#### 3.5.3.11 Filter performance

After the inner loop of the filter has been paired, performance is compared to that of the unpaired code. The results are shown in Table 7.

| Image size | Cycle          | times          | Improvement |
|------------|----------------|----------------|-------------|
|            | unpaired       | paired         |             |
| (pixels)   | (cycles/pixel) | (cycles/pixel) | (%)         |
| 290x509    | 27             | 20             | 25,9        |
| 1580x2176  | 29             | 20             | 31,0        |

Table 7: Performance gain caused by pairing

31% gain by pairing In Table 7 it can be seen that pairing improves the performance of the sharpen filter significantly. The improvement for the 200 dpi A4 is 31%. This is slightly less than for the smooth filter because the amount of (unpairable) memory accesses of the sharpen is higher.

> In Table 8 the performance of the paired version is compared to that of the C++ version for several images.

<sup>&</sup>lt;sup>1</sup> See section 3.4.4.5.

|                                | Image size                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Cycle                                                                                                                   | times                                                                                                                  | Improvement                                                   |                                                |  |  |  |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|------------------------------------------------|--|--|--|
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | C++ version                                                                                                             | MMX version                                                                                                            | -                                                             |                                                |  |  |  |
|                                | (pixels)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | (cycles/pixel)                                                                                                          | (cycles/pixel)                                                                                                         | (%)                                                           | (factor)                                       |  |  |  |
|                                | 87x16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 55                                                                                                                      | 12                                                                                                                     | 78,2                                                          | 4,6                                            |  |  |  |
|                                | 63x96                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 63                                                                                                                      | 16                                                                                                                     | 74,6                                                          | 3,9                                            |  |  |  |
|                                | 127x96                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 75                                                                                                                      | 18                                                                                                                     | 76,0                                                          | 4,2                                            |  |  |  |
|                                | 255x320                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 85                                                                                                                      | 19                                                                                                                     | 77,6                                                          | 4,5                                            |  |  |  |
|                                | 290x509                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 88                                                                                                                      | 20                                                                                                                     | 77,3                                                          | 4,4                                            |  |  |  |
|                                | 1580x32                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 83                                                                                                                      | 18                                                                                                                     | 78,3                                                          | 4,6                                            |  |  |  |
|                                | 1580x2176 (1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 90                                                                                                                      | 20                                                                                                                     | 77,8<br>77,3                                                  | 4,5                                            |  |  |  |
|                                | 1580x2176 (2) 88 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                                                         |                                                                                                                        |                                                               | 4,4                                            |  |  |  |
|                                | Table 8: Performance of the sharpen algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                         |                                                                                                                        |                                                               |                                                |  |  |  |
| MMX 4x faster than C++         | <ul> <li>The lower two images are A4 images scanned at 200 dpi. The difference between the cycle times is caused by the fact that in the first image more clipping<sup>1</sup> occurs than in the second. In the C++ implementation this causes a slight performance reduction because of the additional code processed.</li> <li>From Table 8 it can be concluded that the MMX implementation is about a factor 4.4 faster that the C++ implementation. Processing 200 dpi A4 images with the C++ implementation takes 90 cycles/pixel (1.86 seconds. The MMX implementation requires 20 cycles/pixel (0.41 seconds).</li> <li>Similar to the smooth, the number of cycles/pixel for the sharpen</li> </ul> |                                                                                                                         |                                                                                                                        |                                                               |                                                |  |  |  |
| Borders                        | generally decreat<br>discussed in sector<br>Performance of<br>with 0.5% <sup>(2)</sup> by re-<br>the image to a m<br>processing path<br>multiple of eight                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | tion 3.4.4.7 this<br>the MMX impler<br>emoving the fun<br>nultiple of eight.<br>can only proces                         | is caused by car<br>nentation could<br>ction that increa<br>Note that such a                                           | ching eff<br>be increa<br>ses the<br>a versior                | ects.<br>ased<br>width of<br>of the            |  |  |  |
| 3.5.3.12                       | Categorising th                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | e processing t                                                                                                          | ime                                                                                                                    |                                                               |                                                |  |  |  |
|                                | Figure 45 shows algorithm. The c the smooth filter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | ycle times are c                                                                                                        |                                                                                                                        |                                                               |                                                |  |  |  |
| 43% calculations<br>49% memory | Notice that, com<br>for calculations (<br>sharpen kernel r<br>read the pixels h<br>sharpen filter red<br>source image, w<br>kernel row. Beca<br>operations has in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | (43% vs. 51%) to<br>equires less call<br>has increased fro<br>quires the middle<br>hereas the smo<br>ause of this the t | his is caused by<br>culations. The tir<br>om 17% to 26%<br>e kernel row to b<br>oth filter can re-<br>ime required for | the fact<br>me requi<br>because<br>be read fouse<br>use the b | that the<br>red to<br>the<br>orm the<br>pottom |  |  |  |
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                                         |                                                                                                                        |                                                               |                                                |  |  |  |

 <sup>&</sup>lt;sup>1</sup> See section 3.5.2.3.
 <sup>2</sup> 8 / 1580\*100%.
 <sup>3</sup> See section 3.4.4.8.
 <sup>4</sup> See Figure 27.
 <sup>5</sup> 26%+15%+8% = 49%



Figure 45: Categorised processing time of the sharpen

In Figure 46 the processing time usage is shown categorised according to the instruction type. This chart also shows that the number of memory operations has increased.



Figure 46: Processing time per instruction type

| 3.6                         | RGB to CMYK conversion                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |  |  |
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| 3.6.1                       | The algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |
| 3.6.1.1                     | The basic conversion                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |
|                             | After the smoothing and sharpening the image has to be prepared for printing. This is primarily done by the halftoning algorithm discussed in section 3.7.                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |
| Prepare for halftoning      | <ul> <li>Before the halftoning algorithm can be applied the image has to be converted from the RGB color representation to the CMYK<sup>1</sup> representation. This is necessary for two reasons:</li> <li>1. Print engines print from the CMYK<sup>2</sup> color domain.</li> <li>2. When halftoning RGB images around black areas, such as black text, color is created.</li> </ul>                                                                                                                                                                                                    |  |  |
| Complement                  | Since the CMYK color representation basically is the complement of the RGB representation, the corresponding CMYK value can be calculated from an RGB value.                                                                                                                                                                                                                                                                                                                                                                                                                              |  |  |
|                             | <ul> <li>This calculation is performed in three steps:</li> <li>1. Cyan = 255-Red.<br/>Magenta = 255-Green<br/>Yellow = 255-Blue</li> <li>2. Black = Min(Cyan, Magenta, Yellow)</li> <li>3. Cyan = Cyan-Black<br/>Magenta = Magenta-Black<br/>Yellow = Yellow-Black</li> </ul>                                                                                                                                                                                                                                                                                                            |  |  |
| 3.6.1.2                     | The interpolation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |
| Lookup table<br>Corrections | The conversion described in section 3.6.1.1 is not used in Océ processing paths. Instead the conversion is performed by using a 3D lookup table. The RGB values are used to determine the position in the table from where the CMYK value can be read. The conversion is performed this way to enable corrections in the processing path. These correction are required to correct the non-linear toner characteristics.                                                                                                                                                                  |  |  |
|                             | Unfortunately, if all RGB values would be placed in the lookup table, it would have a size of 64 MB <sup>3</sup> . Since this is much too large, the table is made smaller by placing every eighth R, G and B entry on the axis', instead of all R, G and B values.                                                                                                                                                                                                                                                                                                                       |  |  |
| 3D interpolation            | Since most CMYK values cannot be read directly from the table<br>a 3D interpolation has to be performed. Figure 47 and [ 6 ] show<br>how this interpolation is performed. The value of point x is<br>calculated by weighed adding the eight angular points that lie<br>around it. The weights are represented by three fractions $\alpha$ , $\beta$<br>and $\gamma$ . If the interpolated point lies at angular point zero for<br>example, $\alpha$ , $\beta$ and $\gamma$ are zero. In [ 6 ] all except point zero will be<br>weighed zero and x will have the same value as point zero. |  |  |

<sup>&</sup>lt;sup>1</sup> Cyan, Magenta, Yellow, Black. <sup>2</sup> See section 2.1.2.2. <sup>3</sup> 256\*256\*256\*4 = 67,108,864 B = 64MB.

|                       | $\beta \downarrow 0$<br>Figure 47: 3D interpolation                                                                                                                                                                                                                                                                                                                                                                                                           |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                       | $X \approx (1-a)(1-b)(1-\underline{e}) \cdot f(0) \qquad [6]$ $+ a \cdot (1-b)(1-\underline{e}) \cdot f(1)$ $+(1-a) \cdot b \cdot (1-\underline{e}) \cdot f(2)$ $+ a \cdot b \cdot (1-\underline{e}) \cdot f(3)$ $+(1-a)(1-b) \cdot \underline{e} \cdot f(4)$                                                                                                                                                                                                 |
|                       | + $\mathbf{a} \cdot (1-\mathbf{b}) \cdot \mathbf{g} \cdot f(5)$<br>+ $(1-\mathbf{a}) \cdot \mathbf{b} \cdot \mathbf{g} \cdot f(6)$<br>+ $\mathbf{a} \cdot \mathbf{b} \cdot \mathbf{g} \cdot f(7)$                                                                                                                                                                                                                                                             |
| 3.6.2                 | C++ implementation                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 3.6.2.1               | Traversing the image arrays                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Horizontal traversing | The first question to ask when implementing the conversion is<br>how to traverse the image array(s). Since none of the directions<br>offers special advantages (such as the re-use of intermediate<br>results) the direction that uses the caches optimally can be<br>chosen. Based on the arguments described in section 3.4.3.4<br>the horizontal direction is chosen. This way of traversing the<br>image can be implemented with one loop.                |
|                       | Since the input of the conversion function consists of three RGB arrays and the output of four CMYK arrays, a total of seven arrays have to be processed. Obviously, all the arrays are traversed in the same way.                                                                                                                                                                                                                                            |
| 3.6.2.2               | Reading the lookup table                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| 3D lookup table       | <ul> <li>The three dimensional lookup table can be stored in memory in two ways:</li> <li>1. As a three dimensional array containing 32 bit values (each containing the packed CMYK value. To be able to read the individual values the 32 bit quantity is copied into an array of four bytes.</li> <li>2. As a four dimensional array. This way the fourth dimension can be used to select the individual color components of the angular points.</li> </ul> |
| Copy 32 bit value     | A disadvantage of the first method is that the 32 bit quantity<br>must be copied into an array of four bytes. A disadvantage of the<br>second is that an additional multiply operation, which costs many<br>cycles, is used by the compiler to calculate the address of the                                                                                                                                                                                   |

element<sup>1</sup>. Since a multiply operation is a very time consuming operation the first method of storing the lookup table is chosen, but both methods seem to require about the same amount of time. Another method of addressing the color components in the lookup table is overlaying the lookup table with an array of four bytes. This method seems to be the most elegant one, but the required pointer manipulation will cost the same amount of time as needed to copy the 32 bit value to another array (the pointer is also stored as a 32 bit variable in the memory). 3.6.2.3 Implementation of the interpolation Basically the interpolation can be devided into: 1. Calculate the fractions  $(\alpha, \beta, \gamma)$  that determine the weight of the eight angular points. 2. Multiply each of the angular points values with its appropriate weight for all four color planes. 3. Add the eight weighted angular points. Fractions  $\alpha$ ,  $\beta$  and  $\gamma$ The fractions  $\alpha$ ,  $\beta$  and  $\gamma$  would normally have a value between zero and one. These values would be calculated by dividing the distance from point zero (which is zero to seven) by eight (the distance between two R, G, or B entries in the lookup table). To avoid rounding errors the division by eight is performed after each points' values have been multiplied with the distance to point zero. Obviously, fractions like  $(1-\alpha)$  should be converted to **(8-α)**. Since multiply operations required require much more cycles Multiply operations (10) than other operations, it is important to keep the number of multiplies as low as possible. When implementing the interpolation straightforward, 21 multiplies are required (see [ 6 ]). If the gammas are ignored, the weights of the points zero to three are the same as the weights of the points four to seven (see [7]). Computing these four different weights requires four<sup>2</sup> multiplies. After the points zero to three have been weighed (four multiplies), the sum of those values is multiplied with 8-gamma. Since the alpha and beta weights of point zero to three can be Re-using weights re-used to weigh the points four to seven, this only requires five multiplies, resulting in a total of 14 multiplies.

<sup>&</sup>lt;sup>1</sup> The address of an element in a multidimensional array is calculated by the compiler using a multiply for each dimension (the index times the size of the lower dimension). For more information on how the elements in a multidimensional arrays are addressed see section 3.6.3.4. <sup>2</sup> One multiply per weight.

|                        | $X \approx (8 - \boldsymbol{a})(8 - \boldsymbol{b}) f(0) $                                                                                                                                                                                                                                                                                                                                                                          | [7] |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
|                        | $+ \mathbf{a} \cdot (8 - \mathbf{b}) f(1)$                                                                                                                                                                                                                                                                                                                                                                                          |     |
|                        | $+(8-a)\cdot b\cdot f(2)$                                                                                                                                                                                                                                                                                                                                                                                                           |     |
|                        | $+ \mathbf{a} \cdot \mathbf{b} \cdot f(3)$                                                                                                                                                                                                                                                                                                                                                                                          |     |
|                        | +(8-a)(8-b)f(4)                                                                                                                                                                                                                                                                                                                                                                                                                     |     |
|                        | $+ a \cdot (8-b) f(5)$                                                                                                                                                                                                                                                                                                                                                                                                              |     |
|                        | $+(8-a)\cdot b\cdot f(6)$                                                                                                                                                                                                                                                                                                                                                                                                           |     |
|                        | + $\mathbf{a} \cdot \mathbf{b} \cdot f(7)$                                                                                                                                                                                                                                                                                                                                                                                          |     |
|                        | Since this reduces the number of multiplies significantly, it has<br>been decided to calculate the four alpha and beta weights<br>separately. The calculated values are stored in an array.                                                                                                                                                                                                                                         | S   |
| Four interpolations    | Notice that the above describes the interpolation of one value<br>Four of these interpolations are required for the entire RGB to<br>CMYK conversion.                                                                                                                                                                                                                                                                               |     |
| 32-bit arithmetic      | Before the interpolations can be implemented, it has to be<br>chosen which data type to use when performing the calculatio<br>It is not possible to use eight bit arithmetic because of the<br>multiplications <sup>1</sup> .<br>Sixteen bit arithmetic would be sufficient, but since 32-bit<br>arithmetic is faster <sup>2</sup> it has been chosen to use 32-bit arithmetic<br>(integers).                                       |     |
| 3.6.2.4                | Reading and writing the pixels                                                                                                                                                                                                                                                                                                                                                                                                      |     |
| Reading angular points | Reading the required eight angular points from the lookup table<br>is not difficult; the R, G and B values divided by eight can be<br>used as indexes. At position (R/8, G/8, B/8) point zero is located,<br>point one is loaded from position (R/8+1, G/8, B/8) and so on.<br>The individual color components are selected by mapping an<br>array of four bytes over the position in the lookup table indexed<br>by the RGB value. |     |
| Reading RGB            | The R, G and B values are read from the input arrays. Their position is indicated by the loop index.                                                                                                                                                                                                                                                                                                                                |     |
| Writing CMYK           | Writing the interpolated CMYK values to the four respective<br>arrays is not be a problem either. However, it would not be very<br>efficient to write the CMYK values to four new arrays. Since the<br>RGB values read are no longer needed, they can be overwritten<br>by three of the calculated values. By overwriting the source                                                                                                |     |
| Write to source arrays | pixels by calculated values, the memory bandwidth can be reduced drastically, resulting in better performance.                                                                                                                                                                                                                                                                                                                      |     |
| 3.6.2.5                | The final C++ implementation                                                                                                                                                                                                                                                                                                                                                                                                        |     |
| α, β and γ. remainder  | The issues described in the previous sections lead to the implementation shown in Code example 10. The fractions $\alpha$ , $\beta$ and $\gamma$ are calculated by taking the remainder of the division of G and B by eight. Of course the added neighbours have to be divided three times by eight <sup>3</sup> to compensate this.                                                                                                | R,  |

 $<sup>^1</sup>$  This would cause overflow.  $^2$  Source: "How to optimize for the Pentium Processor" [11], section 20.  $^3$  One time for  $\alpha,\,\beta$  and  $\gamma$  each.

3.6.3

3.6.3.1

```
// Position of neighb. 0
            Integer x, y, z
            Integer alpha, beta, gamma
            Integer Weights[4] // Weights of neighb. (ignore gamma)
            Byte Neighbours[8][4]
                                      // 8 neighbours, each CMYK
            Integer i=0
            Do While (i < Width*Height)</pre>
                   alpha=cArrayR[i]%8
                   beta =cArrayG[i]%8
                   gamma=cArrayB[i]%8
                   Weights=Calculate4Weights(alpha,beta)
                   x=cArrayR[i]/8
                   y=cArrayG[i]/8
                   z=cArrayB[i]/8
                   Neighbours=Read8Neighbours(x,y,z)
                   cArrayR[i]=(Weights[0 to 3]*
                                                                   // Cyan
                                       Neighbours[0 to 3][0]*(8-gamma) )+
                                (Weights[0 to 3]*
                                       Neighbours[4 to 7][0]*gamma) )/8/8/8
                   cArrayG[i]=(Weights[0 to 3]*
                                                                  // Magenta
                                       Neighbours[0 to 3][1]*(8-gamma) )+
                                (Weights[0 to 3]*
                                       Neighbours[4 to 7][1]*gamma) )/8/8/8
                    cArrayB[i]=(Weights[0 to 3]*
                                                                   // Yellow
                                       Neighbours[0 to 3][2]*(8-gamma) )+
                                (Weights[0 to 3]*
                                       Neighbours[4 to 7][2]*gamma) )/8/8/8
                    cArrayK[i]=(Weights[0 to 3]*
                                                                   // Black
                                       Neighbours[0 to 3][3]*(8-gamma) )+
                                (Weights[0 to 3]*
                                       Neighbours[4 to 7][3]*gamma) )/8/8/8
                    i++
                               Code example 10: The RGB to CMYK conversion
Performance
                         The performance of the code is discussed in section 3.6.3.9,
                         where the performance is compared to the MMX
                         implementation.
                         MMX implementation
                         Parallelability of the algorithm
                         Parallel pixel processing
                         Since no dependencies between the pixels exist, the basic RGB
                         to CMYK conversion does not limit the number of pixels to
                         process in parallel.
```

- Lookup table By using a lookup table however the number of pixel processed in parallel is limited to one, since the lookup operation can only be performed for one pixel. It is possible to create a lookup table in which multiple pixels can be looked up, but such a table would be very large.
- Parallel interpolation To perform the interpolation for multiple pixels in parallel, the pixels have to be packed manually, since they have to be looked up one at the time. Obviously, packing the pixels manually will decrease performance significantly. In the next two paragraphs

Four weights

Re-ordering

two other ways to implement parallelism in the algorithm will be discussed.

Both methods perform the interpolation in parallel. A key factor in the choice of one of these methods is the way the weights are calculated. Obviously the four weights required (ignoring gamma) are calculated in parallel. The resulting weights are loaded in a register as shown in Figure 48 (the weights could be reverse ordered). The weights are numbered according to the angular point they are applied to.

| (8-α)(8-β) | α (8-β)         | (8-α) β | αβ     |
|------------|-----------------|---------|--------|
| $=W_0$     | =W <sub>1</sub> | =W2     | $=W_3$ |

Figure 48: Calculated weights in register

Figure 49 shows the four color components of angular point zero loaded into a register<sup>1</sup>. Notice that these registers cannot be multiplied without some re-ordering because all the color components of a point have to multiplied with the same weight (in this case  $W_0$ ). Since re-ordering a register requires quite an amount of computations, the choice on how to perform the interpolation in parallel primarily depends on the amount of re-ordering required.

| C <sub>0</sub> | M <sub>0</sub> | Y <sub>0</sub> | K <sub>0</sub> |
|----------------|----------------|----------------|----------------|
|                |                |                | in realister   |

Figure 49: Loaded angular point in register

Before the two methods of parallel pixel processing can be<br/>discussed, the number of pixels calculated in parallel has to be<br/>determined. It might be tempting to try to interpolate eight pixels<br/>in parallel. However, this is not possible because the multiply<br/>operations required operate on packed words only. Therefore<br/>four words in parallelFour words in parallelfour words are multiplied in parallel,

## The four planes in parallel

The first method of parallel processing is to interpolate the four color planes in parallel. The neighbour points read from the lookup table are loaded in the right format (see Figure 49) but the

Re-order four weights weights (see Figure 48) have to be re-ordered. Since four different weights are used<sup>2</sup>, four registers have to be re-ordered. To be more precise; four weights have to be duplicated to four words.

After the weights have been duplicated correctly, they can be multiplied with the angular points. Finally the multiplied angular points can be added and divided without any re-ordering.

### Process four points in parallel

Re-order eight points The second method processes the input of four angular points in parallel. An advantage of this method is that the weights do not have to be re-ordered in any way. The eight angular points however have to be re-ordered completely. They are read from the lookup table as shown in Figure 49, but in order to multiply

<sup>&</sup>lt;sup>1</sup> The four components are extended to words because the multiplication can only be performed in packed words.

<sup>&</sup>lt;sup>2</sup> Each is used for two of the eight angular points.

Shift before addition

them correctly they have to be re-ordered to the format shown in Figure 50.



Figure 50: Format to process four pixels in parallel

After the multiplications have been performed, the input of the four angular points has to added. Since the packed add operation

cannot add the four words straightway, they first have to be shifted to the format shown in Figure 51. Notice that this addition uses only one of the four datapaths available.

| C <sub>3</sub> |
|----------------|
| C <sub>2</sub> |
| C <sub>1</sub> |
| C <sub>0</sub> |

Figure 51: Format to add the pixels

# Four points or four planes ?

|                         | Processing four <i>planes</i> in parallel requires four weights to be re-<br>ordered. The angular points read from the lookup table do not<br>have to be re-ordered. Processing four angular <i>points</i> in parallel<br>requires eight angular points to be re-ordered. Additionally the<br>results have to be shifted before the weighed points can be<br>added.                                                                                                                                                                                                                                                                                                                                                             |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Four planes in parallel | Obviously the first method requires less operations than the second. Therefore it has been chosen to process the four planes in parallel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| 3.6.3.2                 | Optimal image traversing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Horizontally            | In section 3.6.2.1 it can be read that the RGB to CMYK conversion sets no limitations on the way the image is traversed. Therefore it has been decided that the MMX implementation will traverse the image horizontally, since this direction uses the caches optimally <sup>1</sup> .                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                         | <ul> <li>Normally horizontal traversing means the image is traversed from the top-left to the down-right pixel of the image (each row from left to right). Such an implementation generally requires three registers;</li> <li>1. A counter of the number of pixels to process (or the number of pixels processed).</li> <li>2. One or more pointers to the position of the input array(s) currently processed.</li> <li>3. One or more pointers to the location in the output array(s) where the calculated result has to be stored.</li> <li>Since the RGB to CMYK conversion requires multiple pointers to source and destination registers, there are not enough registers available for such an implementation.</li> </ul> |

| Pixel addressing                                  | Therefore a different way of addressing the pixels has been chosen. One register is used to count the number of pixels left to process <sup>1</sup> , while the addresses of the input and output pixels are determined by adding the base address of the arrays to the register. Since the register is decremented the image array is processed backwards.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  |
|---------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                                                   | This way of indexed addressing is commonly used in C++ code.<br>A disadvantage of this method is that the base address of the<br>array has to be read from memory, but since this memory<br>location is frequently accessed it generally will be loaded into the<br>cache.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |
| 3.6.3.3                                           | Determining <b>a</b> , <b>b</b> and <b>g</b>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |
|                                                   | Now that it has been decided how to implement parallelism, the way to calculate the distances of the interpolated point to neighbour zero has to be determined. These three values $(\alpha, \beta \text{ and } \gamma)$ are used to determine the weights of the points zero to three (ignoring gamma). These values are loaded into a register as shown in Figure 48.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |
| Remainder                                         | Alpha, beta and gamma are determined by taking the remainder of the division by eight of the R, G and B values.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |
| 3.6.3.4                                           | Absolute addresses of angular points                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |
|                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |  |
|                                                   | Point zero                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |
| Absolute address point zero                       | <b>Point zero</b><br>The next preparation necessary before the actual interpolation<br>can be performed is the calculation of the absolute position of<br>point zero. This is necessary because addressing a three<br>dimensional array in assembly is not as easy as in C++. The<br>addresses of the other angular points are calculated by adding<br>an offset to the address of point zero.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |
| Absolute address point zero<br>2D array in memory | The next preparation necessary before the actual interpolation<br>can be performed is the calculation of the absolute position of<br>point zero. This is necessary because addressing a three<br>dimensional array in assembly is not as easy as in C++. The<br>addresses of the other angular points are calculated by adding<br>an offset to the address of point zero.<br>To understand the way this address is calculated, it is necessary<br>to understand the way multidimensional arrays are placed in<br>memory. Figure 52 shows how a two dimensional array A is<br>placed in memory. Notice that the rightmost dimension is placed<br>linear in the memory.<br>$A \longrightarrow \begin{bmatrix} 0][0]\\[0][1]\\[0][2]\\[1][0]\\[1][1]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1][2]\\[1$ |  |
|                                                   | The next preparation necessary before the actual interpolation<br>can be performed is the calculation of the absolute position of<br>point zero. This is necessary because addressing a three<br>dimensional array in assembly is not as easy as in C++. The<br>addresses of the other angular points are calculated by adding<br>an offset to the address of point zero.<br>To understand the way this address is calculated, it is necessary<br>to understand the way multidimensional arrays are placed in<br>memory. Figure 52 shows how a two dimensional array A is<br>placed in memory. Notice that the rightmost dimension is placed<br>linear in the memory.<br>$A \longrightarrow \underbrace{[0][0]\\[0][2]\\[1][0]]}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |

<sup>&</sup>lt;sup>1</sup> This way, after the register has been decremented, a jump-non-zero can be used. When counting the number of pixels already processed, the register has to be compared to the total number of pixels, after which a jump-equal can be used. This method is slower because an additional compare is necessary.

$$pos = A + I_{[0]} + \sum_{p=1}^{n-1} \prod_{q=1}^{p} S_{[q-1]} \cdot I_{[p]}$$
[8]

The lookup table is a three dimensional array with three axis' of  $33^{(1)}$  elements. The position of element [x][y][z] can be calculated as shown in [9].

$$pos = A + 4 \cdot \left[I_{[0]} + \sum_{p=1}^{n-1} \prod_{q=1}^{p} S_{[q-1]} + I_{[p]} \right]$$

$$= A + 4 \cdot \left[I_{[0]} + \prod_{p=1}^{n-1} \sum_{q=1}^{p} S_{[q-1]} + S_{[1]} \cdot S_{[0]} \cdot I_{[2]} \right]$$

$$= A + 4 \cdot \left[I_{[0]} + \prod_{p=1}^{n-1} \sum_{q=1}^{p} S_{[q-1]} + S_{[1]} \cdot S_{[0]} \cdot I_{[2]} \right]$$

$$= A + 4 \cdot \left[I_{[0]} + \prod_{p=1}^{n-1} \sum_{q=1}^{p} S_{[q-1]} + S_{[1]} \cdot S_{[0]} \cdot I_{[2]} \right]$$

$$= A + 4 \cdot \left[I_{[0]} + \prod_{p=1}^{n-1} \sum_{q=1}^{p} S_{[q-1]} + S_{[1]} \cdot S_{[0]} \cdot I_{[2]} \right]$$

Notice that the calculated index value is multiplied by four. This is necessary because the lookup table consists of 32 bit doublewords.

### The offset of the other points

Relative to point zero

32 bit array

The offset of the other angular points relative to point zero can be determined in the same way as the physical address of point zero is calculated. Figure 53 shows the offset of the angular points relative to point zero. Point two for example can be read from the memory location 4x33 bytes after point zero. The offset of point seven is (4x1+4x33+4x33x33).



Figure 53: Physical addresses of angular points

# 3.6.3.5

## Weighed addition of the angular points

The weighed addition of the angular points basically consists of six steps. The first four steps have to be performed for each angular point, the last two are performed once for all the points.

- 1. Duplicate the weight.
- 2. Read the angular point and extend the point to packed words.
- 3. Multiply the point with the weight.
- 4. Add the result of the multiplication to the intermediate result.
- 5. Multiply the (total) intermediate result with the appropriate gamma factor.
- Add intermediate results and divide by a factor to compensate the too large fractions α, β and γ.

<sup>&</sup>lt;sup>1</sup> One would expect 32 elements. The last is added for interpolation purposes.

# Duplicating the weight

Duplicating the weight can be done straightforward. Code example 11 shows how. Notice that the other weights are lost. Therefore a copy of the original weights has to be made before starting the duplication.

|                          | $\begin{array}{c ccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Two angular points       | On page 55 it can be read that, when ignoring gamma, a weight<br>can be used for two angular points. To minimise the number of<br>registers used these two points are processed right after each<br>other.                                                                                                                                                                                                                                                                                                               |
| Duplicate via memory     | An alternative method of duplicating the weight is to write the<br>non-duplicated value four times to adjacent memory locations,<br>after which the MMX register can be read from memory. This<br>method has not been chosen because of the required memory<br>bandwidth.                                                                                                                                                                                                                                                |
|                          | Read the angular point                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Lookup table             | After the weights have been duplicated the angular point has to<br>be read from the lookup table. The way the address of the point<br>is determined is described in section 3.6.3.4. After a point has<br>been read from the calculated address, the unpack instruction<br>(PUNPCKLBW) is used to extend it to packed words.                                                                                                                                                                                             |
|                          | Multiply with the weight                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                          | The multiplication of the pixels with the weights is done with the PMULLW instruction. Theoretically, the multiplication of two 16 bit values normally requires 32 bits to store. Obviously four 32 bits values cannot be stored in a 64 bit MMX register. To solve this problem two multiply instructions are available: one storing the lower 16 bits of the result (PMULLLW), another storing the higher 16 bits (PMULHW). In our algorithm 16 bits should be sufficient to store all possible results <sup>1</sup> . |
|                          | Add to intermediate result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Two intermediate results | After the multiplication is performed, the result is added to the previously multiplied angular points. This way two intermediate results (for each color plane) are calculated; one that must be multiplied with gamma, another to be multiplied with 8-gamma.                                                                                                                                                                                                                                                          |
|                          | But can this result always be stored in a 16 bit quantity ? The angular points read from the lookup table are eight bit values. Figure 54 shows the four weights. Notice that the net result when adding these weights always is 64, as shown in [10].                                                                                                                                                                                                                                                                   |

 $<sup>^1</sup>$  The maximum values multiplied are 255 (pixel) and 64 ( $\alpha$  and  $\beta$  weight). The result (16320) can easily be stored in a 16 bits quantity.

| $ \begin{array}{ c c c c c c c c c c c c c c c c c c c$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| sum = (8-a)(8-b) + a(8-b) + (8-a)b + ab $= 64 - 8b - 8a + ab + 8a - ab + 8b - ab + ab$ $= 64$ [10]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |
| Since 64 is a seven bit number the result of the weighed addition of four angular points can be stored in a 15 bit quantity.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |  |
| Multiply with gamma                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |  |  |
| When multiplying the two intermediate results (each produced by weighed adding four angular points), these 15 bit values are multiplied with a four bit gamma (ranging from zero to eight).                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |
| This results in a 19 bit value. Since the register contains (16 bit) packed words the result will saturate, causing severe rounding errors. To avoid this, the intermediate result is shifted <sup>1</sup> three <sup>2</sup> bits to the right before multiplying with gamma. This results in a 16 bit value.                                                                                                                                                                                                                                                                                                                   |  |  |
| Add and divide                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |  |  |
| Finally, the two intermediate results have to be added and divided. The addition of the two 16 bits values would result in a 17 bit value, causing saturation. Therefore the intermediate results are shifted one <sup>3</sup> bit to the right before the addition is performed.                                                                                                                                                                                                                                                                                                                                                |  |  |
| Now all the weighed angular points have been added, the sum would have to be divided three by times by eight (/8/8/8). This would be done by shifting three times three bits to the right (=9 bits). Since the value already has been shifted four bits to the right, the sum is shifted five <sup>4</sup> bits to the right.                                                                                                                                                                                                                                                                                                    |  |  |
| Rounding errors                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |
| <ul> <li>During the interpolation the divisions cause rounding errors. Three divisions are performed:</li> <li>1. The two intermediate results are divided by eight before multiplying them with gamma.</li> <li>2. After they have been multiplied they are divided by two before the are added.</li> <li>3. After the have been added the result is divided by 32.</li> <li>When dividing by eight the maximum rounding error is 7<sup>(5)</sup>. After this, these errors (in the two intermediate results) are multiplied with <i>γ</i> respectively (8-<i>γ</i>) resulting in the errors shown in [11] and [12].</li> </ul> |  |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |

 <sup>&</sup>lt;sup>1</sup> Shift right arithmetic has to be used. If shift right logical is used incorrect results will be produced (see page 47).
 <sup>2</sup> Equivalent to a division by eight.
 <sup>3</sup> Equivalent to a division by two.
 <sup>4</sup> Equivalent to a division by 32.
 <sup>5</sup> Fractions are *always* rounded down.

Pipelining

Interleaving

$$\Delta_1 \le \Delta_1 \times \boldsymbol{g}_{\max} \le 7 \times 7 \le 49$$
 [11]

$$\Delta_2 \le \Delta_2 \times (8 - \boldsymbol{g}_{\min}) \le 7 \times 8 \le 56 \quad [12]$$

Next, a division by two is performed to prevent saturation when adding the two intermediate values. Since this division also rounds the value, the error values in [13] and [14] are rounded up.

$$\Delta_{1\le} \frac{\Delta_1}{2} \le \frac{49}{2} \le 25$$
 [13]

$$\Delta_{2\leq} \underbrace{\Delta_2}_{2} \leq \underbrace{56}_{2} \leq 28$$
 [14]

Finally, the two intermediate results are added and divided by 32.

$$\Delta \le \sqrt[\Delta_1 + \Delta_2]{32} \le \sqrt[25 + 28]{32} \le 2$$
 [15]

Rounding error These rounding errors are considered acceptable.

## Cycle times of the multiply

The multiply operation normally takes three cycles to complete. When pairing the code the time necessary for this operation can be reduced in two ways. First multiplies can be pipelined. This means each cycle one multiply operation is started and one is completed. Second, by moving two pairs of non-related statements between the multiplication and the instruction that uses the result of the multiplication, two of the three cycles needed by the multiply unit can be used to perform other actions, as shown in Code example 12.

| pmull       | w mm0,       | mm1        |   |
|-------------|--------------|------------|---|
| pand<br>por | mm3,<br>mm5, | mm4<br>mm6 |   |
| mov<br>mov  | mm4,<br>mm5, |            |   |
| padd        | mm7,         | mm0        | ▲ |

Code example 12: Adding non-related statements

| 3.6.3.6      | Reading and writing the pixels                                                                                                                                                               |  |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|              | Reading the source pixels                                                                                                                                                                    |  |
| Base pointer | In section 3.6.3.2 it can be seen that pixels are read from the input arrays by means of the base pointer to the array and an index register.                                                |  |
|              | The base pointer to the array is stored at a memory location,<br>similar to common C++ code. The required memory space is<br>allocated by declaring a pointer variable in the C++ section of |  |

the filter implementation. This variable is declared 'static' to ensure that it is aligned correctly<sup>1</sup>.

Addressing mode The pixel cannot be read by simply adding the declared variable and the index register, since there is no such addressing mode. Therefore the variable containing the base address of the array is first loaded into a general purpose register, after which the pixel can be read.

Because the memory location where the base address is stored is accessed each loop cycle, it will be loaded into the level one cache. Therefore reading the base address takes only one cycle. Reading the pixel with the more complex addressing mode does not take any extra cycles, so this way of reading the source pixels does not decreases performance significantly.

#### Writing the resulting pixels

|                              | After the interpolation is completed the calculated pixels have to |
|------------------------------|--------------------------------------------------------------------|
|                              | be written to the destination arrays. Similar to the C++ version   |
| Write to <i>source</i> array | (see section 3.6.2.4) the three source arrays are used to store    |
|                              | the calculated pixels. The pixels are addressed in the same way    |
|                              | as the source pixels.                                              |
|                              |                                                                    |
|                              |                                                                    |

A problem is caused by the fact that the four color components, which have to be written to four separate arrays, are loaded in a MMX register as packed words in the format shown in Figure 55.

| С                                                  | М | Y | K |  |  |
|----------------------------------------------------|---|---|---|--|--|
| Figure 55: Register format after the interpolation |   |   |   |  |  |

Write one byteBecause the calculated values are written over the source<br/>pixels, only one byte must be written. This is only possible by<br/>using a general purpose register. The data is copied to this<br/>register with the MOVD instruction. Since this instruction only<br/>copies the least significant 32 bits, the MMX register has to be<br/>shifted to copy the right color component.<br/>After the 32 bits have been copied into the general purpose

register, the least significant byte can be written to memory.

Code example 13 shows how the Yellow component is saved to ArrayB[esi]. Notice that the least significant byte of eax is moved by applying a conventional move instruction to register AL. Notice also that the blacK component is lost when shifting the MMX register 16 bits to the right, so color components have to be saved from right to left.

| pslrq<br>movd<br>mov<br>mov | eax,<br>edi, | 16<br>mml<br>ArrayB<br>esi], a | ;mm1=[<br>;mm1=[<br>;eax=[ | 0<br>М | C<br>Y | Y<br>M<br>]<br>[esi] | Y | ] |
|-----------------------------|--------------|--------------------------------|----------------------------|--------|--------|----------------------|---|---|
|                             | -            |                                |                            |        | -      |                      |   |   |

Code example 13: Writing the Yellow component

<sup>&</sup>lt;sup>1</sup> Static variables are aligned automatically by the compiler to the appropriate boundary. For more information on data alignment see page 15.

| 3.6.3.7             | Border pixels                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |  |
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| White borders       | The C++ implementation processes all pixels, since all of them contain valuable information. The MMX version of the processing path however, has an eight pixel wide, white border <sup>1</sup> on both edges (see section 3.3). Since a white pixel will result in a CMYK value with all components zero, these pixels do not have to be interpolated. Because most of the time required to convert a pixel is needed for the interpolation, it might be a good idea not to interpolate these border pixels. |  |  |  |
| Borders are 1%      | Let us take a look at the amount of pixels skipped this way. An A4 image scanned at 200 dpi has a width of 1580 pixels. After adding two, eight pixel wide borders, the image width is 1596 pixels. The 16 pixels only make one percent of this <sup>2</sup> .                                                                                                                                                                                                                                                |  |  |  |
|                     | The performance gain possible when not interpolating these pixels will be less than one percent because the memory accesses required to read and write the <i>pixels</i> are still necessary <sup>3</sup> .                                                                                                                                                                                                                                                                                                   |  |  |  |
| Additional overhead | <ul> <li>The overhead caused by the added complexity of the loop structure will be:</li> <li>Two loop counter registers are required instead of one. This makes the usage of the loop counter to address the pixels much more complex (and therefore costs time).</li> <li>One or more jumps will be required. Since these jumps will be hard to predict for the Pentium's branch prediction they will consume many cycles.</li> </ul>                                                                        |  |  |  |
|                     | Because this will probably cost more cycles than the performance gain possible, it has been decided not to skip the border pixels.                                                                                                                                                                                                                                                                                                                                                                            |  |  |  |
| 3.6.3.8             | The final MMX implementation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |  |
|                     | Code example 14 shows the final structure of the RGB to CMYK color conversion in pseudocode.                                                                                                                                                                                                                                                                                                                                                                                                                  |  |  |  |
|                     | The functions, such as "DuplicateWeight()", represent a number<br>of actions. For the sake of clarity they are represented through a<br>function call. In the implementation however they are<br>implemented in-line with a number of statements.                                                                                                                                                                                                                                                             |  |  |  |

<sup>&</sup>lt;sup>1</sup> The border added to extend the image width to a multiple of eight is not always white, and can therefore not be skipped. <sup>2</sup> 16/1596\*100 = 1,0% <sup>3</sup> After the conversion has been implemented it turned out that 91% of the processing time is

needed for the interpolation (and preparations for it; see page 68). From this it can be concluded that skipping the border pixels can maximally result in a performance gain of 0.9%.

Static PIXEL \*ArrayR, ArrayG, ArrayB, ArrayK // Ptrs to arrays esi=Width\*Height Do While (esi >= 0) // Alpha and beta weights mm0=CalcWeights(ArrayR[esi],ArrayG[esi]) eax=CalcPosOfPoint0(ArrayR[esi],ArrayG[esi], ArrayB[esi], LTable) mm2=mm0DuplicateWeight(mm2,W0) mm6=ReadPoint(eax,0) // Point O // Interm. result 1 in mm6 // Point 4 mm6=Pmul(mm6,mm2) mm1=ReadPoint(eax,4) mm1=Pmul(mm1,mm2) // Interm. result 2 in mm1 mm2=mm0DuplicateWeight(mm2,W1) mm5=ReadPoint(eax,1) // Point 1 mm5=Pmul(mm5,mm2) // Add to result 1 mm6=mm6+mm5 mm3=ReadPoint(eax,5) // Point 5 mm3=Pmul(mm3,mm2) // Add to result 2 mm1=mm1+mm3 mm2=mm0DuplicateWeight(mm2,W2) // Point 2 mm5=ReadPoint(eax,2) mm5=Pmul(mm5,mm2) mm6=mm6+mm5 // Add to result 1 // Point 6 mm3=ReadPoint(eax,6) mm3=Pmul(mm3,mm2) // Add to result 2 mm1 = mm1 + mm3DuplicateWeight(mm0,W3) mm5=ReadPoint(eax,3) // Point 3 mm5=Pmul(mm5,mm0) // Add to result 1 mm6=mm6+mm5 mm3=ReadPoint(eax,7) // Point 7 mm3=Pmul(mm3,mm0) mm1=mm1+mm3 // Add to result 2 ebx=CalcGamma(ArrayB[esi] mm5=Duplicate(ebx) mml=ShiftRightArithmetic(mm1,3) mm1=mm1\*mm5 // Result 2 \* gamma mm5=8-mm5mm6=ShiftRightArithmetic(mm6,3) // Result 1 \* (8-gamma) mm6=mm6\*mm5 mml=ShiftRightArithmetic(mm1,1) mm6=ShiftRightArithmetic(mm6,1) // Add interm. results mm1 = mm1 + mm6mml=ShiftRightArithmetic(mm1,5) WritePixels(mm1, ArrayR[esi], ArrayG[esi], ArrayB[esi] ArrayK[esi]) esi--// Next pixel 1

Code example 14: MMX version of the conversion

3.6.3.9

40% faster by pairing

### Filter performance

After the code has been paired the performance is compared to the performance of the unpaired version. Table 9 shows that a performance improvement of about 40% is achieved. Notice that this is more than for the kernel operations (31%). This is primarily caused by the fact that the interpolation requires many register-register operations. Code containing many register-register operations will benefit more from pairing than code with many register-memory operations, because two operations that access memory cannot pair.

| Image size | Cycle          | Improvement    |      |
|------------|----------------|----------------|------|
|            | unpaired       | paired         |      |
| (pixels)   | (cycles/pixel) | (cycles/pixel) | (%)  |
| 290x509    | 208            | 125            | 39,9 |
| 1580x2176  | 201            | 123            | 38,8 |

Table 9: performance gain by pairing

Table 10 lists the performance of the C++ and the paired MMX version of the RGB to CMYK conversion. The two images of 1580x2176 pixels are A4 images scanned at 200 dpi.

MMX 5x faster than C++ For those images the MMX implementation of the conversion is 5.2 times faster than the C++ version. The performance gain is not only reached by processing four data elements in parallel, but also by using a much more efficient multiplier<sup>1</sup>.

| Image size    | Cycle                   | Improv         | vement |          |
|---------------|-------------------------|----------------|--------|----------|
|               | C++ version MMX version |                |        |          |
| (pixels)      | (cycles/pixel)          | (cycles/pixel) | (%)    | (factor) |
| 87x16         | 628                     | 130            | 79,3   | 4,8      |
| 63x96         | 639                     | 143            | 77,6   | 4,5      |
| 127x96        | 642                     | 133            | 79,3   | 4,8      |
| 255x320       | 640                     | 125            | 80,5   | 5,1      |
| 290x509       | 641                     | 125            | 80,5   | 5,1      |
| 1580x32       | 639                     | 118            | 81,5   | 5,4      |
| 1580x2176 (1) | 646                     | 123            | 81,0   | 5,3      |
| 1580x2176 (2) | 646                     | 124            | 80,8   | 5,2      |

Table 10: Performance of the RGB to CMYK conversion

## 3.6.3.10 Categorising the processing time

Figure 56 shows the processing time of the paired version divided into the categories discussed in the previous sections. The cycle times are counted as described in section 3.4.4.8. Notice that
50% for weighed addition
Only 8% to write pixels
Notice also that only eight percent of the processing time is used to write the pixels, although the unpacking operations require quite an amount of computations. This is primarily caused by the fact that the memory locations the pixels are written to are still loaded into the level one cache (the source pixels are read from the same location).

<sup>&</sup>lt;sup>1</sup> The conventional multiplier takes 10 cycles to complete. The MMX multiplier takes three but an average throughput of one per cycle can be reached by pipelining the multiplies.



<sup>&</sup>lt;sup>1</sup> For a white pixels all the CMYK components will have to be made zero.

zeroes position and writing the calculated pixels to the arrays is done in conventional assembly code.

Figure 57 shows the processing time needed to weighed add two angular points (point five and one). This represents the "Weighed add points" section in Figure 56.

Notice that duplicating the weights costs 41% of the processing Duplicate weight 41% time required to weighed add a point. With this amount of time in mind it might be a good idea to duplicate the weight by writing the individual value four times to memory, as discussed on page Duplicate via memory 61. If duplicating the weights would be entirely omitted the weighed addition of an angular points would be 41% faster. Since the weighed addition of the angular points takes 49% of the total processing time (see Figure 56), this would improve the overall performance with 20%<sup>(1)</sup>. When duplicating the weights via the memory, the calculated weights first have to be unpacked<sup>2</sup>. After this, they have to be moved four times to the memory. Obviously this requires some processing time, but remember that the memory locations were the weights are stored will generally be loaded into the level one cache. Therefore it is reasonable to expect a performance gain

of about ten percent.

```
About 10% faster
```



Figure 57: Categorised processing time of the weighed add section

Figure 58 shows the processing time categorised according to the instruction type. The amount of time required for arithmetic operations is surprisingly low; only 17%. This is, among others, caused by the fact that the multiplies are pipelined and interleaved with other instructions.

The reason the number of shift operations is relatively high (20%) is that duplicating the weights, as well as (un)packing, requires many shift operations.

20% shift

 $<sup>^{1}</sup>$  0.41x0.49 = 0.20

<sup>&</sup>lt;sup>2</sup> The package format is shown in Figure 48.



Figure 58: Processing time per instruction type

| 3.7                  | The halftoning algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3.7.1                | The algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| Decrease color depth | Halftoning is used in every printing device to decrease the number of colors in the image to a number of colors printable by the print engine. This is necessary because a single-color pixel printed by a print engine can only have two values, on ('1') or off ('0') <sup>1</sup> . Since print engines normally use four base colors (CMYK <sup>2</sup> ), a total of nine <sup>3</sup> different colors can be printed.                              |
| CMYK color planes    | The pixels in an image are normally represented with 3*8 bits in the RGB color representation, allowing 16 million <sup>4</sup> colors. Before applying the halftoning algorithm, the RGB image is converted to four separate CMYK color planes, after which the individual planes are halftoned.                                                                                                                                                         |
| Bitwise pixels       | Note that because the halftoned pixels in a single color plane<br>can only have two values a single bit is sufficient for each pixel.<br>Because the pixels in the source image -with separated color<br>planes- are represented with eight bits, generally the halftoning<br>generates pixels which are either 0 (all bits '0') or 255 (all bits<br>'1'), after which only one bit of the calculated pixel value is<br>written to the destination image. |

<sup>&</sup>lt;sup>1</sup> Analog print engines can print gray scale pixels but the image processing discussed in this report is designed for digital print engines. <sup>2</sup> Cyan, Magenta, Yellow and Black. See section 2.1.2.2 for more information on color

representation. <sup>3</sup> Cyan, Magenta and Yellow allow  $2^3$ =8 different colors. Black is only used to print black pixels, resulting in a total of nine different colors. <sup>4</sup>  $2^{(8+8+8)} = 16,777,216$ 

Variations

Halftoning algorithms with varying image quality and algorithm complexity exist. In the following sections some of them are shortly discussed. To compare the quality of the processed images<sup>1</sup>, a (256 gray scale) test image called 'Lena' is printed<sup>2</sup> in Figure 59 at 28% of its original size. This image is used to compare the image quality of the various halftoning algorithms.



Figure 59: Original 'Lena'

# 3.7.1.1 Thresholding

Fixed threshold

Thresholding is the most basic form of halftoning. Each pixel in each of the separated color planes is compared to a fixed threshold; if the pixel's value exceeds the threshold the pixel is rounded to 255 (=on='1'), otherwise it is made 0 (=off='0').

In Figure 60 the thresholded 'Lena' is printed. A threshold of 128 has been used.



Figure 60: Thresholded 'Lena'

Figure 60 shows that thresholding gives reasonable quality for black text, but unacceptable quality for pictures. For this reason, thresholding is not often used.

# 3.7.1.2 Dithering

Dither matrix

Dithering is a halftoning variant where the threshold depends on the pixel's position. The threshold is read from the dither matrix, which is layed-over the image like shown in Figure 61.

<sup>&</sup>lt;sup>1</sup> Because 24-bit color images are separated into four color planes, the individual planes can be thought to be 256 gray scales images.

<sup>&</sup>lt;sup>2</sup> Because the printer this report is printed on also halftones the image, this is not the exact original. The image quality however is sufficient to compare the halftone algorithms.





In Figure 62 an 8x8 dither matrix is shown.

|   | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
|---|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 251 | 235 | 187 | 155 | 123 | 91  | 59  | 51  |
| 2 | 243 | 227 | 179 | 135 | 115 | 83  | 43  | 35  |
| 3 | 219 | 211 | 171 | 115 | 107 | 75  | 27  | 11  |
| 4 | 203 | 195 | 163 | 100 | 99  | 67  | 19  | 3   |
| 5 | 123 | 91  | 59  | 51  | 251 | 235 | 187 | 155 |
| 6 | 115 | 83  | 43  | 35  | 243 | 227 | 179 | 135 |
| 7 | 107 | 75  | 27  | 11  | 219 | 211 | 171 | 115 |
| 8 | 99  | 67  | 19  | 3   | 203 | 195 | 163 | 100 |

Figure 62: An 8x8 dither matrix

In Figure 63 the dithered 'Lena' is shown.



Figure 63: Dithered 'Lena'

For photos dithering gives acceptable results. When dithering text however, the edges of the letters are frayed as shown in Figure 64. Because this is very disturbing when reading the text, pure dithering is not often used for images containing text.

es An

Figure 64: Dithered text

3.7.1.3

# **Error diffusion**

Error diffusion uses a fixed threshold, similar to thresholding. The difference is that with error diffusion the error made when rounding a pixel is passed to the neighbouring pixels. The error is calculated by subtracting the new pixel value from the original pixel value. When calculating the black pixel in Figure 65

Error passing the error is added to four of the neighbouring pixels with the factors shown. The gray pixels are processed previously, the black pixel is the one currently processed.

|      |      | 7/16 |  |
|------|------|------|--|
| 3/16 | 5/16 | 1/16 |  |
|      |      |      |  |

Figure 65: Floyd-Steinberg error passing

"Floyd-Steinberg" This technique of passing the error called the "Floyd-Steinberg" algorithm. In Figure 66 the error diffused 'Lena' is shown.



Figure 66: Error diffused 'Lena'

'Worms' Figure 67 shows an error diffused gray plane (real size). Notice the little 'worms'. These are created when applying an error passing algorithm to very light or very dark planes.

| メリメリア ひとう アンリンパー                         |
|------------------------------------------|
| ちょうしゅう さつのし                              |
| アン・センドイン パント・シアイン                        |
| 20492332323677636233                     |
| - 1993 MAR & MART & 1997                 |
| 二分子 しゃくえん やさくちんくろう                       |
| ちん かんししょう かいがい かくせいき                     |
| くうえ そうもうアントメット どうさくら                     |
| こうえい シント・レイ・ノント                          |
| 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1 |
|                                          |

Figure 67: Error diffused gray plane

Good quality Because error diffusion produces good results for text and for (most) photos, this is a much used halftoning variant. A disadvantage of error diffusion are the 'worms' that sometimes are created in images containing large areas of the same color.

Other variants

3.7.1.4

In the previous sections the most common halftoning algorithms are discussed. Because most algorithms produce optimal results only for certain image types (text or photos), mixtures of the discussed variants are used. These mixtures commonly use dithering for photos and error diffusion for text areas.

It is expected that when implementing an algorithm, passing the error to the neighbouring pixels will cause most problems. For this reason it has been chosen to implement the error diffusion algorithm. Based on the problems encountered there, an indication of the difficulties expected when implementing other algorithms can be given.

An implementation of an error dithering algorithm with MMX or C++ will only differ from error diffusion in the way the threshold is determined. A mixture of error diffusion and dithering will probably be more complex because switching between the two algorithm types is difficult to implement.

#### 3.7.2 C++ implementation

#### 3.7.2.1 Traversing the image array

Similar to the algorithms previously implemented, the first question to ask when implementing the error diffusion algorithm is how to traverse the image array; horizontally, vertically or diagonally.

When taking a closer look at the algorithm it is noticed that it is not possible to re-use intermediate results as with the smooth and the sharpen algorithm. Only the error passing algorithm might prefer some way of traversing the image.

Figure 68 shows the way the error is passed to the four neighbour pixels; the black pixel is the one currently processed.

|   |   | 7 |  |
|---|---|---|--|
| 3 | 5 | 1 |  |
|   |   |   |  |

Figure 68: Floyd-Steinberg error passing

From Figure 68 it can be concluded that two ways of traversing the image array are possible; horizontally from left to right or diagonally from the left top to the right bottom. In both situations one of the error values has to be passed to the next pixel processed. The other error values are needed when processing the next row or (diagonal) column<sup>1</sup>.

Horizontal traversing In section 3.4.2.1 it was concluded that optimal caching can be achieved when traversing the image horizontally. Additionally, a loop that traverses an image diagonally, is more difficult to implement than a loop that traverses an image horizontally<sup>2</sup>. Therefore it is decided to process the image array horizontally.

3.7.2.2

#### The basic inner loop structure

Basically error diffusing one pixel consists of the following steps:

- 1. Receive the error from other pixels
- 2. Determine the new pixel value
- 3. Pass the error to other pixels

Step one and three are described in section 3.7.2.3.

<sup>&</sup>lt;sup>1</sup> Lacking a correct name diagonal column is used.

<sup>&</sup>lt;sup>2</sup> Vertical image traversing would be possible when using other Floyd-Steinberg error passing algorithms. The variant shown in Figure 68 has been chosen because it is most often used. Besides this, vertical image traversing does not offer any advantages over horizontal image traversing (for as far as the implementation is concerned).

Determining the new pixel value is not complex; after the error passed from other pixels is added, the pixel is compared to the threshold. If the pixel is smaller the new pixel value is zero, otherwise it is 255. After the new pixel value has been determined the error can be determined by subtracting the new pixel value from the old one.

#### 3.7.2.3 Passing the error

When processing the image horizontally, the error weighed 7/16 has to be added to the next pixel processed. The other three error values have to be stored in an array for later usage.

Since the image is traversed horizontally, the array where the error values are stored must have enough positions to store errors for an entire row of the image. Figure 69 shows how this array is filled.

| Image |   |   |   |  |
|-------|---|---|---|--|
|       |   |   |   |  |
|       | а | b | С |  |
| •     |   | - |   |  |

#### Error passing array

| 3a/16 | 5a/16+ | a/16+  | 5c/16+ | c/16+ |  |
|-------|--------|--------|--------|-------|--|
|       | 3b/16  | 5b/16+ | 6/16+  |       |  |
|       |        | 3c/16  |        |       |  |

Figure 69: Using an error passing array

In Figure 69 the way the errors have to be passed from the three dark-grayed pixels to the light grayed pixels is shown. The black pixel, which is processed the next loop cycle, receives an error from pixel c. This error value can be stored in a local variable, the light gray errors have to be stored in the error passing array. As shown in Figure 69 the middle of the light gray pixels receives

Receive three errors three errors passed from above. Only edge pixels receive less error values, all the other pixels in an image receive three error values from above (and one from the right).

It has been chosen to store the errors which have to be passed to the lower row in the array. It is also possible to store the rounding errors that have been made in the array and calculate the error to pass to the lower pixels later. This way, passing the error requires less calculations to store the error values but more to calculate the error to add to a pixel. For the entire image the total number of calculations is the same. Both implementations benefit from the caches in the same way, so any one of them can be chosen. As mentioned before it has been chosen to use the error passing array in the way shown in Figure 69.

When storing the error values in the array, care has to be taken not to overwrite values needed later.

Array

|                       | Image<br>01 02 03 04 05 06 07 08 09 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                       | 11       12       13       14       15       16       17       18       19       20         21       22       23       24       25       26       27       28       29       30                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                       | Error passing array<br><b>a b c</b><br>Figure 70: Error passing via array                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Overwrite next error  | Suppose (in Figure 70) pixel 15 is currently processed. After the error has been determined, the weighed error has to be saved to the error passing array in order to pass it to the pixels 24, 25 and 26. Note that the positions a and b in the array contain errors passed previously from the pixels 13 and 14. The error made with pixel 15 has to be added to those errors. Position c of the error passing array contains the error that has to be passed to pixel 16 in the next loop cycle. Unfortunately this position is also needed for the error passed from pixel 15 to 26. This problem has been solved by reading position c from the error pass to pixel 26 can be written to the array.                                                                                                                                                                                                                       |
| Negative error values | Now the basic way of passing the error is designed the data type<br>for the positions in the error passing array have to be chosen.<br>The pixels in the image are written as unsigned bytes, so it<br>would be obvious to create an error passing array of this type.<br>Because negative error values are also possible a sign bit is<br>required. Since it is not possible to multiply the unsigned bytes<br>read from the source image with the signed bytes, both the<br>pixels and the error values have to be extended to signed words.<br>An additional advantage of this extension is that rounding errors<br>are avoided, because the division by sixteen can be delayed<br>until the addition of the error value and the pixel value read from<br>the image array is performed. A disadvantage of the extension is<br>that instead of processing eight pixels per loop cycle, now only<br>four pixels are processed. |
|                       | Because the top row of the image does not receive error values<br>from above, the error passing array has to be cleared before the<br>actual filter loop is started. This is necessary because C++ does<br>not clear storage space when allocated by an application.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 3.7.2.4               | Border pixels                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                       | The Floyd-Steinberg error passing shown in Figure 68 requires special attention for border pixels, since either the left or the right neighbour pixel is missing.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Skip borders          | To avoid problems, a special error passing algorithm can be<br>implemented for border pixels, or the border pixels can be<br>skipped entirely when processing the image. Since applying<br>different error passing mechanisms to border pixels requires four<br>additional error passing versions, it has been decided not to<br>process border pixels.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 3.7.2.5               | Writing the resulting pixels                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                       | Error diffusion produces pixels which are either zero or 255.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

Their values can therefore be represented with one bit.

Bitwise

Compared to the eight bit (= one byte) pixels read from the input file the one bit data requires much less memory bandwidth. For this reason it is

common practice to write the halftoned pixels as bits<sup>1</sup> instead of bytes.

Collecting bitsSince bitwise addressing is not possible, eight bits have to be<br/>collected, after which they are written as a one byte to the<br/>destination array.If it was chosen not to traverse the image horizontally in section<br/>3.7.2.1 this way of storing the bits would re-order the pixels.<br/>If the image would be traversed vertically for example, the<br/>resulting bits (which are collected into one byte) are<br/>automatically written horizontally. Thus the pixels would be<br/>rotated 90° !

Collecting the bits into a byte is implemented as shown in Code example 15.



Code example 15: Collecting bits into a byte

Skipping border pixels Caution has to be taken when skipping the border pixels. When storing the pixels as bytes a single byte can be skipped by simply incrementing the index to the array by one. However when writing the pixels as bits it is not sufficient to increment the BitCounter. The previously packed bits have to be shifted one bit to the left too. Additionally, if BitCounter has reached the value eight, the byte containing the packed bits has to be stored. Since a pixel value only depends on the value of one source pixel and the error passed from other pixels, it is possible to overwrite the source pixels with the calculated values<sup>2</sup>. Because the Overwrite source pixels memory location is loaded into the cache when reading the source pixel this could reduce the time necessary to write the pixel. However since the resulting pixels are written as bits, the calculated pixels require eight times less storage space than the original pixels. This causes the pointer to the currently processed

<sup>&</sup>lt;sup>1</sup> Obviously, when halftoning to more than one level more bits are required. Halftoning to, for example four levels produces pixels which can be represented with two bits.

<sup>&</sup>lt;sup>2</sup> For the smooth and the sharpen algorithm this is not possible because a source pixel is input for nine pixel calculations.

3.7.2.6

source pixel to 'run away' from the pointer to the storage space for the calculated value, eliminating the caching advantage. Therefore it has been decided to store the calculated values in a separate array.

#### The final C++ implementation

The complete C++ implementation of the error diffusion algorithm is shown in Code example 16 (in pseudocode).

```
Clear(ErrorArray)
BitCounter = 0
PackedBits = 0
RowCounter = 1
Do While (RowCounter < Height-1)
       SkipLeftPixel()
       ColCounter = 1
       NextError = ErrorArray[ColCounter] // Error from above
       Do While (ColCounter < Width-1)
             {
                                              // Get input value
              Old_Pixel = Src.Pixels[RowCounter,ColCounter]
              Old_Pixel = Pixel + (NextError/16)
                                              // Get new pixel
              If (Old_Pixel < Treshold) Then</pre>
                    New_Pixel=0
              Else
                    New_Pixel=255
              PackBits()
              If (BitCounter==8) Then StoreBits()
                                             // Error passing
              Error=Old_Pixel-New_Pixel
              NextError= (7*Error) + ErrorArray[ColCounter+1]
              ErrorArray[ColCounter+1]=1*Error
              ErrorArray[ColCounter] +=5*Error
              ErrorArray[ColCounter] +=3*Error
              ColCounter++
       SkipRightPixel()
       RowCounter ++
```

Code example 16: The final C++ implementation

| Nested loops | It has been chosen to implement two nested loops because the border pixels require special attention. The functions SkipLeftPixel() and SkipRightPixel() represent the actions necessary to skip the border pixels. The PackBits() and StoreBits() functions <sup>1</sup> pack the bits into a byte respectively stores the collected bits. All four functions are described in section 3.7.2.5. |
|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3.7.3        | MMX implementation                                                                                                                                                                                                                                                                                                                                                                               |
| 3.7.3.1      | Parallelability of the algorithm                                                                                                                                                                                                                                                                                                                                                                 |
|              | The parallelability of the error diffusion algorithm is limited by<br>the way the error is passed to the neighbour pixels.<br>In Figure 71 the way the error is spread across the neighbours is<br>shown.                                                                                                                                                                                        |

<sup>&</sup>lt;sup>1</sup> In the C++ implementation these sub-functions are not used because of the extra overhead required; all the code is placed in one function.

7

|                       | 3 5 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                       | Figure 71: Floyd-Steinberg error passing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Pixel dependencies    | Obviously, pixels that receive an error from other pixels can not<br>be processed before the error is known. These dependencies<br>highly limit the possibilities for parallel pixel processing. In<br>Figure 72 the dependencies are shown. The pixel currently<br>processed is black. This pixel receives errors from the light gray<br>pixels. The dark gray pixels are influenced by the error passed<br>from the black pixel.<br>01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18                                                                                                                                                                                                             |
|                       | 21       22       23       24       25       26       27       28       29       30       31       32       33       34       35       36       37       38         41       42       43       44       45       46       47       48       49       50       51       52       53       54       55       56       57       58         61       62       63       64       65       66       67       68       69       70       71       72       73       74       75       76       77       78         Figure 72: Pixel dependencies                                                                                                                                                               |
|                       | From Figure 72 it can be concluded that it is not possible to<br>process a number of pixels placed next to each other, as with for<br>example the smooth algorithm. Worse, it seems to be<br>impossible to process pixels in parallel. However, there are two<br>ways to process pixels in parallel; skewing and tiling. In the next<br>two paragraphs these two alternatives will be compared.                                                                                                                                                                                                                                                                                                         |
| Horizontal traversing | Both ways can be implemented with horizontal as well as diagonal image traversing. Since diagonal image traversing requires a more complex loop structure, and gives no advantage over horizontal traversing <sup>1</sup> , it has been chosen to traverse the image horizontally.                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                       | Skewing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Independent pixels    | Basically skewing avoids the dependencies between pixels by processing independent pixels. Figure 73 shows the dependencies for a pixel in the middle of the image.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                       | 01       02       03       04       05       06       07       08       09       10       11       12       13       14       15       16       17       18         21       22       23       24       25       26       27       28       29       30       31       32       33       34       35       36       37       38         41       42       43       44       45       46       47       48       49       50       51       52       53       54       55       56       57       58         61       62       63       64       65       66       67       68       69       70       71       72       73       74       75       76       77       78         Figure 73: Dependencies |
|                       | The dependencies are limited to the areas to the upper left and lower right of the pixel processed. The pixels in the light gray area pass an error <i>to</i> the pixel, the pixels in the dark gray area receive an error <i>from</i> the pixel.                                                                                                                                                                                                                                                                                                                                                                                                                                                       |

From Figure 73 it can be concluded that parallel pixel processing is possible for pixel 12, 30 48 and so on. This is shown in Figure 74.

<sup>&</sup>lt;sup>1</sup> See section 3.7.2.1.

| 01  | 02  | 03  | 04  | 05  | 06  | 07  | 08  | 09  | 10  | 11  | 12  | 13  | 14  |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 21  | 22  | 23  | 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  | 32  | 33  | 34  |
| 41  | 42  | 43  | 44  | 45  | 46  | 47  | 48  | 49  | 50  | 51  | 52  | 53  | 54  |
| 61  | 62  | 63  | 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  | 72  | 73  | 74  |
| 81  | 82  | 83  | 84  | 85  | 86  | 87  | 88  | 89  | 90  | 91  | 92  | 93  | 94  |
| 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 |
| 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 |

Figure 74: Skewed processing

A disadvantage of skewed processing is that the pixels to be processed can not be read into an MMX register with one memory read, as done for example with the smooth algorithm. The pixels have to be read one at a time, after which they can be packed manually into an MMX register to allow parallel processing.

Figure 75 shows the required error passing when processing a number of pixels skewed. The black pixels are the pixels currently processed. The resulting error values a, b and c have to be passed as shown.

|       |       |                 |       |                 | а     | 7a/16 |
|-------|-------|-----------------|-------|-----------------|-------|-------|
|       |       |                 |       | 3a/16+<br>7b/16 | 5a/16 | a/16  |
|       | С     | 3b/16+<br>7c/16 | 5b/16 | b/16            |       |       |
| 3c/16 | 5c/16 | c/16            |       |                 |       |       |

Figure 75: Error passing with skewing

Notice that, when traversing the image array horizontally from left to right, some of the error values are required for the next pixel processed (dark gray pixels). The errors passed to the light gray pixels are required in the two next loop cycles. These three groups of error values can be stored in register, instead of an array to improve performance. The errors passed to the lower row cannot, because they have to be stored until the next row is processed.

The array required to store these error values can be implemented in a similar way as the error passing array of the C++ version of the error diffusion. In section 3.7.2.3 it was concluded that it was necessary for the C++ implementation to use signed words for the error passing array, and to extend the pixels from unsigned bytes to signed words.

For a skewed MMX implementation this is also required. This implies that each loop cycle only four pixels can be processed.

Now let us take a look at the number of memory accesses required to pass the error values when processing four pixels in a skewed implementation (see Figure 76). Assuming that the error passing array contains the error value received by a pixel (instead of the error value to pass from a pixel<sup>1</sup>), one memory read is required to receive an error from above to add to pixel a.

Manual packing

<sup>&</sup>lt;sup>1</sup> In section 3.7.2.3 it was concluded that both ways of storing the errors require the same amount of computations.

One memory operation

The error passed from pixel d to pixel a of the lower row has to be stored with three memory operations; two memory adds, one memory move. Thus, four memory operations are required to process four pixels (one per pixel).



Figure 76: Error passing via array

A disadvantage of skewing is that seven of the border pixels can not be processed in the same way the other pixels are, as can be seen in Figure 77.



Figure 77: Skewed edge pixels

Six pixels have to be skipped because of the sloped processing. The leftmost pixel has to be skipped because the error passed from that pixel to another cannot be passed since there is no left neighbour. This is an inevitable side effect of skewing that can be avoided by temporarily adding two, seven pixels wide borders to the left and right of the image. This causes an extra overhead of

0.9%<sup>1</sup> for A4 image scanned at 200 dpi.

#### Tiling

Another method of processing pixels in parallel is tiling. This methods divides an image into a number of overlapping<sup>2</sup> tiles, after which each loop cycle one pixel of each tile is processed. In Figure 78 for example the first iteration of the loop processes the four light gray pixels. The next processes the dark gray pixels and so on.

Figure 78: Tiling

Dividing the image into tiles is only allowed when the tiles do not influence each other. For error diffusion this is not true, but by creating an overlap of minimal 32 pixels<sup>3</sup> an acceptable result can be obtained. When dividing an A4 image scanned at 200 dpi

Overhead of 0.9%

Overlapping tiles

 $<sup>^{1}(2*7)/(2*7+1580)*100\% = 0.9\%</sup>$ <sup>2</sup> To eliminate the influence of one tile on another.

<sup>&</sup>lt;sup>3</sup> Source: "Image processing with SIMD architectures" [10].

Four pixels/loop cycle

Four memory accesses

7.5% overhead (1580 pixels wide) into four tiles, each 395 pixels wide, 7.5%<sup>(1)</sup> redundant pixels have to be processed.

Similar to a skewed implementation, the negative error values that have to be passed necessitate the extension of the pixels from unsigned bytes to signed words. This automatically reduces the number of pixels to process in parallel to four.

Now lets take a look at the way the error has to be passed. Since the error passing mechanism is identical for all four tiles, the error passing for only one tile is shown in Figure 79 (The black pixel is the one currently processed).



Figure 79: Error passing with tiling

Since the image is traversed from left to right the error to pass to the gray pixel can be stored in a register. The error received from above and the errors to pass to the lower neighbours have to be read/stored in an array. Thus, four memory accesses are necessary to pass the error for one pixel. If the image is divided into four tiles, a total of sixteen memory accesses is required to pass the errors of four pixels.

## Skewing or tiling ?

Both for skewing and for tiling manual packing of the pixels processed is required. An advantage of tiling compared to skewing is that manually packing the pixels is less complex than with skewing.

A tiled implementation processes 7.5% redundant pixels, compared to 0.9% for a skewed implementation. Clearly this is a considerable advantage of a skewed implementation.

The error passing mechanism of a tiled implementation requires sixteen memory accesses to process four pixels, compared to four for the skewed implementation. By implementing the error passing of the tiled implementation in a smart way the number of memory accesses can be reduced, but the required memory bandwidth still remains four times the value of the skewed implementation.

Skewed Therefore it has been chosen to implement the error diffusion algorithm skewed.

Besides skewing and tiling it is also possible to process one pixel of each of the four color planes. Since this method does not require the additional code necessary for skewing, performance will probably be better, while the implementation will be much easier.

This method has not been chosen for two reasons: 1. It is not possible for a gray-scale implementation.  This method avoids the specific problems of the error diffusion. Océ wants to know how these problems can be solved.

# 3.7.3.2 Reading the source pixels

Manual packing When processing an image skewed the source pixels have to be packed manually. This requires the four pixels processed to be loaded individually, after which they can be packed into an MMX register.

Pixel addressing It seems to be necessary to have four pointers to the four pixels processed. Since only six<sup>1</sup> general purpose registers are available, this is not the most optimal way to locate the pixels. In Figure 80 four skewed pixels are shown, with a pointer set to the upper pixel (19).

|    |    |    |    |    |    |    |    | esi |    |
|----|----|----|----|----|----|----|----|-----|----|
|    |    |    |    |    |    |    |    |     |    |
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | ▼   | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19  | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29  | 30 |
| 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39  | 40 |
| 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49  | 50 |
| 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59  | 60 |
| -  |    |    |    |    |    |    |    |     |    |

width Figure 80: Pixel addressing

The position of the pixels 27, 35 and 43 in the image array can be calculated by adding the image width minus two to the position of pixel 19. This way all four pixel can be loaded into registers.

MMX memory moves It has been chosen to load the pixels with MMX memory moves, instead of byte-moves for three reasons:

- 1. Loading the pixels requires a number of registers. While implementing the filter, it turned out that all conventional registers where used already<sup>2</sup>.
- 2. To be able to process the pixels the packed pixels have to be loaded into an MMX register.
- Manual packingThe way the pixels are packed is shown in Code example 17.<br/>Register eax contains the image width in pixels, esi the position<br/>of the upper pixel. Because the addressing mode [esi+3\*eax-6]<br/>is not allowed, the image width (eax) has to be added to esi to<br/>read to lower pixel. More additions can be avoided by using the<br/>right the addressing modes.<br/>The comment added to the code assumes the pixels are<br/>numbered as shown in Figure 81.

<sup>&</sup>lt;sup>1</sup> eax, ebx, ecx, edx, esi and edi.

<sup>&</sup>lt;sup>2</sup> See section 3.7.3.4.

| sub   | esi, | eax         | ;Restore original value of esi |
|-------|------|-------------|--------------------------------|
| por   | mm0, | mm⊥         | ;mm0=[00 a 00 b 00 c 00 d]     |
| pand  | mm1, |             |                                |
| -     |      |             | ;mm1=[0000 0000 0000 00 d]     |
| movq  | ,    |             | Read pixel d                   |
| por   | mm0, |             | mm0 = [00 a 00 b 00 c 0000]    |
| psllq | mm1, |             | mm1 = [0000 0000 00 c 0000]    |
| pand  | mm1, |             | mm1 = [0000 0000 0000 00 c]    |
| movq  |      | [esi-4+eax] | Read pixel c                   |
| por   |      | mm1         | imm0 = [00 a 00 b 0000 0000]   |
| psllq | mm1, | 32          | //mm1=[0000 00 b 0000 0000]    |
| pand  | mml, | mm2         | //mm1=[0000 0000 0000 00 b]    |
| movq  | mm1, | [esi-2]     | //Read pixel b                 |
| add   | esi, | eax         |                                |
| psllq | mm0, | 48          | //mm0=[00 a 0000 0000 0000]    |
| pand  | mmO, | mm2         | //mm0=[0000 0000 0000 00 a]    |
| movq  | mmO, | [esi]       | //Read pixel a                 |
|       |      |             | //mm2=[0000 0000 0000 00FF]    |

Code example 17: Packing skewed pixels



Figure 81: Pixel numbering

Notice the large amount of statements required to pack the pixels, where in the smooth filter one statement was sufficient. Obviously, this will consume a lot of processing time.

#### Writing the resulting pixels

After the new pixel values have been calculated the resulting pixel have to be written as bits. The C++ implementation<sup>1</sup> collects the bits into a byte, after which the byte is written to memory.

For the MMX implementation writing the bits this way is not possible. Figure 82 shows an image of  $16^{(2)}$ x6 pixels.

| 8  |    |    |    |       |      |     |       |       |        | 8     | 5      |    |    | •  |    |
|----|----|----|----|-------|------|-----|-------|-------|--------|-------|--------|----|----|----|----|
|    |    |    |    |       |      |     |       | -     |        |       |        | D  |    | D  |    |
| 01 | 02 | 03 | 04 | 05    | 06   | 07  | 08    | 09    | 10     | 11    | 12     | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21    | 22   | 23  | 24    | 25    | 26     | 27    | 28     | 29 | 30 | 31 | 32 |
| 33 | 34 | 35 | 36 | 37    | 38   | 39  | 40    | 41    | 42     | 43    | 44     | 45 | 46 | 47 | 48 |
| 49 | 50 | 51 | 52 | 53    | 54   | 55  | 56    | 57    | 58     | 59    | 60     | 61 | 62 | 63 | 64 |
| 65 | 66 | 67 | 68 | 69    | 70   | 71  | 72    | 73    | 74     | 75    | 76     | 77 | 78 | 79 | 80 |
| 81 | 82 | 83 | 84 | 85    | 86   | 87  | 88    | 89    | 90     | 91    | 92     | 93 | 94 | 95 | 96 |
|    |    |    |    | Figur | e 82 | ·Wr | itina | eight | t coll | ected | d hits |    |    |    |    |

Figure 82: Writing eight collected bits

After pixel 24 is processed it, along with the grayed pixels, would have to be written to the destination array. Because the pixels are written skewed, the row containing pixel 38 can not be written yet (only six pixels are processed); this write would have to be delayed two loop cycles. The two rows containing the pixels 52 and 66 would have to be delayed four respectively six

3.7.3.3

Write bitwise

<sup>&</sup>lt;sup>1</sup> See section 3.7.2.5.

<sup>&</sup>lt;sup>2</sup> The image width always is a multiple of eight. This is caused by the functions that add a border to the image (see section 3.4.3.5).

loop cycles. It is obvious that writing the pixels this way requires a complex loop structure with a number of storage spaces to temporarily store the bits. An alternative method is to shrink the bytes to bits after and an Shrink four rows entire row (thus four rows of *pixels*) is processed. This is possible because two nested loops are required, one to process the rows, another to process the columns<sup>1</sup>. The inner loop of the error diffusion algorithm can then write the calculated pixels as bytes, after which they are shrinked by a loop that is executed after a row is processed. This method does not require any additional storage space because the shrinked bytes can be written over the source pixels. Since an A4 image scanned at 200 dpi has a width of 1580 pixels, 4x1580=6320 bytes of storage space are required to store the four processed rows as bytes. This means that these rows are still loaded in the level one cache, so shrinking them requires a minimal processing time. The uncomplicated loop structure required to shrink the bytes this way has the advantage that no additional pointer is required; No additional pointer after the four rows are shrinked, the destination pointer of the inner loop is simply set to the next unused position in the array. At the end of the inner loop this pointer points to the last byte of the upper of the four processed row. By subtracting the image width, the pointer can be reset to the beginning of the first of the rows to shrink. Therefore it has been chosen to use the second way discussed of shrinking the bytes to bits. The loop performing the shrinking operation reads eight bytes into an MMX register, shrinks them, and writes one byte to the array. 3.7.3.4 The inner loop structure Code example 18 shows the loop structure resulting from the issues discussed in the previous sections in pseudocode. In the second statement the threshold is duplicated four times. This is Duplicate threshold necessary to enable four pixels to be calculated in parallel.

<sup>&</sup>lt;sup>1</sup> This is necessary because border pixels require special treatment.

| Clear(Error |                                                            |
|-------------|------------------------------------------------------------|
| Threshold=D | uplicateThreshold()                                        |
| RowCounter  | = Height                                                   |
| Do While (R | owCounter > 0)                                             |
| {           |                                                            |
|             | Counter = 7 // Skip 7 pixels                               |
| Do W        | Nhile (ColCounter < Width)                                 |
|             | (<br>mm0=Pack4Bytes(ColCounter)                            |
|             | lillio-pack4byles(corcounter)                              |
|             | mml=GetErrorFromAbove()                                    |
|             | mm3=GetInputPixels(mm0, mm1)                               |
|             | mm2=GetNewPixels(mm1)                                      |
|             | <pre>mml=GetErrors()</pre>                                 |
|             | Write4Bytes(mm2)                                           |
|             | PassErrors(mm1)                                            |
|             | ( )                                                        |
|             | ColCounter++                                               |
| _           | }                                                          |
|             | Counter = 1 // Bytes => bits                               |
| Do W        | Nhile (ColCounter < Width-1)                               |
|             | {<br>Shrink8Bytes()                                        |
|             | }                                                          |
| RowC        | Counter = RowCounter-4                                     |
| }           |                                                            |
| Code exa    | mple 18: Loop structure of the MMX implementation          |
|             |                                                            |
| The nseu    | docode in Code example 18 is quite self explaining.        |
|             | this would add to much complexity, most of the             |
|             |                                                            |
| registers   | used are not mentioned.                                    |
| Notice the  | at bassues RowCounter is degree anted four each loop       |
|             | at because RowCounter is decremented four each loop        |
| •           | ages with a height that is not a multiple of four will not |
| be          |                                                            |
|             | d entirely. Since this only concerns three out of 2176     |
| rows it ha  | s been decided to skip these rows and cut off the          |

Bottom rows

3.7.3.5

Generate

#### Masks

bottom of the image later.

Generally, if a filter requires masks these are generated outside the inner loop. Within the loop the registers they are stored in cannot be used for other purposes.

The error diffusion algorithm however requires a large amount of registers. The two masks required cannot be kept in registers, because then there are not enough registers left. Therefore the masks are generated in the inner loop. This generally requires two instructions per mask.

An alternative method of generating masks is loading them from memory. Because this memory locating is frequently accessed it will generally be loaded from the level one cache, so loading the mask requires only one cycle. It has been chosen to generate the mask because this reduces the memory bandwidth slightly, but for more complicated masks it is also possible to load them from memory. 17% faster by pairing

# 3.7.3.6

#### Filter performance

After the inner loops of the error diffusion implementation have been paired performance is compared to that of the unpaired version. The results are shown in Table 11.

| Image size | Cycle          | Improvement    |      |
|------------|----------------|----------------|------|
|            | unpaired       | paired         |      |
| (pixels)   | (cycles/pixel) | (cycles/pixel) | (%)  |
| 290x509    | 115            | 94             | 18,3 |
| 1580x2176  | 114            | 94             | 17,5 |

Table 11: Performance gain by paring

Notice that the performance gain caused by paired is about 17%, compared to 31% for the sharpen filter<sup>1</sup>. The reason the improvement is less is that the complexity of the error diffusion reduces the possibilities to pair statements. Contrary to the sharpen, where a pairing rate of 100% has been reached, the inner loops of the error diffusion algorithms could not be paired perfectly. A pairing rate of 98% was the maximum feasible.

The performance of the paired MMX version and the C++ version of the error diffusion is shown in Table 12 for several images.

| Image size    | Cycle          | Improvement    |      |          |
|---------------|----------------|----------------|------|----------|
|               | C++ version    | MMX version    |      |          |
| (pixels)      | (cycles/pixel) | (cycles/pixel) | (%)  | (factor) |
| 87x16         | 193            | 95             | 50,8 | 2,0      |
| 63x96         | 190            | 98             | 48,4 | 1,9      |
| 127x96        | 199            | 95             | 52,3 | 2,1      |
| 255x320       | 202            | 92             | 54,5 | 2,2      |
| 290x509       | 203            | 94             | 53,7 | 2,2      |
| 1580x32       | 203            | 96             | 52,7 | 2,1      |
| 1580x2176 (1) | 205            | 94             | 54,1 | 2,2      |
| 1580x2176 (2) | 206            | 94             | 54,4 | 2,2      |

Table 12: Performance of the error diffusion

The lower two images are A4 images scanned at 200 dpi. Notice the small difference in performance for larger and smaller images; this is caused by the fact the image is traversed horizontally. This causes all images to benefit in the same amount of the caches.

MMX 2x faster than C++ In Table 12 it can be seen that the performance of the MMX implementation is a factor 2.2 better than the C++ implementation. Since a part of this performance gain is caused by the fact that assembly code is slightly faster than compiled C++ code, it can be concluded that for error diffusion implementing the algorithm with MMX is still useful. For more complex algorithms<sup>2</sup> however, the performance gain might not be worth the difficulties encountered when implementing a filter with MMX Technology.

<sup>&</sup>lt;sup>1</sup> See section 3.5.3.11.

<sup>&</sup>lt;sup>2</sup> See section 3.7.1.4.

# 3.7.3.7

#### Categorising the processing time of the paired code

33% error passing 49% pack & unpack Figure 83 shows categorised processing time of the error diffusion algorithm<sup>1</sup>. Notice that  $33\%^{(2)}$  of the processing time is spent passing and receiving error values. Another 49% <sup>(3)</sup> of the processing time is used to read, pack, unpack and write the pixels. Just 4% of the processing time is needed to calculate the new pixel values and the rounding errors.



Figure 83: Categorised processing time of the error diffusion

Figure 84 shows the processing time categorised according to the instruction type. Notice that 26% of the processing time is required for logical and shift operations. These operations are needed for the packing and unpacking of the pixels. Since the extension from packed bytes to packed words is performed automatically when manually packing the pixels, no conversion instructions are used.

<sup>&</sup>lt;sup>1</sup> For more information on how the processing time is determined see section 3.4.4.8.

<sup>&</sup>lt;sup>2</sup> 26%+7%

<sup>&</sup>lt;sup>3</sup> 26%+23%



Figure 84: Processing time per instruction type

| 3.8                 | Bits to Bytes conversion                                                                                                                                                                                                                                                                                                                                                                      |
|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3.8.1               | The algorithm                                                                                                                                                                                                                                                                                                                                                                                 |
| Viewing application | The images created by the error diffusion algorithm discussed in section 3.7 contains pixels which are represented by a single bit. These images can be sent to a print engine, but the image viewing application used to view the processed images cannot read such images. In order to be able to view the images on a PC the pixels represented with bits have to be represented by bytes. |
|                     | A pixel with the bitvalue '1' has to be converted to a byte '255'. A bit '0' has to be converted to a byte '0'. Thus the one bit has to be duplicated to all the bits of the byte.                                                                                                                                                                                                            |
| 3.8.2               | The implementation                                                                                                                                                                                                                                                                                                                                                                            |
|                     | The algorithm described in section 3.8 is implemented with MMX Technology. It has been chosen not to implement two versions                                                                                                                                                                                                                                                                   |
|                     | because this algorithm is not part of a processing path implemented in the image processing path of a digital copier.                                                                                                                                                                                                                                                                         |
| MMX version only    | 0 1 01                                                                                                                                                                                                                                                                                                                                                                                        |

<sup>1</sup> By using 32 bits arithmetic.

|                  | parallel. Thus the MMX implementation can reach twice the speed of a C++ version.                                                                                                                                                                                                                                                                                                                                                                          |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 'Shift' and 'Or' | The image array is traversed by one loop. which processes the<br>entire array. Each loop cycle one byte is read from the array into<br>an MMX register. The bits in the byte are shifted and or-ed in<br>such a way that each bit is duplicated to one byte. Duplicating<br>the bits could be done more efficient with the shift right, which<br>duplicates the most significant bit. Unfortunately this operation<br>cannot be performed on packed bytes. |
|                  | Since the extension from bitwise represented pixels to bytes is<br>not part of a processing path implemented in a copier, and its<br>implementation is fairly straightforward, its implementation is not<br>discussed any further.                                                                                                                                                                                                                         |
| 3.9              | Removing borders                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 3.9.1            | The algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                  | In section 3.3 the implementation of the function which adds a borders to the image is discussed.                                                                                                                                                                                                                                                                                                                                                          |
| No extra logic   | In a digital copier (or printer) no logic is needed to remove the borders. The entire image is sent to the print engine, but some of the wires are not connected. Since the test images processed by the demo program should have the same size <sup>1</sup> as the original images, the added borders have to be removed by an additional algorithm.                                                                                                      |
| 3.9.2            | The implementation                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| MMX version only | Since only the MMX version of the processing path adds borders<br>to the images, it is obvious that the function that removes the<br>borders is also implemented with MMX. Besides this, the<br>function that adds the borders is also implemented with MMX.                                                                                                                                                                                               |
|                  | The implementation of the border removing function is almost<br>identical to that of the border adding function. The only<br>difference is that now some pixels are <i>not</i> copied from the<br>source array. With the methods described in the previous<br>sections, this algorithm can easily be implemented. Since the<br>implementation is quite uncomplicated it is further not discussed<br>here.                                                  |
| 3.10             | The total print processing path                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 3.10.1           | 'Warming up' effects                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Caches           | When a sequence of algorithms is executed, the first algorithm will 'warm up' the caches. This means that data required by the other algorithms is loaded into the caches by the first.                                                                                                                                                                                                                                                                    |
|                  | Because of its limited size, the level one cache will not be<br>'warmed up', but the level two cache will. For large images, such<br>as the scanned A4 image, this will have little effect because<br>these images cannot be stored in the level two cache. When                                                                                                                                                                                           |

<sup>&</sup>lt;sup>1</sup> To enable a pixel-wise comparison between unprocessed and the processed image.

91

processing small images however the image will be loaded into the level two cache during the entire processing path. This is one of the reasons small images are sometimes processed much faster than large images.

In the implemented processing path the 'warming up' will primarily be done by the RGB color plane separation. Because both the C++ version of the processing path and the MMX version benefit in the same way from the 'warming up', it has little influence on the performance comparison between the C++ and the MMX versions. Similar, all the algorithms in the path benefit in the same way, so the performance comparison between the various algorithms is not influenced.

#### 3.10.2 Algorithm performance

pixels/second

200 dpi A4

166 MHz

Figure 85 shows the performance of the C++ and the MMX versions of the implemented algorithms. The chart shows the number of pixels of the 200 dpi A4<sup>1</sup> image processed per second on a 166 MHz CPU.



Figure 85: C++ vs. MMX for 200 dpi A4 (1580x2176) (166 MHz)

Performance

For kernel operations, such as the smooth and sharpen, as well as for the RGB to CMYK conversion the MMX version is a factor four to five faster than the C++ version. For more complex algorithms, such as the error diffusion, the performance of the MMX version is a factor two of that of the C++ version.

<sup>&</sup>lt;sup>1</sup> 1580x2176 = 3,438,080 pixels

| 3.10 | .3 |
|------|----|
|------|----|

RGB to CMYK

#### Improvements of the processing path

#### Tiling

In Table 5 and Table 8 it can be seen that the kernel operations Small images process small images about 50% faster than large ones. By splitting large images into a number of small, overlapping tiles (see section 3.7.3.1) the performance of these algorithms can be improved.

Additionally, all algorithms can be applied to the tile before the next tile is processed. This way the tile can be loaded from the cache instead of the slow main memory.

#### Merging algorithms

Since reading and writing the pixels costs a considerable amount of the processing time, it might be interesting to merge a number of algorithms into one, slightly more complex algorithm.

This is especially interesting for algorithms which are executed right after each other. The RGB color plane separation and the smooth operation for example can quite easily be merged into one algorithm. This algorithm first reads a block of 3x3 color pixels (=3x3x3 bytes). Then it separates the three color planes and applies the smoothing kernel<sup>1</sup>.

#### **Planes parallel**

Manual packing Manual packing The performance of the halftoning algorithm is limited heavily by the manual packing of the pixels. The algorithm now processes four pixels of one color plane simultaneously. An alternative way of processing four data elements in parallel is to process one pixel of each color plane simultaneously. Since the RGB to CMYK conversion also has to unpack the pixels manually, this will result in a performance gain for both the color conversion and the error diffusion.

In Figure 56 it can be seen that eight percent of the processing time of the RGB to CMYK conversion is required to unpack and write the pixels. Figure 83 shows that 26% of the processing time of the error diffusion is needed to read and pack the pixels. By writing the four color planes to one array the processing time required for these two parts of the code can be reduced drastically. In fact this time can be reduced almost to zero.

#### White detection

In section 3.6.3.10 it can be seen that most of the time needed for the RGB to CMYK conversion is used to interpolate the CMYK value. By detecting if a white pixels is processed the performance of the RGB to CMYK conversion can be improved. The actual improvement *heavily* depends on the image processed. When processing an entirely white image the time necessary to process the image can be reduced to one-fourth of the time

<sup>&</sup>lt;sup>1</sup> Obviously the algorithm can be optimised by buffering eight of the separated pixels. This way only one color pixels has to be separated per smoothed pixel.

Scan path

currently needed. For images containing no white pixels the processing time will increase slightly.

#### Adding borders

Finally, when using only the scanning section of the processing path, the performance can be slightly improved by adding less borders to the image.

On page 81 it can be read that the error diffusion algorithm requires additional borders to be added to the source image. This increases the image size of a 200 dpi A4 image with about 0.9%. Since the smooth and sharpen algorithms do not require this additional border their performance can be increased a little bit.

# Architecture developments

The IA32<sup>1</sup> architecture is commonly used in PC's. IA32 processors with MMX are for example:

- Intel Pentium with MMX
- Intel Pentium Pro
- Intel Pentium II with MMX
- AMD K6
- Cyrix 6x86MX
- IDT Winchip C6

In the next sections these processors will shortly be discussed, focusing on their possibilities for image processing. The Pentium with MMX is discussed in section 2.4.

## **Intel Pentium Pro**

The Pentium Pro was intended to be the successor of the Pentium. The major improvements were:

- 1. Dynamic execution.
- 2. Improved superscalar architecture.
- 3. On-chip level two cache.

#### **Dynamic execution**

Dynamic execution is a technique where, instead of executing instructions in the sequence the programmer wrote them, the *CPU* determines the order in which they are executed. This way more units can be used simultaneously. On the Pentium the programmer had to re-order the statements; now the CPU re-orders them automatically<sup>2</sup>.

Micro instructions The processor does not execute assembly instructions but micro instructions. An assembly instruction written by the programmer (or the compiler) is broken down into a number of micro instructions. At a certain moment the processor might execute for example the light gray instructions shown in Figure 86 (the dark gray instruction have been executed previously). Instruction 4a could be executed before the previous instructions were finished (it does not use results produced by them).



Unoptimised code Dynamic execution particularly improves the performance of unoptimised code. But also the combination of the superscalar architecture and dynamic execution has possibilities to improve performance. On the Pentium, two pairing instructions both have to be finished before the next pair is started. If one instruction of

4.1.1

4

4.1

<sup>&</sup>lt;sup>1</sup> IA32 is an acronym for Intel Architecture with 32-bit opcodes.

<sup>&</sup>lt;sup>2</sup> This way of re-ordering the instructions is also known as out-of-order execution.

|                 | the pair takes much more time than the other this results in a low<br>CPU utilisation. In these situations dynamic execution allows one<br>instruction of the next pair the be started. This way, while one<br>pixel is calculated, the data required for the next can be read<br>from the memory.                                                                                                                                                   |
|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Branch penalty  | A major disadvantage of dynamic processors is the high penalty<br>for mispredicted branches (10-20 cycles). To help avoid<br>branches conditional move instructions are available.                                                                                                                                                                                                                                                                   |
|                 | Improved superscalar architecture                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                 | The Pentium can execute two instructions per cycle. The<br>Pentium Pro can execute up to three micro operations <sup>1</sup> per<br>cycle. Besides this, more units have been added to allow more<br>types of operations to be executed in parallel. A memory read<br>now can be paired with a memory write.                                                                                                                                         |
|                 | On-chip level two cache                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                 | On the Pentium Pro's predecessors the level one cache was<br>placed on the same chip as the processor core. The level two<br>cache was placed in separate chips on the motherboard, causing<br>the level two cache to run at a much lower speed than the<br>processor.                                                                                                                                                                               |
|                 | On the Pentium Pro however, a level two cache of 256 or 512 KB is integrated in the CPU, running at the same clock speed as the CPU. This increased the performance significantly because by reducing the distance between the level one level two cache the access time of the level two cache could be reduced.                                                                                                                                    |
| Drop-out rate   | Placing the level two cache on the same chip as the CPU<br>caused the chip size of the Pentium Pro to be much larger than<br>that of the<br>Pentium. Because of the increased chip size the drop out rates<br>during the chip's production reached an all time high. This<br>caused the production costs of the Pentium Pro to be the highest<br>of all Intel chips. For this reason, Intel stopped producing the<br>Pentium Pro in the mid of 1998. |
| 4.1.2           | Intel Pentium II                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Level two cache | Soon after the introduction of the Pentium Pro its high<br>production costs forced Intel to re-design the chip. This resulted<br>in the Pentium II. One of the differences between the Pentium II<br>and<br>the Pentium Pro is that on the Pentium II the level two cache is                                                                                                                                                                         |
|                 | no longer placed on the same chip as the processor <sup>2</sup> . The level two cache now runs at half of the CPU speed.                                                                                                                                                                                                                                                                                                                             |
|                 | Additionally MMX Technology was added, and the clock rates<br>were increased further. The Pentium II currently is available at<br>clock speeds ranging from 233 to 400 MHz. Intel plans to crank<br>up the clock speed to 500 MHz.                                                                                                                                                                                                                   |

<sup>&</sup>lt;sup>1</sup> Generally an assembly instruction is decoded to one micro instruction. Instructions that use complex addressing modes (MOV EAX, [EPB+32] for example) and Add-With-Carry instructions are decoded into two, pop's into three and push- es into four micro instructions. <sup>2</sup> The level two cache's chips are placed into the same package as the processor.

Intel also plans to increase the level two cache size to up to two MB (code name 'Xeon'). Although this processor (to be released in June 1998) will be quite expensive, this could be an interesting processor for image processing. By splitting the image in half, one color plane<sup>1</sup> can be loaded into the level two cache during the entire processing path, reducing the time required to read and write the pixels.

In the next two paragraphs two recently introduced improvements of the Pentium II will be discussed; the 100 MHz system bus and MMX2.

#### 100 MHz systembus

The systembus is the main data highway in a personal computer system. On early Pentium II's this bus ran at speeds of 50, 66, 75 or 83 MHz, depending on the installed CPU. Currently Intel is introducing Pentium II's which support a bus speed of 100 MHz. Since the systembus traditionally is one of the main bottlenecks in high performance PC's, it is expected that the performance will improve significantly. The 100 MHz is not only supported by Intel, but also by companies that build Intel clones, such as AMD, Cyrix and IDT.

In 1999 Intel plans the introduction of a 200 MHz system bus. This will further improve the overall performance of PC's.

#### **MMX2<sup>(2)</sup>**

| 'Katmai'       | By the end 1998 a new version of the Intel Pentium II called 'Katmai' will be available. This CPU not will only run at higher clock speeds than its predecessors, but will also contain the new MMX2 instruction set.                                                                                                                                                                                                                                                                      |
|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Floating point | Although Intel has not yet published the MMX2 instruction set,<br>the<br>primary improvement seems to be the addition of floating point<br>MMX instructions, which operate on eight separate registers. The<br>author of [13] <i>suggests</i> that Intel can make the eight new<br>registers 40 bytes (240 bits) large. Since Intel floating point units<br>typically operate on 80 bit values this would enable four 80 bit<br>values or eight 40 bit values to be processed in parallel. |
| Integer        | <ul> <li>Besides this MMX2 contains some new integer instructions:</li> <li>Packed shuffle.</li> <li>Insert/extract word.</li> <li>Packed average on bytes &amp; words.</li> <li>Packed unsigned multiply on words (MMX multiplies signed).</li> <li>Masking.</li> <li>Minimum and maximum operations on bytes &amp; words.</li> <li>A prefetch instruction.</li> </ul>                                                                                                                    |
|                | processing. The insert and extract word instructions are also<br>useful, especially for manual packing of pixels. Minimum and                                                                                                                                                                                                                                                                                                                                                              |

'Xeon'

<sup>&</sup>lt;sup>1</sup> One color plane = 1580x2176 pixels = 3,438,080 B = 3.28 MB.

<sup>&</sup>lt;sup>2</sup> Source: "Windows Source Home Page" [13]. The instruction set was retrieved by Clive Turvey. He disassembled a demo program that was placed accidentally on an Intel home page. Intel has *not* confirmed this, so this paragraph should be considered provisional.

maximum operations can be used for color stretch and edge detection algorithms. With the introduction of MMX2 Intel also will improve the developer environment for MMX. Currently MMX instructions can only be used in in-line assembly. For MMX2 Intel plans to release a compiler plug-in that allows the usage of MMX in standard C++ code. 4.1.3 AMD AMD is the main competitor of Intel. Before the introduction of the K6 high speed Pentium II versions, the AMD K6 processors outperformed the Intel processors. This was primarily caused by AMD's more elegant processor design. The later versions of the Pentium II used Intel's more advanced chip technology to reach higher clock speeds. This way Intel re-gained its position at the top of the market. AMD now targets at the lower end of the market by pricing its CPU's 25% below the equivalent Intel CPU. The K6 has a level one cache of 64 KB. This is twice the size of that of the Pentium II. However, since this is by far too small to store scanned images it is expected that this will not have a lot of influence on the image processing performance. AMD's more advanced processor design resulted in low execution latencies and no decode pairing restrictions. The penalty for misaligned data is only one cycle, compared to three cycles on Pentium (II/Pro) processors. AMD's K6-2<sup>(1)</sup> incorporates MMX3D. MMX3D is not only K6-2: MMX3D supported by AMD, but also by Cyrix, IDT and Microsoft. The main difference between MMX and MMX3D is that MMX3D incorporates SIMD floating point instructions, similar to MMX2. Except a prefetch and FEMMS (Fast Entry/Exit MMX State) instruction MMX3D does not have the improved MMX instructions MMX2 has. Compatibility It is important to realise that although they have some instructions in common, MMX3D and MMX2 are not compatible. In the second half of 1998 AMD plans to introduce the K6 3D+. In this processor the level two cache will be placed on the same chip the CPU is placed on to allow the cache to run at the same speed as the CPU does. Unlike the Pentium II, the conventional level two cache on the motherboard is used as a level three cache, similar to Digital's Alpha. The K6 3D+ will be introduced at a clock speed of 350 MHz, after which it will quickly move to 400 MHz and more. 4.1.4 Cyrix Another important competitor of Intel is Cyrix. Cyrix operates primarily on the low-cost side of the market where it managed to capture a stable market share.

<sup>&</sup>lt;sup>1</sup> The K6-2 was launched in the first half of 1998 as the successor of the K6. The K6-2 is also known as K6-3D. MMX3D is also known as 3Dnow.

| 6x86MX<br>M2 | The 6x86MX is Cyrix' latest CPU. It is a non-superscalar,<br>pipelined architecture that contains several advanced features,<br>such as branch prediction and speculative execution.<br>On 16 bit programs like Windows 95 the 6x86MX slightly<br>outperforms AMD's K6 and the Pentium with MMX. On 32 bit<br>programs like Windows NT however, the performance is<br>mediocre. A more advanced version of the 6x86MX is the M2.<br>This basically is a superscalar 6x86MX with two separate<br>pipelines. |
|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|              | Similar to AMD, Cyrix CPU's are quite well designed, but cannot keep up with the high-end Intel CPU's. This is, among others, caused by the low clock speeds Cyrix CPU's run on.                                                                                                                                                                                                                                                                                                                           |
|              | <ul> <li>The MMX Technology of the 6x86MX and M2 is fully compatible with Intel's. Performance tests show that its MMX performance is the lowest of all MMX CPU's available. Besides all the MMX instructions, the 6x86MX also contains some extended MMX instructions, some of which are useful for image processing:</li> <li>Packed magnitude (= absolute value compare).</li> <li>Packed conditional move (bytes).</li> <li>Packed Average (bytes).</li> </ul>                                         |
|              | Especially the conditional move and the packed magnitude<br>functions can be very useful for image processing. The first has<br>numerous applications, the latter can for example be used for<br>edge detection algorithms.                                                                                                                                                                                                                                                                                |
|              | Cyrix has announced an improved version of MMX called MMXFP to be incorporated in the 'Cayenne', which is to be released in the second half of 1998. Cyrix agreed with AMD that MMXFP will be identical to MMX3D.                                                                                                                                                                                                                                                                                          |
| 4.1.5        | IDT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Winchip C6   | IDT is a relatively unknown CPU producer. In the first half of<br>1998<br>IDT announced its Winchip C6 CPU, which is claimed to have<br>about the same performance as the Pentium MMX for integer<br>operations. On floating point and MMX code however the C6 is<br>slower than the Pentium.                                                                                                                                                                                                              |
|              | The C6 does not incorporate the advanced features the Pentium II, K6-2 or the 6x86MX do; it simply is a classic but straight forward CPU with a large level one cache. The C6 has about the same price tag the 6x86MX has.                                                                                                                                                                                                                                                                                 |
|              | The successor of the C6, the C6+ will incorporate AMD's MMX3D extensions. The release date of the C6+ is not yet published.                                                                                                                                                                                                                                                                                                                                                                                |
| 4.2          | The PowerPC architecture                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Mackintosh   | The PowerPC processors are designed by Motorola. They are<br>used for various applications, including Power Mackintosh<br>computers and embedded control applications. PowerPC CPU's<br>are currently produced by two vendors; Motorola and Samsung.                                                                                                                                                                                                                                                       |
| AltiVec      | The latest improvement of the PowerPC architecture is the addition of a vector processing SIMD technology called AltiVec.                                                                                                                                                                                                                                                                                                                                                                                  |

AltiVec instructions operate on 128 bits<sup>1</sup> wide registers called Vector Register File (VRF). The AltiVec programming model incorporates 32 vector registers which can be used in the following configurations<sup>2</sup>: 16 bytes (8 bit) 8 halfwords (16 bit) 4 words (32 bit) 1 quadword (128 bit) A total of 162 new instructions are available to process these data elements in parallel. The AltiVec programming model incorporates all of the MMX instructions. Additionally some other instructions are available, some of which are very useful for image processing: Sum across Multiply Multiply-sum (multiply and add to other register) Average Maximum Minimum Logical NOR Rotate instructions Splat (duplicate one element through register) Permute (Shuffle elements) Select element Prefetch data stream Various floating point arithmetic Prefetch The prefetch instruction can prefetch up to four data stre ams. First the stream is initialised by selecting a memory area to be loaded. The prefetcher then automatically loads these locations into the level one cache. In chapter 3 it can be read that memory accesses are a significant bottleneck. By using a prefetch instruction these delays can be avoided. Splat Another useful instruction for image processing is the splat instruction; this instruction duplicates one byte or halfword through an entire register. The splat instruction also operates on immediate values. Permute The permute instruction allows any byte in a register to be moved to another position. This can for example be used to reorder values in one cycle. All AltiVec instructions support both saturation and overflow arithmetic. Overflow arithmetic is generally used, but for image processing saturation arithmetic is very useful. Intel architecture CPU's use the little-endian byte numbering scheme to store multi-byte data in the memory<sup>3</sup>. Motorola CPU's traditionally use the big-endian scheme, where the bytes are stored reverse ordered. The AltiVec architecture however also supports the little endian scheme.

<sup>&</sup>lt;sup>1</sup> MMX registers are 64 bit wide.

<sup>&</sup>lt;sup>2</sup> Notice that the names of the data types are not the same as the names Intel uses.

<sup>&</sup>lt;sup>3</sup> See page 19.

Misaligned access

A weak point of the AltiVec architecture are the way misaligned memory accesses<sup>1</sup> are handled. If a quadword is read from memory for example the effective address is calculated by making the four least significant bits zero. If the access was misaligned this results in a memory read of the wrong data. A memory read on address 0x02F03 for example (Figure 87) should read the grayed data. Instead, the squared bytes are read.



AltiVec technology provides special instructions to merge misaligned datablocks. On Pentium (II/Pro) CPU's misaligned memory accesses cause a penalty of three cycles. An AltiVec CPU can merge two datablocks retrieve the misaligned datablock in the same number of cycles too.

<sup>&</sup>lt;sup>1</sup> On AltiVec CPU's misaligned memory accesses are accesses that are not aligned to the natural boundary of a data element. For example a halfword must be aligned on a two byte boundary. See section 2.4.4 for more information on alignment on the Pentium CPU.

| 5               | Evaluation & conclusion                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 5.1             | Implementation issues                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                 | <ul> <li>When implementing image processing algorithms with MMX</li> <li>Technology a number of issues have to be taken in account.</li> <li>These issues can be devided into three categories:</li> <li>1. Architecture related</li> <li>2. Algorithm related</li> <li>3. Developing environment related</li> </ul>                                                                                                                                                                                                                                           |
|                 | Architecture related issues                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                 | The Pentium MMX architecture has some limitations that influence the way image processing algorithms are implemented.                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Memory accesses | First of all, memory accesses (to load pixels for example) take a<br>relatively high amount of CPU clock cycles. This is caused by<br>the fact that memory chips do not reach the high clock speeds<br>the CPU does. Although cache memories improve the memory<br>bandwidth, scanned images cannot be entirely stored in them.<br>Therefore an algorithm has to be implemented in such a way<br>that the memory bandwidth is used optimally (caches).                                                                                                         |
|                 | Future processors, such as the Intel Pentium II 'Katmai', AMD K6-2, and Motorola AltiVec, incorporate instructions that allow prefetching of pixels by special processing units.                                                                                                                                                                                                                                                                                                                                                                               |
| Instructionset  | <ul> <li>Second, the Pentium with MMX lacks some instructions required for image processing, such as:</li> <li>Masking</li> <li>Shuffle or permutate</li> <li>Insert- and extract word</li> <li>Minimum/maximum/average operations</li> </ul>                                                                                                                                                                                                                                                                                                                  |
|                 | To perform such operations a number of instructions are required, causing a performance reduction.                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                 | Intel's successor of MMX (MMX2) is expected to add some of<br>these missing instructions, but not all. An insert/extract word<br>instruction for example is not available. AMD's successor of<br>MMX (MMX3D or 3Dnow!) focuses on SIMD floating point<br>operations and adds no new SIMD integer instructions to MMX.<br>The Motorola AltiVec instructionset contains all of the<br>instructions required for image processing. Not only minimum,<br>maximum and average instructions, but also insert/extract word<br>and shuffle instructions are available. |
| Data types      | Third, many MMX instructions do not operate on the eight bit<br>data scanned images typically consist of. For most algorithms<br>this is no real limitation because they require the data to be<br>extended to larger data types anyway, but for some algorithms<br>(color stretch for example), this limits the number of pixels to be<br>processed in parallel to four, instead of eight.                                                                                                                                                                    |
|                 | Both Intel's MMX2 and AMD's MMX3D are not expected to increase the number of data elements that can be processed in parallel. For as far a the support of eight bit data is concerned:                                                                                                                                                                                                                                                                                                                                                                         |

|                    | although MMX2 adds some integer SIMD instructions the support of eight bit data has not been improved.<br>AltiVec CPU's process twice the amount of data elements MMX CPU's do (128 bit registers vs. 64). This means that 16 bytes or eight words <sup>1</sup> can be processed in parallel. Additionally, some of the more advanced AltiVec instructions, multiply-and-add for example, internally extend the data to prevent rounding errors.<br>All AltiVec instructions operate on 8- and 16 bit data. |
|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Immediate operands | Fourth, most of the MMX instructions do not accept immediate<br>operands (constants). Multiply instructions for example cannot<br>multiply a value with a constant directly, so the constant has to<br>be loaded from memory into a register first (the move instruction<br>does not accept immediate operands either). None of the MMX<br>successors is expected to accept immediate operands.                                                                                                             |
| Registers          | Fifth, Intel architecture CPU's do not have enough registers<br>available to implement complex image processing algorithms<br>without storing register temporarily in the memory. Because of<br>the relatively low memory bandwidth this causes the<br>performance to decrease. Additionally the small number of<br>registers limits the possibilities to store immediate values in<br>registers during the entire processing loop.                                                                         |
|                    | Both Intel's and AMD's MMX successors are not expected to contain more registers than MMX does. AltiVec CPU's have 32 registers, which is a considerable improvement over the eight MMX registers.                                                                                                                                                                                                                                                                                                          |
| Image width        | Finally, a lot of image processing algorithms require special treatment of border pixels. Since SIMD CPU's process a fixed number of pixels in parallel, the image width has to be a multiple of the number of pixels processed in parallel. To process all types of images a border has to be added to some images in order to extend the image width. In a digital copier this is not necessary because the scanner can be configured to scan the optimal image width.                                    |
|                    | Algorithm related issues                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| Collect pixels     | Some of the image processing algorithms (lookup table and<br>error diffusion for example) heavily limit the possibilities for<br>parallel<br>processing. These limitations force the programmer to collect<br>the pixels to process with a number of instructions to allow<br>parallel processing. This occurs with error diffusion for example.                                                                                                                                                            |
|                    | Lookup operations can by definition not be performed in parallel,<br>unless a special lookup table is created. Since such a lookup<br>table's size will be increased exponentially this is generally not<br>possible for image processing.<br>These algorithm limitations, coupled to the described limitations<br>of the MMX instructionset, cause the performance increasement<br>to be reduced significantly.                                                                                            |
|                    | MMX2 and AltiVec contain insert- and extract word instructions that improve the possibilities to collect pixels. It is therefore                                                                                                                                                                                                                                                                                                                                                                            |

<sup>&</sup>lt;sup>1</sup> For simplicity's sake the Intel naming is used. Motorola calls 16 bit values half words and 32 bit values words. When speaking of words Intel means a 16 bit quantity.

expected that they will perform better on algorithms with poor parallelability. **Developing environment related issues** Currently MMX Technology can only be used in in-line assembly. This increases the time needed to implement an algorithm considerably, as well as the chance of program 'bugs'. VTune Besides this, only one development tool is available; VTune. VTune monitors an executing program and shows the time necessary for each statement. Additionally VTune indicates the cause of some execution delays. Unfortunately the current version of VTune (2.5) contains many 'bugs', making its usage very unpleasant. Intel has announced that with MMX2 a much improved version of VTune will be released. Especially the possibilities to analyse C++ code will be improved. Additionally Intel has recently released a compiler plug-in that allows the usage of MMX(2) in standard, C++ like statements. This is expected to decrease the time needed to implement an image processing algorithm considerably. 5.2 Performance For kernel operations (smooth, sharpen), as well as for the RGB to CMYK conversion the MMX implementation is a factor four to five faster than the C++ version. For more complex algorithms, such as error diffusion, this is a factor two. Most of the performance gain is reached by processing multiple Elements parallel (generally four) data elements in parallel. Additionally MMX implementations benefit from the fact that they have to be written Assembly in in-line assembly, which improves the performance slightly. Finally, some of the MMX processor units are faster than their conventional equivalents, from which especially the RGB to CMYK conversion benefits (multiply unit). However, in future (Pentium II) CPU's these units will also be improved, thus eliminating MMX's advantage. A lot of (potential) performance is lost when a certain algorithm Collecting pixels requires the pixels to be collected manually, instead of reading them as one sequential stream from memory. A similar problem occurs when the format of the data elements to be processed is not the same as the format in which the other input of the algorithm is received. This occurs for example with the interpolation<sup>1</sup>, where the weight of each of the angular points has to be re-ordered before the multiplication with the angular point can be performed. The successors of MMX will contain insert- and extract word instructions that help to avoid or relieve these problems. Generally it can be said that, although the MMX instructionset lacks a number of specific image processing instructions, the performance increases<sup>2</sup> almost linear with the number of Linear improvement elements processed in parallel, provided that no manual

<sup>&</sup>lt;sup>1</sup> Part of the RGB to CMYK conversion.

<sup>&</sup>lt;sup>2</sup> Compared to a C++ implementation.

collecting or re-ordering is required. For the algorithms <sup>1</sup> where this is required the performance gain is about half of the number of elements processed in parallel. It is expected that for algorithms with more complex calculations the gain will be slightly more<sup>2</sup>. An important remark to be made is that the number of pixels processed in parallel is limited by the data type of intermediate results (words instead of bytes). Extending data temporarily is often done in image processing algorithms to avoid large rounding errors. Besides this, sometimes data has to be extended because MMX instruction operate on certain data types only. Generally the 8 bit pixels have to be extended to 16 bits. All MMX instructions operate on 16 bit values. MMX can process four 16 bit data elements in parallel (eight 8-bit values). The performance gain actually reached (max. factor five) is less Intel than the gain Intel predicted (factor six). This is primarily caused by the fact that Intel processes small images only. These images can be stored entirely in the fast level one cache memory. Another reason Intel's prediction of the performance gain is too optimistic is that Intel only implements algorithms with an excellent parallelability and the possibility to process eight pixels in parallel, instead of four. 5.3 Recommendations Currently MMX's main limitations are the instructionset, the memory bandwidth and the lack of registers. Alternatives of MMX contain both an improved instructionset and have ways to improve the memory bandwidth. For image processing the Intel MMX2 Pentium II 'Katmai' with MMX2 and especially the Motorola AltiVec PowerPC with AltiVec are the most promising CPU's. The latter has the potential to reach twice the speed MMX2 does because the AltiVec registers can contain twice the amount of pixels an MMX(2) register can. Additionally the AltiVec has four times the number of registers MMX(2) has. Therefore further research on MMX2 and especially AltiVec is recommended. Another limitation of image processing on a PC lies in the fact that Windows NT Windows NT was not designed for (semi-) real-time applications. Due to Windows NT's multitasking system an application's performance can fluctuate considerably. This can be improved by shutting down large parts of the operating system, but generally such a solution cannot be used (because the PC also performs other tasks for example). Therefore further research on how to reach a constant application performance on a Windows NT platform is recommended.

<sup>&</sup>lt;sup>1</sup> For error diffusion and the color conversion. Only for error diffusion this was actually measured. Test indicate that when the C++ version of the color conversion uses a one-cycle multiplier the performance gain of the MMX version is about a factor two.

<sup>&</sup>lt;sup>2</sup> Because the amount of processing time required for pixel-collection is relatively low compared to the time needed to process the collected pixels.

| r 1 |  |
|-----|--|
|     |  |
|     |  |

# Literature

- 1. Rutten, R, An introduction to MMX, Océ internal report, 1997
- 2. <u>Intel MMX Technology at a Glance</u>, Intel Corporation, 1997, Order number 243100-003
- 3. <u>Intel Architecture Technology Overview</u>, Intel Corporation, 1997, Order number 243081
- 4. <u>Intel Architecture MMX Technology Developer's Manual</u>, Intel Corporation, 1997, Order number 243013
- 5. <u>Pentium Processor Family Developers Manual</u>, Intel Corporation, 1997, Order number 241428-05
- Intel Architecture MMX Technology Programmer's <u>Reference Manual</u>, Intel Corporation, 1997, Order number 243007
- Pentium and Pentium Pro Processors and Related Products, Intel Corporation, 1997, Order number 241732-004, ISBN 1-55512-265-5
- 8. <u>Intel Architecture Optimization Manual</u>, Intel Corporation, 1997, Order number 242816-003
- 9. Schmit, M. L., <u>Pentium Processor Optimization Tools</u>, AP Professional, 1995, ISBN 0-12-627230-1
- Rosendahl, J. J. A, <u>Implementing image processing</u> algorithms on SIMD architectures, Océ internal report, 1998
- 11. Fog, A, <u>How to optimize for the Pentium Processor</u>, 1998, http://www.announce.com/agner/assem/
- 12. <u>AltiVec Technology Programming Environments Manual</u>, Motorola Inc., 1998, http://www.motorola.com/AltiVec/
- Turvey, C, <u>Windows Source Home Page</u>, V Communications, 1998, http://www.tbcnet.com/~clive/
- <u>AMD-3D Technology Manual</u>, Advanced Micro Devices Inc., 1998, Publication number 21928.
- 15. <u>IDT Winchip C6 Processor data book</u>, 1997, Centaur technology Inc., http://www.winchip.com

Appendix A:

# MMX instruction set summary

| Category      | Mnemonic | Data type | Description                                                   |
|---------------|----------|-----------|---------------------------------------------------------------|
| Arithmetic    | PADD     | B,W,D     | Add with wrap-around on [byte, word, doubleword]              |
|               | PADDS    | B,W       | Add signed with saturation on [byte, word]                    |
|               | PADDUS   | B,W       | Add unsigned with saturation in [byte, word]                  |
|               | PSUB     | B,W,D     | Subtraction with wrap-around on [byte, word, doubleword]      |
|               | PSUBS    | B,W       | Subtract signed with saturation in [byte, word]               |
|               | PSUBUS   | B,W       | Subtract unsigned with saturation on [byte, word]             |
|               | PMULH    | W         | Packed multiply high on words                                 |
|               | PMULL    | W         | Packed multiply low on words                                  |
|               | PMADD    | WD        | Packed multiply on words and add resulting pairs              |
| Comparison    | PCMPEQ   | B,W,D     | Packed compare for equality [byte, word, doubleword]          |
|               | PCMPGT   | B,W,D     | Packed compare greater than [byte, word, doubleword]          |
| Conversion    | PACKUS   | WB        | Pack words into bytes (unsigned with saturation)              |
|               | PACKSS   | WB,DW     | Pack [words into bytes, bytes into words] (signed with        |
|               |          |           | saturation)                                                   |
|               | PUNPCKH  | BW,WD,DQ  | Unpack (interleave) high-order [bytes, words, doublewords]    |
|               | PUNPCKL  | BW,WD,DQ  | Unpack (interleave) low-order [bytes, words, doublewords]     |
| Logical       | PAND     | Q         | Bitwise AND                                                   |
|               | PANDN    | Q         | Bitwise AND NOT                                               |
|               | POR      | Q         | Bitwise OR                                                    |
|               | PXOR     | Q         | Bitwise XOR                                                   |
| Shift         | PSLL     | W,D,Q     | Packed shift left logical [word, doubleword, quadword] by     |
|               |          |           | amount specified in MMX register or immediate value           |
|               | PSRL     | W,D,Q     | Packed shift right logical [word, doubleword, quadword] by    |
|               |          |           | amount specified in MMX register or immediate value.          |
|               | PSRA     | W, D      | Packed shift right arithmetic [word, doubleword, quadword] by |
|               |          |           | amount specified in MMX register or immediate value           |
| Data transfer | MOV      | D,Q       | Move [doubleword, quadword] to MMX register or from MMX       |
|               |          |           | register                                                      |
| FP & MMX      | EMMS     |           | Empty MMX state                                               |
| State         |          |           |                                                               |

- B = Byte W = Word
- D = Doubleword
- Q = Quadword