## Instruction Set Extensions – Creating A Symmetric-Key Crypto-Processor

Abstract

In this article, the Author will talk about identifying computationally intensive operations within classifications of algorithms, such as symmetric-key ciphers. These operations require many instructions to implement when targeting a general-purpose processor. The concept of instruction set extensions will be introduced to accelerate these operations by off-loading them to custom hardware attached to the processor’s datapath that is accessed via newly defined instructions in the processor’s control logic.

The article is authored by Dr. Adam Elbirt a long time CRYPTOcrat and who is currently working as an Assistant Professor at University of Massachusetts Lowell.

You can find more information about Dr. Elbirt on his LI Profile.

Creating A Symmetric-Key Crypto-Processor

Most algorithms can be broken down into a finite number of core operations. When implementing an algorithm in software targeting a general-purpose processor, some core operations are easy to implement, requiring few instructions, while others are significantly more complex, requiring numerous instructions. An example of a core operation easily implemented in software is key addition, typically achieved by bit-wise XORing a round key with data. Examples of more complex core operations are bit-level permutations and long number arithmetic. Numerous instructions are required because the datapath of a general-purpose processor does not directly support the implementation of these operations due to limited processor word size, the requirement that data be operated upon in bytes or multiple of bytes instead of bits, the lack of a required ALU unit, etc.

When using a general-purpose processor to implement symmetric-key cryptographic algorithms such as block ciphers, even the fastest software implementations cannot satisfy the required bulk data encryption data rates for high-end applications such as ATM networks which require an encryption throughput of 622 Mbps. As a result, hardware implementations are necessary for block ciphers to achieve this required performance level. Although traditional hardware implementations lack flexibility, configurable hardware devices offer a promising alternative for the implementation of processors via the use of IP cores in Application Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA) technology. To illustrate, Altera Corporation offers IP core implementations of the Intel 8051 microcontroller and the Motorola 68000 processor in addition to their own Nios®-II embedded processor. Similarly, Xilinx Inc. offers IP core implementations of the PowerPC processor in addition to their own MicroBlaze^{TM}* *and PicoBlaze^{TM}* *embedded processors. ASIC and FPGA technologies provide the opportunity to augment the existing datapath of a processor implemented via an IP core to add acceleration modules supported through newly defined instruction set extensions targeting performance-critical functions. Moreover, many licensable and extendible processor cores are also available for the same purpose.

The use of instruction set extensions follows the hardware/software co-design paradigm to achieve the performance and physical security associated with hardware implementations while providing the portability and flexibility traditionally associated with software implementations. Moreover, when considering alternative solutions, instruction set extensions result in significant performance improvements versus traditional software implementations with considerably reduced logic resource requirements versus hardware-only solutions such as co-processors. The idea is to “improve the wheel” rather than to “reinvent the wheel”.

Examples of instruction set extensions designed to improve the performance of cryptographic algorithms include those implemented to perform arithmetic over the Galois Field GF(2* ^{m}*), usually targeting elliptic curve cryptography (ECC) systems. Word-level polynomial multiplication was shown to be the time-critical operation when targeting an ARM processor and a special Galois Field multiplication instruction resulted in significant performance improvement. Instruction set extensions targeting a SPARC V8 processor core and a 16-bit RISC processor core were used to accelerate the multiplication of binary polynomials for arithmetic in GF(2

*). An implementation targeting a MIPS32 architecture attempts to accelerate word-level polynomial multiplication through the use of Comba’s method of handling the inner loops of the multiplication operation. Numerous generalized Galois Field multipliers have also been proposed for use in elliptic curve cryptosystems. These implementations focus on accelerating exponentiation and inversion in Galois Fields*

^{m}*GF*(2

*) where*

^{m}*m ?*160

*-*256.

Instruction set extensions designed to minimize the number of memory accesses and accelerate the performance of AES implementations have been proposed for a wide range of processors. Extensions targeting a general-purpose RISC architecture with multimedia instructions yield strategies to implement AES using multimedia instructions while specifically attempting to minimize the number of memory accesses. While the processor is datapath-scalable, the strategies do not map well to 32-bit architectures. Extensions designed to combine the SubBytes andMixColumns AES functions into one *T table *look-up operation to speed up algorithm execution have also been proposed. However, the functional unit requires a significant amount of hardware to implement and cannot be used for either the final AES round (where the MixColumns function is not used) or key expansion (where the SubBytes function is used without the MixColumns function), and *T table *performance is heavily dependent upon available cache size. Extensions targeting the Xtensa 32-bit processor improve the performance of AES encryption but worsen the performance of decryption. An implementation targeting a LEON2 processor core combines the SubBytes and ShiftRows AES functions through the use of an instruction set extension termed *sbox*. Special instructions are also provided to efficiently compute the MixColumns AES function through the use of ECC instruction set extensions.

Clearly, the use of instruction set extensions allows existing processor technologies to be leveraged in combination with custom functionality to vastly improve the performance of the targeted algorithms. However, even within classifications of algorithms, such as symmetric-key algorithms, a wide range of additional functionality may be required to accelerate the entire suite. A trade-off analysis of hardware resource requirements versus expected performance improvement is critical when evaluating which core elements of each algorithm to accelerate via added hardware units. Relevant references that review the discussed implementations are included below.

References

1. S. Bartolini, I. Branovic, R. Giorgi, and E. Martinelli, “A Performance Evaluation of ARM ISA Extension for Elliptic Curve Cryptography Over Binary Finite Fields,” in *Proceedings of the Sixteenth Symposium on Computer Architecture and High Performance Computing – SBC-PAD 2004*, Foz do Igua¸cu, Brazil, October 27-29 2004, pp. 238-245.

2. J. Großchädl and G.-A. Kamendje, “Instruction Set Extension for Fast Elliptic Curve Cryptography Over Binary Finite Fields GF(2*m*),” in *Proceedings of the Fourteenth IEEE International Conference on Application- Specific Systems, Architectures and Processors – ASAP 2003*, The Hague, The Netherlands, June 24-26 2003, pp. 455-468.

3. J. Großchädl and E. Savas, “Instruction Set Extensions for Fast Arithmetic in Finite Fields GF(p) and GF(2*m*),” in *Workshop on Cryptographic Hardware and Embedded Systems – CHES 2004*, M. Joye and J.-J. Quisquater, Eds., Cambridge, Massachusetts, USA, August 11-13 2004, vol. LNCS 3156, pp. 133-147, Springer-Verlag.

4. J. Irwin and D. Page, “Using Media Processors for Low-Memory AES Implementation,” in *Proceedings of the Fourteenth IEEE International Conference on Application-Specific Systems, Architectures and Processors – ASAP 2003*, The Hague, The Netherlands, June 24-26 2003, pp. 144-154.

5. K. Nadehara, M. Ikekawa, and I. Kuroda, “Extended Instructions for the AES Cryptography and Their Efficient Implementation,” in *Proceedings of the Eighteenth IEEE Workshop on Signal Processing Systems – SIPS 2004*, Austin, Texas, USA, October 13-15 2004, pp. 152-157.

6. S. O’Melia, *Instruction Set Extensions for Enhancing the Performance of Symmetric Key Cryptographic Algorithms*, MSEE Thesis, University of Massachusetts Lowell, 2005.

7. S. Ravi, A. Raghunathan, N. Potlapally, and M. Sankaradass, “System Design Methodologies for a Wireless Security Processing Platform,” in *Proceedings of the 2002 Design Automation Conference – DAC 2002*, New Orleans, Louisiana, USA, June 10-14 2002, pp. 777-782.

8. S. Tillich and J. Großchädl, “A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography Over Binary Finite Fields GF(2*m*),” in *Proceedings of the Ninth Asia-Pacific Conference on Advances in Computer Systems Architecture – ACSAC 2004*, Beijing, China, September 7-9 2004, vol. LNCS 3189, pp. 282-295, Springer-Verlag.

9. S. Tillich and J. Großchädl, “Accelerating AES Using Instruction Set Extensions for Elliptic Curve Cryptography,” in *International Conference on Computational Science and Its Applications – ICCSA 2005*, O. Gervasi, M. L. Gavrilova, V. Kumar, A. Laganà, H. P. Lee, Y. Mun, D. Taniar, and C. J. K. Tan, Eds., Singapore, May 9-12 2005, vol. LNCS 3481, pp. 665-675, Springer-Verlag.

10. S. Tillich and J. Großchädl, “Instruction Set Extensions for Efficient AES Implementation on 32-bit Processors,” in *Workshop on Cryptographic Hardware and Embedded Systems – CHES 2006*, L. Goubin and M. Matsui, Eds., Yokohama, Japan, October 10-13 2006, vol. LNCS 4249, pp. 270-284, Springer-Verlag.

11. S. Tillich and J. Großchädl and A. Szekely, “An Instruction Set Extension for Fast and Memory-Efficient AES Implementation,” in *Proceedings of the Ninth International Conference on Communications and Multimedia Security – CMS 2005*, J. Dittmann, S. Katzenbeisser, and A. Uhl, Eds., Salzburg, Austria, September 19-21 2005, vol. LNCS 3677, pp. 11-21, Springer-Verlag.

All the products mentioned herein which have trademarks and/or registered trademarks belong to their respective owners.

## 7 Comments

** Please note the views expressed in the comments below are that of the commenter and the owners of this website may not agree with the views expressed.**

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

“…even the fastest software implementations cannot satisfy the required bulk data encryption data rates for high-end applications such as ATM networks which require an encryption throughput of 622 Mbps”?

There’s a whole bunch of stream ciphers that encrypt 8 bits every 3-5 cycles on Intel Pentium 4 and later. Even AES is capable of 8 bits every 15 cycles.

This article focuses on block ciphers as stream ciphers are not typically

used for bulk data encryption in software due to the high overhead

associated with bit-level manipulations. In regards to AES, a few

high-end implementations have been reported in the Gbps range but most

implementations require 100s of cycles when encrypting the minimum block

size of 128 bits.

I disagree with your assertion.

Bit-based stream ciphers are not used in software. There is whole bunch of stream ciphers that are word-oriented and use block-cipher like operations. They achieve better performance than block ciphers in software.

Have a look at the ECRYPT eSTREAM website where the world’s stream cipher experts have been slaving away for the past few years.

I must also disagree with your assertion. While stream ciphers certainly

have been developed that are high performance, bulk data encryption in

fielded software is dominated by block ciphers not stream ciphers, as

evidenced by the AES standardization process and the prior use of block

ciphers such as DES, IDEA, and a host of others. Regardless, the argument

is off-topic as ANY algorithm, cryptographic or otherwise, will have

software performance bottlenecks that are primarily caused by the need to

perform core operations that are poorly (or not at all) supported by the

target processor’s datapath. Instruction set extensions address this

bottleneck.

I think that the article is making a case for implementing core cryptography algorithms in ASIC/FPGA. Therefore, it need not try to argue that ASIC/FPGA implementations are superior to software with respect to speed or vice versa. I agree with Matt that given a superior processor and an appropriate choice of cryptography algorithm, the mentioned speed can be achieved.

The argument in this article should not be about raw speed. It should be that of cost ($) and jitter properties that are required when one goes down the communciation stack such as to the ATM layer, which is usually below the IP layer.

For real-time communciations (such as those for space communications, control systems, and SCADA systems), jitter is a greater evil than delay. Jitter can be understood as unpredictable delay. Imagine that you have to secure all GPS (Global Positioning System)communications with strong cryptography algorithm and provide a low-cost solution. Firstly, we cannot mandate every application to use a laptop for such implementations — the utility of embedded systems is proven and here to stay. Secondly, given the acute real-time requirements of GPS systems, what would be the best implementation you would opt for? I would bet on ASIC/FPGA implementations that are nicely dove-tailed with application-level software systems.

ASIC/FPGA implementations are inherently parallel in their operations — they are electronic circuits! The jitter characteristics of ASIC/FPGA would be far superior to any microprocessor that is required to execute multiple jobs. If we were to design a special electronic board with Pentium 4 processor and design the software module such that its shall only execute one job (which is to compute a cryptographic algorithm), then this hardware-software combine might possess similar jitter properties such as a similar, low-end ASIC/FPGA product. But, the cost of making such a hardware would be so prohibitive that nobody would make or buy such products. This is where the ASIC/FPGA advantage would be useful over a pure software system.

Additionally, this article may not be a stream cipher or block cipher debate. To make blanket statements that stream ciphers are not used for bulk encryption is incorrect. Do mobile phone networks use block or stream ciphers for bulk encryption on the network? I have designed a network-level, point-to-point bulk-encryption system that can use block or stream ciphers in order to cater to differing end user needs. If you have two-party communications, stream or block ciphers can be used. If you have more than two communicating parties, stream ciphers cannot be used nor can certain configurations of block ciphers that are not self-synchronous like OFB and CFB modes. If we are designing a cryptosystem for point-to-point communication applications that do not allow any form of message buffering and send very short messages, we will have to use stream ciphers. Very short messages may be messages that are 8-bits in length. We cannot have a secure 8-bit block cipher!

I do not believe in silver bullet solutions. Customized (or optimized) designs appeal to my beliefs and experiences. I have seen people in the industry trying to use IPSec for safety-critical real-time applications, which I believe are dangerous and wasteful attempts.

Instruction-set extensions were not proposed as a “silver bullet” solution

to cryptographic algorithm implementation and block ciphers were never

described as the only method of bulk data encryption. The point of the

article is to offer a method of accelerating implementations that take

advantage of IP processor cores targeting FPGAs and ASICs where the

processor is not optimized to perform particular functions such as core

cryptographic components. In such case, instruction-set extensions offer

an acceleration method that is easy to implement versus a full-custom

implementation of the particular algorithm, thus providing the flexibility

associated with software implementations in combination with the speed of

hardware implementations.

“…..advantage of IP processor cores targeting FPGAs and ASICs where the

processor is not optimized to perform particular functions such as core

cryptographic components..”

I have also tried to this same argument in the past. A prompt counter argument that I have received was: why not use a standard cryptography co-processor that is already available in the market?

http://www-03.ibm.com/security/cryptocards/pcicc.shtml

Are FPGA/ASIC based initiatives re-inventing the wheel or actually trying to provide solutions to voids that cannot be (or are not being) addressed by existing implementation techniques?

I have been unable to provide generic and robust answers to such questions. The general question in the industry is “make or buy.” Whenever a company can buy, it usually does not prefer to make. This has been a stone wall that has prevented many of my efforts to introduce high-speed, jitter-sensitive processors in the civilian domain. I know that there are advantages but I have been unable to provide a conclusive answer. I think that this forum could be a nice place for arriving at such an argument. Such argument can greatly improve the adoption of FPGA/ASIC based cryptography subsytems in commercial systems.

I feel that valuable arguments can come from the validity of the following assertions:

1) ASIC/FPGA implementations can be optimzied for power-constrained. environments such as embedded systems (if you do not activate a circuit, its power consumption will be minimal);

2) ASIC/FPGA implementations can provide improved resilience to side-channel cryptanalysis such as power analysis and so on.

3) ASIC/FPGA implementations inherently have superior jitter properties.

I think that conclusive findings to the above arguments could provide a better justification for why such implementations are not “reinventing the wheel” in any manner.

Hope that these feedbacks will be useful.