Lightweight Coprocessor for Koblitz Curves: 283-Bit ECC Including Scalar Conversion with only 4300 Gates

. We propose a lightweight coprocessor for 16-bit microcontrollers that implements high security elliptic curve cryptography. It uses a 283-bit Koblitz curve and oﬀers 140-bit security. Koblitz curves oﬀer fast point multiplications if the scalars are given as speciﬁc τ -adic expansions, which results in a need for conversions between integers and τ -adic expansions. We propose the ﬁrst lightweight variant of the conversion algorithm and, by using it, introduce the ﬁrst lightweight implementation of Koblitz curves that includes the scalar conversion. We also include countermeasures against side-channel attacks making the coprocessor the ﬁrst lightweight coprocessor for Koblitz curves that includes a set of countermeasures against timing attacks, SPA, DPA and safe-error fault attacks. When the coprocessor is synthesized for 130 nm CMOS, it has an area of only 4,323 GE. When clocked at 16 MHz, it computes one 283-bit point multiplication in 98 ms with a power consumption of 97.70 μ W, thus, consuming 9.56 μ J of energy.


Introduction
Elliptic curve cryptography (ECC) is one of the prime candidates for bringing public-key cryptography to applications with strict constraints on implementation resources such as power, energy, circuit area, memory, etc. Lightweight applications that require strong public-key cryptography include, e.g., wireless sensor network nodes, RFID tags, medical implants, and smart cards. Such applications will have a central role in actualizing concepts such as the Internet of Things and, hence, providing strong cryptography with low resources has been an extremely active research field in the recent years. As a result of this research line, we have several proposals for efficient lightweight implementations of ECC. These proposals focus predominately on 163-bit elliptic curves which provide medium security level of about 80 bits [4][5][6]15,24,26,34,42,43]. We provide a coprocessor architecture that implements ECC using a high security 283-bit Koblitz curve and includes countermeasures against side-channel attacks. Koblitz curves [23] are a special class of elliptic curves which enable very efficient point multiplications and, therefore, they are an attractive alternative also for lightweight implementations. However, these efficiency gains can be exploited only by representing scalars as specific τ -adic expansions. Most cryptosystems require the scalar also as an integer (see, e.g., ECDSA [31]). Therefore, cryptosystems utilizing Koblitz curves need both the integer and τ -adic representations of the scalar, which results in a need for conversions between the two domains. This is not a major problem in applications which have sufficient resources because fast methods for on-the-fly scalar conversion are available [8,37]. Consequently, very fast ECC implementations using Koblitz curves have been presented for both software [39] and hardware [18]. For lightweight implementations, however, the extra overhead introduced by these conversions has so far prevented efforts to use Koblitz curves in lightweight implementations. A recent paper [4] showed that Koblitz curves result in a very efficient lightweight implementation if τ -adic expansions are already available but the fact that the conversion is not included seriously limits possible applications of the implementation. An alternative approach was provided in a very recent paper [20] which provides a solution that delegates conversions from the lightweight implementation to a powerful server. However, this solution is not suitable for applications where both communicating parties are lightweight implementations and it also requires minor modifications to the cryptosystems which may hinder its use in some applications. Computing conversions directly in the lightweight implementation would be a better option in many cases and, hence, we focus on that alternative in this paper. All previous hardware implementations of the conversions [1,7,8,19,36] are targeted on high speed which makes them unsuitable for lightweight implementations.
To the best of our knowledge, we present the following novel contributions: -We present the first lightweight implementation of high security ECC by using a 283-bit Koblitz curve offering roughly 140 bits of security. By high security, we mean security levels exceeding 128 bits (e.g., AES-128). Because security of a cryptosystem utilizing multiple cryptographic algorithms is determined by its weakest algorithm, our implementation is the first lightweight implementation of ECC that can be combined, e.g., with AES-128 without reducing the security level of the entire system. -We present the first complete lightweight implementation of Koblitz curves that also includes on-the-fly scalar conversion. We achieve this by presenting a lightweight variant of the conversion algorithm from [8] which is optimized for word-serial computations. As mentioned above, the first implementation introduced in [4] does not include the conversion which limits the possible applications of the implementation. All conversion algorithms and architectures available in the literature focus on the speed of the conversion. -The first lightweight implementation of Koblitz curves [4] does not include any countermeasures against side-channel attacks. We present the first lightweight implementation of Koblitz curves with countermeasures against sidechannel attacks such as simple power analysis (SPA), differential power analysis (DPA), timing attacks, and safe-error fault attacks.
The paper is structured as follows. In Sect. 2, we provide a brief background on ECC and Koblitz curves. Then in Sects. 3 and 4, we describe our scalar conversion and point multiplication techniques. Our lightweight coprocessor architecture is presented in Sect. 5. We provide synthesis results in 130 nm CMOS and comparisons to other works in Sect. 6. We end with conclusions in Sect. 7.

Preliminaries
The use of elliptic curves for cryptography was independently proposed by Victor Miller [29] and Neal Koblitz [22] in the mid-1980 s. Points that satisfy the equation of an elliptic curve form an additive Abelian group E together with a special point O, which is the zero element of the group. Elliptic curves over finite fields F q are used in cryptography and we focus on elliptic curves over binary fields F 2 m (finite fields over characteristic two) with polynomial basis. Let P 1 , P 2 ∈ E. The group operation P 1 + P 2 is called point addition when P 1 = ±P 2 and point doubling when P 1 = P 2 . The fundamental operation of ECC is the elliptic curve point multiplication Q = kP , where k ∈ Z and Q, P ∈ E.
Point multiplication is computed with a series of point additions and point doublings. The basic approach to compute point multiplications is to use the double-and-add algorithm (also called the binary algorithm) which iterates over the bits of k one at a time and computes a point doubling for every bit and a point addition if the bit is one. Each point operation involves several operations in the underlying finite field. Projective coordinates are typically used for representing points as (X, Y, Z) in order to reduce the number of inversions in F 2 m . We use the López-Dahab coordinates [27] and specifically the point addition formulae from [2]. Another option that we considered was to use the λ-coordinates [33] which offer slightly faster point additions. In our case, however, the cost of obtaining the λ-coordinate representation, which includes an inversion in F 2 m , overweighs the cheaper point additions.
Koblitz curves introduced by Koblitz in [23] are a special class of elliptic curves defined by the following equation: The Frobenius endomorphism for a point P = (x, y) is given by φ(P ) = (x 2 , y 2 ) and, for Koblitz curves, it holds that φ(P ) ∈ E for all P ∈ E. It can be also shown that φ 2 (P ) − μφ(P )+2P = O for all P ∈ E, where μ = (−1) 1−a [23]. Consequently, the Frobenius endomorphism can be seen as a multiplication by the complex number τ = (μ + √ −7)/2 [23].
Representing the scalar k as a τ -adic expansion t = −1 i=0 t i τ i allows computing point multiplications with a Frobenius-and-add algorithm, which is similar to the double-and-add algorithm except that point doublings are replaced by Frobenius endomorphisms. Depending on the application, a τ -adic expansion can be found by converting an integer into a τ -adic expansion [8,23,28,37] and/or by finding a random τ -adic expansion directly [23,25]. In the latter case, a conversion in the other direction is typically required because most cryptosystems (e.g., ECDSA [31]) require the scalar as an integer, too. Conversions in either direction can be expensive [8] but once the τ -adic expansion is obtained, the point multiplication is significantly faster, which typically makes Koblitz curves more efficient than other standardized elliptic curves. So far, no efficient lightweight implementations of these conversions exist ruling Koblitz curves out of the domain of lightweight cryptography.
Because the negative of a point is given simply as −P = (x, x+y), the cost of point subtraction is practically equal to the cost of point addition and significant performance improvements can be obtained by using signed-bit representations for the scalar. In that case, a point addition is computed if t i = +1 and a point subtraction is computed if t i = −1. The most widely used signed-bit representation for Koblitz curves is the τ -adic nonadjacent form (τ NAF) introduced by Solinas in [37] and it has an average density of 1/3 for nonzero coefficients. Solinas also presented the window τ NAF (w-τ NAF) that allows even lower densities of 1/(w + 1) by utilizing precomputations to support an increased set of possible values for coefficients: Both τ NAF and w-τ NAF have the serious downside that they are vulnerable against side-channel attacks because the pattern of point operations depends on the key bits. The basic approach for obtaining resistance against side-channel attacks for ECC is to use Montgomery's ladder [30] which employs a constant pattern of point operations. Unfortunately, Montgomery's ladder is not a viable choice for Koblitz curves because then all benefits of cheap Frobenius endomorphisms are lost. Certain options (e.g., by using dummy operations) have been proposed in [14]. In this paper, we reuse the idea of using a zero-free τ -adic representation [32,41], that contains only nonzero digits, i.e., t i ∈ {−1, +1}. When this representation is scanned with windows of size w ≥ 2, the resulting point multiplication algorithm is both efficient and secure against many side-channel attacks because it employs a constant pattern of operations [32,41].

Koblitz Curve Scalar Conversion
The zero-free representation for an integer scalar k is found so that k is first reduced to ρ = b 0 + b 1 τ ≡ k (mod τ m − 1) and the zero-free representation t is generated from the reduced scalar ρ [32,37,41]. The overhead of these conversions is specifically important for lightweight implementations. Another important aspect is resistance against side-channel attacks. In the following, we describe our lightweight and side-channel resistant scalar conversion algorithms. Only SPA countermeasures are required because only one conversion is required per Input: Integer scalar k Output: ; /* The lsb of d0, the remainder before division by τ */ Algorithm 1. Scalar reduction algorithm from [8] k. The scalar k is typically a nonce but even if it is used multiple times, t can be computed only once and stored.

Scalar Reduction
We choose the scalar reduction technique called lazy reduction (described as Algorithm 1) from [8]. The algorithm reduces an integer scalar by repeatedly dividing it by τ for m times. This division can be implemented with shifts, additions, and subtractions. This makes the scalar reduction algorithm [8] attractive for lightweight implementations. However, the only known hardware implementations of this algorithm [8] and its speed-optimized versions [1,36] use fullprecision integer arithmetic and parallelism to minimize cycle count. Hence the reported architectures consume large areas and are thus not suitable for lightweight implementations. We observe that the original lazy reduction algorithm [8] can also be implemented in a word-serial fashion to reduce area requirements but such a change in the design decision increases cycle count. To reduce the number of cycles, we optimize the computational steps of Algorithm 1. Further, we investigate side-channel vulnerability of the algorithm and propose lightweight countermeasures against SPA.
Computational Optimization. In lines 6 and 7 of Algorithm 1, computations of d 1 and a 0 require subtractions from zero. In a word-serial architecture with only one adder/subtracter circuit, they consume nearly 33 % of the cycles of the scalar reduction. We use the iterative property of Algorithm 1 and eliminate these two subtractions by replacing lines 6 and 7 with the following ones: However with this modification, (a 0 , a 1 ) and (d 0 , d 1 ) have a wrong sign after every odd number of iterations of the for-loop in Algorithm 1. It may appear that this wrong sign could affect correctness of (b 0 , b 1 ) in line 5. Since the remainder u (in line 3) is generated from d 0 instead of the correct value −d 0 , a wrong sign is also assigned to u. Hence, the multiplications u·a 0 and u·a 1 in line 5 are always correct, and the computation of (b 0 , b 1 ) remains unaffected of the wrong signs. After completion of the for-loop, the sign of (d 0 , d 1 ) is wrong as m is an odd integer for secure fields. Hence, the correct value of the reduced scalar should be computed as Protection Against SPA. In line 5 of Algorithm 1, computation of new (b 0 , b 1 ) depends on the remainder bit (u) generated from d 0 which is initialized to k. Multi-precision additions are performed when u = 1; whereas no addition is required when u is zero. A side-channel attacker can detect this conditional computation and can use, e.g., the techniques from [8] to reconstruct the secret key from the remainder bits that are generated during the scalar reduction.
One way to protect the scalar reduction from SPA is to perform dummy addi- However, such countermeasures based on dummy operations require more memory and are vulnerable to fault attacks [11]. We propose a countermeasure inspired by the zero-free τ -adic representations from [32,41]. A zero-free representation is obtained by generating the remainders u from We observe that during the scalar reduction (which is basically a division by τ ), we can generate the remainder bits u as either 1 or −1 throughout the entire for-loop in Algorithm 1. Because u = 0, new (b 0 , b 1 ) is always computed in the for-loop and protection against SPA is achieved without dummy operations. The The above equation takes an odd d 0 and computes u such that the new d 0 after division of d − u by τ is also an odd integer. Algorithm 2 shows our computationally efficient SPA-resistant scalar reduction algorithm. All operations are performed in a word-serial fashion. Since the remainder generation in (3) requires the input d 0 to be an odd integer, the lsb of d 0 is always set to 1 (in line 3) when the input scalar k is an even integer. In this case, the algorithm computes the reduced scalar of k + 1 instead of k and after the completion of the reduction, the reduced scalar should be decremented by one. Algorithm 2 uses a one-bit register e to implement this requirement. The final subtraction in line 10 uses e as a borrow to the adder/subtracter circuit. In the next section, we show that the subtraction d 0 − u in line 6 also leaks information about u and propose a countermeasure that prevents this.

Computation of τ -adic Representation
For side-channel attack resistant point multiplication, we use the zero-free τ -adic representation proposed in [32,41] and described in Algorithm 3. In this paper, we add the following improvements to the algorithm.
Computational Optimization. Computation of b 1 in line 5 of Algorithm 3 requires subtraction from zero. Similar to Sect. 3.1 this subtraction can be avoided by computing With this modification, the sign of (b 0 , b 1 ) will be wrong after an odd number of iterations. In order to correct this, the sign of t i should be flipped for odd i (by multiplying it with (−1) i ).
Protection Against SPA. Though point multiplications with zero-free representations are resistant against SPA [32], the generation of τ -adic bits (Algorithm 3) is vulnerable to SPA. In line 3 of Algorithm 3, a remainder u is computed as per the four different cases described in (3) and then subtracted from b 0 in line 4. We use the following observations to detect the side-channel vulnerability in this subtraction and to propose a countermeasure against SPA.
1. For Case 1, 2 and 3 in (3), the subtractions of u are equivalent to flipping two (or one) least significant bits of b 0 . Hence, actual subtractions are not computed in these cases. 2. For Case 4, subtraction of u from b 0 (i.e. computation of b 0 +1) involves carry propagation. Hence, an actual multi-precision subtraction is computed in this case. 3. If any iteration of the while-loop in Algorithm 3 meets Case 4, then the new value of b 1 will be even. Hence, the while-loop will meet either Case 1 or Case 2 in the next iteration.
Based on the differences in computation, a side-channel attacker using SPA can distinguish Case 4 from the other three cases. Hence, the attacker can reveal around 25 % of the bits of a zero-free representation. Moreover, the attacker knows that the following τ -adic bits are biased towards 1 instead of −1 with a probability of 1/3. We propose a very low-cost countermeasure that skips this special addition b 0 + 1 for Case 4 by merging it with the computation of new (b 0 , b 1 ) in Algorithm 3. In line 5, we compute a new b 0 as: Since b 1 is an odd number for Case 4, we can represent it as {b 1 , 1} and subtract the least significant bit 1 from The computation of b 1 ← (b 0 + 1)/2 in line 5 of Algorithm 3 involves a carry propagation and thus an actual addition becomes necessary. We solve this problem by computing b 1 ← (b 0 −1)/2 instead of the correct value b 1 ← (b 0 +1)/2 and remembering the difference (i.e., 1) in a flag register h. Correctness of the τ -adic representation can be maintained by considering this difference in the future computations that use this wrong value of b 1 . Now as per observation 3, the next iteration of the while-loop meets either Case 1 or 2. We adjust the previous difference by computing the new b 0 as follows: In a hardware architecture, this equation can be computed by setting the borrow input of the adder/subtracter circuit to 1 during the subtraction.
In (6), we show our new map Ψ (·) that computes a remainder u and a new value h of the difference flag following the above procedure. We consider b 1 [0]⊕h (instead of b 1 [0] as in (3)) because a wrong b 1 is computed in Case 4 and the difference is kept in h.
Algorithm 4. SPA-resistant generation of a zero-free τ -adic representation The same technique is also applied to protect the subtraction d 0 − u in the scalar reduction in Algorithm 2.
Protection Against Timing Attack. The terminal condition of the whileloop in Algorithm 3 is dependent on the input scalar. Thus by observing the timing of the computation, an attacker is able to know the higher order bits of a short τ -adic representation. This allows the attacker to narrow down the search domain. We observe that we can continue the generation of zero-free τ -adic bits even when the terminal condition in Algorithm 3 is reached. In this case, the redundant part of the τ -adic representation is equivalent to the value of b 0 when the terminal condition was reached for the first time; hence the result of the point multiplication remains correct. For example, starting from (b 0 , b 1 ) = (1, 0), the algorithm generates an intermediate zero-free representation −τ − 1 and again reaches the terminal condition (b 0 , b 1 ) = (−1, 0). The redundant representation −τ 2 − τ − 1 is equivalent to 1. If we continue, then the next terminal condition is again reached after generating another two bits. In this paper we generate zero-free τ -adic representations that have lengths always larger than or equal to m of the field F 2 m . To implement this feature, we added the terminal condition i < m to the while-loop. In Algorithm 4, we describe an algorithm for generating zero-free representations that applies the proposed computational optimizations and countermeasures against SPA and timing attacks. The while-loops of both Algorithms 3 and 4 require b 0 to be an odd integer. When the input ρ has an even b 0 , then an adjustment is made by adding one to b 0 and adding (subtracting) one to (from) b 1 when b 1 is even (odd). This adjustment is recorded in a flag f in the following way: if b 0 is odd, then f = 0; otherwise f = 1 or f = 2 depending on whether b 1 is even or odd, respectively. In the end of a point multiplication, this flag is checked and (τ + 1)P or (−τ + 1)P is subtracted from the point multiplication result if f = 1 or f = 2, respectively. This compensates the initial addition of (τ + 1) or (−τ + 1) to the reduced scalar ρ described in line 2 of Algorithm 4. Input: An integer k, the base point P = (x, y), a random element r ∈ F2m Output: The result point Q = kP (t, f ) ← Convert(k) ; /* Alg. 2 and 4 */

Algorithm 5. Zero-free point multiplication with side-channel countermeasures 4 Point Multiplication
We base the point multiplication algorithm on the use of the zero-free representation discussed in Sect. 3. We give our modification of the point multiplication algorithm of [32,41] with window size w = 2 in Algorithm 5. The algorithm includes countermeasures against SPA, DPA, and timing attacks as well as inherent resistance against safe-error fault attacks. Implementation details of each operation used by Algorithm 5 are given in Appendix A. Below, we give a high-level description. Line 1 computes the zero-free representation t given an integer k using Algorithms 2 and 4. It outputs a zero-free expansion of length with t i ∈ {−1, +1} represented as an -bit vector and a flag f . Lines 2 and 3 perform the precomputations by computing P +1 = φ(P ) + P and P −1 = φ(P ) − P . Lines 4 and 5 initialize the accumulator point Q depending on the length of the zero-free expansion. If the length is odd, then Q is set to ±P depending on the msb t −1 . If the length is even, then Q is initialized with ±φ(P ) ± P by using the precomputed points depending on the values of the two msb's t −1 and t −2 . Line 6 randomizes Q by using a random element r ∈ F 2 m as suggested by Coron [9]. This randomization offers protection against DPA and attacks that calculate hypotheses about the values of Q based on its known initial value (e.g., the doubling attack [12]). Lines 7 to 10 iterate the main loop of the algorithm by observing two bits of the zero-free expansion on each iteration. Each iteration begins in line 8 by computing two Frobenius endomorphisms. Line 9 either adds or subtracts P +1 = (x +1 , y +1 ) or P −1 = (x −1 , y −1 ) to or from Q depending on the values of t i and t i+1 processed by the iteration. It is implemented by using the equations from [2] which compute a point addition in mixed affine and López-Dahab [27] coordinates. Point addition and subtraction are carried out with the exactly same pattern of operations (see Appendix A). Lines 11 and 12 correct the adjustments that ensure that b 0 is odd before starting the generation of the zero-free representation (see Sect. 3.2). Line 13 retrieves the affine point of the result point Q.
The pattern of operations in Algorithm 5 is almost constant. The side-channel properties of the conversion (line 1) were discussed in Sect. 3. The precomputation (lines 2 and 3) is fixed and operates only on the base point, which is typically public. The initialization of Q (lines 4 and 5) can be carried out with a constant pattern of operations with the help of dummy operations. The randomization of Q protects from differential power analysis (DPA) and comparative side-channel attacks (e.g., the doubling attack [12]). The main loop operates with a fixed pattern of operations on a randomized Q offering protecting against SPA and DPA. Lines 11 and 12 depend on t (and, thus, k) but they leak at most one bit to an adversary who can determine whether they were computed or not. This leakage can be prevented with a dummy operation. Although the algorithm includes dummy operations, it offers good protection also against safe-error fault attacks. The reason is that the main loop does not involve any dummy operations and, hence, even an attacker, who is able to distinguish dummy operations, learns only few bits of information (at most, the lsb and the msb and whether the length is odd or even). Hence, fault attacks that aim to reveal secret information by distinguishing dummy operations are not a viable attack strategy.

Architecture
In this section, we describe the hardware architecture ( Fig. 1) of our ECC coprocessor for 16-bit microcontrollers such as TI MSP430F241x or MSP430F261x [40]. Such families of low-power microcontrollers have at least 4 KB of RAM and can run at 16 MHz clock. We connect our coprocessor to the microcontroller using a memory-mapped interface [35] following the drop-in concept from [42] where the coprocessor is placed on the bus between the microcontroller and the RAM and memory access is controlled with multiplexers. The coprocessor consists of the following components: an arithmetic and logic unit (ALU), an address generation unit, a shared memory and a control unit composed of hierarchical finite state machines (FSMs).
The Arithmetic and Logic Unit (ECC-ALU) has a 16-bit data path and is used for both integer and binary field computations. The ECC-ALU is interfaced with the memory block using an input register pair (R 1 , R 2 ) and an output multiplexer. The central part of the ECC-ALU consists of a 16-bit integer adder/subtracter circuit, a 16-bit binary multiplier and two binary adders. A small Reduction-ROM contains several constants that are used during modu- Finally, the output multiplexer is used to store the contents of the registers CL, T and a masked version of CL in the memory block, which sets the msb's of the most significant word of an alement to zero.
The Memory Block is a single-port RAM which is shared by the ECC coprocessor and the 16-bit microcontroller. Each 283-bit element of F 2 283 requires 18 16-bit words totaling 288 bits. The coprocessor requires storage for 14 elements of F 2 283 (see Appendix A), which gives 4032 bits of RAM (252 16-bit words). Some of these variables are reused for different purposes during the conversion.

The Address Unit generates address signals for the memory block. A small
Base-ROM is used to keep the base addresses for storing different field elements in the memory. During any integer operation or binary field operation, the two address registers RdB 1 and RdB 2 in the address unit are loaded with the base addresses of the input operands. Similarly the base addresses for writing intermediate or final results in the memory block are provided in the register W tB 1 and in the output from the Base-ROM (W tB 2 ). The adder circuit of the address block is an 8-bit adder which computes the physical address from a read/write offset value and a base address.
The Control Unit consists of a set of hierarchical FSMs that generate control signals for the blocks described above. The FSMs are described below.
(1) Scalar Conversion uses the part of the ECC-ALU shown by the red dashed polygon in Fig. 1. The computations controlled by this FSM are mainly integer additions, subtractions and shifts. During any addition or subtraction, the words of the operands are first loaded in the register pair (R 1 , R 2 ). The result-word is computed using the integer adder/subtracter circuit and stored in the accumulator register CL. During a right-shift, R 2 is loaded with the operand-word and R 1 is cleared. Then the lsb of the next higher word of the operand is stored in the one-bit register LSB. Now the integer adder is used to add the shifted value {LSB, R 2 /2} with R 1 to get the shifted word. One scalar conversion requires around 78,000 cycles.
(2) Binary Field Primitives use the registers and the portion of the ECC-ALU outside the red-dashed polygon in Fig. 1.
-Field addition sequentially loads two words of the operands in R 2 , then multiplies the words by 1 (from the Reduction-ROM ) and finally calculates the result-word in CL after accumulation. One field addition requires 60 cycles. -Field multiplication uses word-serial comb method [13]. It loads the words of the operands in R 1 and R 2 , then multiplies the words and finally accumulates. After the completion of the comb multiplication, a modular reduction is performed requiring mainly left-shifts and additions. The left-shifts are performed by multiplying the words with the values from the Reduction-ROM. One field multiplication requires 829 cycles. -Field squaring computes the square of an element of F 2 283 in linear time by squaring its words. The FSM first loads a word in both R 1 and R 2 and then squares the word by using the binary multiplier. After squaring the words, the FSM performs a modular reduction. The modular reduction is shared with the field multiplication FSM. One field squaring requires 200 cycles. -Field inversion uses the Itoh-Tsujii algorithm [17] and performs field multiplications and squarings following an addition chain (1,2,4,8,16,17,34,35,70,140,141,282) for F 2 283 . One inversion requires 65,241 cycles.
(3) Point Operations and Point Multiplication are implemented by combining an FSM with a hardwired program ROM. The program ROM includes subprograms for all operations of Algorithm 5 and the address of the ROM is controlled by the FSM in order to execute Algorithm 5 (see Appendix A for details). Algorithm 5 is executed so that the microcontroller initializes the addresses reserved for the accumulator point Q with the base point (x, y) and the random element r by writing (X, Y, Z) ← (x, y, r). The scalar k is written into the RAM before the microcontroller issues a start point multiplication command. When this command is received, the reduction part of the conversion is executed followed by the computation of the msb(s) of the zero-free expansion. After this, the precomputations are performed by using (x, y) and the results are stored into the RAM. The initialization of Q is performed by writing either P +1 or P −1 in (X, Y ) if the length of the expansion is even; otherwise, a dummy write is performed. Similarly, the sign of Q is changed if t −1 = −1 and a dummy operation is computed otherwise. The main loop first executes two Frobenius endomorphisms and, then, issues an instruction that computes the next two bits of the zero-free expansion. By using these bits, either a point addition or a point subtraction is computed with P +1 or P −1 . One iteration of the main loop takes 9537 clock cycles. In the end, the affine coordinates of the result point are retrieved and they become available for the microcontroller in the addresses for the X and Y coordinates of Q.

Results and Comparisons
We described the architecture of Sect. 5 by using mixed Verilog and VHDL and simulated it with ModelSim SE 6.6d. We synthesized the code with Synopsys Design Compiler D-2010.03-SP4 using the regular compile for UMC 130 nm CMOS with voltage of 1.2 V by using Faraday FSC0L low-leakage standard cell libraries. The area given by the synthesis is 4,323 GE including everything in Fig. 1 except the single-port RAM. Computing one point multiplication requires in total 1,566,000 clock cycles including the scalar conversion. The power consumption at 16 MHz is 97.70 μW which gives an energy consumption of approximately 9.56 μJ per point multiplication. Table 1 summarizes our synthesis results together with several other lightweight ECC implementations from the literature.
Among all lightweight ECC processors available in the literature, the processor from [4] is the closest counterpart to our implementation because it is so far the only one that uses Koblitz curves. Even it has many differences with our architecture which make fair comparison difficult. The most obvious difference is that the processor from [4] is designed for a less secure Koblitz curve NIST K-163. Also the architecture of [4] differs from ours in many fundamental ways: they use a finite field over normal basis instead of polynomial basis, they use a bit-serial multiplier that requires all bits of both operands to be present during the entire multiplication instead of a word-serial architecture that we use, they store all variables in registers embedded into the processor architecture instead of an external RAM, and they also do not provide support for scalar conversions or any countermeasures against side-channel attacks. They also provide implementation results on 65 nm CMOS. Our architecture is significantly more scalable for different Koblitz curves because, besides control logic and RAM requirements, other parts remain almost the same, whereas the entire multiplier needs to be changed for [4]. It is also hard to see how scalar conversions or sidechannel countermeasures could be integrated into the architecture of [4] without significant increases on both area and latency.  Table 1 includes also implementations that use the binary curve B-163 and the prime curve P-160 from [31]. The area of our coprocessor is on the level of the smallest coprocessors available in the literature. Hence, the effect of selecting a 283-bit elliptic curve instead of a less secure curve is negligible in terms of area. The price to pay for higher security comes in the form of memory requirements and computation latency. The amount of memory is not a major issue because our processor shares the memory with the microcontroller which typically has a large memory (e.g. TI MSP430F241x and MSP430F261x have at least 4 KB RAM [40]). Also the computation time is on the same level with other published implementations because our coprocessor is designed to run on the relatively high clock frequency of the microcontroller which is 16 MHz.
In this work our main focus was to investigate feasibility of lightweight implementations of Koblitz curves for applications demanding high security. To enable a somewhat fair comparison with the existing lightweight implementations over F 2 163 , Table 1 provides estimates (see Appendix B) for area and cycles of ECC coprocessors that follow the design decisions presented in this paper and perform point multiplications on curves B-163 or K-163. Our estimates show that our coprocessors for both B-163 and K-163 require more cycles in comparison to [43] which also uses a 16-bit ALU. The reason behind this is that [43] uses a dual-port RAM, whereas our implementation uses a single-port RAM (as it works as a coprocessor of MSP430). Moreover [43] has a dedicated squarer circuit to minimize cycle requirement for squaring. Table 1 provides estimates for cycle and area of a modified version of the coprocessor that performs point multiplications using the Montgomery's ladder on the NIST curve B-283. The estimated cycle count is calculated from the cycle counts of the field operations described in Sect. 5. From the estimated value, we see that a point multiplication on B-283 requires nearly 23.5 % more time. However, the coprocessor for B-283 is smaller by around 550 GE as no scalar conversion is needed.
Although application-specific integrated circuits are the primary targets for our coprocessor, it may be useful also for FPGA-based implementations whenever small ECC designs are needed. Hence, we compiled our coprocessor also for Xilinx Spartan-6 XC6SLX4-2TQG144 FPGA by using Xilinx ISE 13.4 Design Suite. After place & route, it requires only 209 slices (634 LUTs and 309 registers) and runs on clock frequencies up to 106.598 MHz.
Our coprocessor significantly improves speed, both classical and side-channel security, memory footprint, and energy consumption compared to leading lightweight software [3,10,16,21,38]. For example, [10] reports a highly optimized Assembly implementation running on a 32-bit Cortex-M0+ processor clocked at 48 MHz that computes a point multiplication on a less secure Koblitz curve K-233 without strong side-channel countermeasures. It computes a point multiplication in 59.18 ms (177.54 ms at 16 MHz) and consumes 34.16 μJ of energy.

Conclusions
In this paper we showed that implementing point multiplication on a high security 283-bit Koblitz curve is feasible with extremely low resources making it possible for various lightweight applications. We also showed that Koblitz curves can be used in such applications even when the cryptosystem requires scalar conversions. Beside these contributions, we improved the scalar conversion by applying several optimizations and countermeasures against side-channel attacks. Finally, we designed a very lightweight architecture in only 4.3 kGE that can be used as a coprocessor for commercial 16-bit microcontrollers. Hence, we showed that Koblitz curves are feasible also for lightweight ECC even with on-the-fly scalar conversions and strong countermeasures against side-channel attacks.

A Implementation of Operations Used by Algorithm 5
The operations required by Algorithm 5 are implemented by combining an FSM and a program ROM. The program ROM includes subprograms for all operations of Algorithm 5 and the FSM sets the address of the ROM to the first instruction of the subprogram according the phase of the algorithm and t i+1 , t i . Table 2 shows the contents of the program ROM. The operations required by Algorithm 5 are in this ROM as follows: -Line 0 obtains the next bits of the zero-free representation.
-Lines 1-23 perform the precomputation that computes (x +1 , y +1 ) = φ(P ) + P and (x −1 , y −1 ) = φ(P ) − P . -Line 24 computes the negative of Q during the initialization and Line 25 is the corresponding dummy operation. -Lines 26-28 randomize the projective coordinates of Q by using the random r ∈ F 2 283 which is stored in Z. -Lines 29-34 compute two Frobenius endomorphisms for Q.
Point addition and point subtraction are computed with exactly the same sequence of operations. This is achieved by introducing an initialization which sets the values of three internal variables x p , y p , and y m according to Table 3 (these are in lines 35-46 in Table 2). This always requires two copy instructions followed by an addition. After this initialization, both point addition and point subtraction are computed with a common sequence of operations which adds the point (x p , y p ) to Q. The element x m is the y-coordinate of the negative of (x p , y p ) and it is also used during the point addition.

B Estimates for B-163 and K-163
Our estimated cycle count for scalar multiplication over F 2 163 is based on the following facts: 1. A field element in F 2 163 requires 11 16-bit words, and hence, is smaller by a factor of 0.61 than a field element in F 2 283 . Since field addition and squaring have linear complexity, we estimate that the cycle counts for these operations scale down by a factor of around 0.61 and become 37 and 122 respectively. In a similarly way we estimate that field multiplication (which has quadratic complexity) scales down to 309 cycles. A field inversion operation following an addition chain (1, 2, 4, 5, 10, 20, 40, 81, 162) requires nearly 22,700 cycles.
2. The for-loop in the scalar reduction operation (Algorithm 2) executes 163 times in F 2 163 and performs linear operations such as additions/subtractions and shifting. Moreover the length of τ -adic representation of a scalar reduces to 163 (thus reducing by a factor of 0.57 in comparison to F 2 283 ). So, we estimate that the cycle count for scalar conversion scales down by a factor of 0.57 × 0.61 and requires nearly 27,000 cycles. 3. One Frobenius-and-add operation over F 2 283 in Algorithm 5 spends total 9,537 cycles among which 6,632 cycles are spent in eight quadratic-time field multiplications, and the rest 2,905 cycles are spent in linear-time operations. After scaling down, the cycle count for one Frobenius-and-add operation over F 2 163 can be estimated to be around 4,250. The point multiplication loop iterates nearly 82 times for a τ -adic representation of length 164. Hence the number of cycles spent in this loop can be estimated to be around 348,500. 4. The precomputation and the final conversion steps are mainly dominated by the cost of field inversions. Hence the cycle counts can be estimated to be around 45,400.