Skip to content
This repository was archived by the owner on Mar 16, 2023. It is now read-only.
This repository was archived by the owner on Mar 16, 2023. It is now read-only.

Research notes, SIMD implementation, faster multiplication and GF(2^m) #1

@HarryR

Description

@HarryR

I also found this really useful too: https://www.iacr.org/workshops/ches/ches2013/presentations/CHES2013_Session6_1.pdf (although it focuses a lot on binary curves, rather than prime)

However, it seems like it can't beat the MULX, ADOX, ADCX sequence as described by: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

And this covers squaring using MULX, ADOX and ADCX sequences: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf

This repository, in addition to https://github.com/cloudflare/sidh/blob/master/p503/arith_amd64.s seems to be enough to adapt libff with the MULX + ADOX/ADCX optimisations, batching loads and then a batch of interleaved adox/adcx for best pipeline usage.

From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf (pages 13-15) - The CLMUL instruction looks like it's unlikely to be able to outperform the above without being able to represent the G(p) field over G(2^n) (I can't find an appropriate irreducible polynomial,... I don't think that's possible to easily represent some arbitrary prime over a binary field)

You would assume that an optimising compiler could produce something similar to the following for fq_impl_int128:

Karatsuba Multiplication 128-bit x 128-bit = 256-bit

// c1c0 = a * b
void GF2m_mul_2x2_xmm(__m128i *c1, __m128i *c0, __m128i b,
__m128i a)
{
 __m128i t1, t2;
 *c0 = _mm_clmulepi64_si128(a, b, 0x00);
 *c1 = _mm_clmulepi64_si128(a, b, 0x11);
 t1 = _mm_shuffle_epi32(a, 0xEE);
 t1 = _mm_xor_si128(a, t1);
 t2 = _mm_shuffle_epi32(b, 0xEE);
 t2 = _mm_xor_si128(b, t2);
 t1 = _mm_clmulepi64_si128(t1, t2, 0x00);
 t1 = _mm_xor_si128(*c0, t1);
 t1 = _mm_xor_si128(*c1, t1);
 t2 = t1;
 t1 = _mm_slli_si128(t1, 8);
 t2 = _mm_srli_si128(t2, 8);
 *c0 = _mm_xor_si128(*c0, t1);
 *c1 = _mm_xor_si128(*c1, t2);
}

Karatsuba Multiplication 256-bit x 256-bit = 512-bit

// c3c2c1c0 = a1a0 * b1b0
{
 a1 = _mm_set_epi64x(a->d[3], a->d[2]);
 a0 = _mm_set_epi64x(a->d[1], a->d[0]);
 b1 = _mm_set_epi64x(b->d[3], b->d[2]);
 b0 = _mm_set_epi64x(b->d[1], b->d[0]);
 GF2m_mul_2x2_xmm(&c1, &c0, a0, b0);
 GF2m_mul_2x2_xmm(&c3, &c2, a1, b1);
 a0 = _mm_xor_si128(a0, a1);
 b0 = _mm_xor_si128(b0, b1);
 GF2m_mul_2x2_xmm(&c5, &c4, a0, b0);
 c4 = _mm_xor_si128(c4, c0);
 c4 = _mm_xor_si128(c4, c2);
 c5 = _mm_xor_si128(c5, c1);
 c5 = _mm_xor_si128(c5, c3);
 c1 = _mm_xor_si128(c1, c4);
 c2 = _mm_xor_si128(c2, c5);
}

Quick Squaring (256-bit)^2 = 512-bit

// c3c2c1c0 = a1a0 * a1a0
{
 c0 = _mm_clmulepi64_si128(a0, a0, 0x00);
 c1 = _mm_clmulepi64_si128(a0, a0, 0x11);
 c2 = _mm_clmulepi64_si128(a1, a1, 0x00);
 c3 = _mm_clmulepi64_si128(a1, a1, 0x11);
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions