Day 1
Introduction to NEON
- Clarifying the resources shared by NEON and the scalar floating point engine
- Explaining the AArch32 and AArch64 differences
- NEON Register banks
- S, D and Q registers (AArch32)
- B, H, S, D and V registers (AArch64)
- Data types
- Vector vs scalar
- Related system registers
- Alignment issues
- Enabling NEON
- Differences between NEONv7 and NEONv8
NEON instruction syntax
- Instructions producing wider / narrower results
- Instructions modifiers
- Selecting the shape
- Selecting the operand / result type
- Syntax flexibility
- Declaring initialized vectors in C language
- Using unions with vectors and arrays of vectors to simplify the debug
- Casting vectors
Data transfer instructions
- Move
- Swap
- Table lookup
- Vector transpose
- Vector zip / unzip
- Data transfer between NEON and integer unit
- Practical lab: clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors
Exercise: |
Example: managing audio samples |
Exercise: |
Using load with de-interleaving instructions to store all right lane samples into a vector and left lane samples into another vector |
Exercise: |
Clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors |
Arithmetic Instructions
- Arithmetic instructions
- Add, modulo vs saturated arithmetic
- Halving / Doubling the result
- Rounding
- Subtract
- Multiply
- Multiply accumulate / Multiply subtract
- Absolute value
- Min / Max
Exercise: |
Implementing a complex multiply accumulate with NEON |
- Conversion instructions
- Converting Floating Point numbers into Fixed point numbers
- Converting Fixed point numbers into Floating point numbers
Exercise: |
Converting fixed-point elements into single precision floating point values and adding the resulting elements |
- Advanced arithmetic instructions
- Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm
- Pairwise instructions
|
Day 2
Logic and Bitfield Instructions
- Element comparison
- Logic instructions
- Logical AND, Bit Clear, OR, XOR
- Operations with immediate values
- Bitfield instructions
- Count Leading zeros, ones, signs
- Bitwise insert instructions
- Conditional bitwise insert instructions, avoiding branches
- Shifts with possible rounding and saturation
- Bitfield reverse
Exercise: |
Transposing a matrix, shifting a large bitmap using vector instructions |
NEON Cryptography Extension
- The Cryptography extension
- Algorithms supported
Optimizing techniques
- Automatic vectorization
- Tuning loops for optimal results
- Avoid loop feedbacks
- Avoid loop-dependent conditionals
- Avoid early termination
- Padding loops
Exercise: |
Experimenting with loop auto-vecorization |
- Pointers and arrays
- indirect addressing
- pointer aliasing and restrict
Exercise: |
Using restrict to eliminate dependencies |
- Function calls and inlining
Exercise: |
Making promises to help the compiler optimize |
- Avoiding data dependencies
NEON coding examples
- FIR filter
- Converting the scalar algorithm into a vector algorithm
- Finding the NEON instructions to encode the vector algorithm
- Optimizing the code
- Using the performance monitor to tune the algorithm
- FFT (DFT)
- Converting the scalar algorithm into a vector algorithm, understanding how circle properties can be used to process 4 angles concurrently
- Finding the NEON instructions to encode the vector algorithm
- Optimizing the code
- Using the performance monitor to tune the algorithm
|