Day 1
Introduction to NEON
 Clarifying the resources shared by NEON and the scalar floating point engine
 Explaining the AArch32 and AArch64 differences
 NEON Register banks
 S, D and Q registers (AArch32)
 B, H, S, D and V registers (AArch64)
 Data types
 Vector vs scalar
 Related system registers
 Alignment issues
 Enabling NEON
 Differences between NEONv7 and NEONv8
NEON instruction syntax
 Instructions producing wider / narrower results
 Instructions modifiers
 Selecting the shape
 Selecting the operand / result type
 Syntax flexibility
 Declaring initialized vectors in C language
 Using unions with vectors and arrays of vectors to simplify the debug
 Casting vectors
Data transfer instructions
 Move
 Swap
 Table lookup
 Vector transpose
 Vector zip / unzip
 Data transfer between NEON and integer unit
 Practical lab: clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors
Exercise: 
Example: managing audio samples 
Exercise: 
Using load with deinterleaving instructions to store all right lane samples into a vector and left lane samples into another vector 
Exercise: 
Clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors 
Arithmetic Instructions
 Arithmetic instructions
 Add, modulo vs saturated arithmetic
 Halving / Doubling the result
 Rounding
 Subtract
 Multiply
 Multiply accumulate / Multiply subtract
 Absolute value
 Min / Max
Exercise: 
Implementing a complex multiply accumulate with NEON 
 Conversion instructions
 Converting Floating Point numbers into Fixed point numbers
 Converting Fixed point numbers into Floating point numbers
Exercise: 
Converting fixedpoint elements into single precision floating point values and adding the resulting elements 
 Advanced arithmetic instructions
 Reciprocal estimate, reciprocal square root estimate, Newtonraphson algorithm
 Pairwise instructions

Day 2
Logic and Bitfield Instructions
 Element comparison
 Logic instructions
 Logical AND, Bit Clear, OR, XOR
 Operations with immediate values
 Bitfield instructions
 Count Leading zeros, ones, signs
 Bitwise insert instructions
 Conditional bitwise insert instructions, avoiding branches
 Shifts with possible rounding and saturation
 Bitfield reverse
Exercise: 
Transposing a matrix, shifting a large bitmap using vector instructions 
NEON Cryptography Extension
 The Cryptography extension
 Algorithms supported
Optimizing techniques
 Automatic vectorization
 Tuning loops for optimal results
 Avoid loop feedbacks
 Avoid loopdependent conditionals
 Avoid early termination
 Padding loops
Exercise: 
Experimenting with loop autovecorization 
 Pointers and arrays
 indirect addressing
 pointer aliasing and restrict
Exercise: 
Using restrict to eliminate dependencies 
 Function calls and inlining
Exercise: 
Making promises to help the compiler optimize 
 Avoiding data dependencies
NEON coding examples
 FIR filter
 Converting the scalar algorithm into a vector algorithm
 Finding the NEON instructions to encode the vector algorithm
 Optimizing the code
 Using the performance monitor to tune the algorithm
 FFT (DFT)
 Converting the scalar algorithm into a vector algorithm, understanding how circle properties can be used to process 4 angles concurrently
 Finding the NEON instructions to encode the vector algorithm
 Optimizing the code
 Using the performance monitor to tune the algorithm
