+ +
- -
Systèmes d'Exploitation
Calendrier  Détails
Calendrier  Détails
Processeurs ARM
Calendrier  Détails
Processeurs PowerPC
Calendrier  Détails
Calendrier  Détails
+ +
> >
- -

ac6 >> ac6-training >> Processeurs ARM >> ARM Cores >> NEON programming Télécharger le catalogue Télécharger la page Ecrivez nous Version imprimable

RC1 NEON programming

This course explains how to use NEON SIMD instructions to boost multimedia algorithms

  • This course has been designed for programmers wanting to run multimedia algorithms on NEON Single Instruction Multiple Data execute units.
  • Each instruction family is detailed, first at assembly level, and then at C level using macros developed present in arm_neon.h file.
  • Several tricky usage of processing instructions are provided.
  • Vector and vector element load / store instructions are studied and guidelines for organizing data in memory are provided to minimize the number of memory accesses.
  • The underlying cache operation as well as preload mechanisms (instruction and hardware prefetch) are detailed to explain how a processing can be pipelined .
  • The course shows how DSP typical algorithms such as FIR and FFT can be vectorized and then optimized to be executed on NEON unit.

  • An overview of the evolution of NEON between ARMv7 (Cortex-A9 for example) and ARMv8 in 64bits mode (Cortex-A53 for example) is also provided.
Labs are compiled with GCC and run on a Linux Cortex-A9 board or a simulator
A more detailed course description is available on request at training@ac6-training.com
  • Knowledge of ARMv7 instruction set.

  • Clarifying the resources shared by NEON and VFP
  • Register bank, Q registers, D registers
  • Data types
  • Vector vs scalar
  • Related system registers
  • Alignment issues
  • Enabling NEON/VFP
  • Differences between NEONv7 and NEONv8
  • Instructions producing wider / narrower results
  • Instructions modifiers
  • Selecting the shape
  • Selecting the operand / result type
  • Syntax flexibility
  • Declaring initialized vectors in C language
  • Using unions with vectors and arrays of vectors to simplify the debug
  • Casting vectors
  • Addressing modes
  • Vector load / store
  • Vector load / store multiple
  • Element and structure load / store instructions
    • Multiple single elements
    • Single element to 1 lane
    • Single elements to all lanes
  • Optimizing the ordering of data in memory to take benefit of 2-, 3- and 4- element structures
    • Example: managing audio samples
  • Processor acceleration mechanisms: store merging buffers
    • Practical lab: using load with de-interleaving instructions to store all right lane samples into a vector and left lane samples into another vector
  • Move
  • Swap
  • Table lookup
  • Vector transpose
  • Vector zip / unzip
  • Data transfer between NEON and integer unit
    • Practical lab: clarifying narrow and long instructions, building a vector from bytes selected from a pair of vectors
  • Logical AND, Bit Clear, OR, XOR
  • Operations with immediate values
  • Bitwise insert instructions, avoiding branches
  • Count Leading zeros, ones, signs
  • Normalizing floating point numbers when VFP is not implemented
  • Scalar duplicate
  • Extract
  • Shift with possible rounding and saturation
  • Bitfield revers
    • Practical lab: Transposing a matrix, shifting a large bitmap using vector instructions
  • Add, modulo vs saturated arithmetic
  • Halving / Doubling the result
  • Rounding
  • Subtract
  • Multiply
  • Multiply accumulate / Multiply subtract
  • Absolute value
  • Min / Max
  • Converting Floating Point numbers into Fixed point numbers
  • Converting Fixed point numbers into Floating point numbers
  • Reciprocal estimate, reciprocal square root estimate, Newton-raphson algorithm
  • Pairwise instructions
  • Element comparison
    • Practical lab: implementing a complex multiply accumulate with NEON
    • Practical lab: converting fixed-point elements into single precision floating point values and adding the resulting elements
  • FIR filter
    • Converting the scalar algorithm into a vector algorithm
    • Finding the NEON instructions to encode the vector algorithm
    • Optimizing the code
    • Using the performance monitor to tune the algorithm
  • FFT (DFT)
    • Converting the scalar algorithm into a vector algorithm, understanding how circle properties can be used to process 4 angles concurrently
    • Finding the NEON instructions to encode the vector algorithm
    • Optimizing the code
    • Using the performance monitor to tune the algorithm