PCSX2 Documentation/PCSX2 EE Recompiler: Difference between revisions

From PCSX2 Wiki
Jump to navigation Jump to search
 
(23 intermediate revisions by the same user not shown)
Line 25: Line 25:


[[File:Recompiler_block.png]]
[[File:Recompiler_block.png]]


=== PS2 Virtual space ===
=== PS2 Virtual space ===
Line 64: Line 65:


=== Lazy Allocation ===
=== Lazy Allocation ===
Might be removed in the future. Postponed
Recompiler buffers are lazy-allocated to reduce the physical memory footprint. During the init, contiguous virtual memories are reserved to hold the various buffers. However memories can't be accessed (no read, no write permission). A write to the buffer will trigger a segmentation fault that it will be caught by a dedicated handler. The handler will enable permission access. And therefore the memory will be allocated. In order to limit segmentation fault management, buffers are allocated by big block (typically 1/4th of the max size).
 
In my humble opinion, this mechanism is useless nowadays because
* kernel might already do lazy-allocation under the hood
* unused memory will be pushed in the swap
* 32 bits program on recent machines are limited by the virtual address space not the physical resources anymore. The former is 2-4GB whereas a single DDR4 module is 4GB.
 
Potentially this feature could be removed in the future but it is a very low priority


=== Memory Protection ===
=== Memory Protection ===
Line 72: Line 80:
The situation can occurred because of self-modifying code or due to a library linking (change memory pointer in RAM).
The situation can occurred because of self-modifying code or due to a library linking (change memory pointer in RAM).


A naive implementation would be to instrument all write to detect corresponding block. However it will cost a big penalty for each memory write. Another one will be to check the content of the instruction block at each execution. Again slow. A more complex implementation will use the page fault signal handler mechanism to detect invalid write. Guess what, we choose the later.
A naive implementation would be to instrument all write to detect corresponding block. However it will cost a big penalty for each memory write. Another one will be to check the content of the instruction block at each execution. Again slow. A more complex implementation will use the page fault signal handler mechanism to detect invalid write. Guess what, we choose the latter.


* <code>C++ function: void mmap_MarkCountedRamPage( u32 paddr )</code>
* '''<code>C++ function: void mmap_MarkCountedRamPage( u32 paddr )</code>'''
* <code>C++ function: int mmap_GetRamPageInfo( u32 paddr )</code>
* '''<code>C++ function: int mmap_GetRamPageInfo( u32 paddr )</code>'''
* <code>C++ function: void mmap_ClearCpuBlock( uint offset )</code>
* '''<code>C++ function: void mmap_ClearCpuBlock( uint offset )</code>'''
* <code>C++ function: void dyna_page_reset(u32 start, u32 sz)</code>
* '''<code>C++ function: void dyna_page_reset(u32 start, u32 sz)</code>'''
* <code>C++ function: void dyna_block_discard(u32 start, u32 sz)</code>
* '''<code>C++ function: void dyna_block_discard(u32 start, u32 sz)</code>'''
* <code>C++ function: void recClear(u32 addr, u32 size)</code>
* '''<code>C++ function: void recClear(u32 addr, u32 size)</code>'''
* <code>C++ function: void mmap_PageFaultHandler::OnPageFaultEvent( const PageFaultInfo& info, bool& handled )</code>
* '''<code>C++ function: void mmap_PageFaultHandler::OnPageFaultEvent( const PageFaultInfo& info, bool& handled )</code>'''
* <code>C++ array: u16 manual_page[Ps2MemSize::MainRam >> 12]</code>
* '''<code>C++ array: u16 manual_page[Ps2MemSize::MainRam >> 12]</code>'''
* <code>C++ array: u8 manual_counter[Ps2MemSize::MainRam >> 12]</code>
* '''<code>C++ array: u8 manual_counter[Ps2MemSize::MainRam >> 12]</code>'''
* <code>C++ array: vtlb_PageProtectionInfo m_PageProtectInfo[Ps2MemSize::MainRam >> 12]</code>
* '''<code>C++ array: vtlb_PageProtectionInfo m_PageProtectInfo[Ps2MemSize::MainRam >> 12]</code>'''


==== The Automatic Protection ====
==== The Automatic Protection ====
Line 107: Line 115:
==== Limitation ====
==== Limitation ====


== Code Generation ==
The biggest limitation is the mix of data and code in the same page. Data could just be the global variable of the program often after the code. It could also the thread data stack. Or the kernel area to save register context.
 
== Blocks Management ==
 
=== Extended Base Block ===
 
Recompiler handles code by block. The standard base block was already disccussed above *TODO ADD A LINK*. The standard base block is rather primitive and allow only a translation from EE's PC to the x86 buffer address. It is the minimum to run but it isn't enough to manage them so they're extended to '''BASEBLOCKEX'''. This new structure contains both the EE's PC and x86 buffer address with their respective code segment size. Note: EE instructions have a constant size of 4 bytes so you can store directly the number of instructions. However X86 instructions have a variable length so you only have a byte length. An extended base block will be created every time you recompile a new block.
 
All extended base blocks are aggregated into a map/array class named '''BaseBlockArray'''. It has the following properties
# Entries are indexed with a continuous index. I.e. if you have 10 elements, they will be placed from index 0 to 9.
#* It means that every addition/removal will move the element (also know as self-balancing tree).
# Entries are sorted on the ascending order of EE's PC.
#* Be aware that block can overlap. So the Nth entry can finish before the previous N-1th entry.
# Lookups are implemented as a binary search. Easy to implement due to the 2 previous properties.
 
* '''<code>C++ class: BaseBlockArray</code>'''
* '''<code>C++ class: BaseBlocks</code>'''
* '''<code>C++ object: static BaseBlocks recBlocks</code>'''
 


=== Dispatcher ===
=== Block Link ===
At the end of execution of a block you have 2 possibilities
* Block was rather big so you will
# update the event
# lookup the next block
# then execute it
* Block was small so you will
# lookup the next block
# then execute it


=== Compilation ===
Often, you know the next block at compile it. So you can directly do the lookup at compilation and directly jump from block1 to block2. It allow to save a couple of instructions. Small loops really profit of this optimization.


== EE Memory Emulation ==
This optimization requires that you keep a record of the x86 pointer at end of the block and the destination PC. You can have severals blocks that jump to the same PC so you need a multi map.


There are 2 ways to handle the link creation.
# You want to link to a block that already exists (let's call it a direct link).
## You get the x86 function pointer of the destination block.
## You update the jump instruction of the current block to jump to the destination block.
## You save the link to handle the removal of the destination block.
# You want to link to a block that doesn't exists (let's call it a deferred link).
## You get the x86 function pointer of the destination block.
## You update the jump instruction of the current block to jump to the '''recompiler dispatcher'''
## You save the link '''to update the jump instruction later when a new block is created'''. As previously, you keep the link information to handle the removal of destination block.


Every time you create a new block, you will check if the new block is the destination block of any blocks. If you have some hits, a link will be established. This way it will handle all deferred links.


Every time you delete a block, again you will check if the deleted block is the destination block of any blocks. If you have some hits, the link will be deleted. It is important to delete dangling link otherwise you might jump into the void.


=== The Block Manager ===


TO DO !


== Code Generation ==


All those blocks are managed by the BaseBlocks class.
=== Dispatcher ===
[code]
static BaseBlocks recBlocks;
[/code]


PCSX2 contains various dispatchers for the EE recompiler but also for others recompiler. Dispatchers are small ASM code generated by hand. They serve 2 purposes:
# Wrapper to call a C++ function from the recompiler
# Add some glues between recompiler block


_DynGen_* functions generate dispatcher functions and return a function pointer to the function. Full initialization is done in _DynGen_Dispatchers.
Note: '''"ExitRecompiledCode"''' is the exit address/label of the recompiler.


==== EventTest ====


* JITCompile (generated by _DynGen_JITCompile) will
* Generated by '''_DynGen_DispatcherEvent'''
1/ Call recRecompile(cpuRegs.pc) to recompile the current block
* High-Level Description
2/ Jump to the recompiled block PC_GETBLOCK(cpuRegs.pc)->m_pFnptr()
# Execute '''"recEventTest"'''


Basically all BASEBLOCK will contains JITCompile as init address.
==== DispatcherReg ====


* Generated by '''_DynGen_DispatcherReg'''
* High-Level Description
# Compute the address of BASEBLOCK related to the current PC. It is equivalent to "'''bb = PC_GETBLOCK(cpuRegs.pc)'''"
# Jump to the related function pointer. It is equivalent to "'''bb->m_pFnptr()'''"


* JITCompileInBlock (generated by _DynGen_JITCompileInBlock)
==== JITCompile ====
1/ Jump to JITCompile


Basically after the compilation of BLOCK of size N. First BASEBLOCK
* Generated by '''_DynGen_JITCompile'''
will contains the x86 address. The remaining N-1 BLOCK will contain
* High-Level Description
JITCompileInBlock.
# Realign the stack
# Recompile the current block. It is equivalent to '''"recRecompile(cpuRegs.PC)"'''
# Execute the just compiled block. It is equivalent to '''"DispatcherReg"'''


==== JITCompileInBlock ====


* DispatcherReg (generated by _DynGen_DispatcherReg) will
* Generated by '''_DynGen_JITCompileInBlock'''
1/ Jump to the current block (Note. Stack won't be realigned)
* High-Level Description
# Realign the stack
# Execute JITCompile
* Note: the purpose is to detect that EE instructions were already compiled inside another block.
* Note2: it might be removed soon


==== EnterRecompiledCode ====


* EnterRecompiledCode (generated by _DynGen_EnterRecompiledCode) will
* Generated by '''_DynGen_EnterRecompiledCode'''
1/ Setup the base frame pointer
* High-Level Description
2/ Align the stack pointer
# Setup the base frame pointer
3/ Save edi/esi/ebx on the stack
# Align the stack pointer
4/ Simulate a function call? (potentially to help debugger to unwind the stack)
# Save edi/esi/ebx on the stack
5/ Simulate the stack frame preparation "push ebp, mov ebp, esp"
# Simulate a function call by pushing the call address and EBP onto the stack
6/ Save esp, ebp into static variable (for debug check). Code can surely be removed.
# Simulate the stack frame preparation ("push ebp, mov ebp, esp")
7/ Jump to DispatcherReg
# Execute DispatcherReg
8/ Handle the return of DispatcherReg (Leave and restore edi/esi/ebx)
# Handle the return of DispatcherReg call ("leave")
9/ Handle the return of current function (leave and ret)
# Restore edi/esi/ebx
# Destroy the stack ("leave")
# Return to C++ world ("ret")
* Note: I think the stack is simulated to ease stack rewind for exeception/debugger.


* ExitRecompiledCode is the return address of DispatcherReg (end of EnterRecompiledCode)
==== DispatchBlockDiscard ====


* DispatchBlockDiscard (generated by _DynGen_DispatchBlockDiscard) is a wrapper to the C++ function dyna_block_discard
* Generated by '''_DynGen_DispatchBlockDiscard'''
* High-Level Description
# Execute '''"dyna_block_discard"'''
# Exit the recompiler


* DispatchPageReset (generated by _DynGen_DispatchPageReset) is a wrapper to the C++ function dyna_page_reset
==== DispatchPageReset ====


[hr]
* Generated by  '''_DynGen_DispatchPageReset'''
* High-Level Description
# Execute '''"dyna_page_reset"'''
# Exit the recompiler


=== Compilation ===


The details of the "recompilation" stage
== EE Memory Emulation ==
Input: the PC (first instruction address)
Output: x86 code in a buffer that is ready to be executed.

Latest revision as of 22:49, 13 January 2016

WORK IN PROGRESS I think it is about time that I start to contribute to this project ;)

It is a summary of my understanding of the EE recompiler. The idea is to collect the key information on the recompiler. It is interesting as a general info and it would be useful to port/improve it one day.

External Reference

Global Overview of the EE recompiler

Others useful documentation

  1. Introduction to Dynamic Recompilation
  2. Recompilers: All 'dems buzzwords?


The 3 recompiler phases

  1. The recompilation phase:
    The purpose is to compile an EE instruction list into an X86 instruction list (also know as an instruction block). Instructions are stored in a buffer called x86Ptr. It can be seen as an instruction cache.
  2. The execution phase:
    The x86 instruction block will be executed.
  3. The pause phase:
    The purpose is to emulate the others HW block (VU, GIF, DMA etc..) In particular EE interrupts are handled here.

Memory And Buffer

An important part of the recompiler is the management of various blocks (x86/EE etc...). You can see below a nice schematic with the links between them.


PS2 Virtual space

The PS2 virtual space is composed of EE instructions. An instruction is always 4B. An instruction can be anywhere in the 4GB address space of the PS2 (yes 4GB). Of course those 4GB are mapped physically into a 32MB RAM or 4MB ROM. The address of the current instruction is stored inside an HW register. It is named PC (which stand for Program Counter).


Recompiler lookup table

The recLUT (recompiler lookup table) will allow to get the related BASEBLOCK related to the PC. The LUT maps the virtual address space as 64kB page. There is an easy macro to retrieve the current BASEBLOCK from a PC.

  • C++ macro: PC_GETBLOCK(PC)
  • C++ array: __aligned16 uptr recLUT[_64kb]

The hwLUT (hardware lookup table) will allow to get the physical PS2 address from a PS2 virtual address. Again a nice macro is available.

  • C++ macro: HWADDR(mem_address)
  • C++ array: __aligned16 uptr hwLUT[_64kb]

Important note: those LUTs are based on the TLB mapping at the boot of the EE kernel. They don't take TLB updates into consideration. It won't work if your game/program (linux/home-brew) relies on TLB management. In this situation your best bet is to use the interpreter.


Base Block

The recompiler will map the EE memory into 3 differents BASEBLOCK array.

  • C++ array: BASEBLOCK *recRAM
  • C++ array: BASEBLOCK *recROM
  • C++ array: BASEBLOCK *recROM1


The BASEBLOCK is barely a function pointer. It could be either

  • a pointer to the X86 instruction buffer
  • a pointer to a JITCompile dispatcher. The function will do
  1. Call recRecompile(cpuRegs.pc) to recompile the current block
  2. Jump to the recompiled block PC_GETBLOCK(cpuRegs.pc)->m_pFnptr()
  • a pointer to a JITCompileInBlock dispatcher.


Lazy Allocation

Recompiler buffers are lazy-allocated to reduce the physical memory footprint. During the init, contiguous virtual memories are reserved to hold the various buffers. However memories can't be accessed (no read, no write permission). A write to the buffer will trigger a segmentation fault that it will be caught by a dedicated handler. The handler will enable permission access. And therefore the memory will be allocated. In order to limit segmentation fault management, buffers are allocated by big block (typically 1/4th of the max size).

In my humble opinion, this mechanism is useless nowadays because

  • kernel might already do lazy-allocation under the hood
  • unused memory will be pushed in the swap
  • 32 bits program on recent machines are limited by the virtual address space not the physical resources anymore. The former is 2-4GB whereas a single DDR4 module is 4GB.

Potentially this feature could be removed in the future but it is a very low priority

Memory Protection

The instruction cache buffer is a cache of the EE program inside the EE memory. Therefore it is required to ensure some coherencies between the cache and the EE memory. It means that a write in the EE memory must be translated to a discard of the cache content, likely followed by a recompilation.

The situation can occurred because of self-modifying code or due to a library linking (change memory pointer in RAM).

A naive implementation would be to instrument all write to detect corresponding block. However it will cost a big penalty for each memory write. Another one will be to check the content of the instruction block at each execution. Again slow. A more complex implementation will use the page fault signal handler mechanism to detect invalid write. Guess what, we choose the latter.

  • C++ function: void mmap_MarkCountedRamPage( u32 paddr )
  • C++ function: int mmap_GetRamPageInfo( u32 paddr )
  • C++ function: void mmap_ClearCpuBlock( uint offset )
  • C++ function: void dyna_page_reset(u32 start, u32 sz)
  • C++ function: void dyna_block_discard(u32 start, u32 sz)
  • C++ function: void recClear(u32 addr, u32 size)
  • C++ function: void mmap_PageFaultHandler::OnPageFaultEvent( const PageFaultInfo& info, bool& handled )
  • C++ array: u16 manual_page[Ps2MemSize::MainRam >> 12]
  • C++ array: u8 manual_counter[Ps2MemSize::MainRam >> 12]
  • C++ array: vtlb_PageProtectionInfo m_PageProtectInfo[Ps2MemSize::MainRam >> 12]

The Automatic Protection

The EE memory is memory mapped as 4K Read/Write pages. A protection status is attached for each page. If the protection is manual, you need to handle it manually (easy isn't it). This case will be discussed below. Otherwise you will mark the page as Read-Only.

The Write Interception

Now that EE memory page is Read-Only, any write on it will trigger an error. On Linux it will be a SIGSEGV (segmentation fault) signal. PCSX2 remaps the default handler to handle it. It will dispatch the signal to the correct buffer. Buffer will

  • Remount the page as Read/Write
  • Mark the memory protection as manual
  • Clear the recLUT cache

The Manual Protection

Some pages will be marked as manually protected, all the block will be recompiled with a manual protection code. The protection is an integrity check at the start of the block. Basically it compares the content of the current EE instructions with the older instruction. In case of failure the block will be cleared with the help of the dyna_block_discard function.

The Automatic Re-Protection

A couple of write could happen at the startup of the game to initialize a couple of library pointers. It would be costly to handle all the pages. Therefore a counter was added to current manual block. It will trigger a recompilation in the future. To avoid a "WRITE => SIGSEGV => MANUAL RECOMP => RECOMP" loop, after severals trials the automatic re-protection feature will be dropped.

Limitation

The biggest limitation is the mix of data and code in the same page. Data could just be the global variable of the program often after the code. It could also the thread data stack. Or the kernel area to save register context.

Blocks Management

Extended Base Block

Recompiler handles code by block. The standard base block was already disccussed above *TODO ADD A LINK*. The standard base block is rather primitive and allow only a translation from EE's PC to the x86 buffer address. It is the minimum to run but it isn't enough to manage them so they're extended to BASEBLOCKEX. This new structure contains both the EE's PC and x86 buffer address with their respective code segment size. Note: EE instructions have a constant size of 4 bytes so you can store directly the number of instructions. However X86 instructions have a variable length so you only have a byte length. An extended base block will be created every time you recompile a new block.

All extended base blocks are aggregated into a map/array class named BaseBlockArray. It has the following properties

  1. Entries are indexed with a continuous index. I.e. if you have 10 elements, they will be placed from index 0 to 9.
    • It means that every addition/removal will move the element (also know as self-balancing tree).
  2. Entries are sorted on the ascending order of EE's PC.
    • Be aware that block can overlap. So the Nth entry can finish before the previous N-1th entry.
  3. Lookups are implemented as a binary search. Easy to implement due to the 2 previous properties.
  • C++ class: BaseBlockArray
  • C++ class: BaseBlocks
  • C++ object: static BaseBlocks recBlocks


Block Link

At the end of execution of a block you have 2 possibilities

  • Block was rather big so you will
  1. update the event
  2. lookup the next block
  3. then execute it
  • Block was small so you will
  1. lookup the next block
  2. then execute it

Often, you know the next block at compile it. So you can directly do the lookup at compilation and directly jump from block1 to block2. It allow to save a couple of instructions. Small loops really profit of this optimization.

This optimization requires that you keep a record of the x86 pointer at end of the block and the destination PC. You can have severals blocks that jump to the same PC so you need a multi map.

There are 2 ways to handle the link creation.

  1. You want to link to a block that already exists (let's call it a direct link).
    1. You get the x86 function pointer of the destination block.
    2. You update the jump instruction of the current block to jump to the destination block.
    3. You save the link to handle the removal of the destination block.
  2. You want to link to a block that doesn't exists (let's call it a deferred link).
    1. You get the x86 function pointer of the destination block.
    2. You update the jump instruction of the current block to jump to the recompiler dispatcher
    3. You save the link to update the jump instruction later when a new block is created. As previously, you keep the link information to handle the removal of destination block.

Every time you create a new block, you will check if the new block is the destination block of any blocks. If you have some hits, a link will be established. This way it will handle all deferred links.

Every time you delete a block, again you will check if the deleted block is the destination block of any blocks. If you have some hits, the link will be deleted. It is important to delete dangling link otherwise you might jump into the void.

The Block Manager

TO DO !

Code Generation

Dispatcher

PCSX2 contains various dispatchers for the EE recompiler but also for others recompiler. Dispatchers are small ASM code generated by hand. They serve 2 purposes:

  1. Wrapper to call a C++ function from the recompiler
  2. Add some glues between recompiler block

Note: "ExitRecompiledCode" is the exit address/label of the recompiler.

EventTest

  • Generated by _DynGen_DispatcherEvent
  • High-Level Description
  1. Execute "recEventTest"

DispatcherReg

  • Generated by _DynGen_DispatcherReg
  • High-Level Description
  1. Compute the address of BASEBLOCK related to the current PC. It is equivalent to "bb = PC_GETBLOCK(cpuRegs.pc)"
  2. Jump to the related function pointer. It is equivalent to "bb->m_pFnptr()"

JITCompile

  • Generated by _DynGen_JITCompile
  • High-Level Description
  1. Realign the stack
  2. Recompile the current block. It is equivalent to "recRecompile(cpuRegs.PC)"
  3. Execute the just compiled block. It is equivalent to "DispatcherReg"

JITCompileInBlock

  • Generated by _DynGen_JITCompileInBlock
  • High-Level Description
  1. Realign the stack
  2. Execute JITCompile
  • Note: the purpose is to detect that EE instructions were already compiled inside another block.
  • Note2: it might be removed soon

EnterRecompiledCode

  • Generated by _DynGen_EnterRecompiledCode
  • High-Level Description
  1. Setup the base frame pointer
  2. Align the stack pointer
  3. Save edi/esi/ebx on the stack
  4. Simulate a function call by pushing the call address and EBP onto the stack
  5. Simulate the stack frame preparation ("push ebp, mov ebp, esp")
  6. Execute DispatcherReg
  7. Handle the return of DispatcherReg call ("leave")
  8. Restore edi/esi/ebx
  9. Destroy the stack ("leave")
  10. Return to C++ world ("ret")
  • Note: I think the stack is simulated to ease stack rewind for exeception/debugger.

DispatchBlockDiscard

  • Generated by _DynGen_DispatchBlockDiscard
  • High-Level Description
  1. Execute "dyna_block_discard"
  2. Exit the recompiler

DispatchPageReset

  • Generated by _DynGen_DispatchPageReset
  • High-Level Description
  1. Execute "dyna_page_reset"
  2. Exit the recompiler

Compilation

EE Memory Emulation