PCSX2 Documentation/PCSX2 Optimization: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
 
(3 intermediate revisions by 2 users not shown)
Line 6: Line 6:




Introduction
==Introduction==


[[File:pcsx2-optimization.jpg]]
[[File:pcsx2-optimization.jpg]]
Line 20: Line 20:
The rest of the plugins are either not implemented like USB and FireWire or are pretty standard across emulators.
The rest of the plugins are either not implemented like USB and FireWire or are pretty standard across emulators.


Optimization
==Optimization==


When running, Pcsx2 spends most of its time in recompiled code (EE/IOP or VUs). When in the EE recompiler, most components are executed as hardware registers are written through the memory write functions. About every 512 EE cycles, the recompilers call cpuBranchTest to update counters, give IOP enough time to catch up, and perform other system stuff. The VU programs are usually executed as soon as the program is started in the VIF. If VU1 is executed, the recompilers don't exit until the VU program exits. If VU0 is executed, only 512 VU cycles are executed then control is given to the EE so it can catch up (synchronizing VU0 can be very tricky). IOP recompilers usually don't take up a lot of CPU time. Besides the fact that IOP runs 8 times slower than EE, it is also a 32bit processor, so things are fairly quick.  
When running, Pcsx2 spends most of its time in recompiled code (EE/IOP or VUs). When in the EE recompiler, most components are executed as hardware registers are written through the memory write functions. About every 512 EE cycles, the recompilers call cpuBranchTest to update counters, give IOP enough time to catch up, and perform other system stuff. The VU programs are usually executed as soon as the program is started in the VIF. If VU1 is executed, the recompilers don't exit until the VU program exits. If VU0 is executed, only 512 VU cycles are executed then control is given to the EE so it can catch up (synchronizing VU0 can be very tricky). IOP recompilers usually don't take up a lot of CPU time. Besides the fact that IOP runs 8 times slower than EE, it is also a 32bit processor, so things are fairly quick.  
Line 27: Line 27:
Because Pcsx2 is written in C/C++, function calls and pipelining is very fast. Many people believe that rearranging code and optimizing the calling conventions will speed things up. This is not the case at all. Although there do exist functions that could benefit from some hand-coded assembly, you will not see a statistical speed difference because the assembly code might not have improved speed too much, or the function is not called often enough to really matter. For example, it is stupid to optimize the recInit function, it is called only once per game!
Because Pcsx2 is written in C/C++, function calls and pipelining is very fast. Many people believe that rearranging code and optimizing the calling conventions will speed things up. This is not the case at all. Although there do exist functions that could benefit from some hand-coded assembly, you will not see a statistical speed difference because the assembly code might not have improved speed too much, or the function is not called often enough to really matter. For example, it is stupid to optimize the recInit function, it is called only once per game!


Recompiler Optimizations
==Recompiler Optimizations==


In short, don't waste your time optimizing the ADD or MULT instruction ...or any single instruction for that matter. The bottleneck of the recompilers is the way the registers are allocated. Also, registers are not tracked across simple blocks meaning that they are all dumped to memory at the end of each block, which is very suboptimal. My advice is not to touch any recompiler unless you know what is going on. If you are inclined to optimize recompilers, do the EE. The VU recompilers are very fast at the moment, so don't waste your time because you might kill compatibility... and IOP is just not worth it. When running a profiler, you might see the cpuBranchTest function pop up a lot... don't waste your time by rearranging code out of the function or trying to call the function less frequently, it won't work.
In short, don't waste your time optimizing the ADD or MULT instruction ...or any single instruction for that matter. The bottleneck of the recompilers is the way the registers are allocated. Also, registers are not tracked across simple blocks meaning that they are all dumped to memory at the end of each block, which is very suboptimal. My advice is not to touch any recompiler unless you know what is going on. If you are inclined to optimize recompilers, do the EE. The VU recompilers are very fast at the moment, so don't waste your time because you might kill compatibility... and IOP is just not worth it. When running a profiler, you might see the cpuBranchTest function pop up a lot... don't waste your time by rearranging code out of the function or trying to call the function less frequently, it won't work.


GS plugin optimizations
==GS plugin optimizations==


There have been various suggestions to use shaders to offload as much computation as possible. This is exactly what ZeroGS does! In fact, ZeroGS does it so well that your GPU will cry (hence the name KOSMOS). The biggest problem with the GS plugin is managing the render targets and caching the textures. Also, the trick to everything is never (and I really mean never) to transfer GPU memory to system memory. This problem is actually very common with many games because game developers just do crazy shit with the GS... sometimes I get the urge to email companies asking them 'what were you on when you programmed this?'. This is the reason why certain games are much slower than others. Sometimes the problem is fixable, sometimes it isn't (this is a good time to mention that the ocean zbuffer bug in FF12 is exactly due to this problem... and ZeroGS ingeniously ignores the memory transfer so FF12 is nice and fast... hence the bug).
There have been various suggestions to use shaders to offload as much computation as possible. This is exactly what ZeroGS does! In fact, ZeroGS does it so well that your GPU will cry (hence the name KOSMOS). The biggest problem with the GS plugin is managing the render targets and caching the textures. Also, the trick to everything is never (and I really mean never) to transfer GPU memory to system memory. This problem is actually very common with many games because game developers just do crazy shit with the GS... sometimes I get the urge to email companies asking them 'what were you on when you programmed this?'. This is the reason why certain games are much slower than others. Sometimes the problem is fixable, sometimes it isn't (this is a good time to mention that the ocean zbuffer bug in FF12 is exactly due to this problem... and ZeroGS ingeniously ignores the memory transfer so FF12 is nice and fast... hence the bug).
Line 37: Line 37:
Someone in the forums suggested executing VU1 instructions in the GPU vertex shaders. If you look at the shader and VU instruction sets, you'll see that they are very similar, so I can understand why someone would think this. However this is not possible because VUs are way more complex than vertex shaders! In fact, it is suicide to consider the amount of code reengineering that would have to be done just to get the GPU to execute a simple VU program... the GPU already has enough work to do. While on the subject of VU, it is also useful to mention that it is suicide to even consider putting the VU recompilers in their own CPU thread that can execute concurrently with the EE! Making things multi-threaded is tricky because of the data sharing between the two threads. In fact, putting the VU execution on another thread would probably slow things down more than speed up. The reason multithreading on the GS works so well is because the EE doesn't have to stop to synchronize with the GS, all it has to do is copy the stream it wants to send to the GS in a temporary buffer. The GS plugin will process it when it gets the time.
Someone in the forums suggested executing VU1 instructions in the GPU vertex shaders. If you look at the shader and VU instruction sets, you'll see that they are very similar, so I can understand why someone would think this. However this is not possible because VUs are way more complex than vertex shaders! In fact, it is suicide to consider the amount of code reengineering that would have to be done just to get the GPU to execute a simple VU program... the GPU already has enough work to do. While on the subject of VU, it is also useful to mention that it is suicide to even consider putting the VU recompilers in their own CPU thread that can execute concurrently with the EE! Making things multi-threaded is tricky because of the data sharing between the two threads. In fact, putting the VU execution on another thread would probably slow things down more than speed up. The reason multithreading on the GS works so well is because the EE doesn't have to stop to synchronize with the GS, all it has to do is copy the stream it wants to send to the GS in a temporary buffer. The GS plugin will process it when it gets the time.


SPU2 optimizations
==SPU2 optimizations==


Don't know what's going on here. SPU2 really needs reworking. And it isn't just the plugin, there is a whole synchronization issue with DMA and SPU2 that has to be taken care of. The SPU2 should be multithreaded and it shouldn't require any huge synchronization with the main thread. If anyone will work on something, it should be SPU2.
Don't know what's going on here. SPU2 really needs reworking. And it isn't just the plugin, there is a whole synchronization issue with DMA and SPU2 that has to be taken care of. The SPU2 should be multithreaded and it shouldn't require any huge synchronization with the main thread. If anyone will work on something, it should be SPU2.


Other optimizations
==Other optimizations==


Most other components of Pcsx2 are already pretty optimized. To summarize, the biggest bottlenecks are the recompilers, GS plugin, and SPU2. Currently x86-64 versions of the recompilers are being developed. ZeroGS OpenGL will be released as open source soon, so people can contribute their own changes if necessary. Actually, future versions of ZeroGS will have a complex patch system that will enable gamers to tweak settings so that the game is as fast as possible (even dual core users can see a big performance gain with the correctly tweaked settings). No one is maintaining SPU2.... but that will change if this continues.
Most other components of Pcsx2 are already pretty optimized. To summarize, the biggest bottlenecks are the recompilers, GS plugin, and SPU2. Currently x86-64 versions of the recompilers are being developed. ZeroGS OpenGL will be released as open source soon, so people can contribute their own changes if necessary. Actually, future versions of ZeroGS will have a complex patch system that will enable gamers to tweak settings so that the game is as fast as possible (even dual core users can see a big performance gain with the correctly tweaked settings). No one is maintaining SPU2.... but that will change if this continues.
Line 48: Line 48:


Moral of the blog Optimization is time consuming and it is more of an art than science. Programs are developed much faster and have less bugs when you don't have to worry about optimizing (Java is a great example of a language where it is easy to develop programs, but very hard to have it run fast.. by design I guess). But no matter the application, having it go faster is much better than having it go slower. A lot of game companies spend a lot of money optimizing their code because it is that important. The reason most newer games have beautiful graphics running in real-time is not just because GPUs are getting faster, but also companies are getting better at optimizing their game engines. For example, games when the PS2 was first released look inferior to what new PS2 games look today. And with the way things are going in the software industry, it looks like C++ will stick around for a very long time because of its optimization capabilities. Learn C++, there are many books I can recommend, but the best way to learn is to actually go and program something.
Moral of the blog Optimization is time consuming and it is more of an art than science. Programs are developed much faster and have less bugs when you don't have to worry about optimizing (Java is a great example of a language where it is easy to develop programs, but very hard to have it run fast.. by design I guess). But no matter the application, having it go faster is much better than having it go slower. A lot of game companies spend a lot of money optimizing their code because it is that important. The reason most newer games have beautiful graphics running in real-time is not just because GPUs are getting faster, but also companies are getting better at optimizing their game engines. For example, games when the PS2 was first released look inferior to what new PS2 games look today. And with the way things are going in the software industry, it looks like C++ will stick around for a very long time because of its optimization capabilities. Learn C++, there are many books I can recommend, but the best way to learn is to actually go and program something.
{{PCSX2 Documentation Navbox}}
ninja
782

edits

Navigation menu