The Icon Bar: Building the Dream 2 - The RISC OS Sound System

A bit later than I was hoping, but nevertheless it's now time for Building the Dream 2. This time I'll be looking at the RISC OS sound system - everything from the terminology used, to what makes a sound, how the RISC OS sound system works, and how you can write your own sample player.

Terminology (and how sound works)

First, let's start with the basics. A sample can, confusingly, have two meanings. It can either be used to describe a sound clip (e.g. a .WAV file), or one of the individual units of sound that make up a part of the clip. For the latter, a sound sample is typically stored as a 16 bit signed integer. After some processing by the sound software/hardware to provide volume scaling or mixing with other sounds, the sample is sent to an analogue-to-digital converter that translates it into a voltage. This controls the position of the diaphragm in the speaker. In order to generate an audible sound, the position of the diaphragm must be changed many hundreds or thousands of times a second, resulting in it oscillating back and forth. These oscillations set up the required pressure waves in the air, which travel outwards from the speaker and eventually stimulate the receptors inside your ear.

Most sound hardware operates by processing a certain number of samples per second - this is known as the frequency of the sound system, and is measured in Hertz (Hz). Most hardware supports several different frequency settings, e.g. 11kHz, 22kHz and 44kHz. Although at first glance it may appear that the computer is only able to generate sounds at 11kHz, 22kHz, or 44kHz frequencies, in reality many different frequencies of sound can be generated by sending the correct sequence of samples. For example, if the sound system is set to 22kHz, and the sequence of samples 32767, -32768, 32767, -32768, 32767, ... etc. is sent to the hardware, then a constant 11kHz tone will be emitted by the speaker. The tone will be at 11kHz instead of the expected 22kHz because the period (duration) of the wave is 2 samples. I.e. every 2 samples the in-out motion of the speaker diaphragm will repeat (and it is in the in-out motion that results in a sound being produced). If the sequnce of samples had instead been 32767, 0, -32768, 0, 32767, 0 -32768, 0, 32767, ... then the tone would be at 5.5kHz, because the period of the wave is 4 samples. Many other patterns are possible, each providing approximations to different frequencies of sound. The higher the sample rate of the hardware, and the higher the resolution of the samples (e.g. 24 bit instead of 16 bit), the closer these patterns come to representing the real shape of a specific frequency sound wave. One thing to remember though is that the maximum frequency sound you can output will be half the frequency of the sound system, because a period of at least two samples is required in order to represent the essential in-out motion.

Although I've mentioned the period as being the duration of one cycle of a sound wave, it is also used to describe the duration of one sample. For example, at 22kHz, the period of each sample is 1/22050s = 0.000045351 seconds (Note that 22kHz rarely equals 22000Hz. A 22kHz sound system would typically run at 22050Hz, and it is only written as 22kHz as a matter of convenience. Similarly 11kHz is 11025Hz and 44kHz is 44100Hz). This means that, every 0.00045351 seconds, the sound hardware reads a sample from the sound buffer, converts it to a voltage, and sends it to the speaker.

The buffer is another important aspect of the sound system. Rather than constantly pester the CPU with requests for individual samples, the sound hardware typically requests several hundred samples at a time, which the CPU will put into a memory buffer ready for reading by the sound hardware. This provides more efficient operation, as it reduces the number of context switches that the CPU must perform (A context switch - e.g. switching from running an application to running the sound buffer fill routine - will take a finite amount of time. So the more context switches there are, the more CPU time will be lost to performing the switching) Bigger buffers will require less context switches, and so will incurr a lower CPU overhead. However care has to be taken, as the buffer introduces a delay in the sound system - if the buffer is too large the user will notice that a sound is only heard a considerable amount of time after it was meant to start playing. This is more of importance to games than regular desktop use, as games rely upon many sounds starting and stopping at specific times. It is also important for movie playback, to ensure the sound is in sync with the pictures - although if the buffer size remains constant, the delay can be taken into account by the movie player code.

The RISC OS sound system

There are actually two RISC OS sound systems - the older, practically obsolete 8bit sound system, and the newer 16bit sound system. Although I'll be discussing both in this section, the code samples in the next section will focus on the use of the 16bit sound system, as it is the easiest to write for, produces higher-quality sound output, and has better support for mutliple users (via the SharedSound module).

In the beginning though, there was just the 8bit sound system. This is the system that was shipped with the first ARM-powered machines running Arthur. It could be configured to support between 1 and 8 channels of 8bit audio, which would then be mixed together in hardware to produce the two stereo channels output to the speakers. To do this, each channel could be assigned a stereo 'position', which controlled how much of it was mixed into the left and right output channels. Although the sample rate and buffer size of the sound system could be changed, these settings could not be changed on a channel-by-channel basis - they had to be a global change.

The 8bit sound system was based largely around the notion of voice generators - pieces of code (normally contained in a module) which either played back samples from a stored audio clip or generated the data on-the-fly. Each channel would be assigned a specific voice generator, meaning that each channel can only play one sound at once. Sound playback is controlled via the channel interface - so rather than request for a specific sound to be played, you instead issue a standard call to the channel controller to play whatever sound (voice) is currently attached to the given channel. The pitch, volume, and duration of the sound can be specified with the command, although it is up to the voice generator whether it pays attention to those values or not. Although each channel can only have one generator attached at once, there is nothing to stop the same generator from being attached to multiple channels.

Although adequate when first introduced, it's obvious that the old sound system contained several flaws. Only 8 channels existed, and so only 8 sounds could be played at once (unless software mixing was used). It was also a single-user system, as no framework existed for allocating specific channels to specific programs. Furthermore, the reliance upon voice generators made it difficult to implement continuous sound sources (such as music players). Although it is possible to implement a music player as a voice generator, the solution several music players took was to bypass the channel allocation system entirely and instead claim the entire sound system for itself, preventing any other sounds from being played at all (even if the music player only uses a few of the available hardware channels).

The other point to note about the 8bit sound system is that it used logarithmic samples, not linear; although this makes the data more difficult to process, it results in a marked increase in sound quality, as it essentially gives the computer a 12bit sound system.

16bit sound and SharedSound

With the introduction of the RiscPC came a standardised 16bit sound specification, and the hardware to go with it. The main aspect of the specification was the new SWI, Sound_LinearHandler, which allows programs to register their own 16bit sound handler function. This function has the role of filling both the left and right channels of the 16bit sound buffer with data; thus it is limited to stereo sound hardware only. The main drawback of the 16bit sound system is that only one handler can be registered at once - however this was soon resolved with the release of the SharedSound module, which is now the standard interface to the 16bit sound system under RISC OS.

The SharedSound module operates by registering its own handler (via Sound_LinearHandler); it then allows clients to register as many sound handlers as they want, via the SharedSound_InstallHandler SWI. Each time a buffer fill request is received from the Sound module, SharedSound will iterate through the list of client handlers and call each handler, allowing them to write whatever data they want to the sound buffer. Because the same buffer is shared between all clients, each client must correctly obey the mixing flags that SharedSound passes to it, to ensure data from other clients doesn't get overwritten. However the use of linear 16bit sound samples makes the mixing process trivial to perform.

Apart from providing shared access to the 16bit sound buffer, SharedSound also provides useful information about the sample rate of the buffer, in particular fractional step values to convert from the client's sample rate to the buffer rate. It also allows the volume of each handler to be changed individually (although it is up to the fill code to perform the required volume scaling). It also allows several different types of handler to be used - standard handlers that can be run from an interrupt handler, callback handlers that perform processing in a callback (which means they can take longer, as interrupts will be enabled), and process handlers (which are called in a callback after all other handlers, to allow for any effects to be applied to the final sound buffer data). In my experience, even with relatively simple buffer fill code like the one in the example below, if you're playing more than a handful of sounds at once you will get better performance by switching from an interrupt-based fill routine to a callback-based fill routine.

The only downside to the SharedSound module is that the latest version is only available if you own RISC OS 6. If you don't have RISC OS 6 (Or any earlier version of the module shipped with RISC OS Select/Adjust), then it looks like the most recent version available on the Internet is version 1.04, available from here on Castle's website. For documentation, the best source would appear to be the 'OS' StrongHelp manual, available here on the StrongHelp site.

A sample simple SharedSound sample player

I'll now go into detail about how to write your own sample player, using SharedSound. The player will load one or more WAV files from disc and play them back all at the same time - demonstrating how to read (simple) WAV files, how to interface with SharedSound, and how to perform relatively simple activities such as playing multiple sounds at the same time, at their correct sample rates. The code could be easily expanded to support other features, e.g. looping, staggered playback, individual pitch/volume controls for each sound, etc. - basically everything you need for a simple music player or computer game.

Download the code

To run the sample you will need RISC OS 3.5 or above, a copy of the SharedSound module, and some WAV files. The included makefile is suitable for compiling the code with GCC.

The code is split into four sections, which will be explained below. It is a mixture of C and assembler; C for file reading and general control, assembler for buffer filling and some error handling. The code could have been written as a mix of BASIC and assembler (or in pure assembler), but processing the WAV files in C is much easier than in BASIC or assembler. If you really wanted you could use C for the buffer filling code instead of assembler - but that would require extra wrapper code to get the CPU/stack in the correct state for the running APCS code, as well as extra care when writing the code to ensure certain instructions (e.g. floating point) aren't used. For a simple player like this, it's a lot easier and safer just to write the buffer fill code in assembler.

1. WAV reading

Each WAV file is read into a sound struct by the load_wav() function. This function will take the filename and sound pointer as input, and attempt to convert the WAV file into a sequence of stereo 16bit samples (stored in the correct format for placing in one of SharedSound's buffers). Although the code is by no means perfect, it should be able to understand almost any uncompressed 1 or 2 channel WAV file you throw at it. If you're interested in writing your own WAV reader, then you're out of luck, because I've lost the link to the original page I used when writing the code. All I can do is point you to Wikipedia and let you struggle by yourself if one of the sites they link to gets it wrong!

2. SharedSound initialisation

This is performed by the init_sharedsound() function. After the buffer fill code has been registered with SharedSound, init_sharedsound() performs two other important tasks - it registers a couple of error handlers (more on that later) and reads the current sample rate. It uses the sample rate returned by SharedSound to calculate the playback rate of each WAV file. This is necessary because the WAV loader doesn't do any sample rate conversion. Instead, each sound is given its own playback rate.

3. Playback

On the C side, not much occurs during playback. The code merely waits in a while() loop until it detects that all samples have reached their end. This means that playback is entirely under the control of SharedSound - and our buffer_fill() code.

Each time the buffer_fill() code gets called, it processes the sound list, adding each sound to the buffer one at a time. I've deliberately left the buffer fill code quite simple, so that you can see how it works. For one, it ignores most of the information SharedSound supplies it with, such as sample rates and volume levels - because it should already know the sample rate to use, and no volume scaling is performed. Secondly, no math overflow checks are performed on the samples as they are added into the sound buffer - this will result in clicks and other noise if you play too many loud sounds at once. Extending the code to support volume scaling and/or overflow protection is fairly trivial.

Note that because of the way sound works, in order to play two sounds at the same time we merely have to add the two samples to each other. This will only work with linear sound samples however - if this code were for the 8-bit sound system then it would be much more complicated (which is one of the many advantages of using the 16bit sound system instead).

It's also worth pointing out the use of R7, R12, and R8 in tracking the playback position. R7 and R12 contain the playback position of the sound; R7 is the whole part (i.e. number of samples played) and R12 the fractional part. R8 is the sample rate. This is calculated (in init_sharedsound()) as the ratio of the WAV sample rate to the SharedSound playback rate. So if you were playing an 11kHz WAV file into a 22kHz SharedSound buffer, the ratio would be 1:2, or 0.5. This ratio is stored in R8 as an 8.24 format fixed-point number (8 bit whole number, 24 bit fraction). So the 0.5 of the example would become 8388608 (Or 0x100000 in hex). R12 is also stored in 8.24 format; this means that, each time the playback position is updated, the upper 8 bits of R12 are added to the lower 8 bits of R7, and then cleared from R12. R12's sole purpose is to track the fractional part of the sample position.

Note that R7 is saved in the sound state, but R12 is not. This does mean that if any value is left in R12 after each buffer fill, it will have the effect of stretching the sample slightly. For example if there are 100 buffer fills per second, then at most the sample will be stretched by 100 samples per second, making it 1% longer and 1% lower in pitch. If I were writing a music player then I might store the value of R12 between buffer fills; but for a simple sample player there is little point, as the effect is minor and not guaranteed to occur anyway.

4. Shutdown

Shutdown is handled by the aptly-named shutdown_sharedsound() function. This function unregisters the SharedSound handler (if it is still installed, as an error may have forced it to be removed), and disables the ErrorV handler that was installed by init_sharedsound(). The ErrorV handler is used for error handling, which is discussed below.

Error handling

Error handling is a very important part of audio playback under RISC OS, especially if your buffer fill code runs from application space or relies on data stored in application space. If your program exits in an unexpected way then there's a chance that the buffer fill code will be left active - at which point anything could happen from random data beginning to play or the machine locking up entirely. After struggling with error handling for some time, I've come across a solution that appears to work under all circumstances.

Firstly, a C atexit() handler is used to catch all cases of the C program terminating under the control of the C library. This handler will get called if the program exits via exit(), abort(), or by returning from main(). In our case, shutdown_sharedsound() is registed as an atexit funciton.

Secondly, an assembler function is attached to ErrorV, the vector that will be executed whenever the OS is made aware of an error. This is necessary to trap the cases which atexit() does not - for example more serious errors such as data aborts or undefined instructions. The assembler function that gets registered (error_handler()) disables the SharedSound handler, but, importantly, does not disable the ErrorV handler. This is because after testing it on my Iyonix I found that the machine would lock up completely on error, most likely because it does not support an error handler removing itself. Manually removing the error handler isn't deathly important anyway, as the OS is capable of removing it itself when the application exits.

Multitasking

A couple of extra considerations are needed for multitasking programs. Firstly - don't expect to be able to play sounds in the WIMP using a player in application space. This is because of how RISC OS mainpulates the memory map to allow each program to run from the same &8000 base address. Secondly - don't expect to be able to do it from a TaskWindow either (hence the check near the start of main() in the example), again for the same reason as above. If you want to play sounds from within the WIMP then you'll have to rely on buffer fill code stored in the module area (ideally inside a module rather than as a random piece of floating code), and store your sample data in a dynamic area or in the module area.

Interrupts

One important thing to remember if you're doing more complex work with sounds is that all buffer fills occur in interrupts. This means that your program doesn't know when they will occur; if you have some C code that adds and removes sound effects from a list of effects to play, then there's a chance the buffer fill code will get called in the middle of your code updating the list. You should design your code with this in mind. For example, if you are swapping a long sample with a short sample, and you update the data pointer before updating the playback position or sample length, then there's a chance random data will be played instead of sound data. The best course of action in this case is likely to be to temporarily set a 'paused' flag for that sound, or set the data pointer to 0 (and have the buffer fill code interpret that case as there being no data to play). If your sample playback list is constantly changing then you might want to keep two lists - one which is in a safe state and is read by the buffer fill code, the other which is in a volatile state as it is constructed or updated. After a list has been constructed the 'active list' pointer will be swapped to point at the other list. If you take this approach, remember that you will have to store the playback position for each sound somewhere sensible (i.e. not in either of the two lists)

For example, the game I'm currently working on, DeathDawn, uses three structures to manage sounds: Each sample is stored in a sample structure. This contains the raw sample data, length, and sample rate of that data. Each object in the game has a fixed number of sound slots; each slot can reference a sample, and contains the playback position, volume and pitch modifiers, as well as pause and repeat flags. Although there may be 30 objects in the game world with 4 sound slots each, there may only be 4 slots which are both in use and in earshot of the player. For this reason, two lists of active sounds are kept. These lists only contain pointers to the sounds that should be played. As well as the sound pointer itself, each entry also contains the final volumes of the left and right stereo channels, and the fully-adjusted 8.24 fractional step to use for playback. At the end of each frame, the active list that is used by the buffer fill code is swapped with the list that has just been assembled by the C code. Although this solution solves many problems with handling sounds it is still not perfect, for special code must be used to delete sound structures (to ensure they are not referenced in either the of the active lists), and if a sample is swapped for one of a different sample rate then there's a chance it will be played at the wrong rate for the duration of one buffer fill (although this can be fixed by moving the sample rate calculation into the buffer fill code). The reliance on active sound lists also introduces a delay of a few centiseconds between a sound being scheduled for playing and playback to begin.

Next time...

The next article, due sometime before judgement day, will deal once again with the topic of random map generators, as well as provide numerous examples of uses for the different container data structures that were discussed last time.