1<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 2<html> 3<head> 4 5<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/> 6<title>Ogg Vorbis Documentation</title> 7 8<style type="text/css"> 9body { 10 margin: 0 18px 0 18px; 11 padding-bottom: 30px; 12 font-family: Verdana, Arial, Helvetica, sans-serif; 13 color: #333333; 14 font-size: .8em; 15} 16 17a { 18 color: #3366cc; 19} 20 21img { 22 border: 0; 23} 24 25#xiphlogo { 26 margin: 30px 0 16px 0; 27} 28 29#content p { 30 line-height: 1.4; 31} 32 33h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a { 34 font-weight: bold; 35 color: #ff9900; 36 margin: 1.3em 0 8px 0; 37} 38 39h1 { 40 font-size: 1.3em; 41} 42 43h2 { 44 font-size: 1.2em; 45} 46 47h3 { 48 font-size: 1.1em; 49} 50 51li { 52 line-height: 1.4; 53} 54 55#copyright { 56 margin-top: 30px; 57 line-height: 1.5em; 58 text-align: center; 59 font-size: .8em; 60 color: #888888; 61 clear: both; 62} 63</style> 64 65</head> 66 67<body> 68 69<div id="xiphlogo"> 70 <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.org"/></a> 71</div> 72 73<h1>Ogg Vorbis stereo-specific channel coupling discussion</h1> 74 75<h2>Abstract</h2> 76 77<p>The Vorbis audio CODEC provides a channel coupling 78mechanisms designed to reduce effective bitrate by both eliminating 79interchannel redundancy and eliminating stereo image information 80labeled inaudible or undesirable according to spatial psychoacoustic 81models. This document describes both the mechanical coupling 82mechanisms available within the Vorbis specification, as well as the 83specific stereo coupling models used by the reference 84<tt>libvorbis</tt> codec provided by xiph.org.</p> 85 86<h2>Mechanisms</h2> 87 88<p>In encoder release beta 4 and earlier, Vorbis supported multiple 89channel encoding, but the channels were encoded entirely separately 90with no cross-analysis or redundancy elimination between channels. 91This multichannel strategy is very similar to the mp3's <em>dual 92stereo</em> mode and Vorbis uses the same name for its analogous 93uncoupled multichannel modes.</p> 94 95<p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and 96later implement a coupled channel strategy. Vorbis has two specific 97mechanisms that may be used alone or in conjunction to implement 98channel coupling. The first is <em>channel interleaving</em> via 99residue backend type 2, and the second is <em>square polar 100mapping</em>. These two general mechanisms are particularly well 101suited to coupling due to the structure of Vorbis encoding, as we'll 102explore below, and using both we can implement both totally 103<em>lossless stereo image coupling</em> [bit-for-bit decode-identical 104to uncoupled modes], as well as various lossy models that seek to 105eliminate inaudible or unimportant aspects of the stereo image in 106order to enhance bitrate. The exact coupling implementation is 107generalized to allow the encoder a great deal of flexibility in 108implementation of a stereo or surround model without requiring any 109significant complexity increase over the combinatorially simpler 110mid/side joint stereo of mp3 and other current audio codecs.</p> 111 112<p>A particular Vorbis bitstream may apply channel coupling directly to 113more than a pair of channels; polar mapping is hierarchical such that 114polar coupling may be extrapolated to an arbitrary number of channels 115and is not restricted to only stereo, quadraphonics, ambisonics or 5.1 116surround. However, the scope of this document restricts itself to the 117stereo coupling case.</p> 118 119<h3>Square Polar Mapping</h3> 120 121<h4>maximal correlation</h4> 122 123<p>Recall that the basic structure of a a Vorbis I stream first generates 124from input audio a spectral 'floor' function that serves as an 125MDCT-domain whitening filter. This floor is meant to represent the 126rough envelope of the frequency spectrum, using whatever metric the 127encoder cares to define. This floor is subtracted from the log 128frequency spectrum, effectively normalizing the spectrum by frequency. 129Each input channel is associated with a unique floor function.</p> 130 131<p>The basic idea behind any stereo coupling is that the left and right 132channels usually correlate. This correlation is even stronger if one 133first accounts for energy differences in any given frequency band 134across left and right; think for example of individual instruments 135mixed into different portions of the stereo image, or a stereo 136recording with a dominant feature not perfectly in the center. The 137floor functions, each specific to a channel, provide the perfect means 138of normalizing left and right energies across the spectrum to maximize 139correlation before coupling. This feature of the Vorbis format is not 140a convenient accident.</p> 141 142<p>Because we strive to maximally correlate the left and right channels 143and generally succeed in doing so, left and right residue is typically 144nearly identical. We could use channel interleaving (discussed below) 145alone to efficiently remove the redundancy between the left and right 146channels as a side effect of entropy encoding, but a polar 147representation gives benefits when left/right correlation is 148strong.</p> 149 150<h4>point and diffuse imaging</h4> 151 152<p>The first advantage of a polar representation is that it effectively 153separates the spatial audio information into a 'point image' 154(magnitude) at a given frequency and located somewhere in the sound 155field, and a 'diffuse image' (angle) that fills a large amount of 156space simultaneously. Even if we preserve only the magnitude (point) 157data, a detailed and carefully chosen floor function in each channel 158provides us with a free, fine-grained, frequency relative intensity 159stereo*. Angle information represents diffuse sound fields, such as 160reverberation that fills the entire space simultaneously.</p> 161 162<p>*<em>Because the Vorbis model supports a number of different possible 163stereo models and these models may be mixed, we do not use the term 164'intensity stereo' talking about Vorbis; instead we use the terms 165'point stereo', 'phase stereo' and subcategories of each.</em></p> 166 167<p>The majority of a stereo image is representable by polar magnitude 168alone, as strong sounds tend to be produced at near-point sources; 169even non-diffuse, fast, sharp echoes track very accurately using 170magnitude representation almost alone (for those experimenting with 171Vorbis tuning, this strategy works much better with the precise, 172piecewise control of floor 1; the continuous approximation of floor 0 173results in unstable imaging). Reverberation and diffuse sounds tend 174to contain less energy and be psychoacoustically dominated by the 175point sources embedded in them. Thus, we again tend to concentrate 176more represented energy into a predictably smaller number of numbers. 177Separating representation of point and diffuse imaging also allows us 178to model and manipulate point and diffuse qualities separately.</p> 179 180<h4>controlling bit leakage and symbol crosstalk</h4> 181 182<p>Because polar 183representation concentrates represented energy into fewer large 184values, we reduce bit 'leakage' during cascading (multistage VQ 185encoding) as a secondary benefit. A single large, monolithic VQ 186codebook is more efficient than a cascaded book due to entropy 187'crosstalk' among symbols between different stages of a multistage cascade. 188Polar representation is a way of further concentrating entropy into 189predictable locations so that codebook design can take steps to 190improve multistage codebook efficiency. It also allows us to cascade 191various elements of the stereo image independently.</p> 192 193<h4>eliminating trigonometry and rounding</h4> 194 195<p>Rounding and computational complexity are potential problems with a 196polar representation. As our encoding process involves quantization, 197mixing a polar representation and quantization makes it potentially 198impossible, depending on implementation, to construct a coupled stereo 199mechanism that results in bit-identical decompressed output compared 200to an uncoupled encoding should the encoder desire it.</p> 201 202<p>Vorbis uses a mapping that preserves the most useful qualities of 203polar representation, relies only on addition/subtraction (during 204decode; high quality encoding still requires some trig), and makes it 205trivial before or after quantization to represent an angle/magnitude 206through a one-to-one mapping from possible left/right value 207permutations. We do this by basing our polar representation on the 208unit square rather than the unit-circle.</p> 209 210<p>Given a magnitude and angle, we recover left and right using the 211following function (note that A/B may be left/right or right/left 212depending on the coupling definition used by the encoder):</p> 213 214<pre> 215 if(magnitude>0) 216 if(angle>0){ 217 A=magnitude; 218 B=magnitude-angle; 219 }else{ 220 B=magnitude; 221 A=magnitude+angle; 222 } 223 else 224 if(angle>0){ 225 A=magnitude; 226 B=magnitude+angle; 227 }else{ 228 B=magnitude; 229 A=magnitude-angle; 230 } 231 } 232</pre> 233 234<p>The function is antisymmetric for positive and negative magnitudes in 235order to eliminate a redundant value when quantizing. For example, if 236we're quantizing to integer values, we can visualize a magnitude of 5 237and an angle of -2 as follows:</p> 238 239<p><img src="squarepolar.png" alt="square polar"/></p> 240 241<p>This representation loses or replicates no values; if the range of A 242and B are integral -5 through 5, the number of possible Cartesian 243permutations is 121. Represented in square polar notation, the 244possible values are:</p> 245 246<pre> 247 0, 0 248 249-1,-2 -1,-1 -1, 0 -1, 1 250 251 1,-2 1,-1 1, 0 1, 1 252 253-2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3 254 255 2,-4 2,-3 ... following the pattern ... 256 257 ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9 258 259</pre> 260 261<p>...for a grand total of 121 possible values, the same number as in 262Cartesian representation (note that, for example, <tt>5,-10</tt> is 263the same as <tt>-5,10</tt>, so there's no reason to represent 264both. 2,10 cannot happen, and there's no reason to account for it.) 265It's also obvious that this mapping is exactly reversible.</p> 266 267<h3>Channel interleaving</h3> 268 269<p>We can remap and A/B vector using polar mapping into a magnitude/angle 270vector, and it's clear that, in general, this concentrates energy in 271the magnitude vector and reduces the amount of information to encode 272in the angle vector. Encoding these vectors independently with 273residue backend #0 or residue backend #1 will result in bitrate 274savings. However, there are still implicit correlations between the 275magnitude and angle vectors. The most obvious is that the amplitude 276of the angle is bounded by its corresponding magnitude value.</p> 277 278<p>Entropy coding the results, then, further benefits from the entropy 279model being able to compress magnitude and angle simultaneously. For 280this reason, Vorbis implements residue backend #2 which pre-interleaves 281a number of input vectors (in the stereo case, two, A and B) into a 282single output vector (with the elements in the order of 283A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus 284each vector to be coded by the vector quantization backend consists of 285matching magnitude and angle values.</p> 286 287<p>The astute reader, at this point, will notice that in the theoretical 288case in which we can use monolithic codebooks of arbitrarily large 289size, we can directly interleave and encode left and right without 290polar mapping; in fact, the polar mapping does not appear to lend any 291benefit whatsoever to the efficiency of the entropy coding. In fact, 292it is perfectly possible and reasonable to build a Vorbis encoder that 293dispenses with polar mapping entirely and merely interleaves the 294channel. Libvorbis based encoders may configure such an encoding and 295it will work as intended.</p> 296 297<p>However, when we leave the ideal/theoretical domain, we notice that 298polar mapping does give additional practical benefits, as discussed in 299the above section on polar mapping and summarized again here:</p> 300 301<ul> 302<li>Polar mapping aids in controlling entropy 'leakage' between stages 303of a cascaded codebook.</li> 304<li>Polar mapping separates the stereo image 305into point and diffuse components which may be analyzed and handled 306differently.</li> 307</ul> 308 309<h2>Stereo Models</h2> 310 311<h3>Dual Stereo</h3> 312 313<p>Dual stereo refers to stereo encoding where the channels are entirely 314separate; they are analyzed and encoded as entirely distinct entities. 315This terminology is familiar from mp3.</p> 316 317<h3>Lossless Stereo</h3> 318 319<p>Using polar mapping and/or channel interleaving, it's possible to 320couple Vorbis channels losslessly, that is, construct a stereo 321coupling encoding that both saves space but also decodes 322bit-identically to dual stereo. OggEnc 1.0 and later uses this 323mode in all high-bitrate encoding.</p> 324 325<p>Overall, this stereo mode is overkill; however, it offers a safe 326alternative to users concerned about the slightest possible 327degradation to the stereo image or archival quality audio.</p> 328 329<h3>Phase Stereo</h3> 330 331<p>Phase stereo is the least aggressive means of gracefully dropping 332resolution from the stereo image; it affects only diffuse imaging.</p> 333 334<p>It's often quoted that the human ear is deaf to signal phase above 335about 4kHz; this is nearly true and a passable rule of thumb, but it 336can be demonstrated that even an average user can tell the difference 337between high frequency in-phase and out-of-phase noise. Obviously 338then, the statement is not entirely true. However, it's also the case 339that one must resort to nearly such an extreme demonstration before 340finding the counterexample.</p> 341 342<p>'Phase stereo' is simply a more aggressive quantization of the polar 343angle vector; above 4kHz it's generally quite safe to quantize noise 344and noisy elements to only a handful of allowed phases, or to thin the 345phase with respect to the magnitude. The phases of high amplitude 346pure tones may or may not be preserved more carefully (they are 347relatively rare and L/R tend to be in phase, so there is generally 348little reason not to spend a few more bits on them)</p> 349 350<h4>example: eight phase stereo</h4> 351 352<p>Vorbis may implement phase stereo coupling by preserving the entirety 353of the magnitude vector (essential to fine amplitude and energy 354resolution overall) and quantizing the angle vector to one of only 355four possible values. Given that the magnitude vector may be positive 356or negative, this results in left and right phase having eight 357possible permutation, thus 'eight phase stereo':</p> 358 359<p><img src="eightphase.png" alt="eight phase"/></p> 360 361<p>Left and right may be in phase (positive or negative), the most common 362case by far, or out of phase by 90 or 180 degrees.</p> 363 364<h4>example: four phase stereo</h4> 365 366<p>Similarly, four phase stereo takes the quantization one step further; 367it allows only in-phase and 180 degree out-out-phase signals:</p> 368 369<p><img src="fourphase.png" alt="four phase"/></p> 370 371<h3>example: point stereo</h3> 372 373<p>Point stereo eliminates the possibility of out-of-phase signal 374entirely. Any diffuse quality to a sound source tends to collapse 375inward to a point somewhere within the stereo image. A practical 376example would be balanced reverberations within a large, live space; 377normally the sound is diffuse and soft, giving a sonic impression of 378volume. In point-stereo, the reverberations would still exist, but 379sound fairly firmly centered within the image (assuming the 380reverberation was centered overall; if the reverberation is stronger 381to the left, then the point of localization in point stereo would be 382to the left). This effect is most noticeable at low and mid 383frequencies and using headphones (which grant perfect stereo 384separation). Point stereo is is a graceful but generally easy to 385detect degradation to the sound quality and is thus used in frequency 386ranges where it is least noticeable.</p> 387 388<h3>Mixed Stereo</h3> 389 390<p>Mixed stereo is the simultaneous use of more than one of the above 391stereo encoding models, generally using more aggressive modes in 392higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p> 393 394<p>It is also the case that near-DC frequencies should be encoded using 395lossless coupling to avoid frame blocking artifacts.</p> 396 397<h3>Vorbis Stereo Modes</h3> 398 399<p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes 400constructed out of lossless and point stereo. Phase stereo was used 401in the rc2 encoder, but is not currently used for simplicity's sake. It 402will likely be re-added to the stereo model in the future.</p> 403 404<div id="copyright"> 405 The Xiph Fish Logo is a 406 trademark (™) of Xiph.Org.<br/> 407 408 These pages © 1994 - 2005 Xiph.Org. All rights reserved. 409</div> 410 411</body> 412</html> 413 414 415 416 417 418 419