1<?xml version="1.0" encoding="iso-8859-1"?> 2<!DOCTYPE chapter PUBLIC "-//Samba-Team//DTD DocBook V4.2-Based Variant V1.0//EN" "http://www.samba.org/samba/DTD/samba-doc"> 3<chapter id="SambaHA"> 4<chapterinfo> 5 &author.jht; 6 &author.jeremy; 7</chapterinfo> 8 9<title>High Availability</title> 10 11<sect1> 12<title>Features and Benefits</title> 13 14<para> 15<indexterm><primary>availability</primary></indexterm> 16<indexterm><primary>intolerance</primary></indexterm> 17<indexterm><primary>vital task</primary></indexterm> 18Network administrators are often concerned about the availability of file and print 19services. Network users are inclined toward intolerance of the services they depend 20on to perform vital task responsibilities. 21</para> 22 23<para> 24A sign in a computer room served to remind staff of their responsibilities. It read: 25</para> 26 27<blockquote> 28<para> 29<indexterm><primary>fail</primary></indexterm> 30<indexterm><primary>managed by humans</primary></indexterm> 31<indexterm><primary>economically wise</primary></indexterm> 32<indexterm><primary>anticipate failure</primary></indexterm> 33All humans fail, in both great and small ways we fail continually. Machines fail too. 34Computers are machines that are managed by humans, the fallout from failure 35can be spectacular. Your responsibility is to deal with failure, to anticipate it 36and to eliminate it as far as is humanly and economically wise to achieve. 37Are your actions part of the problem or part of the solution? 38</para> 39</blockquote> 40 41<para> 42If we are to deal with failure in a planned and productive manner, then first we must 43understand the problem. That is the purpose of this chapter. 44</para> 45 46<para> 47<indexterm><primary>high availability</primary></indexterm> 48<indexterm><primary>CIFS/SMB</primary></indexterm> 49<indexterm><primary>state of knowledge</primary></indexterm> 50Parenthetically, in the following discussion there are seeds of information on how to 51provision a network infrastructure against failure. Our purpose here is not to provide 52a lengthy dissertation on the subject of high availability. Additionally, we have made 53a conscious decision to not provide detailed working examples of high availability 54solutions; instead we present an overview of the issues in the hope that someone will 55rise to the challenge of providing a detailed document that is focused purely on 56presentation of the current state of knowledge and practice in high availability as it 57applies to the deployment of Samba and other CIFS/SMB technologies. 58</para> 59 60</sect1> 61 62<sect1> 63<title>Technical Discussion</title> 64 65<para> 66<indexterm><primary>SambaXP conference</primary></indexterm> 67<indexterm><primary>Germany</primary></indexterm> 68<indexterm><primary>inspired structure</primary></indexterm> 69The following summary was part of a presentation by Jeremy Allison at the SambaXP 2003 70conference that was held at Goettingen, Germany, in April 2003. Material has been added 71from other sources, but it was Jeremy who inspired the structure that follows. 72</para> 73 74 <sect2> 75 <title>The Ultimate Goal</title> 76 77 <para> 78<indexterm><primary>clustering technologies</primary></indexterm> 79<indexterm><primary>affordable power</primary></indexterm> 80<indexterm><primary>unstoppable services</primary></indexterm> 81 All clustering technologies aim to achieve one or more of the following: 82 </para> 83 84 <itemizedlist> 85 <listitem><para>Obtain the maximum affordable computational power.</para></listitem> 86 <listitem><para>Obtain faster program execution.</para></listitem> 87 <listitem><para>Deliver unstoppable services.</para></listitem> 88 <listitem><para>Avert points of failure.</para></listitem> 89 <listitem><para>Exact most effective utilization of resources.</para></listitem> 90 </itemizedlist> 91 92 <para> 93 A clustered file server ideally has the following properties: 94<indexterm><primary>clustered file server</primary></indexterm> 95<indexterm><primary>connect transparently</primary></indexterm> 96<indexterm><primary>transparently reconnected</primary></indexterm> 97<indexterm><primary>distributed file system</primary></indexterm> 98 </para> 99 100 <itemizedlist> 101 <listitem><para>All clients can connect transparently to any server.</para></listitem> 102 <listitem><para>A server can fail and clients are transparently reconnected to another server.</para></listitem> 103 <listitem><para>All servers serve out the same set of files.</para></listitem> 104 <listitem><para>All file changes are immediately seen on all servers.</para> 105 <itemizedlist><listitem><para>Requires a distributed file system.</para></listitem></itemizedlist></listitem> 106 <listitem><para>Infinite ability to scale by adding more servers or disks.</para></listitem> 107 </itemizedlist> 108 109 </sect2> 110 111 <sect2> 112 <title>Why Is This So Hard?</title> 113 114 <para> 115 In short, the problem is one of <emphasis>state</emphasis>. 116 </para> 117 118 <itemizedlist> 119 <listitem> 120 <para> 121<indexterm><primary>state information</primary></indexterm> 122 All TCP/IP connections are dependent on state information. 123 </para> 124 <para> 125<indexterm><primary>TCP failover</primary></indexterm> 126 The TCP connection involves a packet sequence number. This 127 sequence number would need to be dynamically updated on all 128 machines in the cluster to effect seamless TCP failover. 129 </para> 130 </listitem> 131 <listitem> 132 <para> 133<indexterm><primary>CIFS/SMB</primary></indexterm> 134<indexterm><primary>TCP</primary></indexterm> 135 CIFS/SMB (the Windows networking protocols) uses TCP connections. 136 </para> 137 <para> 138 This means that from a basic design perspective, failover is not 139 seriously considered. 140 <itemizedlist> 141 <listitem><para> 142 All current SMB clusters are failover solutions 143 &smbmdash; they rely on the clients to reconnect. They provide server 144 failover, but clients can lose information due to a server failure. 145<indexterm><primary>server failure</primary></indexterm> 146 </para></listitem> 147 </itemizedlist> 148 </para> 149 </listitem> 150 <listitem> 151 <para> 152 Servers keep state information about client connections. 153 <itemizedlist> 154<indexterm><primary>state</primary></indexterm> 155 <listitem><para>CIFS/SMB involves a lot of state.</para></listitem> 156 <listitem><para>Every file open must be compared with other open files 157 to check share modes.</para></listitem> 158 </itemizedlist> 159 </para> 160 </listitem> 161 </itemizedlist> 162 163 <sect3> 164 <title>The Front-End Challenge</title> 165 166 <para> 167<indexterm><primary>cluster servers</primary></indexterm> 168<indexterm><primary>single server</primary></indexterm> 169<indexterm><primary>TCP data streams</primary></indexterm> 170<indexterm><primary>front-end virtual server</primary></indexterm> 171<indexterm><primary>virtual server</primary></indexterm> 172<indexterm><primary>de-multiplex</primary></indexterm> 173<indexterm><primary>SMB</primary></indexterm> 174 To make it possible for a cluster of file servers to appear as a single server that has one 175 name and one IP address, the incoming TCP data streams from clients must be processed by the 176 front-end virtual server. This server must de-multiplex the incoming packets at the SMB protocol 177 layer level and then feed the SMB packet to different servers in the cluster. 178 </para> 179 180 <para> 181<indexterm><primary>IPC$ connections</primary></indexterm> 182<indexterm><primary>RPC calls</primary></indexterm> 183 One could split all IPC$ connections and RPC calls to one server to handle printing and user 184 lookup requirements. RPC printing handles are shared between different IPC4 sessions &smbmdash; it is 185 hard to split this across clustered servers! 186 </para> 187 188 <para> 189 Conceptually speaking, all other servers would then provide only file services. This is a simpler 190 problem to concentrate on. 191 </para> 192 193 </sect3> 194 195 <sect3> 196 <title>Demultiplexing SMB Requests</title> 197 198 <para> 199<indexterm><primary>SMB requests</primary></indexterm> 200<indexterm><primary>SMB state information</primary></indexterm> 201<indexterm><primary>front-end virtual server</primary></indexterm> 202<indexterm><primary>complicated problem</primary></indexterm> 203 De-multiplexing of SMB requests requires knowledge of SMB state information, 204 all of which must be held by the front-end <emphasis>virtual</emphasis> server. 205 This is a perplexing and complicated problem to solve. 206 </para> 207 208 <para> 209<indexterm><primary>vuid</primary></indexterm> 210<indexterm><primary>tid</primary></indexterm> 211<indexterm><primary>fid</primary></indexterm> 212 Windows XP and later have changed semantics so state information (vuid, tid, fid) 213 must match for a successful operation. This makes things simpler than before and is a 214 positive step forward. 215 </para> 216 217 <para> 218<indexterm><primary>SMB requests</primary></indexterm> 219<indexterm><primary>Terminal Server</primary></indexterm> 220 SMB requests are sent by vuid to their associated server. No code exists today to 221 effect this solution. This problem is conceptually similar to the problem of 222 correctly handling requests from multiple requests from Windows 2000 223 Terminal Server in Samba. 224 </para> 225 226 <para> 227<indexterm><primary>de-multiplexing</primary></indexterm> 228 One possibility is to start by exposing the server pool to clients directly. 229 This could eliminate the de-multiplexing step. 230 </para> 231 232 </sect3> 233 234 <sect3> 235 <title>The Distributed File System Challenge</title> 236 237 <para> 238<indexterm><primary>Distributed File Systems</primary></indexterm> 239 There exists many distributed file systems for UNIX and Linux. 240 </para> 241 242 <para> 243<indexterm><primary>backend</primary></indexterm> 244<indexterm><primary>SMB semantics</primary></indexterm> 245<indexterm><primary>share modes</primary></indexterm> 246<indexterm><primary>locking</primary></indexterm> 247<indexterm><primary>oplock</primary></indexterm> 248<indexterm><primary>distributed file systems</primary></indexterm> 249 Many could be adopted to backend our cluster, so long as awareness of SMB 250 semantics is kept in mind (share modes, locking, and oplock issues in particular). 251 Common free distributed file systems include: 252<indexterm><primary>NFS</primary></indexterm> 253<indexterm><primary>AFS</primary></indexterm> 254<indexterm><primary>OpenGFS</primary></indexterm> 255<indexterm><primary>Lustre</primary></indexterm> 256 </para> 257 258 <itemizedlist> 259 <listitem><para>NFS</para></listitem> 260 <listitem><para>AFS</para></listitem> 261 <listitem><para>OpenGFS</para></listitem> 262 <listitem><para>Lustre</para></listitem> 263 </itemizedlist> 264 265 <para> 266<indexterm><primary>server pool</primary></indexterm> 267 The server pool (cluster) can use any distributed file system backend if all SMB 268 semantics are performed within this pool. 269 </para> 270 271 </sect3> 272 273 <sect3> 274 <title>Restrictive Constraints on Distributed File Systems</title> 275 276 <para> 277<indexterm><primary>SMB services</primary></indexterm> 278<indexterm><primary>oplock handling</primary></indexterm> 279<indexterm><primary>server pool</primary></indexterm> 280<indexterm><primary>backend file system pool</primary></indexterm> 281 Where a clustered server provides purely SMB services, oplock handling 282 may be done within the server pool without imposing a need for this to 283 be passed to the backend file system pool. 284 </para> 285 286 <para> 287<indexterm><primary>NFS</primary></indexterm> 288<indexterm><primary>interoperability</primary></indexterm> 289 On the other hand, where the server pool also provides NFS or other file services, 290 it will be essential that the implementation be oplock-aware so it can 291 interoperate with SMB services. This is a significant challenge today. A failure 292 to provide this interoperability will result in a significant loss of performance that will be 293 sorely noted by users of Microsoft Windows clients. 294 </para> 295 296 <para> 297 Last, all state information must be shared across the server pool. 298 </para> 299 300 </sect3> 301 302 <sect3> 303 <title>Server Pool Communications</title> 304 305 <para> 306<indexterm><primary>POSIX semantics</primary></indexterm> 307<indexterm><primary>SMB</primary></indexterm> 308<indexterm><primary>POSIX locks</primary></indexterm> 309<indexterm><primary>SMB locks</primary></indexterm> 310 Most backend file systems support POSIX file semantics. This makes it difficult 311 to push SMB semantics back into the file system. POSIX locks have different properties 312 and semantics from SMB locks. 313 </para> 314 315 <para> 316<indexterm><primary>smbd</primary></indexterm> 317<indexterm><primary>tdb</primary></indexterm> 318<indexterm><primary>Clustered smbds</primary></indexterm> 319 All <command>smbd</command> processes in the server pool must of necessity communicate 320 very quickly. For this, the current <parameter>tdb</parameter> file structure that Samba 321 uses is not suitable for use across a network. Clustered <command>smbd</command>s must use something else. 322 </para> 323 324 </sect3> 325 326 <sect3> 327 <title>Server Pool Communications Demands</title> 328 329 <para> 330 High-speed interserver communications in the server pool is a design prerequisite 331 for a fully functional system. Possibilities for this include: 332 </para> 333 334 <itemizedlist> 335<indexterm><primary>Myrinet</primary></indexterm> 336<indexterm><primary>scalable coherent interface</primary><see>SCI</see></indexterm> 337 <listitem><para> 338 Proprietary shared memory bus (example: Myrinet or SCI [scalable coherent interface]). 339 These are high-cost items. 340 </para></listitem> 341 342 <listitem><para> 343 Gigabit Ethernet (now quite affordable). 344 </para></listitem> 345 346 <listitem><para> 347 Raw Ethernet framing (to bypass TCP and UDP overheads). 348 </para></listitem> 349 </itemizedlist> 350 351 <para> 352 We have yet to identify metrics for performance demands to enable this to happen 353 effectively. 354 </para> 355 356 </sect3> 357 358 <sect3> 359 <title>Required Modifications to Samba</title> 360 361 <para> 362 Samba needs to be significantly modified to work with a high-speed server interconnect 363 system to permit transparent failover clustering. 364 </para> 365 366 <para> 367 Particular functions inside Samba that will be affected include: 368 </para> 369 370 <itemizedlist> 371 <listitem><para> 372 The locking database, oplock notifications, 373 and the share mode database. 374 </para></listitem> 375 376 <listitem><para> 377<indexterm><primary>failure semantics</primary></indexterm> 378<indexterm><primary>oplock messages</primary></indexterm> 379 Failure semantics need to be defined. Samba behaves the same way as Windows. 380 When oplock messages fail, a file open request is allowed, but this is 381 potentially dangerous in a clustered environment. So how should interserver 382 pool failure semantics function, and how should such functionality be implemented? 383 </para></listitem> 384 385 <listitem><para> 386 Should this be implemented using a point-to-point lock manager, or can this 387 be done using multicast techniques? 388 </para></listitem> 389 390 </itemizedlist> 391 392 </sect3> 393 </sect2> 394 395 <sect2> 396 <title>A Simple Solution</title> 397 398 <para> 399<indexterm><primary>failover servers</primary></indexterm> 400<indexterm><primary>exported file system</primary></indexterm> 401<indexterm><primary>distributed locking protocol</primary></indexterm> 402 Allowing failover servers to handle different functions within the exported file system 403 removes the problem of requiring a distributed locking protocol. 404 </para> 405 406 <para> 407<indexterm><primary>high-speed server interconnect</primary></indexterm> 408<indexterm><primary>complex file name space</primary></indexterm> 409 If only one server is active in a pair, the need for high-speed server interconnect is avoided. 410 This allows the use of existing high-availability solutions, instead of inventing a new one. 411 This simpler solution comes at a price &smbmdash; the cost of which is the need to manage a more 412 complex file name space. Since there is now not a single file system, administrators 413 must remember where all services are located &smbmdash; a complexity not easily dealt with. 414 </para> 415 416 <para> 417<indexterm><primary>virtual server</primary></indexterm> 418 The <emphasis>virtual server</emphasis> is still needed to redirect requests to backend 419 servers. Backend file space integrity is the responsibility of the administrator. 420 </para> 421 422 </sect2> 423 424 <sect2> 425 <title>High-Availability Server Products</title> 426 427 <para> 428<indexterm><primary>resource failover</primary></indexterm> 429<indexterm><primary>high-availability services</primary></indexterm> 430<indexterm><primary>dedicated heartbeat</primary></indexterm> 431<indexterm><primary>LAN</primary></indexterm> 432<indexterm><primary>failover process</primary></indexterm> 433 Failover servers must communicate in order to handle resource failover. This is essential 434 for high-availability services. The use of a dedicated heartbeat is a common technique to 435 introduce some intelligence into the failover process. This is often done over a dedicated 436 link (LAN or serial). 437 </para> 438 439 <para> 440<indexterm><primary>SCSI</primary></indexterm> 441<indexterm><primary>Red Hat Cluster Manager</primary></indexterm> 442<indexterm><primary>Microsoft Wolfpack</primary></indexterm> 443<indexterm><primary>Fiber Channel</primary></indexterm> 444<indexterm><primary>failover communication</primary></indexterm> 445 Many failover solutions (like Red Hat Cluster Manager and Microsoft Wolfpack) 446 can use a shared SCSI of Fiber Channel disk storage array for failover communication. 447 Information regarding Red Hat high availability solutions for Samba may be obtained from 448 <ulink url="http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster-manager/s1-service-samba.html">www.redhat.com</ulink>. 449 </para> 450 451 <para> 452<indexterm><primary>Linux High Availability project</primary></indexterm> 453 The Linux High Availability project is a resource worthy of consultation if your desire is 454 to build a highly available Samba file server solution. Please consult the home page at 455 <ulink url="http://www.linux-ha.org/">www.linux-ha.org/</ulink>. 456 </para> 457 458 <para> 459<indexterm><primary>backend failures</primary></indexterm> 460<indexterm><primary>continuity of service</primary></indexterm> 461 Front-end server complexity remains a challenge for high availability because it must deal 462 gracefully with backend failures, while at the same time providing continuity of service 463 to all network clients. 464 </para> 465 466 </sect2> 467 468 <sect2> 469 <title>MS-DFS: The Poor Man's Cluster</title> 470 471 <para> 472<indexterm><primary>MS-DFS</primary></indexterm> 473<indexterm><primary>DFS</primary><see>MS-DFS, Distributed File Systems</see></indexterm> 474 MS-DFS links can be used to redirect clients to disparate backend servers. This pushes 475 complexity back to the network client, something already included by Microsoft. 476 MS-DFS creates the illusion of a simple, continuous file system name space that works even 477 at the file level. 478 </para> 479 480 <para> 481 Above all, at the cost of complexity of management, a distributed system (pseudo-cluster) can 482 be created using existing Samba functionality. 483 </para> 484 485 </sect2> 486 487 <sect2> 488 <title>Conclusions</title> 489 490 <itemizedlist> 491 <listitem><para>Transparent SMB clustering is hard to do!</para></listitem> 492 <listitem><para>Client failover is the best we can do today.</para></listitem> 493 <listitem><para>Much more work is needed before a practical and manageable high-availability transparent cluster solution will be possible.</para></listitem> 494 <listitem><para>MS-DFS can be used to create the illusion of a single transparent cluster.</para></listitem> 495 </itemizedlist> 496 497 </sect2> 498 499</sect1> 500</chapter> 501