Soft-Fail Versus Conventional Redundancy
October 23rd, 2023Redundancy, in the context of earth station electronics, refers to subsystems that are designed to limit service-interruption to a matter of milliseconds (the time it takes for a switch to move from position-one to position-two) following the catastrophic failure of a major component (a 'component' being any significant device, such as a power amplifier, RF converter, LNA/B, modem etc.) These subsystems come in many forms that are highly application-dependent.
The need for redundancy is a given for (dare I say it) 'mission critical applications' (one of the most overused terms in our industry). There are certainly instances where an interruption of service can cause great harm, like a life-threatening surgery being performed remotely via a satellite link or more importantly, the seamless transmission of a major sporting event.
Redundancy subsystems are comprised of a controller that monitors the health of the components in that subsystem and, upon detection of the failure of an online component, will send commands to a switching system that will reroute the signal to an identical standby that is sitting in a quiescent state waiting to be called into service.
This architecture is what I refer to as 'conventional redundancy.' It's a concept that has been in play since the beginning of the industry for instances where a high level of availability is desired. It is a very simple approach, though not necessarily the best approach for all cases. But we'll get more into that later.
When it comes to earth station components commonly used to transmit and receive satcom links, we typically think in terms of 'single pol' and 'dual pol', which refers to the two orthogonally separated polarization fields (except for circular polarized satellites, but let's not go there) that exist between the ground station and the satellite. If the antenna feed is configured for dual pol access (full frequency reuse), four discrete ports are available - two transmit and two receive.
A few downsides of conventional redundancy are that the dedicated backup can't be used to carry additional services and still serve as a backup (without the addition of some priority switching logic - and that can get messy). A dedicated backup is okay if the components are relatively inexpensive (cheap LNBs or low-power amplifiers). But if you're talking about more expensive products, like very high-power HPAs, you inherently tie up a lot of capital. And if the backup component sits in hot-standby (usually the case), it's aging along with the online component, while serving no benefit until there's a failure.
That pretty well covers the main elements of conventional redundancy, but the conversation wouldn't be complete without mentioning 'phase-combining' and its impact on redundancy architecture. It should be noted that solid state power amplifiers depend on Field-Effect Transistors (FETs) to generate RF power.
FETs come in all shapes and sizes with power levels that range from a few watts to somewhere around 130 watts at some frequencies. In order to generate respectable power levels, they must be cascaded, or phase-combined such that their individual merits can be summed. Phase-combining can be performed inside the amplifier with a break point of around 1kW or so at some frequencies. Beyond that, physical size and weight become prohibitively impractical.
For higher power systems, one can choose to either externally phase combine these larger amplifiers, or one can choose to distribute the load over a larger number of smaller amplifiers. When it comes to redundancy in high-power systems, an alternative approach to consider is a modular system based on 'Soft-Fail Redundancy'.
Operationally, a soft-fail system isn't so different from a system that uses conventional redundancy. But behind the curtain, soft-fail systems are considerably more sophisticated, carry a host of additional benefits and cost savings that might not be readily apparent at a casual glance. But we'll get into the intimate details later.
When considering the purchase of high-power amplifier systems, two important metrics include - 'Mean-Time-Between-Failures' (MTBF) and 'Mean-Time-To-Repair' (MTTR). Important, because together they determine the 'Availability' of the system - the total number of hours it will be usable over its projected lifespan - in simpler terms, ROI.
Where MTBF is more or less a reflection of a product's design quality, like how well it's able to extract heat from the transistors under high ambient conditions, MTTR is more of a reflection of how quick and efficiently a failed component can be removed from the system, fixed and reinstalled. In other words, how long will the system be down if a component fails.
The steps include removal, packing, freight-time back to the factory, Customs-clearance, repair-time, test-time, repacking, freight-time back to the site, Customs-clearance and re-installation. For systems that employ conventional redundancy, the MTTR can be greatly reduced if there is a spare component at the site that can be placed into service while the failed component is off being repaired.
In this case, a lower MTTR can come at a significant expense, particularly for high-power systems. You now have two high-dollar components sitting in stasis - the offline backup plus the shelf spare. This could equate to hundreds of thousands of dollars in exchange for peace of mind (or job security). Another option is to employ soft-fail redundancy.
Back in the early 2000's, Maxtech, a manufacturer of solid state satcom power amplifiers, introduced a radical new concept in high power amplifier (HPA) redundancy, along with two new terms - soft-fail and hot-swap. The product was badged 'Modumax' and consisted of a rack mount chassis with eight power modules phase-combined to produce 1kW of Psat power in C-band.
It was eventually expanded into other frequency bands and power levels, but the cool thing about it was that the failure of a single module resulted in a maximum power loss of 1.2dB. And if sufficient power was held in back-off, the output of the remaining seven modules would automatically be increased to compensate for that loss (no mechanical switching required) - so the total RF output of the amplifier would remain constant (soft-fail).
And the beauty was that the failed module could be removed and replaced while the amplifier was in service (hot swap). As a result, the expense of sparing was reduced to the cost of a single module and (perhaps a power supply module that was also hot swap capable). In that scenario, MTTR went down from weeks or months to minutes or hours.
I was responsible for sales at the systems integration facility of VertexRSI at the time Modumax was introduced and market acceptance was tepid at best. The comfort zone of the industry was centered around legacy, conventional redundancy (welcome to satcom). It took a while, but following a few successes, soft-fail became a staple of the industry for high power applications.
Modumax was only available in a rack mount configuration at a time when high power amplifier systems were moving out to the antenna to manage the RF insertion loss associated with long waveguide runs, reduce utility costs and eliminate the need for RF equipment shelters. Regardless, Modumax was a very successful product and remains so to this day.
A decade would pass before competing soft-fail systems would come to market when Paradise Datacom introduced 'PowerMAX' and Advantech launched 'Summit'. In both cases, the RF modules were complete amplifiers that had the capacity to generate significantly higher levels of RF output power and virtually eliminated single points of failure.
When Advantech Wireless brought the second and third generations of Summit (Summit II in 2019 and Summit III in 2022), great care was taken to exploit the benefits of soft-fail system architecture, including individual amplifier/modules, each capable of generating up to 1kW of RF power and the introduction of CAN-Bus as an operating platform, due to its high processing speed and component-level diagnostics capability.
When CAN-Bus became integral to each amplifier in the system, the need for outboard controllers was eliminated thus allowing any amplifier in the system to take over as the master in the event of a module failure.
In soft-fail, unlike in a conventional redundancy system, all of the amplifiers are in service, sharing the load and with the health of each (down to the device level) being constantly monitored and reported to the master in real time.
Since the total system output power can be distributed over a larger number of smaller amplifiers, the loss from a single amplifier failure is reduced (0.6dB for a 16-amplifier system). The amplifiers are smaller and easier to handle, are less expensive to spare and the return freight costs for service are much lower.
Back to the MTBF and MTTR metrics. In soft-fail systems, switching is not required to facilitate redundancy, which increases the MTBF, and at no point is the signal severed during the backup process.
As is always the case, one size doesn't fit all. Both conventional and soft-fail redundancy platforms have their respective 'perfect fit' scenarios, but it's great to know that a few new options are available for operators around the planet to ponder.
For more information, visit advantechwireless.com