The Hercules of Safety Critical systems
In principle, all products and systems must be safe, in other words, they can't have any unacceptable risk factors. But some risks always remain – no product is 100% safe.
Functional safety means first identifying the risk factors that can cause serious consequences for people or the environment and then determining their risk level. If a risk level is higher than acceptable, actions need to be taken to reduce the risk and to define how they can be detected and how their consequences can be mitigated.
Functional safety has to be taken into consideration in many areas. Factories, for example, have monitoring processes that ensure nothing unplanned happens. The control system can, for example, make sure that a paper machine roll doesn't start rotating too fast or that a wood chip digester doesn't heat up too much.
Functional safety requirements are defined in EU directives, national legislation, and standards. There are many different directives, for example the machinery, elevator, medical, and ATEX directives. The machinery directive specifies among other things that the safety of machinery is the responsibility of its manufacturer. This means that the device has to be designed to function safely throughout the entire lifespan.
Safety is built in layers
The safety of an industrial process, for example, can be thought of as being built in layers in the following way:
First, the process can be inherently safe or it can be designed to be safe. Such a process would not cause serious harm to people or the environment even if something did go wrong. An example could be a chemical process that is designed so that its materials, pressures, and temperatures remain in the safe range for people and the environment.
In the next safety layer, the process regulates itself. When raw products are heated, for example, the correct temperature is maintained by the factory’s automation system.
The next layer contains countermeasures to things that can go wrong. For example, a control room operator carries out predetermined measures if the automation system notices that certain limits have been exceeded.
After this comes the autonomous safety system. If the previous layers of protection are not able to maintain safety, the autonomous safety system kicks in and drives the system or process into a safe state. If for example a paper machine's motor starts to run too fast, the safety system turns off the power to the motor or even turns on the motor’s brake to stop it immediately.
If all of the above safety systems don't help, active protection systems are activated, e.g. fire extinguisher systems. There are also passive safety systems, like for example fire doors. These are the last resorts to try to mitigate the possible consequences.
Safety integrity level according to IEC 61508
The IEC 61508 standard has an important role as an umbrella standard for the functional safety of electrical, electronic, and programmable electronic systems. Safety integrity is indicated according to the IEC 61508 standard through so-called safety integrity levels (SIL). These SIL’s indicate how effectively the autonomous safety system is able to reduce the risks. SIL 1 applications have the lowest level of risk reduction, SIL 4 the highest.
The safety integrity level requirements are based on risk analyses, i.e. risk assessments. SIL 4 is very strict and used mainly where risks have to be minimized in every way possible, such as in nuclear power plants.
There are no rules of thumb for SIL classification. Classification is always determined separately for each application and often requires input from an expert. The bigger the risks and the more serious the possible consequence, the higher the safety classification has to be. The choice of classification always needs to be justified by sufficient evidence.
As a basic rule, the different components in the safety system should be reliable. If and when this is not the case, the alternatives are to have good enough diagnostics in place to detect faults or to increase redundancy or both to ensure the safety functions. This is actually always a trade-off. When there are dozens or hundreds of components that can become defective, the theoretical failure rate can get so high that it is no longer possible to achieve the required SIL without using fault diagnostics.
Product certification means that a third party, a certification company, has issued a certificate of the product’s compliance with the relevant requirements. This is nowadays very often TÜV. TÜV inspects the design and documentation and checks that they comply with the relevant standards.
It is much easier to implement the safety functions using previously certified components. That is why a great variety of products and components now available as already certified. Such components include various logic controllers, motor and device drives, measurement devices, and safety switches. The certification is a strong sales argument for these devices.
There are also safety-certified electronic components. An example is the Hercules processor from Texas Instruments. The idea of Hercules is that the certification of an entire product is easier when the critical building blocks are already certified. If ordinary, uncertified processors are used in product design, one has to search for the reliability data oneself and plan and carry out the development process in compliance with the relevant standards.
The Hercules processor is actually a platform that includes several different product families. The two most important product families are the RM and the TMS570. The Hercules is based on a Cortex-R core, and developed to comply with SIL 3. The RM series is designed mainly for industrial and medical use, and the TMS570 is certified as SIL 3 and also according to the ISO 26262 ASIL D standard for automotive use.
The safety functions and diagnostics are implemented in the microcontroller according to the “safe island” concept. The safety functionality is protected by diagnostics means and isolated from the non-safety critical parts. This architecture contains two processor cores. From a safety perspective, the chip uses the concept “one out of one with diagnostics” (1oo1D), in which one core carries out the security functions but includes safety-ensuring diagnostics and cross comparison from the other core. Its primary way of functioning is in principle single-channel activity, but diagnostics has been added to it. In this way the system itself tries to ensure that it’s working correctly.
What benefits does Hercules bring?
Texas Instruments has certified the Hercules processors in terms of both hardware and software. The documentation explains how products have to be designed by utilizing the Hercules processor’s features so that the design complies with the IEC65108 standard.
During the certification process, Texas Instruments had to go through the failure analysis very carefully together with the certification company. The Hercules platform provides a range of failure analysis tools. Different configurations can be selected using the failure analysis tools and corresponding failure rates calculated. These failure rates and modes are important data when the certification body assesses whether the failure rate complies with the SIL requirements.
The diagnostics functions in the processor significantly reduce the need to implement diagnostics functions in the program code itself. Other safety related functions are implemented in the processor as well, including clock monitoring and CPU self-testing. This means that separate diagnostics functions or components outside of the processor are not needed. This reduces costs and saves time and the design process becomes easier.
The documentation instructs what methods should been chosen to reduce systematic and random failures and their consequences. Many methods are obligatory, and increasingly more complex, broad, and obligatory methods are required the higher your safety target is.
Autonomous safety systems diversify
Safety systems are nowadays diverse and versatile. Safety functions are carried out by both devices and software. The further a problem situation escalates through the different layers of safety, the bigger the problem usually gets. Autonomous safety systems prevent the situation from escalating so far that active protection systems are activated, for example fire extinguishers, which themselves can already cause significant damage or at least annoyance and extra work.
About the authors
Tomi Lampola is a chief electronics engineer and a TÜV certified safety professional. Tomi has more than 15 years of experience in electronics development for industrial applications and more than 5 years design for safety critical systems. Tomi can be reached at firstname.lastname@example.org.
Ville Sipinen is a chief software engineer and a TÜV certified safety professional. Ville has more than 5 years of experience in SW development for safety critical systems and several years of experience from mission critical systems for space applications prior to this. Ville can be reached at email@example.com