Mentioning the different layers of managing the impact of radiation (process, circuit, microarchitecture, software) was nice. The recognition of the value of post-launch change is expected from an FPGA vendor (though this brings in the power-performance-area and dynamic configurability tradeoffs of software, FPGAs, and more traditional mostly-fixed function logic), but the tradeoffs between one-time-programmable memory FPGA (less dynamic configurability but more resilient to radiation), SRAM FPGA, and hardwired logic might have been worth mentioning. The tradeoffs between spatial and temporal redundancy might have been worth mentioning. With a faster device (or looser timing constraints, possibly from processing tasks in parallel on separate processing elements available from using a finer-featured process) re-running a computation may be more reasonable than conditionally retrying or voting. Obviously a persistent fault can make purely temporal fault detection inadequate. (Presumably phase-shifted redundancy has also been considered and used; by not having lockstep operation, some spatially broad but temporally local fault causes could have less systematic and more stochastic behavior. If the phase is shifted by an entire work unit, this becomes temporal redundancy with processor switching. One could also consider something like Todd M. Austin's DIVA, where a checker processor is less complex, possibly keeping pace via higher level parallelism (multiple checkers overlapping operation where a task starts at the same time on a high performance processor and a checker and when the next task starts on the high performance processor a different checker is used) or by avoiding cache miss overheads. That kind of simpler checker design does not seem likely to work well generally, but it might be appropriate somewhere and is at least interesting to think about.) Also, the reduced launch cost per payload (both more multiple payload launches and lower launch costs) encourages reduced payload cost even at the cost of reliability and durability. (Reducing the cost and expected reliability of payloads can also reduce the cost of launches as reducing launch failure rate is expensive. Higher volume also leads to lower costs and higher reliability as one moves faster up the learning curve.) Components not rated for aerospace uses can be less expensive and defects The link between approximate computing and reliability was hinted at. Some errors have no architectural effect (e.g., a comparison used for a branch decision could have an error in the inputs or the processing as long as the decision remains correct), some errors produce an architectural effect that is not significant to the higher-level function (e.g., flipping the least significant bit of pixel data is usually not significant), some errors are detectable and possibly correctable with on-ground processing (though distinguishing between an "impossible" observation and an inprobably error presents challenges - science can be delayed by throwing out data that "can't be right"). Manufacture-time variation/defectivity also seem related (in theory some of the redundancy to improve manufacturability might be transferable to increase operational reliability) Minal Sawant seemed to discount persistent faults, but a data-only fault in SRAM-based FPGAs will be quasi-persistent (requiring a reprogramming to reset the error). Even in a hardwired system, persistent faults are possible and radiation (like temperature and other environmental factors) presumably increases the probability of a persistent error. (Fault persistence is also not binary or one dimensional. E.g., electromigration causing increased resistance might initially only cause certain inputs to fail timing under certain temperature conditions - from a hardware perspective this is a persistent fault but from a higher level it will appear as transient faults.) Very nice information overall (though I would have preferred reading rather than watching/listening) even if not 10,000 hours long to consider a broader range of the tradeoffs in more depth.☺
Mentioning the different layers of managing the impact of radiation (process, circuit, microarchitecture, software) was nice. The recognition of the value of post-launch change is expected from an FPGA vendor (though this brings in the power-performance-area and dynamic configurability tradeoffs of software, FPGAs, and more traditional mostly-fixed function logic), but the tradeoffs between one-time-programmable memory FPGA (less dynamic configurability but more resilient to radiation), SRAM FPGA, and hardwired logic might have been worth mentioning.
The tradeoffs between spatial and temporal redundancy might have been worth mentioning. With a faster device (or looser timing constraints, possibly from processing tasks in parallel on separate processing elements available from using a finer-featured process) re-running a computation may be more reasonable than conditionally retrying or voting. Obviously a persistent fault can make purely temporal fault detection inadequate. (Presumably phase-shifted redundancy has also been considered and used; by not having lockstep operation, some spatially broad but temporally local fault causes could have less systematic and more stochastic behavior. If the phase is shifted by an entire work unit, this becomes temporal redundancy with processor switching. One could also consider something like Todd M. Austin's DIVA, where a checker processor is less complex, possibly keeping pace via higher level parallelism (multiple checkers overlapping operation where a task starts at the same time on a high performance processor and a checker and when the next task starts on the high performance processor a different checker is used) or by avoiding cache miss overheads. That kind of simpler checker design does not seem likely to work well generally, but it might be appropriate somewhere and is at least interesting to think about.)
Also, the reduced launch cost per payload (both more multiple payload launches and lower launch costs) encourages reduced payload cost even at the cost of reliability and durability. (Reducing the cost and expected reliability of payloads can also reduce the cost of launches as reducing launch failure rate is expensive. Higher volume also leads to lower costs and higher reliability as one moves faster up the learning curve.) Components not rated for aerospace uses can be less expensive and defects
The link between approximate computing and reliability was hinted at. Some errors have no architectural effect (e.g., a comparison used for a branch decision could have an error in the inputs or the processing as long as the decision remains correct), some errors produce an architectural effect that is not significant to the higher-level function (e.g., flipping the least significant bit of pixel data is usually not significant), some errors are detectable and possibly correctable with on-ground processing (though distinguishing between an "impossible" observation and an inprobably error presents challenges - science can be delayed by throwing out data that "can't be right"). Manufacture-time variation/defectivity also seem related (in theory some of the redundancy to improve manufacturability might be transferable to increase operational reliability)
Minal Sawant seemed to discount persistent faults, but a data-only fault in SRAM-based FPGAs will be quasi-persistent (requiring a reprogramming to reset the error). Even in a hardwired system, persistent faults are possible and radiation (like temperature and other environmental factors) presumably increases the probability of a persistent error. (Fault persistence is also not binary or one dimensional. E.g., electromigration causing increased resistance might initially only cause certain inputs to fail timing under certain temperature conditions - from a hardware perspective this is a persistent fault but from a higher level it will appear as transient faults.)
Very nice information overall (though I would have preferred reading rather than watching/listening) even if not 10,000 hours long to consider a broader range of the tradeoffs in more depth.☺
Great insight 👍
This is amazing!
Wow!