Building robust and thoroughly resilient processors has emerged as a crucial nanotechnology challenge. In the late CMOS era, device-scaling trends have resulted in an increased awareness of the various sources of unreliability at the chip level. Our goal is to build resilient processors where resilience is a measure of a processor's ability to continue working in the presence of system degradations and failures. There are three major variation sources, at large they can be grouped into either static or dynamic variations, and can be further binned into Process, Voltage, and Thermal problems at the chip level.
The unique approach that is taken by us is to make future systems robust and resilient involves (1) co-designing hardware and software for resiliency to ease the burden of increasing hardware complexity for robustness; (2) applying deep learning techniques to develop self-healing systems that are capable of predictive failure analysis; (3) designing feedback-directed resiliency techniques that learn from recurring behavior and take fault-tolerant measures in the future through either hardware or software solutions.