مرکز منطقه ای اطلاع رساني علوم و فناوري - Infant Mortality--The Lesser Known Reliability Issue

Abstract :

Infant Mortality problems have been around for a long time (maybe that is why sometimes we have a shorter warranty period for many electronic products). Anyway, the explanation of infant mortality is that these are left over (or latent) defects. Defects that do not necessarily expose themselves and they can skip by all the manufacturing tests, including system test. However, with electrical and thermal stresses during use, they will eventually degrade to cause a significant functionality problem and will result as a failed system in the field. Product may last for hours to months if such a latent defect is trapped within. Since field failures are undesirable and field failures have to be below a certain level so as not to create general customer resentment, some kind of infant mortality acceleration is needed. Burn-in is the process to accelerate these latent defects that eventually will lead to infant mortality failures. The same electrical and thermal stress is applied to the chip during the burn-in test step, though at a much elevated level such that months to years of life time of the product is consumed in hours. Hence these latent defects will be detected and screened within the manufacturing test flow and will not be shipped to customers. So, if this is a production worthy process and has been used extensively in many kinds of chip manufacturing, why is there a problem? The issue is that scaling has created very short channel transistors and very thin gate oxide. These shorter channel transistors bring along higher electrical field around the source-drain and create hot electrons which lead to gate damage and shortening the life of the transistors. Higher Vcc also cause more stress to the gate oxide and can cause more soft-breakdown. The electrical and thermal acceleration essentially exacerbate this wearout effect, even though it reduces infant mortality. We are therefore trading one type of reliability for another. Process change (like changeover to a diffe- - rent gate stack) would push back this wearout effect by a generation, but the drive to scale to ever smaller devices will continue and the problem may come back. Lowering Vcc will likely contain the long term wearout effect but it will hurt performance and voltage scaling has already slowed down substantially in recent years. Traditional solution of placing test guardband into manufacturing test will not be cost effective in the long run, especially facing a competitive environment. It is likely that we need both circuit and architectural level solutions to deal with this. Will online testing or fault tolerance come to the rescue? Will fault tolerant techniques be sufficient for dealing with infant mortality problem? Will these high reliability system features eventually move into main stream computing products? Or better yet, will we have latent defect acceleration or screening without the ill effect of degrading long term lifetime of our product? All of these remain to be answered.

Keywords :

integrated circuit reliability; integrated circuit testing; life testing; burn-in test; chip manufacturing; electrical acceleration; fault tolerant techniques; field failures; functionality problem; infant mortality acceleration; infant mortality problem; latent defects; left over defects; manufacturing tests; online testing; thermal acceleration; very short channel transistors; very thin gate oxide; voltage scaling; wearout effect; Acceleration; Circuit testing; Fault tolerance; Life testing; Manufacturing; Production; System testing; Thermal degradation; Thermal stresses; Warranties;