FRAMINGHAM (09/24/2003) - Their makeshift home was a cramped, rectangular conference room. Token ventilation was provided by an oscillating fan that pushed around warm, stale air. An entire wall was festooned with an unintentional mosaic of fluorescent Post-it Notes.
Here, a team of eight software engineers would work around the clock. Several times a day, the tangle of electrical wires and cords would accidentally be kicked out of the power strip, crashing the network until one of the programmers crawled underneath the table and restored power.
This was the unlikely setting of a revolution in mass fatality identification science.
Staffers at Gene Codes, the Ann Arbor, Mich., bioinformatics company, had witnessed the carnage of Sept. 11, 2001, on a television that Howard Cash, the company's founder and CEO, hastily purchased that fateful day. They saw what the world saw, but were about to be charged with describing it -- in numbers.
The tools for the task were as bizarre as the environment in which it was carried out. Toothbrushes, razors, and clothing provided reference samples that might yield a billionth of a gram of DNA -- invaluable clues to a 20,000-piece human puzzle scattered over a violent 17-acre grave where the twin towers once stood.
For forensic biologists, the Sept. 11 attacks on the World Trade Center (WTC) marked the largest mass fatality identification project in history. Families and friends of the victims demanded answers -- and closure. While the relentless recovery effort at Ground Zero produced many heroes, others were to be found far from the spotlight.
Heeding the 9/11 Call
Four weeks after the disaster, on October 8, Robert Shaler, director of the Department of Forensic Biology at the Office of the Chief Medical Examiner (OCME) of New York City, asked Cash to create catharsis out of code.
Cash flew to New York expecting to donate some existing software to the recovery effort. But Shaler needed new software that would inventory and match the victims' remains, as well as catalog the reference samples required to name the dead -- and reunite them with their families.
The deadline for this unprecedented task: yesterday.
Cash warned his staff that identifying the victims of 9/11 would be a 24/7 marathon. And there could be absolutely no mistakes. His 16 colleagues readily agreed, and after a few additional hires, they set about reinventing the science of DNA mass identification. "We'd do it again in a heartbeat," Cash says. "Every employee, every shareholder was completely behind it," even though it would mean enduring arduous 12-hour programming shifts.
Whether out of patriotism or professionalism, staffers routinely arrived for work at 7 a.m. and left at midnight. Engineers like Dave Relyea just wanted to help. "We thought about the victims, the families, and the people at the Office of the Chief Medical Examiner working around the clock. What they were going through made us feel like we could never work hard enough."
Occasional relief came in the form of some old toys lying around the office. A boxing nun became their "integration token." Staff could submit source code changes, but only if they had the nun. A boxing Godzilla was also on hand, although there was little use of the Nerf guns.
Every Friday, Cash flew to New York to deliver the latest software release to Shaler, returning the following Monday. His colleagues are convinced that Cash subsisted on coffee and airline pretzels.
On Dec. 13, 2001, Gene Codes' Mass Fatality Identification System "M-FISys" (pronounced "emphasis") was born. M-FISys combines DNA profiles from three sources: victims' personal effects (toothbrushes, razors, combs, etc.); kinship references (relatives' cheek swabs); and the remains themselves. The software is able to crossmatch thousands of DNA profiles in minutes, a task that previously might have taken two weeks.
That first day alone, the OCME, normally about the business of solving homicides and sexual assaults, made 55 victim identifications from Ground Zero. However, while M-FISys can present almost all the knowledge necessary for making identifications, only the chief medical examiner, Charles Hirsch, can legally determine the identity of each set of remains.
The collapse of the twin towers did not merely kill thousands of people; many literally disappeared. Although most victims left behind traces of DNA (albeit often highly damaged), those who were vaporized or pulverized may never be identified, because they left nothing behind.
The WTC foundation was a six-story, 70-foot-deep, watertight shell designed to keep the New York Harbor at bay. Following the attacks, 1.6 million tons of debris, water, bodies, and burning jet fuel collected in the well, as if putting everything in a pot and cooking it. Fires burned for three months, stewing some of the remains for as long as nine months.
Conventional means of identification such as dental records became virtually irrelevant. "If someone's been lying in burning jet fuel for three months, it's harder to figure out," Cash says. "DNA is relatively fragile stuff."
Only 287 nearly intact bodies were recovered from Ground Zero. The remains of one individual were found scattered in nearly 200 places. These remains were collected in 16 refrigerated trailers backed under a tent dubbed "Memorial Park."
The identification effort required new technologies to salvage readable sequence from the remnants of DNA. To identify individuals, forensic biologists usually measure the length of microsatellite, or short tandem repeat (STR), patterns in nuclear DNA. These are naturally occurring variations in the length of 13 to 15 stretches of repetitive DNA strewn across the human genome (see "Forensic Profiling," right).
As many of the DNA samples were badly degraded, M-FISys also included the option of matching the genetic patterns of mitochondrial DNA (mtDNA) and single nucleotide polymorphisms (SNPs).
Each human cell contains 1,000 mitochondria, each carrying a loop of mtDNA some 16,500 bases long. Two highly variable regions of about 1,000 bases within the mtDNA are checked. Unrelated individuals might have a handful of single-base differences within those regions. But about 7 percent of the Caucasian population has the identical sequence.
SNP technology is still under development at Orchid Cellmark, a Dallas-based business unit of Orchid BioSciences dedicated to forensic DNA testing services, but after hearing about Orchid's SNP panel for parental testing, the New York medical examiner asked if the technology could be used on WTC specimens. Orchid developed a technique to examine mere 100-base DNA fragments, rather than the standard 100- to 400-base lengths, which helped with such degraded samples.
According to Orchid BioSciences' Bob Giles, a panel of 71 SNPs provides more powerful identification on average than a full profile of 13 STR loci. "Many samples recovered at the WTC gave a partial STR profile, but typically four to five markers are not enough to make an identification. If you couple a partial STR profile with 20 to 30 SNP markers, identification is feasible. It's not an either-or; in some situations, it's both."
So far, Orchid has tested 2,500 tissue specimens and continues to study the remaining samples. The SNP data are sent to Gene Codes to be incorporated into M-FISys so OCME analysts can plug gaps in the STR profiles if necessary.
Another key contributor to M-FISys and the identification project was The Bode Technology Group in Virginia. Mitchell Holland, Bode's laboratory director, also received an urgent call from Shaler in September 2001 asking "if we could develop a system for processing a very large number of bone samples."
Within a month, Bode was processing 1,000 bone samples per week. "Historically, we'd be lucky to do 100 to 200 per week," Holland says. "We increased that by a factor of 10 by developing a method that rapidly prepared bones for DNA extraction."
Extracting pure DNA from bones is hampered by contaminating material (e.g., soft tissue, dirt) that blocks amplification. Only when the surface is absolutely clean is a core sample taken. Bode reduced the processing time from 20 minutes to about four, boosting efficiency five- to tenfold.
Bode also enhanced the quality of results by developing a polymerase chain reaction (PCR) STR system for highly degraded DNA samples, re-engineering the PCR primers to halve the size of the target. "If you reduce the size of the target you increase the amount of DNA available for amplification." So far, Bode has performed more than 18,000 analyses and sent more than 30,000 results to the OCME.
Several other groups were also on the scene when Gene Codes arrived. Myriad Genetics, for example, had a pre-existing contract with the State Police lab in Albany to process rape kits. The Salt Lake City firm analyzed nearly 20,000 DNA samples, using a dozen high-speed sequencing machines. DNA technicians from Celera Genomics in Rockville, Md., also sequenced mtDNA samples from victims and relatives.
An IT Nightmare
Gene Codes spent October of 2001 looking at databases in New York with the medical examiner. At the Family Assistance Center, New York Police Department personnel interviewed friends and family members of victims, taking buccal swabs from family members. All swabs and personal effects were then sent to the Forensic Investigation Center (FIC) at the New York State Police headquarters in Albany. All of these "exemplars" were recorded and either tested within the FIC or shipped to collaborating commercial labs for STR analysis.
But the collection of family reference samples and personal effects bordered on chaos. Volunteers at Pier 94, a makeshift outreach center for victims' families, sometimes accepted toothbrushes with no names on them. Other family members brought personal effects or donated cheek swabs separately, resulting in some people entering the system twice. Handwritten interviews with mourning family members also slowed down data entry.
Researchers navigated through databases from FileMaker Pro to Oracle, collapsing data that were once held in 22 different laboratory databases across five states into neatly compiled aggregated profiles in M-FISys. With more than 164,000 lines of code, M-FISys links all the information in the identification project: 11,641 cheek swab samples from 7,166 family members; 7,681 personal effects and the results of the three types of DNA test; and nearly 20,000 human remains.
Prior to M-FISys, the medical examiner's laboratory tried to use the Combined DNA Index System (CODIS), a tool used by the FBI to allow government labs worldwide to match criminals' DNA profiles, for victims' identification. CODIS generates a report every time two DNA patterns match. But because so many bodies were fragmented into multiple pieces, most of the reports generated by CODIS were redundant, slowing the work. And with no way to shut it off, analysts had to verify every find against multiple reports.
"We were literally running from computer terminal to terminal to verify information," recalls Elaine Mar, a criminalist and lead supervisor of the WTC DNA Identification Unit at the OCME. "It was a logistical nightmare because we had to comb through so many databases to answer a question."
Mar says M-FISys solved most of her team's problems. Each victim sample was given a number according to when it was found. DM0100001 was the first sample found in "Disaster Manhattan 2001." Numbers with similar profiles collapse into aggregates. For every profile, M-FISys calculates the probability of another person having the same one.
Initially, M-FISys was programmed not to match remains to personal effects unless the likelihood was 1010 or better. As more remains were identified, this threshold was lowered and more pieces fell into the same aggregates. "CODIS couldn't give us a red flag, but M-FISys does," Mar says. "That's one of the amazing things about M-FISys -- it does the searching for us."
Because the software for M-FISys had to be developed so quickly, Gene Codes couldn't write specifications. So it adopted extreme programming (XP) to safeguard quick development against errors. In XP, programmers work in pairs, constantly checking and testing each other's code every step of the way.
At the end of each week's iteration, the staff holds a retrospective -- a ritual since November 2001. They list things that worked well, and what needs improvement, on fluorescent pink, green, and yellow Post-it Notes, transforming an entire wall into a case of art imitating life. Under "Worked Well," a note says "Figured out how to use debug form on a wrapped test class." One square under the "Needs Improvement" category simply reads, "I'm tired."
Amy Sutton, Gene Codes' manager of software quality assurance, says the early weeks were tough. Her team was tasked with safeguarding the integrity of M-FISys by automating tests to "break" it. "We have tests for crazy things they never even thought of. It's been a stressful project for most of us," Sutton says, though she seemingly thrives on the pressure.
Sutton was the first person to see the names and data together, working alone one night. "Everyone is always very aware of what the data represent," she says. "You can think you're prepared for dealing with the reality of these data, and then it hits you."
Sutton enjoys watching forensic biologists use M-FISys because it helps her to fine-tune the user experience. "Usability is our watchword here. When you're digging a ditch, any shovel will do, but one with a padded handle makes a big difference. I like to pad the handle."
Cash takes pride that his team has never made a software mistake that could result in a misidentification. "If we write a bug that destroys a computer, we buy a new one," Cash says. "If all the lights go out on the Eastern Seaboard, they'll eventually come back on. But none of these (errors) are as serious as misidentifying a person. How do you tell a family they have to give the body back, that their funeral didn't count?"
The chance of a false match based on the coincidental sharing of loci is less than 1 in 3.58 million.
The Future for M-FISys
The Gene Codes team has come a long way since M-FISys 0.1, which simply imported STR data and grouped them on screen. The latest iteration -- Version 6.03 -- is the 68th release of the program. "Only people who have written complex code understand how difficult it is and what an extraordinary job Gene Codes has done in providing it to us as quickly as they did," Shaler wrote in an August 2002 letter to Cash.
Once a positive ID has been made, the relatives may have a funeral director collect the remains from the OCME and request notification if additional remains are found. Others prefer not to be notified, in which case remains pass into the common memorial, individually tagged in case the family changes its mind. Families can also ask the OCME to retain the remains until all identification efforts are completed, and have the funeral director collect everything at once. Meanwhile, some families still missing any trace of their relative can request not to be informed, even if remains are discovered.
Authorities have had to deal with 80 attempts at fraudulent insurance claims for supposed victims. One woman is alleged to have fled to Peru after collecting US$70,000 in compensation for falsely reporting a death of a relative. But according to the chief medical examiner, the forensic effort will continue until every victim is identified. For now, unidentified remains are being saved for testing under future technologies. M-FISys recently added "virtual profiles," which are made up of several attempts to extract DNA from the same sample.
While investigators wrap up as much as they can in New York, M-FISys is being recruited for other natural and manmade tragedies. Gene Codes has offered to donate the finished software to nonprofit forensic organizations like the International Commission on Missing Persons, which is working to identify remains from the war in Kosovo.
Gene Codes may also issue licenses to countries that would like to have the software as a national resource. "Several countries have expressed interest in M-FISys, and we will continue to do development in order to support those requests," Cash says. "But for now, the Office of the Chief Medical Examiner has absolute priority."
Approaching the second anniversary of Sept. 11, 1,521 of the 2,792 people who perished in the WTC disaster have been identified.
The Right Decision
To create M-FISys, Gene Codes had to put its flagship software product Sequencher on hold for about a year and a half. "It's like dancing with a gorilla," says founder and CEO Howard Cash. "You don't stop when you're tired. You stop when the gorilla is tired. We're not done until the city says we're done."
First released in 1991, Sequencher has become the industry standard for DNA sequencing, and was an important tool in the early stages of the Human Genome Project. Indeed, some of M-FISys was based on Sequencher, but it was mostly built from scratch. The company, which was founded by Cash in 1988, has also created software used by the U.S. Army and the FBI for identifying old war remains and helping the government use DNA samples to identify criminals.
Gene Codes was profitable for 39 consecutive quarters before embarking on M-FISys. Cash set up Gene Codes Forensics Inc., a wholly owned subsidiary of Gene Codes, to create M-FISys, while protecting the rest of the company's assets. Gene Codes signed a three-year, $10-million contract with New York City but will bill for only real hours and expenses -- already close to $7 million.
"We made a decision early on not to try to profiteer on this project," Cash says. "We honestly thought we would be working on this for a year at the most." The company will probably see some paper profit eventually, but that will compensate for losses incurred by the pause in Sequencher development. "You don't make money selling software, you make money selling upgrades," Cash says.
Sales manager Carol Carriere says the company "knew it would put a strain on (Sequencher) sales, but the real loss wasn't financial. We lost time for development, and that gave our competitors a chance to catch up."
Even though it will take time before most staff members can refocus their efforts on Sequencher, Cash has no regrets. "If in the process we had brought some comfort to these hundreds and hundreds of families, but in the end we had lost the company, I would still have thought we had made a good investment and the right decision."
DNA profiling typically involves the analysis of 13 core regions, or loci, of short tandem repeats (STRs). At each STR, the length of the repetitive tract of DNA varies according to the number of repeats. Most of the STRs used in forensic identification are tetranucleotide repeats (e.g., "AATG AATG AATG" repeated several times). Naturally occurring variations in the length of these STRs are inherited and, in aggregate, form a unique genetic fingerprint.
The U.S. Department of Justice (DOJ) set the standard of using 13 STR locations across the human genome where the frequencies in various ethnic populations are known. Each locus is genetically unlinked from the other 12, thereby allowing them to be treated as independent data points for statistical analysis. None are believed to contain information pertinent to medical history.
Since humans have two copies of each chromosome, they carry two versions of each STR. For example, an individual might inherit an STR containing eight repeats from one parent and 12 repeats from another. Or each parent may contribute 10 repeat units, leading to a single value in the assay. The forensic biologist needs to determine only the length of each STR.
A representative STR profile might look like this, with the name of the repeat locus in the center column, and the number of repeats on the right:
Chromosome: Locus: STR repeats
3 : D3S1358 : 15/16
12 : VWA : 17/18
4: FGA : 22/24
8 : D8S1179 : 13/14
21 : D21S11 : 29/30
18 : D18S51 : 13/17
5 : D5S818 : 11/12
13 : D13S317 : 8/8
7 : D7S820 : 10/11
16 : D16S539 : 9/11
11 : THO1 : 7/9
2 : TPOX : 8/11
5 : CSF1PO : 10/12
The likelihood of this profile occurring by chance in the population is 9.1 x 1014, or less than one in 900 trillion.
In Gene Codes' frenzy to create M-FISys, founder and CEO Howard Cash recruited William Wake, an independent software coach, for one basic task: Exterminate the bugs before they hatch.
This is part of the philosophy behind Extreme Programming (XP), which was created by Kent Beck in 1996 and published in 1999. Beck sought a more efficient approach to building software through communication, simplicity, feedback, and courage.
Wake introduced Gene Codes engineers to the burgeoning Agile software development method in November 2001. The process aims to create an environment of increased interaction and communication within the programming team by scheduling frequent releases of software, tempered by constant testing and feedback from its users. Testing is done before, during, and after the code is written to ensure the same bugs don't surface twice. This results in a legion of automated tests, which serve as a safety net.
Two programmers produce code together on one computer. "The goal is to make sure everything gets code-reviewed as you go and to keep design ideas flowing," Wake says. "You'll rarely see people working by themselves."
The system is rebuilt many times per day. A test for a particular "story" (which describes a piece to be developed) or feature is written before the story itself, and the iteration is published so fellow programmers can see it. Each new story has to pass every previous test before running.
Wake uses the analogy of a manufacturing line to explain the difference between traditional programming and XP. "Someone develops a muffler and someone else develops the body, and at the end they screw all the pieces together, but from start to finish how long do the parts sit on the line? We're having everyone swoop under the car and build it, then build the next car, all in parallel with each other, in a short bit rather than a long pipeline of stuff coming out."
Wake, who wrote the book Extreme Programming Explored just before Sept. 11, spends one week per month at Gene Codes. "No one is all that sure what it means for XP to be in its full glory," he says. "We're dealing with quite a bit of change. That puts a lot of pressure on a project, to keep systems working all the time and be open to changes."
Wake credits the XP approach in allowing Gene Codes to deliver weekly updates of M-FISys to the OCME. "We're prepared to deliver every week, no matter what we accomplished last week. We realized that they needed new things all the time; they couldn't wait six months for what they needed today."