Background and the Benchmark Data
According to a 2013 report from the American Cancer Society, prostate cancer is the most common type of cancer in their top 10 list of cancers, with more than 238,000 new cases expected in the United States in 2013. The next most common cancers are breast cancer and lung cancer (www.cancer.org/research/cancerfactsfigures/cancerfactsfigures/cancer-facts-figures-2013, www.cancer.gov/cancertopics/types/prostate).
In this study, we will investigate a set of prostate cancer data to see whether statistical methods and machine-learning tools can help identify the genes that are related to this specific disease. The data comprises 102 patients (52 cancer, 50 normal) and 6,033 genes. The original data were collected and analyzed by a team of 15 scientists from a dozen institutions, including Harvard Medical School, Whitehead Institute/Massachusetts Institute of Technology, and Bristol-Myers Squibb Inc., Princeton (Singh et al., 2002).
Efron and colleagues (Efron and Zhang, 2011; Efron, 2010, 2011) also discussed this set of data in the context of Benjamini-Hochberg FDR (false discovery rate) and Bayesian analysis. We are very grateful that Dr. Efron emailed us the data he used in his papers. The data are in the Dap structure (http://fossies.org/dox/dap-3.8/classes.html) with a size of 11.5 MB. A glimpse of the data follows (here, we show only the first and the latter three lines of the file):
In order to facilitate the analysis that will be carried out by SAS, STATISTICA, R, and other software packages, we use the following SAS code to convert the Dap data:
To run the SAS code as is, the user needs to create a new folder called “Prostate_Cancer_data” in the C:drive, deposit the raw data in C:\Prostate_Cancer_data, and then run the above code in SAS Editor window. The output data file will be in the folder “C:\Prostate_Cancer_data,” and will be called “dapout.sas7bdat.”
After the conversion, there will be 102 rows, representing n=102 patients in the study. The first column of the data will be “_target” (1=cancer, 0=normal patient), the second column will be “PatientID” and the remaining columns will be gene1–gene6033. One goal of the study is to find the genes that are related to the disease.
Due to the amount of the data and the intricate process of data conversion, it will take a few minutes to complete the run, so be patient.