Bio Info SIG: Data Mining Breast Cancer Clinical and Expression Data



  • The Monthly Meeting of the Bio Informatics SIG


    Eugenia Bastos, PhD, is an analytical consultant for Life Sciences Organization at SAS Institute, San Francisco, California

    Presentation Overview

    Data Mining Breast Cancer Clinical and Expression Data

    One in eight women in US will develop breast cancer. Of these, one third will progress to fatal metastatic cancer (Peto et al., Lancet, 2000). Gene expression profiles have the potential to improve accuracy of metastasis classification when compared with the classical clinical variables.

    In this study, we re-analyze clinical and microarray data of primary breast cancer tumor tissues from 78 patients (van’t Veer et al., Nature, 2002), with the primary objective to predict metastasis status. The study sample includes sporadic and brca1 positive cases where 47.4% progressed to metastasis; only 18.6% had lymph node involvement; high grade accounted for 61.9% and 71.1% developed angioinvasion. Two groups of patients are classified according to their metastasis status: non-metastasis patients who were disease free for at least 5 years are called “good prognosis” group and patients who developed metastasis within 5 years are classified as “poor prognosis” group.

    We begin with the 78 sporadic cases, using which the original authors developed a classifier based on 70 out of ~25,000 genes determined by a filtering and cross-validation approach using correlation coefficients. They report a 16.7% misclassification rate for leave-one-out cross validation (LOOCV). We present an approach based on analysis-of-variance filtering and cross-validated stepwise discriminant analysis which results in a classifier based on 13 genes and an LOOCV error rate of 3.8%. We also applied classification methods such as Support Vector Machine (SVM) and Partial Least Squares (PLS) into our analyses and error rates are even lower, showing 0.04 and 0.01, respectively.

    Next, we extend our analyses to add the patient clinical covariates. We consider additional mining approaches such as logistic regression, decision trees, and prediction based on profiles obtained by unsupervised k-means clustering. Due to the observational nature of the data, we advocate extensive cross-validation to help ensure generalizability. Substantial gains in predictive performance are evident.


    About the Presenter

    Eugenia Bastos

    With a PhD in Epidemiology and a Masters in Biostatistics, Eugenia Bastos brings a unique set of skills to the Health and Life Sciences Division at SAS since dez/2000. She has developed an extensive knowledge of data mining, risk analysis, forecasting, bioinformatics and the Micro Array Solution. As a senior statistical consultant, she has offered seminars in data mining for in-depth analyses of clinical trial data, including the development of predictive models for pharmaceutical customers. In the bioinformatics field, she has worked in analyses of gene expression data, such as variable selection and dimension reduction of wide data sets and statistical algorithms such as Support Vector Machine and Partial Least Squares.

    Working previously in the academic environment, she has experience with reproductive health and cardiologic clinical trials, involving study design and sample size determination, as well as developing models of risk factors and costs of diseases using health services utilization databases.


    Event Logistics:


    Hanson Bridgett
    333 Market Street, 21st Floor
    San Francisco, CA, 94105

    Note: Location provided courtesy of Hanson, Bridgett, Marcus, Vlahos, Rudy, LLP.


    6:30 - 7:00 p.m. Registration and Networking
    7:00 - 9:00 p.m. Presentation


    $15 at the door for non-SDForum members
    No charge for SDForum members
    Please call 408.494.8378 for student memberships
    Registration NOT required

    More on the BioInformatics SIG...